Wednesday, August 16, 2006

thingLang

Introducing thingLang, a simple, pragmatic API for determining the language of a book. thingLang uses LibraryThing's MARC records, when it can. When it can't it uses the Group Identifiers embedded at the start of the ISBN format. I'm releasing it for free, for both commercial and noncommercial use.* Aw, what the heck!

Examples:

http://www.librarything.com/api/thingLang.php?isbn=2070525570 (Harry Potter in French)
http://www.librarything.com/api/thingLang.php?isbn=9955081260 (The Hobbit in Lithuanian)

Rather than returning XML, which--ssh!--I don't really like, LibraryThing returns a naked three-letter string, following the MARC standard. The exceptions are (1) if the ISBN is invalid, it returns "invalid," (2) if it really can't guess (see below) it returns "unknown." It works equally well with ISBN10 and ISBN13. It's not perfect, but it's probably good enough.

UPDATE: If you add "&display=name" you'll get the language's name, instead of the code, eg. The Greek New Testament.

The weeds. Using ISBNs to determine language is a tricky problem. Strictly speaking, ISBNs don't encode the language, but a queer mixture of language and region. The code 5, for example, is the "Evil Empire" code. Okay, they don't call it that, but that's what it is--Azerbijan, Tajikistan, Armenia, Estonia, Georgia, etc. You know the language has seen a translation of Lenin's works, but that's about it. The same problem affects India (dozens of languages, with most of the LibraryThing books being English), Sri Lanka, and others.

Or take Egypt. Although most ISBNs published in Egypt are in Arabic, most Egyptian books logged on LibraryThing are merely from Egyptian publishers. In fact, they're mostly tourist guidebooks.

Or take Ethiopia. Of about 50 books Ethiopian ISBNs, none are in Amharic or about Ethiopia. When Avon published its Science Fiction Hall of Fame (1970), was it just poaching numbers? (ISBNs, after all, cost money.)

So, Abby and I went through the numbers, running them against LibraryThing's holdings. If most of the books for a given code were in a single language, we use it. If not, we igore it.

The result is an API that works pretty well for LibraryThing and, we suspect, for many other sites. (We made this, in part, because BookMooch was looking for a solution, and we felt generous.)

*Terms. Don't hit it more than once/second. If you use it more than experimentally, you must put a notice somewhere on your website reasonably near what it's contributing, linking to LibraryThing.

6 Comments:

Anonymous Anonymous said...

Talking about languages, I must then repeat my post from the groups:

Tim, when languages was introduced, you had put Nynorsk (nno) as the choice for Norwegian on the short list, but after discussion in the Google group changed it to "general" Norsk (nor).
It's taken me a long time to notice that while it's now correct when you add new books manually (addnew.php) or edit directly from the catalog page, it still has Nynorsk on the short list if you use the card_edit.php page. Any chance of getting that corrected?

(I guess that's the reason most of the books on the language.php?l=nno page incorrectly have ended there)

8/16/2006 7:04 PM  
Anonymous Ottox said...

Oops, that other post was by me.

To get back to your post here, I must admit that I don't get the sentence "When Avon published its Science Fiction Hall of Fame (1970), was it just poaching numbers? (ISBNs, after all, cost money.)" so I don't know if it's worth mentioning that while ISBNs cost money in the US, they don't necessarily do it other places in the world. In Denmark for example, you can get 10 or 100 numbers for free, 1000 numbers cost $42 and 10,000 only $265 - a bit cheaper than ten numbers in US. I wouldn't be surprised if it was even cheaper in Ethiopia. ;)

8/16/2006 8:59 PM  
Blogger Jeremy Dunck said...

Please consider an optional parameter to change which kind of language code is returned.

IANA owns IETF 3066 language codes.

These are the codes used in HTTP's Accept-Language, XML's lang attribute, and many other internet standards.

8/17/2006 9:52 AM  
Blogger Jeremy Dunck said...

Oh, I see you have display= now.

So add display=rfc3066 if it's not too much trouble. :)

8/17/2006 10:08 AM  
Blogger LibraryThing said...

Someone give me a complete table between the two and I'll think about it...

8/17/2006 4:39 PM  
Blogger Michael Rodgers said...

Just sitting here thinking about the comment regarding Ethiopia and Amharic. I have a few cheap paperbacks published in Ethiopia. Most do not have ISBN's.

Also, a large portion of books published in Ethiopia are published in English, as its quickly becoming the national language. All education after sixth grade is in English... so in this case it shouldn't be surprising that Ethiopian ISBN's are coming back as English books. But, I agree with your assessment that ISBN's don't always help with language, especially in countries with multi-lingual populations.

8/23/2006 7:38 PM  

Post a Comment

<< Home