Sunday, December 21, 2008

uClassify library mashup? (with prize!)

I keep up with the Museum of Modern Betas* and today it found something wonderful: uClassify.

uClassify is a place where you can build, train and use automatic classification systems. It's free, and can be handled either on the website or via an API. Of course, this sort of thing was possible before uClassify, but you needed specialized tools. Now anyone can do it—on a whim.

Their examples are geared toward the simple:
  • Text language. What language is some text in?
  • Gender. Did or a man or a woman write the blog? It was made for genderanalyzer.com (It's right only 63% of the time.)
  • Mood.
  • What classical author your text is most alike? Used on oFaust.com (this blog is Edgar Allen Poe).
Where did I lose the librarians—mood? But wait, come back! The language classifier works very well. It managed to suss-out Norwegian, Swedish and Dutch reviews of the Hobbit.** So what if the others are trivial? The idea is solid. Create a classification. Feed it data and the right answer. Watch it get better and better.

Now, I'm a skeptic of automatic classification in the library world. There's a big difference between spam/not-spam and, say, giving a book Library of Congress Subject Headings. But it's worth testing. And, even if "real" classification is not amenable to automatic processes, there must be other interesting book- and library-related projects.

The Prize! So, LibraryThing calls on the book and library worlds to create something cool with uClassify by February 1, 2009 and post it here. The winner gets Toby Segaran's Programming Collective Intelligence and a $100 gift certificate to Amazon or IndieBound. You can do it by hand or programmatically. If you use a lot of LibraryThing data, and it's not one of the sets we release openly, shoot me an email about what you're doing and I'll give you green light.

Some ideas. My idea list...
  • Fiction vs. Non-Fiction. Feed it Amazon data, Common Knowledge or LT tags.***
  • DDC. Train it with Amazon's DDC numbers and book descriptions. Do ten thousand books and see how well it's guessing the rest.
  • Do a crosswalk, eg., DDC to LCC, BISAC to DDC, DDC to Cutter, etc.
Merry data-driven Christmas!


*A website that tracks new "betas." Basically, it tracks new web 2.0 apps. It also keeps tab of their popularity, according to Delicious bookmarks. LibraryThing is now number 12, beating out Gmail. Life isn't fair.
**Yes, we're going to get it going for reviews on the site itself. Give us some time. Cool as it is, we're pretty busy right now. Note: You can't give it the URL alone. You have to give it the text of the review.
***We may do this with tags. We already do it very crudely, using it only for book recommendations.

Labels: , ,

3 Comments:

Anonymous Anonymous said...

You need to change the due date. (Or give us time machines.)

12/21/2008 5:01 PM  
Blogger kvista said...

Hello,

I'm writing to let you know that I took you up on your LibraryThing/Uclassify challenge. It was a lot of fun...

Here's what I did: My goal was to create a classifier that would automatically "tag" any book description based on actual LibraryThing tags. For example, if you paste the book description for "Truman" into UClassify, it should return to you LibraryThing tags that suit the book. This is one step more general than one of your ideas (fiction vs. non-fiction).

Here's how I did it:

(sorry, I did not find LibraryThing API calls that suited my needs, but I would have been happy to use them if they existed -- hopefully, I didn't miss them...)

1) Getting the tags. I manually extracted the "most popular tags" (e.g., "history") from the LibraryThing Zeitgeist page.

2) Getting the training examples for each tag. Since the LibaryThing URLs are predictable, I wrote a program that automatically fetched all HTML pages for each tag (e.g., "http://www.librarything.com/tag/history"). For each "tag home page fetched", the program then fetched the LibraryThing book pages for each of the "books most frequently tagged". For example, the HTML pages for "Guns, Germs and Steel" through "Founding Brothers" were fetched for the history tag. The program stored each HTML page locally - this resulted in my "training examples" for UClassify. I think this was a good heuristic because it capitalizes on the wide audience agreement for suitability of a tag (thanks LibraryThing for sorting in this manner!). BTW, since I wasn't using any official API, I wrote my program to pause 1.5 seconds between HTTP calls, so that I didn't overly tax your server(s).

3) Classifier creation and training. Now that I had the pages stored locally, I then wrote a separate program to programmatically create a classifier via UClassify, add classes for each of the LibraryThing tags, and then train each class using the first description (usually the Amazon description) for each book.

The classifier I created is now public -- you can find it at: http://www.uclassify.com/browse/kvista/tags

4) Testing. I then spot tested the classifier using some book descriptions NOT used for training. This is a standard machine learning technique. (NOTE: I could have tested many books using another program, but I thought that was a bit overboard for this contest.).

To see the result of my work, try entering a book that is NOT listed in the most frequently tagged books for a given tag. For example, the David McCullough's book "Truman" (http://www.librarything.com/work/14040) is not listed in the most frequently tagged "history" books, but when you run it through the classifier, it returns:

1. biography (17.0 %)
2. 20th century (10.9 %)
3. history (9.7 %)

which is pretty good! But don't think the classifier is perfect -- it is far from that (for example, it ranks "classics" and "christianity" ahead of "science fiction" for Isaac Asimov's Caves of Steel!). One really needs a better algorithm (pretty much a human) to pick out truly representative books for each class that aren't going to throw the classifier off. But, for a purely automated approach, it's not bad.

Try it out and let me know what you think!

-- kv (kvista7@gmail.com)

2/01/2009 5:44 AM  
Anonymous Anonymous said...

Great initiative! But where is the follow up?

Please see
http://blog.uclassify.com/librarything-competition-follow-up/

2/22/2009 5:41 AM  

Post a Comment

<< Home