Monday, July 20, 2009

LTFL: Non-ISBN Matching

Short Story. We've been going through so many big changes at LibraryThing lately that we let a pretty substantial improvement go by without giving it the fanfare it deserves: the LibraryThing for Libraries (LTFL) Cataloging Enhancements now pick up many non-ISBN items. All LibraryThing for Libraries libraries will see better coverage (5-15%), and academic libraries with older materials should be especially pleased:

Some examples:

The coolest thing about the LibraryThing office: Need a photo of an old book? Grab iphone, swivel chair 180 degrees and shoot. Second coolest thing: The only hot Web 2.0 company with a 1774 edition of Terence.
Long Story. Our enhancements usually run on the basis of the ISBN. ISBNs are easy to pick out of the HTML without knowing the structure of the page ( /[0-9Xx]{10,13}/*, if you speak regular expressions*), and most books have them, so they're our primary way of knowing what content to load for a particular page.

However, as a part of our reviews enhancement, we developed a JavaScript library called the LibraryThing Connector that, among other things, screen-scrapes the title and author of the book out of the HTML. This is what allows our reviews to work on any item a library owns, whether or not it is in LibraryThing or has an ISBN. It's tricky stuff, because it requires specific code for every type of library software that we provide reviews for.

To get title-matching therefore, we take the title and author extracted by the Connector and feed it to our own "What Work" fuzzy matching API. Of course, this method is far from foolproof, so we err on the side of caution, only loading enhancement data if we've got a strong match on both the title and the author. We haven't seen any false positives yet, but even with being pretty strict about matching, based on real world stats, we're able to provide around 5-15% more content in the catalog. Academic libraries will get more of a boost out of this, because they tend to have a lot more non-ISBN items than public libraries.

We did this because it's fun and useful and kind of magic, but more importantly because we want to constantly improve our products. LibraryThing for Libraries is a subscription service. Every year when it is time for a library to renew with us, we want it to be clear that they're getting something better from us than they were a year ago, and that even better things are in store for the future. It's more fun and challenging for us that way, but it's also something we know works pretty well as a business strategy too.

In my mind a big reason why LibraryThing.com has succeeded is that a membership comes with an expectation of improvement. We don't call a membership an investment, but you get to expect that you will be able to do more and better and cooler things with LibraryThing over time, and that it will become more valuable to you. As a result of this, our members become deeply involved in the site and how it works, and if a LibraryThing membership is a great investment, members end up making an even greater investment of their knowledge and enthusiasm right back. It's a great thing to be a part of, so I hope it's a philosophy we can keep bringing to the library world as well. — Casey

*Pace Casey, who wrote this post, ISBNs are/([0-9]{9}[0-9X}|97[89][0-9]{10})/i !

Labels: , ,

7 Comments:

Blogger Andrew Timson said...

At first glance I think both regexes are defective, because they don't seem to allow for hyphens. That said, I don't speak regex enough to know how to fix them!

7/20/2009 5:27 PM  
Blogger Tim said...

Ha. Good point. I think we strip them out before we look... :)

7/20/2009 5:59 PM  
Blogger Katya said...

MARC ISBNs don't include hyphens, anyway, so I don't see why an OPAC would add them back in.

7/21/2009 8:15 AM  
Blogger Casey Durfee said...

Yeah, we do strip the hyphens. Actually what happens is we take the text on the page, throw away parts of the page that have timestamps and things like that which might coincidentally contain a valid ISBN, then with what's left we throw everything away that's not whitespace or [0-9xX]. Then we use a regex to pull out the ISBNs on what's left...

Hyphens frequently not only show up in library catalogs, but some systems actually require the hyphens to be in the ISBN for a link to an item by ISBN to work right. It's funky.

7/21/2009 12:35 PM  
Blogger Katya said...

Hyphens frequently not only show up in library catalogs, but some systems actually require the hyphens to be in the ISBN for a link to an item by ISBN to work right. It's funky.

How very odd . . .

7/21/2009 12:36 PM  
Blogger Unknown said...

Does this change what I should be uploading? I currently extract books with MARC 020 matching the ISBN format, and upload ISBN, Title, and Author. Should I be uploading book data for items without ISBNs?

7/29/2009 9:00 AM  
Blogger Catreona said...

Ummm... I'm not a computer techy. But, I wonder if it would be possible to do something similar for non-library that is ordinary people members? I don't have a whole lot of books with no ISBN, but I have enough that cataloging them on LT is sort of difficult. So, any help would be very greatly appreciated.

Thanks so much for the magic that is Library Thing!

8/28/2009 9:05 PM  

Post a Comment

<< Home