Monday, February 26, 2007

New feed: Compare your library with LibraryThing

Over on Next Generation Catalogs for Libraries, NCSU's Emily Lynema, asked me:
"Do you have any idea of the coverage of non-fiction, research materials in LT? Have you done any projects to look at overlap with a research institution (or with WorldCat)?"
No, we haven't. And I'm dying to find out, both for academic and non-academic libraries.

So I put together a feed of all unique LibraryThing's ISBNs. With a little work, library programmers should be able to compare them against their holdings.

If you're not up to the task, but still want to find out how LibraryThing compares to your library, you can send me a file with ISBNs—just ISBNs or a more detailed dump—and I'll do the comparison.

See our Feeds and APIs page for the file, AllLibraryThingISBNs.xml.gz.

Complications and opportunities:
  • I included only valid ISBNs.
  • It's a week or two old.
  • About 20% of LibraryThing books have invalid or no ISBN. Many of these have LCCNs. I suspect a high percentage are library-ish books.
  • I have turned all ISBN-13s in 978 format into ISBN-10s. There are a few bogus ones too, including the valid but numerically absurd 0000000000. (Bowker should auction that one off!)
  • There can be little doubt that LibraryThing is stronger in paperbacks and weaker in the formats libraries collect. It would therefore be very useful to run all ISBNs through OCLC's xISBN service*. (By definition, they're not going to be improved by running them through xISBN's chief competitor alternative service provider, thingISBN.) Unfortunately, I can't run them through xISBN on my own.
  • The feed is available for non-commercial use only. That basically means libraries and hobbyists. Other use is expressly prohibited.
  • I am guessing the overlap won't always be that impressive as a percentage. But these are the books people think enough of to own. They're going to move more than other library books.
I'm looking forward to what people find out!

*Which is moving, but will not break.

10 Comments:

Blogger Robert J. said...

"Do you have any idea of the coverage of non-fiction, research materials in LT? Have you done any projects to look at overlap with a research institution (or with WorldCat)?"

I've got lots of research materials in my LT collection, but many of them will precisely be those without ISBNs (19th century, for example). I'd like to suggest, however, that an important component of the "research" value of LT will come to lie in the metadata (the cataloging) rather than in the objects. Dedicated collectors will be in a position to go well beyond even the standard full-level detail that may be found in academic library catalogs. We saw that already in the Talk thread on multiple authors (see my message #57 on relator codes). In a previous thread I noted the prospect of LT as a global union catalog for private libraries. There are all sorts of rarely used MARC fields that LT users would gladly fill, and fill very quickly if past performance is an indicator of the future. Why not pick a few and do a trial run. Illustrators? Book designers? Typographers? Paper engineers? Provenance? Perhaps some folks in the library community would offer suggestions on data they wish they could access ("I sure wish we had catalog access to all series with numbering peculiarities") -- just turn the army of Thingamabrarians loose and you'll probably have it in a week.

2/26/2007 7:14 PM  
Blogger James said...

Hi Tim,

Here are my results...

List 1 (ISBNs University of Waikato Library)
Total Unique Percentage
178460 133201 74.639
List 2 (ISBNs Library Thing)
Total Unique Percentage
1774322 1729063 97.449

Any other stats you particularly wanted?

We have approximately 500,000 Bib records, so that suggests we have a lot of stuff without ISBNs, or I made a mistake somewhere :) (hopefully that later given that we do have lots of dvd's, cd's, videos, journals, etc) I just wouldn't have thought it would add up to more than half of our collection? or maybe that is just because has a number of isbn's that double up.

I probably can't give you the list of ISBN's without official permission but I can probably do a little more playing around with the data over here without getting in trouble :)

I'll try to post the perl code I used to do this comparison to our library blog http://librarycogs.blogspot.com
later today or tomorrow.

I didn't run xisbn over either dataset.

2/27/2007 8:44 PM  
Blogger James said...

The other stat I missed was of course, # of isbn's in common. Answer: 45,259

Which is actually a lot of books!

2/27/2007 9:00 PM  
Blogger Tim said...

James,

Interesting stuff. Some points:

1. The total is decent, but not as good as I'd like.
2. It looks a lot better once I found out your library is in New Zealand. Chances are, your collecting non-US editions much of the time, which LT is somewhat weaker in. That's one reason I think xISBN would up the numbers.
3. No doubt using non-ISBN data--LCCNs, but also title and author--would up the numbers.
4. 50k ISBNs is a good base, however, and it can be assumed to be the most heavily-used part.

Thanks for giving the data a spin!

2/27/2007 10:23 PM  
Blogger James said...

Hi Tim,

I have now documented my progress on our blog:
librarycogs.blogspot.com/2007/02/compare-your-library-with-librarything.html

And posted the perlcode (mega-simple stuff) on my webpage so others can easily give this a whirl if they so desire.

In Response to your response :)

1. Assume you mean decent as in 'enough to tell you something'? I think given out location and other factors even if we had a million isbn's you'd still want to get a few more people sharing stats before you started drawing any conclusions. So I agree, we need to get a lot more people involved with this if you want to get decent stats...

2. XISBN would possibly help with US Editions, we are probably also getting lot of NZ published material that isn't likely to have alot of relevance to other areas of the world (making it less likely to be added to LibraryThing).

3. Yes, Doing a comparison on more full records would be interesting but I probably would get support from the library to take on a project of that magnatude either...

4. I wonder if it is the most heavily used, or if the needs down under are just significantly different to LibraryThing's (I assume) largely North American user base. It would be interesting to know if those 45,000 are also very popular on librarything?

2/27/2007 11:33 PM  
Anonymous Anonymous said...

Cheers for the code James :-)

Here's the scores from Huddersfield...

List 1 (LibraryThing)
Total / Unique / Percentage
1774322 / 1714038 / 96.60%

List 2 (University of Huddersfield, UK)
Total / Unique / Percentage
240293 / 180009 / 74.91%

Total in common: 60284

It's interesting that we both have around 75% of unique ISBNs.

Here's our vital statistics...

293423 bib records on the catalgoue
400487 items records
208952 bibs that have at least 1 ISBN
240240 bibs that have at least 1 item attached to them

If our list of ISBNs is of use to anyone, you can download it at http://161.112.232.18/isbns.zip

2/28/2007 6:02 PM  
Blogger James said...

Hi Dave,

Glad to see that at least someone has has found my code useful.

I just found a major issue with the way I was normalising the ISBNs (Quite a few were being dropped entirely :) which explains why our total unique isbns was so low.

Here are my new results.

List 1 (LibraryThing)
Total / Unique / Percentage
1,774,320 / 1,700,943 / 95.86
List 2 (Waikato, NZ)
Total / Unique / Percentage
292,073 / 218,696 / 74.88
Total in common: 73,377

It doesn't change the percentages a lot, interesting that huddersfield has such similar numbers...

Out of interest I also did the comparison with your isbn list:
List 1 (Huddersfield, UK)
Total / Unique / Percentage
240293 / 201682 / 83.93
List 2 (Waikato, NZ)
Total / Unique / Percentage
292073 / 253462 / 86.78
Total in common: 38611

3/01/2007 8:36 PM  
Blogger Emily said...

Tim,

I may have spoken a bit hastily on ngc4lib. What I get for trying to finish something up on a Friday afternoon!

What method did you use to convert your ISBN-13 numbers to ISBN-10? For someone reason, I thought I was safe just chopping off the beginning '978'. *grin*

-emily lynema

3/09/2007 10:51 PM  
Blogger Tim said...

Hey, no, both ISBN-10 and ISBN-13 have checksums. When you convert ISBN13 978s down to ISBN10s you need to recalculate the checksum digit. The methods are different. I can't remember the details of the algorithms, but ISBN10s are 0-9 plus X; ISBN13s are just 0-9.

I think I use some PHP code Blyberg posted (with modifications discussed on his blog). You're in Java, right? I'm sure there's a good library. Maybe ask on NGC4Lib.

You could get around it by chopping the 978 and then chopping ALL checksum off--both yours and mine. That's a pretty violent way to do it, though. And you'd have to trust that both sets are valid.

3/09/2007 11:01 PM  
Blogger Emily said...

Here are the results from comparison with North Carolina State University Libraries collection (including checksum calculations didn't change the results much). Thanks again to James for some perl comparison code that gave me a place to start.

List 1 (LibraryThing)
Total / Unique / Percentage
1774322 / 1542835 / 87%

List 2 (NCSU Libraries, Raleigh, NC)
Total / Unique / Percentage
798128 / 566644 / 71%

Total in common: 231484

Again, the unique percentage within our collection is surprisingly similar to what has been reported at other insitutions.

Basic statistics for the NCSU Libraries collection:

1731810 bib records
798128 bibs that have at least 1 ISBN

That means that currently, only about 46% of our bib records have an ISBN. Lots of government docs, serials, and early english books online in there.

Out of all of our bib records, 13.4% had ISBNs that overlapped with LibraryThing. Out of bib records with ISBNs, 29% overlapped with LibraryThing.

3/13/2007 12:02 PM  

Post a Comment

<< Home