Sunday, February 01, 2009

The evil 3.26%

The question has arisen of why I advocate against OCLC's attempt to monopolize library data. Roy Tennant of OCLC, an intelligent, likeable man whom, although we disagree on some issues, has done more for libraries than most, accused me of writing and talking about the issue because:
"... your entire business model is built on the fact that you can use catalog records for free that others created and not contribute anything back unless they pay (yes, there is a limited set of data available via an API, but then they need the chops to do something with it)."
Fair enough. Let's look at the numbers, and the argument.

I did a comprehensive analysis, available here as a text file, with both output and PHP code. If anyone doubts it, send me an email and I'll let run the SQL queries yourself.

The numbers. As of 6:17pm Sunday, some 3.5 years after LibraryThing began, our members have added 35,831,904 books from 690 sources:
  • 85.48% came from bookstore data (almost exclusively Amazon).
  • 4.88% were entered manually by members
  • 9.63% were drawn from library sources
Now, where did that 9.63% come from?

These sources were in every case free and open Z39.50 connections our members accessed through us. Very frequently they accessed records of their own academic institution, but in any case, these members accessed these records alongside everyone else—libraries, museums, public agencies of one sort or another and all the students and scholars who use RefWorks, EndNote and other such services. Meanwhile LibraryThing has never been asked to stop accessing a source. On the contrary, libraries frequently ask to include themselves on our list of sources.

Of the 9.63%, by far the largest source is the US Library of Congress, the source of 2,203,182 books, or 6.15% of the total. The Library of Congress is a Federal organization, created for the benefit of the country and falling under the government-wide rule that public work is for the benefit of the public, and cannot be copyrighted or otherwise "owned." As long as technology was there the Library of Congress has allowed access to its cataloging data; the OCLC policy change will not affect that.* We are grateful the Library of Congress does this. But insofar as we are taxpayers and support American notion of public ownership of public resources, I will not apologize for it. (On the contrary, I feel that OCLC should apologize for attempting to restrict and profit from public work.)

3.26%. That leaves 3.48%—more appropriately 3.26%**—the evil sliver upon which our "entire business model is built." Take a look at the top fifteen here:
  • Koninklijke Bibliotheek — 130,406 books (0.36%)
  • National Library of Scotland — 80,826 books (0.23%)
  • British Library (powered by Talis) — 80,205 books (0.22%)
  • Gemeinsamer Bibliotheksverbund (GBV) — 77190 books (0.21%)
  • National Library of Australia — 72,896 books (0.2%)
  • Helsinki Metropolitan Libraries : 70,551 books (0.2%)
  • The Royal Library of Sweden (LIBRIS) : 63,430 books (0.18%)
  • Italian National Library Service : 60,643 books (0.17%)
  • Vlaamse Centrale Catalogus : 58,936 books (0.16%)
  • LIBRIS, svenska forskningsbibliotek — 54,339 books (0.15%)
  • ILCSO (Illinois Libraries) : 28,517 books (0.08%)
  • Yale University : 26,885 books (0.08%)
  • Det kongelige Bibliotek : 24,564 books (0.07%)
  • University of California : 20,098 books (0.06%)
  • Bibliotek.dk : 19,628 books (0.05%)
With 690 possible sources, it's a long, long tail. We take 2087 from the Russian State Library, 1067 records from the Magyar Országos Közös Katalógus, 286 from Princeton, 106 from Koç (in Izmir), 63 from Hong Kong Baptist, 4 from the Universidad Pública de Navarra, etc.

It should be apparent to anyone looking at the above that the 3.26% is largely about satisfying the needs of foreign LibraryThing members--a small percentage of our membership and hardly central to our "business model." Equally clear is the government orientation of the list—only one, Yale—is a private institution. The rest are all government agencies. Of course, no records actually came from OCLC itself!

All-in-all, library data from non-federal sources is a negligible component of LibraryThing's content. LibraryThing is not some big plot to capture library records. That idea is simply not in the figures.

Do we give back? What of the second half of the accusation, that we "not contribute anything back unless they pay" and the bit against APIs.

First, assuming Roy means LibraryThing data generally, it's absurd to suggest that because LibraryThing draws 3.26% of its data from free, unlicensed sources, our members' data and services are owned by OCLC or its members. OCLC no more owns members' tags and reviews on bibliographic metadata than Saudi Aramco owns the furniture I bring home in my car. Who in their right mind would every accept a list of titles and authors from a library, if that meant ceding ownership over what you think about the book?

LibraryThing and OCLC both have terms. But LibraryThing license terms are unlike OCLC's in a number of ways. LibraryThing members knew what they're getting, unlike OCLC members, who thought they were sharing with other libraries, but find themselves the lynchpin of a monopoly. From our inception LibraryThing has reserved a right to sell aggregate or anonymized data. We also sell some reviews—giving members the option to deny them to us. All our member data is non-exclusively licensed, so members can do anything they want with it outside of LibraryThing, and members can leave at any time. Neither is true of OCLC members' data under the Policy.

Cataloging data. That leaves LibraryThing cataloging data, of which we have three types. We don't have any legal responsibility to make it free, but we do so anyway.

First, we would be happy to offer downloads of original or modified MARC records! We haven't done so in order to avoid attracting a suit from OCLC. But perhaps we were mistaken. If OCLC would like us to start releasing our MARC records to others, someone should let us know. We will release them under the same terms they were given to us—freely.

Second, our Common Knowledge cataloging (series, awards, characters, etc.) is free and available to all. We can't think of a better way to provide it other than through an API, but we're all ears if Roy knows of a better way. And if OCLC would like to admit it to WorldCat, without subverting its always-free license, they don't even need our permission. Go on, OCLC, make my day!

Thirdly, there's ThingISBN, which was directly patterned on OCLC's xISBN service. Despite Roy's criticism, they are identical in format and delivery so if there's something wrong with its XML APIs, OCLC has only itself to blame. Indeed the only difference is cost: ThingISBN is completely free, both as an API and as a feed; xISBN, which member data creates, is sold back to members.

Stop killing the messenger. It's time for OCLC to recognize they made this mess, not others. They have perpetrated some astouding missteps—from attempting to sneak through a major rewrite of the core member policy in a few days without consultation, to a comic series of rewrites and policy reversals, culminating in withdrawing the policy entirely for discussion. (It now seems clear they did so on the heels of a member revolt, whether general or just of some key libraries.)

It's also important to see that, before OCLC started threatening companies and non-profits doing interesting but non-competing things with book data—notably LibLime, Open Library and LibraryThing—they had none of the problems they have now. Now, by attempting to control all book data, they've spurred the creation of LibLime's ‡Biblios system, a free, free-data alternative to OCLC and, well, sent me, Aaron Swartz of Open Library and dozens of prominent library bloggers into orbit.

Being caught so flat-footed can't feel nice. It must be hard feeling like royalty and discovering your subjects think themselves a confederacy. But this is no time for OCLC to start attacking the credibility of its opponents. Surely LibraryThing is an unusual case—a company that has an opinionated, crusading—okay, loud—president. But the thousands of librarians and other individuals who supported our calls, or raised other objections to the OCLC policy are not less well-motivated than OCLC and its employees. They do not love libraries less. They are, rather, concerned that OCLC's urge to control library metadata threatens longstanding library traditions of sharing, and sets libraries on a path of narrowness and restriction that will surely prove no benefit in this increasingly open, connected world.

*I need to write a blog post on this, but I was recently informed that whatever changes OCLC makes cannot touch federal libraries without explicit authorization. That is, federal law does recognize clauses like "if you continue to use" or "we can change this at any time."
** It should more accurately be 3.48%, because we are getting our British Library records through Talis, who have a contract with the British Library.

Labels:

27 Comments:

Anonymous Anonymous said...

Your transparency with regards to your internal data does a lot to give credence your argument, and I commend you for it. Those who are implicitly or explicitly fighting against transparency might make note of that.

2/01/2009 11:13 PM  
Blogger Lilithcat said...

* 85.48% came from bookstore data (almost exclusively Amazon).
* 4.88% were entered manually by members
* 9.63% were drawn from library sources


Fascinating!

My stats are very different.

24.8% from bookstore data (yes, mostly Amazon)
11.5% manual entry
63.7% from libraries (overwhelmingly from the Library of Congress)

(By the way, when I was checking this, I noticed that a few sources were listed twice, just a difference in capitalization, and that I have "Washington Research Library Consort" and "Washington Research Library Consortium (D.C.)")

2/01/2009 11:42 PM  
Blogger Unknown said...

Snap.

2/02/2009 12:21 AM  
Blogger Roy Tennant said...

So...because Amazon has only recently acquired a minority stake in LibraryThing you think that means all the data you've sucked out of them over the years is retroactively yours? But whatever.

I just wonder when they'll cop to the fact that they have two web sites that do the same thing and they own more of one than the other. The fact remains that what I said stands. Less than 5% of your data was created on your site -- by your own admission.

Also, I hope you enjoyed our free beer, wine, and food at the OCLC Blog Salon where you accosted me to start an argument (quoted here), while wearing your insulting t-shirt. I don't want to put too fine of a point on it, but apparently we know quite a bit more about cooperation than you do. But that's OK, since it isn't your line of work.

2/02/2009 1:23 AM  
Blogger Tim said...

Roy,

1. We are certainly aware of Amazon's license on its data. We agreed to that license. We never agreed to the OCLC license, and therefore are not bound by it. That's how contracts work—we have to abide by the ones we actually agreed to. I look forward to a day when bibliographic data is entirely free. Forgive me if I misspeak, but I think you do not.

2. I wasn't quoting the argument. I don't have so good a memory. I was quoting your blog comment.

3. The 5% number relates to the bibliographic data. We do not and never have sold that data. What we own—or rather have some restricted rights over—are data about books—the metadata, if you please. The distinction may be clearer if you think on all the data that OCLC owns free and clear—everything related to Interlibrary Loan, for example, and all your code. OCLC has a lot of great services to sell to libraries. Why does it need to attempt to control all bibliographic information?

4. Actually, as you well remember, I removed the protest tshirt—a popular item snapped up even by OCLC members councilors—immediately upon noticing the sponsor of the event and spoke to you in a plain brown button-down. If you insist, I will count myself a glass of wine and some cheese in your debt--and OCLC can give up the pretense of a "bloggers salon" and rename it "bloggers who agree with us salon." It has a ring to it.

As for the argument, I regret that you cannot see others as having opinions worthy of argument, but must resort to attacks on their integrity, something which I have avoided.

2/02/2009 2:27 AM  
Blogger Tim said...

>Lilithcat

It would be interesting to do it for "power users," or even to do it by user-count, not raw count. I strongly suspect that users like you are more library-aware.

In the near future we hope to move to a more complete manual entry--one that we can feed into open-data services as level-three records. When I had a cataloger over to look at Connexion and at Biblios, we grabbed two books off my Greek shelf, and neither were in OCLC! The same happened when we cataloged books at the Beverly Episcopal Church—I believe Katya entered about a dozen such books.

2/02/2009 2:33 AM  
Anonymous Anonymous said...

Of course, Tim, those percentages that you're quoting are the initial sources for the data, without taking into account the fact that users might then correct the Amazon data. (eg. Amazon didn't know my copy of A Short History of Almost Everything was written by Bill Bryson, so I had to enter that particular datum manually).

Of course, this is probably a vanishingly small percentage, but it is still there.

2/02/2009 6:52 AM  
Anonymous Anonymous said...

Of course, it doesn't look like this will happen any time soon, but wouldn't it be great if LT, OCLC and some company like Liblime could all work together somehow?

What could they accomplish together?

-Nathan

2/02/2009 8:41 AM  
Blogger Barbara said...

One of Amazon's innovations was to make their data so free that it spread. This has strengthened Amazon (to a rather scary extent - they've added to their empire with POD and e-book formats, including the one Kindle uses, then told those playing elsewhere that their books won't be listed.) But it was a smart business decision to provide all that free data, because we tend to link back to its source. They leveraged that pool of stuff in an amazing way.

It's just strange that a cooperative of libraries (which are designed for sharing) feels they can't afford to do that because when people link back to libraries, it doesn't pay them anything. Libraries let people have books for free. So the only way to pay for the shared bibliographic information is to collect money from the libraries, and the only way to avoid them getting it cheaper (or free) elsewhere is to lock it up.

New information models are emerging that will require new thinking about revenue streams. It's happening all over the information industries - newspapers are on the front line and how the news survives will matter. (It will survive, even if news on paper doesn't.) Book publishing is feeling it, too, and the music industry is in shambles because it decided to go on lockdown. If nothing else, we should have learned that whether or not information wants to be free, it will leak - and we need to find new business / philanthropy models that actually generate energy from that leakage rather than try to dam it all up.

If libraries can't figure out how to benefit from the spread of information - then we aren't being imaginative enough.

2/02/2009 8:45 AM  
Blogger Tim said...

Barbara, I couldn't have said it better.

2/02/2009 8:49 AM  
Blogger Jeffrey Beall said...

The ironic thing about R. Tennant is that for many years he used his column in Library Journal and many other venues to trash cataloging and catalogers. He is the author of articles like "MARC must die" and "MERC exit strategies," both of which are indexed in WorldCat.org. Now, all of a sudden, he is the world's great defender of proprietary MARC data, the same data that he would just as soon have thrown in the rubbish heap several years ago. Now that he's being paid by OCLC, his tune has changed. The contentious nature of OCLC's recent public relations failures is a direct reflection of how Tennant does business. How long will OCLC let his divisiveness continue to damage libraries, librarians, and library users?

2/02/2009 9:12 AM  
Blogger ksclarke said...

I won't defend OCLC not wanting to make their catalog records freely available because I think they should be. Libraries contributed them and even though the aggregation of them was done by OCLC I think libraries would want them made available.

Likewise, I think LibraryThing should make it's user contributed reviews freely available. They are user contributed and despite the value that LibraryThing adds by aggregating them I think users would want them to be made freely available.

I really see the actions of LibraryThing and OCLC as being pretty similar... both are protecting their cash cows. Both are engaged in a war of words. How about you two just be friends, realize there really isn't much difference between you two, and release both sets of data to us users?

2/02/2009 10:31 AM  
Blogger Joe said...

While I agree with your position on the OCLC Terms of Service, Tim, I don't think your stats here add up.

I find the connection between "record pulled from" and "cataloging done by" highly suspect. There's enough copy cataloging out there that there's little reason to believe that "where the data came from" is actually "where the record was created." And after all, if the data is ownable at all, it seems it should be owned by the institution that wrote the record, not the one that hosted it when a LibraryThing user happened to come by. (And I agree, not by the network which makes copy cataloging realistic.)

(Of course, given the preference many libraries have for using DLC as their preferred source of copy cataloging, this might just shore up your argument about Library of Congress records being special...
but actually reviewing the MARC records to establish the original source sounds like an errand that would make Don Quixote buy Sisyphus a beer.)

2/02/2009 11:53 AM  
Blogger Greg Schwartz said...

Not sure I really want to enter into this fray, but is Roy suggesting that OCLC's beer is actually "free as in kittens?"

2/02/2009 2:05 PM  
Blogger Caitlin said...

Likewise, I think LibraryThing should make it's user contributed reviews freely available. They are user contributed and despite the value that LibraryThing adds by aggregating them I think users would want them to be made freely available.

The users are the authors of those reviews. The reason LT doesn't just make it all freely available is because the authors of the reviews might not want that. There's an option on your profile settings to allow LT to keep the reviews on LT, give user reviews to non-commercial sites, or give the user reviews to both commercial and non-commercial sites. The reviews belong to the people who wrote them, and you can't just assume that "users would want them to be made freely available". Different people will feel differently about it. So no, I don't think the actions of LT and OCLC are similar.

2/02/2009 2:31 PM  
Blogger ksclarke said...

Caitlin, right... I can also imagine there are some libraries who don't want their cataloging records shared (for a variety of reasons). All I was suggesting was sharing the ones where the contributors want them shared (and perhaps making "share with everyone" the default on the form). If anything, this should be _easier_ for LT than for OCLC, being a newer org with this option explicitly recorded.

2/02/2009 2:48 PM  
Blogger Tim said...

I need to respond to much of this, but I wanted to leave a quick note. (I'm at a burger joint, so no time to write something long.)

I do NOT agree with going after Roy over the MARC issue. There is no inherent contradiction between wanting to kill of the MARC standard and having the OCLC position on ownership of metadata.

This goes to the larger point I want to make. Roy implied my position came from self-interest not sincere conviction. I do not think that was fair, and it would be equally unfair if I were to accuse him of the same. Roy/OCLC's position is a wrong one, in my opinion, but I sincere, well-intentioned people can disagree on issues of this sort.

I wasn't kidding when I said I consider Roy both very likable and a great asset to his field. It stung to have such an obviously well-intentioned, *nice* person think ill of me. That he extends this attack to the blogosphere, however, begs a response.

Insofar as the attack was on a public blog—a very prominent public blog—I had to say who said it. But I tried very hard not to personalize it, carefully hedging my—true—opinion of his sincerity and value to libraryland. I do so again here.

2/02/2009 3:09 PM  
Blogger Casey Durfee said...

"Less than 5% of your data was created on your site -- by your own admission."

Ah, the heart of the matter. The tens of millions of tags, reviews, ratings, common knowledge facts, conversations, book covers and so on our users have added don't count as real data. The work we've done on synthesizing that data and making it useful and fun doesn't count. Well, that way of seeing the world certainly explains why you hold us in low esteem. If bibliographic data was the only data users care about, why aren't they using Worldcat to tag and review things instead of our site? I'm no slouch in the programming department, but if Worldcat doesn't have better bibliographic data than we do, that's just sad, man. But if that really is the case, would you mind giving me a recommendation on LinkedIn at least?

Now, what percentage of Google's data is created on their site? If we're the antichrist, what does that make them (666 * 10^100 antichrists, I suppose)? Even if we were just harvesting and indexing library data, I don't see how that would make us more evil than Google.

Here's a common scenario: when I get a book from my local public library, after I'm done reading it I add it to LibraryThing using my public library's publicly available Z39.50 server as the source of data. When I return the library book, it's got my fingerprints and dog-ears in it. It's less valuable, and the whole process of getting the book to me and reshelving it costs the library several dollars. But in no sense is the MARC record less valuable when I use it, nor does it cost anybody anything. So why is using the book OK but using data about the book is wrong? Who's getting exploited? I can't wrap my mind around it.

Not only does the transaction not make the MARC record, my library or OCLC any less valuable, it's not the MARC record that makes the transaction valuable in the first place -- you can't get MARC records from LibraryThing even if that's what our users wanted, which they don't; most of them don't know or care what a MARC record is. The value is created by my adding to LibraryThing; none is destroyed -- the value came into being because I cared enough to record it and that there was a place I could fill that need. The value was created out of thin air, not taken from something or someone else.

Finally, regarding cooperation and service. I feel weird having to point this out, but here goes. Libraries work with us because they want to, Roy. They have no other reason to.

They buy our products and use our site because they're easy to use, they provide real value, and we give excellent customer service. Libraries contact us all the time wanting to be data sources on Librarything or have their locations and events show up on LibraryThing Local -- they want to be a part of what we're doing however they can. We're certainly not using any library against their wishes. I'm not sure who you think is getting exploited by us, but you'd be hard-pressed to find many librarians who think of us as exploiters. Seriously, ask around.

You're entitled to your feelings about us, but do realize they're just feelings -- they're not widely shared in the non-OCLC world, nor are they grounded in reality,

2/02/2009 9:02 PM  
Blogger Karen Coyle said...

I'm fascinated at the amount of data in LT that comes from publishers/booksellers via Amazon. The Open Library also loaded Amazon data, although I don't know what proportion of the whole that represents. Libraries are adamant that non-library data is of poor quality and virtually useless. Yet... here are two projects that are using that data. My gut feeling is that non-librarians find the Amazon metadata friendlier ("J. R. R. Tolkien" vs. "Tolkien, J. R. R. (John Ronald Reuel), 1892-1973"). Library data is stilted, old-fashioned, and fairly unattractive. That doesn't mean that library data isn't quality data -- it means that it's low on the 'user friendly' side. The LT stats show that there is a real need that is being met by the Amazon data, and that our book metadata world is not homogeneous because our book metadata needs vary.

2/03/2009 10:42 AM  
Blogger Tim said...

The use of Amazon has a few main sources:

1. It's the default. People don't change defaults. The Library of Congress is the second default choice--the rest you need to click somewhere, look at a list and add. People like defaults.

2. Amazon has the paperbacks that most people have. Libraries are much spottier there.

3. Amazon is *faster* than Z39.50 access to libraries, so adding from them can feel like a drag.

4. Some users think that library metadata is worse because it doesn't capitalize titles. Indeed, this is a frequently-reported "bug." Ouch!

Those are the main reasons, in rapidly descending order of importance.

2/03/2009 10:59 AM  
Blogger Karen Coyle said...

Interesting note about capitalizing titles: it turns out that's an anglo-centric practice. Do a search in bookstore data in other languages,* and the titles look more like the ones libraries create. Using sentence case has various advantages, including that it preserves the distinctive upper case on proper nouns. But, to English speakers, sentence case for titles looks odd.

* http://www.ibs.it, search on 'eco rosa'

2/03/2009 12:46 PM  
Blogger Sebastien said...

This comment has been removed by the author.

2/03/2009 1:01 PM  
Blogger Sebastien said...

I'm still holding out that the LOC will have a increasingly more structured and formidable role in digitizing the nation's and world's heritage, and can't help but feel as though the Nation's library has a place in this discussion no less.

In the recent Guardian piece Wendy Grossman addressed the question, why can't you find a library book in your search engine? This leads me to wonder if free and open data could and would be used invariably better following the advent of the semantic web, or whatever shape or form the web's next generation will take.

2/03/2009 1:10 PM  
Blogger Alex said...

I'm a bit shocked at Roy's comment here. Why are you extending the personal attacks instead of dealing with the facts laid out? Answer the man, for Pete's sake!

2/03/2009 2:39 PM  
Blogger Prosfilaes said...

I have another reason for using Amazon. Over 10% of my books are GURPS books, and apparently the Library of Congress has decided to route the copies they keep to a vaguely labeled box in the cellar and put all those books in that box in one catalog entry: this stunningly useful one. Amazon always has the most popular books, and sufficiently good data on the in-print ones.

2/04/2009 11:09 AM  
Blogger PFSchaffner said...

-- Using foreign libraries as a source may not imply that one is a foreign user. I make heavy use of the National Library of Scotland myself (though resident in Michigan), because it has proven a reliable source of records for British books, just as I use Yale as a good choice for protestant theology. Likewise for other European libraries.

-- I use Amazon only when I have to (for paperbacks, mostly, and genre fiction), and I heavily edit every record. Including getting rid of that horrible capitalization!! I agree with 'wimble' that 'Amazon' as a source may conceal a lot of user-supplied data.

-- I for one find the lack of true MARC editing and MARC export and import function by far the most serious flaws in LT. I have only 1,000 books listed in LT, but 10,000 listed in my personal, handmade catalogue (SGML/XML). I would love to be able to convert those 10,000 to MARCxml and then to proper MARC and upload them (I contribute records to OCLC; why not to LT?)* ; or use LT as a Z39.50 client to pull down MARC records and reverse the process (MARC -> MARCxml -> personal XML). As it is, I have to catalog everything twice. Once at home, so I can get everything in proper fields, and include the fields that I want; and then again on LT, to share with the world.

These are all quibbles as regards the main argument, with which I do not disagree.

* I'm the entire tech services dept of a small community college library, as well as a non-TS librarian in a large state university library.

2/04/2009 4:02 PM  
Anonymous Anonymous said...

From the original post:
"Do we give back?...From our inception LibraryThing has reserved a right to sell aggregate or anonymized data. We also sell some reviews-giving members the option to deny them to us."

Tim, you forgot to mention the free covers you've been trying to give (back) away.

3/11/2009 7:56 PM  

Post a Comment

<< Home