Monday, February 12, 2007

Library of Congress Authority Files, Open!

So begins the PDF announcing and detailing a major new development for the library-data world. Simon Spero, library-geek extraordinaire, has released a nearly complete copy of the Library of Congress Authority Files.

Get them here:

Simon assembled the files, available in MarcXML, by querying the Library of Congress' Authorities website one-by-one over months. He's a patient man.

As I've discussed before, Library of Congress data is both free and unfree. As a work of the US government, it cannot be copyrighted.* But the LC has traditionally restricted access, offering small amounts through public interfaces**, and selling larger amounts through its Cataloging Distribution Service. A small industry has developed where the CDS's buyers resell it commercially. Until now, nobody has decided to just... let it go.

I anticipate that Simon's action will draw some criticism. If the LC can't make money selling its cataloging, how will it support this vital work? This sentiment will grow stronger when Casey Bisson releases the full LC Marc data, but whether for authorities or other cataloging data I think this is short sighted.

As I see it, the failure of the LC and other libraries to get their data "out there" on the open web has hurt them far more deeply than their catalog sales could ever recoup. It has made them seem irrelevant, standing silent and apart from the great conversation, which grows more interesting with each passing year.

The first culprits are the online catalogs***, ugly, backward things lamed with session-based URLs. If you want to link to the LC, you can't. The URL you get will only work for you, for ten minutes. Linking--the very soul of the Web--is impossible.

The second culprit is how libraries have distributed the data itself. Amazon makes its book data accessible to all in a handy, universally-understood XML format. It's so easy and appealing, over 140,000 developers have signed up to receive it. Libraries by contrast generally make their data available—if they make it available—over a tricky and obscure protocol know as z39.50. And the data itself is in MARC, a rich but impenetrable spectrum of formats—eg., DanMARC, the Danish MARC format!—used by and largely only understood by librarians.

With wretched web sites and unretrievable, unparseable data, libraries have lost vital ground. If the world worked right, Googling a book should turn up a library within the first few results. But libraries seldom make the top 100, and despite being the largest library on the planet and producing the lion's share of original cataloging, the Library of Congress is completely absent. In its place are Amazon, its peers and sites that use Amazon data.**** Libraries may know a lot, but simplicity, attractiveness and ubiquitous data have won out.

It's time to fight back. Libraries and library data can change the book web for the better. Three cheers to Simon for making a critical first step. Viva La Revolución, my brother.*****

*The LC reserves the right to copyright it outside of the United States. It's unclear if they ever have.
**In LibraryThing's case, through a z39.50 connection. Although the limits are not clearly specified, we've been given to understand that large-scale mining will not be tolerated.
***What library-techs called OPACs—Online Public Access Catalog. The fact that someone still needs to to add "Public Access" to "Online" is the problem in miniature. Does Google call itself a Public Access Search Engine?
****Don't get me wrong; Amazon is a great site, and should be up in the top results too.
*****In so far as both Simon and I blogged the death of Milton Friedman, I suspect we're equally uneasy with revolutionary Spanish.


Blogger RJO said...

Great news!

Hmm, now what might we do with them...

2/12/2007 7:22 PM  
Anonymous Peter Murray said...

[quote]The first culprits are the online catalogs, ugly, backward things lamed with session-based URLs. If you want to link to the LC, you can't. The URL you get will only work for you, for ten minutes. Linking--the very soul of the Web--is impossible.[/quote]

Agreed with caveats. There is at least one major library automation vendor that does not use session-based URLs. It could, of course, be argued that said vendor's URLs are not clean.

2/12/2007 8:42 PM  
Anonymous Hugh Taylor said...

[quote] If you want to link to the LC, you can't.

Not really true. You can express an LC OPAC search as a URL without needing a session ID to run the search. So you could embed the search as a link within, say, a reading list and go straight to the catalogue entry. Equally, you can install one of the available search engine plugins for the LC catalog (for Firefox users, at least - don't know anything about IE). There are at least three of these to be had.

It could be that the vendor Peter Murray was referring to is the one LC uses.

2/13/2007 12:14 PM  
Blogger john said...

You're right, if you have specialized knowledge and a particular OPAC's URL formation and/or install a plugin to Firefox, you can effect a search and/or link to an LC page.

I don't want to seem catty, but isn't that the example the proves the rule?

I found a book on the LC. I bookmarked it. Here it is,1&Search%5FArg=Pnin&Search%5FCode=TALL&CNT=25&PID=5229&SEQ=20070213125104&SID=1 It won't work.

2/13/2007 1:00 PM  
Blogger Adrienne said...

So does this mean LT will switch to the open authority records (and open MARC records, when those become available)?

2/13/2007 1:01 PM  
Blogger Tim said...

Oh, that was me, sorry.

2/13/2007 1:18 PM  
Blogger Simon Spero said...

Having spent a fair amount of time synthesizing URLS for LC (see, um, this page :):

You can generate semi-persistent URLS for Bibliographic records in Voyager by replacing the PID and SEQ elements with DB=local.

You can't do this for the authority records due to Endeavor never really quite finishing it

2/13/2007 1:44 PM  
Blogger Tim said...


What parts of the URL do I need to change when linking to Amazon, Google, Wikipedia or a blog?

2/13/2007 1:58 PM  
Anonymous cat wizard said...

It seems that Tim forgot what you've seen when you commented LC's new search: "On the plus side, the URLs appear permanent, rather than the LC's usual "expiring" URLs."
For example:
It's long and based on LCCN, but not session-based.

2/14/2007 2:40 AM  
Blogger Tim said...

Okay, fair enough. But click anything after that, and you've got a session-based one. (All the permanent URL does it get you in.) Of course, the URLs are all so complex that no regular user would ever know.

I think my point stands. Regular people expect to be able to link to pages on a library catalog like the LCs. Except under rare conditions, this won't work.

Take a look at a Google search for Except for a few help pages,i these are all links to expired sessons. Google only shows one of them by default, having determined that all look the same--the expired session page always looking the same. So, people are TRYING to link to LC pages. They're just not able to.

2/14/2007 7:35 AM  
Blogger Janusman said...

Hmm... is there a similarly open version of the LCC Schedules? Ideally, one that would express its hierarchical structure. How about the OCLC/WLN Conspectus categories/divisions/etc. (which we are trying to use for collection analysis and recommending related books)

Innovative's webopac does have sessionless URLs, which has helped some libraries like us have stuff like permalinks to records. (BTW I agree there about the term "OPAC")

2/14/2007 10:33 AM  
Blogger Stuart said...

I take issue with the complaints about the URLs at work in LC's catalog. However ugly or impermanent they may be, LC's catalog is not the place anyone (but LC) should be linking.

I have elaborated on what I judge to be a better approach in my post at:

Stuart Weibel
OCLC Research

2/20/2007 8:48 AM  
Blogger ashuber said...

As far as I am aware, authority files are not avalible via z39.50. Does anyone know why this is so?

2/20/2007 9:06 AM  
Blogger Tim said...

Stuart: Thanks for the post. I responded at some length over there. I think, if I have the time, I'll try to work up a more comprehensive reply.

If I had my druthers, the Library of Congress would be the link, not WorldCat.

I favor things that are in their very nature free, open and accountable. OCLC for all its virtues is unable to be any of these. Free and open data would end OCLC's business model--you guys even license out the Dewey Decimal system!—and there can be no accountability to a massive private monopoly standing two or three removes from the institutions we borrow books from.

If I don't like how the LC uses it's data, I can take it and do it myself, without fear of prosecution. (LibraryThing does just that.) And if I don't like their governance, I can write my congressman. OCLC keeps its data close to the vest, and as for complaints... well, I guess you're my congressman today. :)

2/20/2007 11:03 AM  
Blogger deanna said...

Do you know if Simmon will update his one-time (wow!) gathering of the authorities?

6/21/2007 12:08 PM  

Post a Comment

<< Home