Monday, December 03, 2007

MARCThing: A simple, self-contained MARC and Z39.50 application

Over the past couple of weeks, LibraryThing has been rolling out major improvements to our cataloging system—a new system for retrieving and parsing book information we're calling "MARCThing."

MARCThing is a major advance for LibraryThing. We've sunk months of development time into it, but we're not going to keep it to ourselves. We will be releasing all the code for non-commercial use in libraries and elsewhere.

When the dust settles, LibraryThing members will be able to draw on nearly 700 data sources worldwide, with greatly improved foreign character support and better data manipulation behind the scenes. With MARCThing underneath we will be able to introduce many new features and to reach a truly global audience. But we are confident that developers outside of LibraryThing will find many other, equally compelling uses for MARCThing, and make useful changes and extensions.

What it is. When I was given the task of improving LibraryThing's cataloging system and other involving library data, I immediately thought of Solr, one of the most influential pieces of software to come out in the past couple of years. The big idea behind Solr is that it provides a "magic box"—an easy, self-contained interface to some very powerful but complex technology, the Lucene search engine. Solr hides the messy details of Lucene from the developer and provides all sorts of extra goodies in a self-contained package. The net result is you can instantly stick an extremely powerful search engine into your project with almost no work. This combination of power and ease-of-use has quickly made it a developer favorite, and spawned all sorts of interesting projects that never would've come out without Solr.

I wanted my own magic box that would handle the two main protocols used by libraries to transfer cataloging data, MARC and Z39.50, without anyone having to go into the details of how they work. And since I didn't want to have to find or build another magic box, ever, I wanted something that could be easily used from any programming language.

Writing it was pretty easy—I used Django for the web part, Pymarc for MARC, and PyZ3950 for the Z39.50 support. With a good software library, working with Z39.50 or MARC records isn't hard. The hard (or at least time-consuming) part of MARCThing was tracking down servers and dealing with oddball cases. There are many lists of Z39.50 servers out there, but the data is often incomplete, incorrect, or out of date. When you do find a Z39.50 server, oftentimes it's non-standard in some way, or only has limited functionality. So the process of connecting to libraries using Z39.50 is fraught with guesswork and manual fiddling. That's bad. The whole point of a standard should be to free you from guesswork.

How to use it. Using MARCThing is simple. Either send it some MARC records or what Z39.50 server you want to search and what you want to search on, and get back XML (or a variety of other formats) that you can use in applications without having to know a lick about library cataloging. All the messy details (and there are a lot of them) are hidden from view. Everything just works. You don't need to know what a nonfiling indicator or a use attribute is, or the difference between MARC8 and UTF-8. You just need to know how to make an HTTP request.

What I hope is that this inspires allows people not in the library world to do cool things with library data. It's sad that working with library data is such a hassle -- there are so many underused resources out there. I won't go too much into the technical problems with Z39.50 and MARC, but I do have a recommendation for anybody involved in implementing a standard or protocol in the library world. Go down to your local bookstore and grab 3 random people browsing the programming books. If you can't explain the basic idea in 10 minutes, or they can't sit down and write some basic code to use it in an hour or two, you've failed. It doesn't matter how perfect it is on paper -- it's not going to get used by anybody outside the library world, and even in the library world, it will only be implemented poorly.

Open source plans. LibraryThing was already the only major cataloging site that used any library data. (The rest use Amazon's data exclusively, a severe hurdle to book lovers in the US and an absolute barrier to those in most other countries.) It took us a long time to develop, and we have limited resources. We are not eager to give our competitors such a valuable tool -- they can get their own library geeks. At the same time, we are eager to encourage non-profit use and to license its non-competing commercial use for a token amount.

We're thinking of releasing the code under the Creative Commons Attribution-Noncommercial-Share Alike license, but it will depend on what people want to do with it. If you were bitten by a radioactive librarian and suddenly gained the power to search 700 libraries worldwide, what would you do?

Stay tuned; code is coming soon!

Labels: , ,

18 Comments:

Anonymous koffieyahoo said...

What about the interface specification. Could that be made available so it becomes possible for people to write their own back-ends for other protocol sets, e.g. SRU with Dublin Core records.

12/04/2007 6:32 AM  
Blogger Biblio said...

MARC Thing!
You make my data sing!
You make everything...
Groovy...

12/04/2007 10:58 AM  
Blogger MMcM said...

Unless you have magical heuristics, you must have developed some rich meta-data to drive the “oddball cases.” Why not slap a UI on that and let us users fill out the directory? There are both librarians and network protocol experts here. Once NLS and biblio field details are checked, the new server can be introduced to LT. It really only makes slightly more sense to have your precious resources doing this than it would to have you do the work combining for us.

12/04/2007 10:59 AM  
Anonymous Paul said...

"If you were bitten by a radioactive librarian and suddenly gained the power to search 700 libraries worldwide, what would you do?"

--I can already do that (plus many more libraries) via Worldcat.org. Am I missing something here???

12/04/2007 1:31 PM  
Blogger C4bl3Fl4m3 said...

Re: Open Source/Creative Commons

I would absolutely make it open source. It's in keeping with the transparency that LT already has and we users love. (What's transparency but Open Source business operations?) It's giving back to the world. It's sharing. It's in keeping with everything that this website stands for.

I, personally, wouldn't use the code as I'm not a coder. But I am an advocate for Open Source/Free software and I have a number of code monkey friends. The beauty of open source is that you don't know what people are going to do with it. That can be scary but it's also so rewarding.

To take a geeky example... look at the music of Jonathan Coulton. He released a significant portion of his music for free on the web, as well as making it all CC. Because of this, he's able to make his living off of his music w/o needing a record deal... the publicity he gets from everyone using his music in WoW machinma videos and as inspiration for art (and word of mouth) is enough to keep him alive. Heck, someone used his works for part of a presentation at their church.

It's a leap, but think of what you'll give to the world. Yes, it was a lot of work, and it is tempting to say "go get your own library geeks", but, I mean, what would your kindergarten teacher say to that? :-)

12/04/2007 2:07 PM  
Anonymous D.F.Flanders said...

I say go with the attribution-noncommercial-sharealike at first and then go to Amazon and OCLC and get them to pay for the next phase of development. Upon the signing of those cheques then release it to them as attribution-sharealike, then let their developers maintain the trunk! Play the game.

12/05/2007 1:32 AM  
Anonymous andyl said...

I'm perfectly happy with CC-NC-SA.

It allows people writing free software to use the code as they see fit.

Although that licence prevents other commercial activity using MARCThing there is absolutely nothing to stop someone emailing or phoning Tim up and arrange some alternate licensing. The responses may range from no way, to give us some dosh, to yeah go ahead we will let you do that for nothing.

BTW I also agree with mmcm that it would be cool for us to do some of the work in getting new libraries working.

12/05/2007 3:20 AM  
Blogger David said...

This sounds fantastic -- I'm looking forward to getting a glimpse of it!

That said, I agree with your basic statement on "implementing a standard or protocol in the library world" that "if you can't explain the basic idea in 10 minutes, or they can't sit down and write some basic code to use it in an hour or two, you've failed." But realistically, MARC and Z39.50 aren't Johnny-come-lately standards dreamed up by starry-eyed dotcom-era librarians. They are 30-40 years old, and were originally designed at a time when computers had only a fraction of the capabilities they have now. XML is certainly a lot more readable than MARC, but if someone had tried to create an XML-style system in 1968 they would have burned through their precious 3 MB of disk space before they finished cataloging the first floor of their library.

The longevity of these standards is a testimony to their usefulness, and also the reason they are such a pain to use now. In the intervening decades, people tried to shoehorn new types of data into the old standards without breaking them. (See, for example, Tim's comment about there being a specific designation for a festschrift and none for a CD -- has mainly to do with the fact that libraries in 1968 had a lot more festschriften than CDs.)

More recently-developed standards are often (although not always, of course) more intelligible to today's geeks.

But anyway, anything that makes it easier to easily work with archaic standards is definitely a Good Thing!

Re: licensing, I think your proposed Creative Commons license sounds good as long as you guys are willing to be flexible when needed. (I'm particularly sensitive to this because I used to work for a newspaper library -- technically a commercial organization. While our parent company may have had a lot of money, we scraped by on scraps. If such an institution figured out a way to do something cool and it didn't compete with LT, would be nice to know that reasonable licensing terms were available.)

12/06/2007 8:30 AM  
Blogger Casey said...

koffieyahoo, yes, it will be easy to add other back-ends to it. I'm really hoping somebody else will do SRW/SRU.

mmcm & andyl, I agree that it would be nice to open up the ability to have members add data sources. A lot of the initial push had to be done with back-room knowledge, but as MARCThing gets more mature I'd love it to be member-driven.

paul, worldcat.org only gives you limited metadata about the item back (and not in a format you can easily use in your own code). MARCThing extracts just about everything useful from the MARC record and returns it to you in your preferred format. OCLC could release something waaay better than MARCThing if they wanted to. I'm pretty sure they don't want to.

c4bl3fl4m3, I'm a strong believer in open source, but we have to do this in a way that won't hurt the site in the long run.

David, I absolutely agree that MARC and Z39.50 are products of their time, and viewed in that light, are pretty remarkable pieces of technology. My criticism is a lot more valid of newer stuff like SRU/SRW, NCIP, FRBR, or OpenURL (and as an exhortation for people to do more projects like MARCThing that give old and complex things new and simple interfaces). MARC and Z39.50 aren't so bad; endlessly reinventing them is.

For the licensing, I'm sure we can be flexible -- a lot of corporate libraries are in that situation. It's really about how the software is used rather than who's using it in my book. If it doesn't interfere or compete with LibraryThing's core business, I don't see us having a problem with it.

12/06/2007 12:27 PM  
Blogger phasefx said...

We're thinking of releasing the code under the Creative Commons Attribution-Noncommercial-Share Alike license, but it will depend on what people want to do with it.

Hrmm, I don't think the Creative Commons licenses are really geared toward software. Have you considered something like the GPL? However, an open source license doesn't preclude commercial activity, so you might not want that.

-- Jason

12/07/2007 9:37 AM  
Blogger Stephen said...

I stood up and shouted when I read your post.

12/07/2007 2:29 PM  
Blogger Amanda Ellis said...

I think you should get legal advice on whether to go with the CC licence. Last time I looked into it there were only a few test cases worldwide of the legal enforceability of a CC licence.

When you share intellectual property with a CC licence can you change your mind and then unshare? Or change to another type of licence? Or change to a later version of the same licence? The nitty-gritty of it all seems quite complicated.

Congrats on the MARCThing thing. I am forwarding the blog-post link to a library friend :)

12/09/2007 10:50 PM  
Blogger egh said...

Creative Commons' noncommercial license may be great for some things. It is a bad idea for software. It is not free, it is not open source, and it is not compatible with GPL. If you release this code under this license, it will not be useful for free software projects. For instance, it could not be distributed with Ubuntu/Debian.

To be clear: distributing software under a non-commercial license is not compatible with the FSF's definition of free software OR OSI's open source definition. It is not free software, it is not open source.

2/27/2008 5:09 PM  
Blogger Tim said...

Re: The last comment.

If a paragraph can, by the substitution of one or two words, be turned into a good approximation of a passage from Mao, it is likely to indicate a similar mental climate.

2/27/2008 10:43 PM  
Blogger Stephen said...

is this going to be released?

It's been a while - why are you sitting on it.

PS it's your work - release it under whatever license(s) you want.
If you are worried about competitors using you code without sharing enhancements back(like google does), you could try a licence that addresses 'software as a service'. I think GPL3 maybe?

Cheers.
Stephen

2/28/2008 6:55 AM  
Blogger egh said...

If you do not wish to release your software as free/open source, then by all means do not do so.

If you wish to read the Open Source Initiative's "Open source definition", you can do so here

If you wish to read the Free Software Foundation's "Free software definition", you can do so here.

Both of these statements are incompatible with the non-commercial clause of "by-nc-sa" license from CC.

I am not so much concerned with the usage of the term "open source" to mean something that is incompatible with the term as it is generally meant, but I am trying to let you know that releasing your software under a license like this will restrict its use in free/open source projects that use other commonly-used licenses (like the GPL).

2/29/2008 10:46 PM  
Blogger Andris said...

So you abandoned the plan to release it as open source. Sad, but understandable.

11/13/2008 3:08 PM  
Blogger Stephen said...

Hey! We need an update.

11/14/2008 5:03 AM  

Post a Comment

<< Home