LibraryThing for Libraries: How it works / The five-second rule
The LibraryThing for Libraries widgets have a unique architecture. You install it on your OPAC's HTML pages, but the OPAC doesn't "do anything." All the work takes place in browser JavaScript requests to the LibraryThing for Libraries servers. Only when the patron clicks on a specific book does the library OPAC come into the picture again.
Your creaky OPAC can rest easy. All the database work and the statistical number-crunching that makes something like recommendations or tag browsing possible takes place elsewhere. You get beefy new functionality without a single extra OPAC request. (Of course, we think using a LibraryThing-enhanced catalog will be so fun—we don't mean that ironically—that patrons will spend more time browsing them.)
*BUT* before LibraryThing can take the work off your hands, it needs to know what ISBNs you have. So we ask for an export with ISBN data, and accept any format your OPAC makes.* And if a link to a book is to display the same title and author given in your OPAC, it needs to get them. Exporting and uploading them is impracticable. There are dozens of possible formats to parse, and anything that complicates the export process will limit our potential user-base. LibraryThing for Libraries needs to be dirt-simple. It needs to be people-who-doesn't-even-know-HTML simple.
So, LibraryThing for Libraries hits your OPAC to collect titles and authors, "screen scraping" the pages. The question is: How fast can it go?
Good question, and one we've struggled with. In search-engine industry, the standard maximum is one request/second. Google, Yahoo, AskJeeves, MSN (who?) and their peers use that as their benchmark, although you can request to speed them up or slow them down using standards like robots.txt. And they'll do it all day long every day, and obviously without regard for how many others are hitting you too. In March LibraryThing was visited by 71 registered "bots." The greediest, Google, hit us 11,338,467 times--an average of 4 times/second--and took almost 200GB. As our total bandwidth was 650GB, you can understand why Google sometimes seems a a bit, er, codependent.
Anyway, I wrongly believed that most OPACs could handle 1/second. After all, the libraries who've contacted us all have systems that cost hundreds or millions of dollars. And most have unspiderable "sessions," so LibraryThing wouldn't be competing with Google and its ilk.
Apparently I was wrong. Until Thursday, the requests were sporadic or round-robin-ed, so the effective time between requests was more than a second. Thursday afternoon we threaded the process, so they could run mostly continuously and concurrently. This morning I heard back that LibraryThing was taking too much from one OPAC, and slowing performance. Yipes! The system in question served a consortium of more than 25 libraries, so one can expect it isn't the slowest, worst OPAC out there! We yanked the spidering. They took it well, even so. We owe them.
So, the new rule will be one request/five seconds max. And I'll put in the rule of monitoring how fast it took the document to come in, and waiting a multiple of that, so any performance issue is adjusted for in real time. The LibraryThing for Libraries interface--not yet publicly available--allows libraries to speed up or slow down the process. "Slow" will reduce it to 10 seconds; "fast" will increase it to 2 seconds.
The new speed will mean longer waits before a library can see LibraryThing for Libraries in action. In our experience, we run about 50% coverage on US publics, so a 250,000-ISBN library will have 125,000 overlapping ISBNs and take a week for us to fetch all titles and authors. With almost three million ISBNs in LibraryThing already, we can show a library what the widgets will look like before, so long as they understand the titles may not match theirs exactly.
We thank the dozen libraries who are participating in our initial tests of the system. We think everyone is going to be impressed with the result. We got the tag-browsing widget working last night, and it's absolutely fantastic. Altay, our JavaScript guru, is outdoing himself. And I celebrated with a big hunk of brie. I can't wait to finish it up and show it off at CIL and the Library of Congress next week.
*This is possible because ISBNs aren't just numbers, but numbers with structure. They are either ten digits (and maybe an X) long or thirteen digits starting with 978 or 979.** And the last digit is a checksum--a calculation based on the others. So ISBN 0747532699 is the first British edition of Harry Potter and Philosopher's Stone, now selling for upwards of $1,000. But change a digit and you don't get another book, but an error. The checksum won't work. If anything bad slips through, running the ISBNs against LibraryThing's books tosses them out.
**ie., ([0-9]{9}[0-9X]|(978|979)[0-9]{10}) in regular-expression land, where I live.
Your creaky OPAC can rest easy. All the database work and the statistical number-crunching that makes something like recommendations or tag browsing possible takes place elsewhere. You get beefy new functionality without a single extra OPAC request. (Of course, we think using a LibraryThing-enhanced catalog will be so fun—we don't mean that ironically—that patrons will spend more time browsing them.)
*BUT* before LibraryThing can take the work off your hands, it needs to know what ISBNs you have. So we ask for an export with ISBN data, and accept any format your OPAC makes.* And if a link to a book is to display the same title and author given in your OPAC, it needs to get them. Exporting and uploading them is impracticable. There are dozens of possible formats to parse, and anything that complicates the export process will limit our potential user-base. LibraryThing for Libraries needs to be dirt-simple. It needs to be people-who-doesn't-even-know-HTML simple.
So, LibraryThing for Libraries hits your OPAC to collect titles and authors, "screen scraping" the pages. The question is: How fast can it go?
Good question, and one we've struggled with. In search-engine industry, the standard maximum is one request/second. Google, Yahoo, AskJeeves, MSN (who?) and their peers use that as their benchmark, although you can request to speed them up or slow them down using standards like robots.txt. And they'll do it all day long every day, and obviously without regard for how many others are hitting you too. In March LibraryThing was visited by 71 registered "bots." The greediest, Google, hit us 11,338,467 times--an average of 4 times/second--and took almost 200GB. As our total bandwidth was 650GB, you can understand why Google sometimes seems a a bit, er, codependent.
Anyway, I wrongly believed that most OPACs could handle 1/second. After all, the libraries who've contacted us all have systems that cost hundreds or millions of dollars. And most have unspiderable "sessions," so LibraryThing wouldn't be competing with Google and its ilk.
Apparently I was wrong. Until Thursday, the requests were sporadic or round-robin-ed, so the effective time between requests was more than a second. Thursday afternoon we threaded the process, so they could run mostly continuously and concurrently. This morning I heard back that LibraryThing was taking too much from one OPAC, and slowing performance. Yipes! The system in question served a consortium of more than 25 libraries, so one can expect it isn't the slowest, worst OPAC out there! We yanked the spidering. They took it well, even so. We owe them.
So, the new rule will be one request/five seconds max. And I'll put in the rule of monitoring how fast it took the document to come in, and waiting a multiple of that, so any performance issue is adjusted for in real time. The LibraryThing for Libraries interface--not yet publicly available--allows libraries to speed up or slow down the process. "Slow" will reduce it to 10 seconds; "fast" will increase it to 2 seconds.
The new speed will mean longer waits before a library can see LibraryThing for Libraries in action. In our experience, we run about 50% coverage on US publics, so a 250,000-ISBN library will have 125,000 overlapping ISBNs and take a week for us to fetch all titles and authors. With almost three million ISBNs in LibraryThing already, we can show a library what the widgets will look like before, so long as they understand the titles may not match theirs exactly.
We thank the dozen libraries who are participating in our initial tests of the system. We think everyone is going to be impressed with the result. We got the tag-browsing widget working last night, and it's absolutely fantastic. Altay, our JavaScript guru, is outdoing himself. And I celebrated with a big hunk of brie. I can't wait to finish it up and show it off at CIL and the Library of Congress next week.
*This is possible because ISBNs aren't just numbers, but numbers with structure. They are either ten digits (and maybe an X) long or thirteen digits starting with 978 or 979.** And the last digit is a checksum--a calculation based on the others. So ISBN 0747532699 is the first British edition of Harry Potter and Philosopher's Stone, now selling for upwards of $1,000. But change a digit and you don't get another book, but an error. The checksum won't work. If anything bad slips through, running the ISBNs against LibraryThing's books tosses them out.
**ie., ([0-9]{9}[0-9X]|(978|979)[0-9]{10}) in regular-expression land, where I live.
3 Comments:
Could this be sent from a library with access to an OpenURL server without the preliminary scraping?
When will we see some new stuff for users?
Good point. Soon. We're going to have free resources now that the Big Push is over.
T
Post a Comment
<< Home