Thingology (LibraryThing's ideas blog): Can subjects be relevancy ranked?

I wrote this up on the plane from San Francisco. (I was there on a secret, unbloggable mission!*) It's a bit involved and it doesn't "arrive" anywhere, but, if you're interested in subjects and relevancy ranking, it might be worth thinking about.

There are a couple differences between user tagging ("free tagging," "social tagging," etc.) and traditional library classification. "Who does it?" is the most obvious difference, followed by whether or not the labeling action takes place within a predefined ontology, or is made up on the fly.

It's easy to ignore a third, and very critical difference. Subject classifications, like the Library of Congress Subject Headings (LCSH), are essentially binary. It's non-overlapping buckets. Something either does or does no belong in a subject. There are no gradations of belonging.

The idea is, as Clay Shirky and David Weinberger have reminded us, rooted in the physical world. Subject classification escapes the physicality of shelf-order classification, in which a book must be shelved in a single place, but is still restrained by the physicality of the catalog card. A catalog card can only reference a certain number of subjects. Nobody wants a book to take up twenty cards. And the subject cards can only reference so many books. About 90% of all literature could fall under the LCSH subject Man-woman relationships. But it would make no sense to slot this 90% under that heading in a physical card catalog--the card catalog would instantly grow by 90%! And there seem to be very real differences in relevancy and "what-the-heck"-ness between real-life members of the "Man-woman relationships" LCSH: High Fidelity, Great expectations, The Fountainhead, I Kissed Dating Goodbye, and The Official Hottie Hunting Guide.

If you're very selective, you can keep the numbers down. But, apart from the rule that the first subject is generally the primary one, there's no good way to relevancy rank the books belonging to a subject.

Tags can do it, because tens, hundreds or thousands of users applying tags creates a "statistics of meaning." So, 1984 is tagged dytopia 549 times, torture six times and Great Britain two times. The numbers can be turned into ranking, so 1984 shows up high on a list of books about "dystopia," lower under "torture" and near the end of a list of books about Great Britain.

This is all well-worn territory. My question is this: Is there any way to relevancy-rank books within subjects?

I was reminded of the question when checking out OCLC's new project, FictionFinder. I'll blog about the whole later, but for now know that you can search for a LCSH subject and get back a list of books belonging to it. (I can't link to the results, which are session based.**) Check out the LCSH "City and Town Life" and the top book is Red Badge of Courage. Lacking a better method, FictionFinder let popularity (the number of OCLC libraries with a copy) stand in for relevance. LibraryThing does the same, using our popularity numbers instead. The results are not systemmatically better (in this case Ulysses wins).

I tried two solutions:

The first was to tie into LibraryThing's tags. So, figure out what tags are most characteristic of books with the subject "Man-Woman Relationships," and then use the presence and number of these tags to rank the subject results. So, for example, "Man-Woman Relationships" has a global correlation with "relationships," "dating" and "romance," none of which are very prominent among the tags applied to Great Expectations, so it can fall low on the list.

I got far enough down this road to know it was going to help.

The second and more interesting algorithm was to see if books can be ranked within subjects without any other information. This would help OCLC, who are unlikely to pay for LibraryThing data, and to any library that employs LCSH, most of which would have no "popularity" data to use either.

I hit upon the idea that subjects "reinforce" each other, and that this must leave a statistical signature. For example, it seems that "Love stories" and "Psychological fiction" are commonly applied to books about "Man-Woman Relationships," but that "Androgynous robot alone on an island -- Stories" is not. (Okay, that's not real, but the point stands.) Can these "related subjects" relevancy rank the subject itself?

I wish so, but I can't get it to work well enough. It works for some topics, but falls down for others, laughably.

Some ideas I've considered:

Treating subjects as links, and running some sort of "page-rank" style connection algorithm against them. Maybe this would bring out coincidences that simple statistics misses.
Using other library data, such as LCC and Dewey. This would be reminiscent of how I made LibraryThing's LCSH/LCC/Dewey recommendations.
Doing statistics on other fields, such as the title. So, for example, there's probably a statistical correlation between "Man-woman relationships" and books with "dating," "men and women" and "proposal" in the title.

None strike me as the silver bullet.

Anyway, my plane has landed--allowing me to do real work again--so I end in aporia. Ideas?

*I'm itching to blog it, but I have to hold off for now. I'll throw some pictures up soon, however. I'd never been to San Francisco before. What a wonderful wonderful town.
**One can understand why OPACs made in 1996 are session based. How frustrating to see a new product with them.

3 Comments:

Tim said...: Interesting.

Although of course relevancy is in the eye of the beholder, that shouldn't cloud us too much. If you take that attitude, there would be no difference between Google's links 1 and 1,000,000, or between, say, Google and the engines it beat (HotBot anyone?).

That said, your suggestion is very interesting.

Narrowing by ANOTHER subject would be one way. (Presumably you'd go past sorting into "has" and "has not" by looking at how many hops there would be if the book didn't have the subject.)

But you could also narrow by book--using whatever other data you have--book-to-book holding patterns, tags, etc.

Aporetically indeed. Tough problem.

PS: "Yesterday I won a drinking contest at my fraternity. Of course, I saw to it that the others were using a pre-narrowed funnel."; 2/05/2007 4:13 PM
Carter Clan said...: On a slightly different tack...
Having just started a theology course and acquired a whole load of new books I decided I needed to arrange them on the shelves in some kind of sensible arrangement. So, I looked at LT and have used the LC call numbers, as far as I had them, but there are some books where it's not in the database yet, so I've had to do best fit. For the number of books I've got this isn't really an issue, but it did get me thinking.
Is there anyway that the LT data - tags and the rest can be used to get back to a recommended shelf order? Put the books most likely to suggest each other next to each other? Put the books most likely to unsuggest each other as far away from each other as possible? Probably horribly processor heavy, but I'd love to see what it would throw up.

TCarter; 2/08/2007 3:14 PM
Carter Clan said...: I'd reckon I could get somewhere on my library ordering with Yahoo Pipes, if there were a few more RSS feeds around. What I'd really like is to be able to get an RSS feed of the Suggestion by classification for a single work and an RSS feed for my whole library, rather than just the most recently added. Any chance?

TCarter; 2/11/2007 2:33 PM

<< Home

Monday, February 05, 2007

Can subjects be relevancy ranked?

3 Comments:

Discuss

LibraryThing on Twitter

Consultants

Recently

Recent Reviews from Los Gatos Public Library

Previous Posts

Archives