Monday, February 05, 2007

Can subjects be relevancy ranked?

I wrote this up on the plane from San Francisco. (I was there on a secret, unbloggable mission!*) It's a bit involved and it doesn't "arrive" anywhere, but, if you're interested in subjects and relevancy ranking, it might be worth thinking about.

There are a couple differences between user tagging ("free tagging," "social tagging," etc.) and traditional library classification. "Who does it?" is the most obvious difference, followed by whether or not the labeling action takes place within a predefined ontology, or is made up on the fly.

It's easy to ignore a third, and very critical difference. Subject classifications, like the Library of Congress Subject Headings (LCSH), are essentially binary. It's non-overlapping buckets. Something either does or does no belong in a subject. There are no gradations of belonging.

The idea is, as Clay Shirky and David Weinberger have reminded us, rooted in the physical world. Subject classification escapes the physicality of shelf-order classification, in which a book must be shelved in a single place, but is still restrained by the physicality of the catalog card. A catalog card can only reference a certain number of subjects. Nobody wants a book to take up twenty cards. And the subject cards can only reference so many books. About 90% of all literature could fall under the LCSH subject Man-woman relationships. But it would make no sense to slot this 90% under that heading in a physical card catalog--the card catalog would instantly grow by 90%! And there seem to be very real differences in relevancy and "what-the-heck"-ness between real-life members of the "Man-woman relationships" LCSH: High Fidelity, Great expectations, The Fountainhead, I Kissed Dating Goodbye, and The Official Hottie Hunting Guide.

If you're very selective, you can keep the numbers down. But, apart from the rule that the first subject is generally the primary one, there's no good way to relevancy rank the books belonging to a subject.

Tags can do it, because tens, hundreds or thousands of users applying tags creates a "statistics of meaning." So, 1984 is tagged dytopia 549 times, torture six times and Great Britain two times. The numbers can be turned into ranking, so 1984 shows up high on a list of books about "dystopia," lower under "torture" and near the end of a list of books about Great Britain.

This is all well-worn territory. My question is this: Is there any way to relevancy-rank books within subjects?

I was reminded of the question when checking out OCLC's new project, FictionFinder. I'll blog about the whole later, but for now know that you can search for a LCSH subject and get back a list of books belonging to it. (I can't link to the results, which are session based.**) Check out the LCSH "City and Town Life" and the top book is Red Badge of Courage. Lacking a better method, FictionFinder let popularity (the number of OCLC libraries with a copy) stand in for relevance. LibraryThing does the same, using our popularity numbers instead. The results are not systemmatically better (in this case Ulysses wins).

I tried two solutions:

The first was to tie into LibraryThing's tags. So, figure out what tags are most characteristic of books with the subject "Man-Woman Relationships," and then use the presence and number of these tags to rank the subject results. So, for example, "Man-Woman Relationships" has a global correlation with "relationships," "dating" and "romance," none of which are very prominent among the tags applied to Great Expectations, so it can fall low on the list.

I got far enough down this road to know it was going to help.

The second and more interesting algorithm was to see if books can be ranked within subjects without any other information. This would help OCLC, who are unlikely to pay for LibraryThing data, and to any library that employs LCSH, most of which would have no "popularity" data to use either.

I hit upon the idea that subjects "reinforce" each other, and that this must leave a statistical signature. For example, it seems that "Love stories" and "Psychological fiction" are commonly applied to books about "Man-Woman Relationships," but that "Androgynous robot alone on an island -- Stories" is not. (Okay, that's not real, but the point stands.) Can these "related subjects" relevancy rank the subject itself?

I wish so, but I can't get it to work well enough. It works for some topics, but falls down for others, laughably.

Some ideas I've considered:
  • Treating subjects as links, and running some sort of "page-rank" style connection algorithm against them. Maybe this would bring out coincidences that simple statistics misses.
  • Using other library data, such as LCC and Dewey. This would be reminiscent of how I made LibraryThing's LCSH/LCC/Dewey recommendations.
  • Doing statistics on other fields, such as the title. So, for example, there's probably a statistical correlation between "Man-woman relationships" and books with "dating," "men and women" and "proposal" in the title.
None strike me as the silver bullet.

Anyway, my plane has landed--allowing me to do real work again--so I end in aporia. Ideas?

*I'm itching to blog it, but I have to hold off for now. I'll throw some pictures up soon, however. I'd never been to San Francisco before. What a wonderful wonderful town.
**One can understand why OPACs made in 1996 are session based. How frustrating to see a new product with them.

5 Comments:

Blogger Robert J. said...

Is there any way to relevancy-rank books within subjects?

This presumes that we know, for a particular user, what "relevant" means, no? While we might be able to judge relevance for the population as a whole, we can't be sure how well that captures what an individual user has in mind (and the individual user might not really know, anyway).

So, how about inverting the process and asking the user? When selecting "man-woman relationships" as a subject, offer five additional choices and say: "Do you want more titles closer to (A), to (B), to (C), to (D), or to (E)?" And then let the user interactively narrow the funnel, instead of presenting a pre-narrowed funnel.*

Google has something vaguely like this with certain searches: a kind of "More like this?" choice, IIRC.

But perhaps I misunderstand the task? (He said, aporetically.)

RJO

*I wonder if the phrase "pre-narrowed funnel" has ever been used in the English language before.

2/05/2007 4:04 PM  
Blogger Tim said...

Interesting.

Although of course relevancy is in the eye of the beholder, that shouldn't cloud us too much. If you take that attitude, there would be no difference between Google's links 1 and 1,000,000, or between, say, Google and the engines it beat (HotBot anyone?).

That said, your suggestion is very interesting.

Narrowing by ANOTHER subject would be one way. (Presumably you'd go past sorting into "has" and "has not" by looking at how many hops there would be if the book didn't have the subject.)

But you could also narrow by book--using whatever other data you have--book-to-book holding patterns, tags, etc.

Aporetically indeed. Tough problem.

PS: "Yesterday I won a drinking contest at my fraternity. Of course, I saw to it that the others were using a pre-narrowed funnel."

2/05/2007 4:13 PM  
Blogger Carter Clan said...

On a slightly different tack...
Having just started a theology course and acquired a whole load of new books I decided I needed to arrange them on the shelves in some kind of sensible arrangement. So, I looked at LT and have used the LC call numbers, as far as I had them, but there are some books where it's not in the database yet, so I've had to do best fit. For the number of books I've got this isn't really an issue, but it did get me thinking.
Is there anyway that the LT data - tags and the rest can be used to get back to a recommended shelf order? Put the books most likely to suggest each other next to each other? Put the books most likely to unsuggest each other as far away from each other as possible? Probably horribly processor heavy, but I'd love to see what it would throw up.

TCarter

2/08/2007 3:14 PM  
Blogger Robert J. said...

I think my proposal may be most effective in those cases where natural language is especially ambiguous. Someone searching for "Aporetic dialogue" or "Pre-narrowed funnels" may be taken immediately to the desired target. In other cases the user may be searching with a term that is inherently ambiguous* on first use and so may require immediate clarification. Vanilla Google assumes most search terms are close enough, but has recognized that some aren't, and so gives the "More like this?" choice in selected cases; similarly, Wikipedia has those special disambiguation pages that ask if you mean "Cars (automobiles)" or "Cars (musical group)."

The trick, then, may be to somehow flag those terms which are likely to require extra disambiguation. This is really a perfect spot to have access to all the data that's actually contained in the LCSH, because it's full of stuff like that already: "Broader term," "Narrower term," "Used for," "Related term," etc. --- it's all in there.

*It's easy to get lost in a broad semantic field, as a syllogism from systematic biology once demonstrated:

1. Phylogenetic reconstruction is based on the analysis of characters.
2. Sokal and Sneath are characters.
3. Therefore, phylogenetic reconstruction is based on the analysis of Sokal and Sneath.

2/09/2007 2:05 PM  
Blogger Carter Clan said...

I'd reckon I could get somewhere on my library ordering with Yahoo Pipes, if there were a few more RSS feeds around. What I'd really like is to be able to get an RSS feed of the Suggestion by classification for a single work and an RSS feed for my whole library, rather than just the most recently added. Any chance?

TCarter

2/11/2007 2:33 PM  

Post a Comment

<< Home