Tuesday, July 24, 2007

Tagmash: Book tagging grows up

Tagmash: alcohol, history gets over the fact that almost nobody tags things history of alcohol

Short version: I've just gone live with a new feature called "tagmash," pages for the intersections of tags. This is a fairly obvious thing to do, but it isn't trivial in context. In getting past words or short phrases, tagmash closes some of the gap between tagging and professional subject classifications.

For example, there is no good tag for "France during WWII." Most people just don't tag that verbosely. Tagmash allows for a page combining the two: France, wwii. If you want to skip the novels, you can do france, wwii, -fiction. The results are remarkably good.

Tagmash pages are created when a user asks for the combination, but unlike a "search" they persist, and show up elsewhere. For example, the tagmash for France, Germany shows France, wwii as a partial overlap, alongside others. Related tagmashes now also show up on select tag and library subject pages, as a third system for browsing the limitless world of books.

Booooring? Go ahead and play a bit:
That's the short version. But stop here and you'll never know what Zombie Listmania is!

Long version. LibraryThing has shown some of the things that book tags are good for, such as plain language, genre fiction, capturing identity and perspective, academic schools, staying current and changing over time. (Details and examples in footnote.*)

It also demonstrates some of the weaknesses, including:
  1. Idiots
  2. Bad actors (spammers, racists, anarchists)
  3. "Personal" tags clouding the tagosphere with junk (eg., "at the beach house")
  4. The lack of a "controlled" vocabulary results in ambiguous terms (eg., classics, leather, magic)
  5. Tags lacks the detail and focus available to a hierarchical subject system like the Library of Congress Subject Headings (LCSH), eg.,
    Great Britain -- History -- Elizabeth, 1558-1603 -- Fiction
    , or
    Jews -- Italy -- Bologna -- Conversion to Christianity -- History -- 19th century**
As I've argued elsewhere and in my Library of Congress talk, problems 1, 2 and 3 are mitigated by having LOTS of tags. Idiocy, malice and personal junk fall out statistically. A tag here or there can't be trusted, but a large body of tags in agreement is different.

Problems 4 and 5 are harder to tackle. Flickr has shown the way with one solution, statistical clustering. The screen shot below shows this--clusters of images related to the tag "bow."



Some day--when I become a better programer?--I'm going to try this on LibraryThing data. It will help with ambiguity—the secondary tags on the various meanings of "leather" are surely wildly divergent! But I suspect it separates better than it clarifies. Flickr supposes that tags fall into discrete clusters, but subjects interact with books in extremely complex ways. On a more basic level, I am suspicious of the too-quick resort to algorithms against user data.*** After all, if computers are so good at figuring out meaning, why were users necessary in the first place? It smacks of technological revanchism.

So, where Flickr's clusters are automated, tagmash is a semi-automated process. LibraryThing does the statistics, but users decide what the meaningful clusters are. Some mashes are interesting and useful. Some aren't. By and large, uninteresting clusters won't last.****

This certainly helps with ambiguity. Take the problemmatic tag leather, which divides easily into tagmashes like:
Now let's take the "focusing" power of hierarchy. As mentioned above, there is no good way to get at "france during wwii." The tag Vichy covers some of the ground, but not enough. Tagmash provides an answer.
The book list is good, and a simple union gets around an imposed hierarchy. Looking at the related LCSHs, for example, one is left in doubt whether France is part of World War II, or World War II part of France—or what:
Of course, both trees are equally artificial. David Weinberger writes how, in the real world, a leaf can be on many branches. But it's equally true that what's trunk and what's branch are largely about where you start--dirt or pinecone. Either way, branching happens. The order of the branches isn't necessarily important.

Even as it borrows some of the virtues of subject classification, tagmash keeps the strenghts of tagging. Subject systems are pre-built things. Now and then they get larger, but it takes deliberation and effort. What gets "blessed" is often surprising. I would have never predicted the unusually staid LCSH would have embraced:
But tagging has no limits. Think of the tagmash "erotica" and "zombies" and there it is. (Tagmash: erotica, zombies). Want to know what chick lit takes place in Greece? (Tagmash: chick lit, greece.) Young adult books involving horses? (Tagmash: horses, young adult.) Poems from or about San Francisco? (Tagmash: poetry, san francisco). Slavery in Brazil? (Tagmash: brasil, slavery.) Non-fiction books about Narnia? (Tagmash: narnia, -fiction.) The options are endless.

Of course, tagmash only narrows the gap. It doesn't eliminate it. Tagmash: poetry, San Francisco still can't distinguish between poetry about and poetry from San Francisco--it involves whatever is tagged "San Francisco" and that's probably a mixed bag.***** Well-planned and carefully executed subject systems have strengths that no ad hoc, regular-person system can match.

Lastly—let there be no doubt—tagmash needs a very large quantity of tags to work. For tagmash after tagmash, the data is simply insufficient.

You've made it to Zombie Listmania! There are some obvious directions this can go:
  • The syntax can improve, for example to allow alternates (eg., humor, cats/dogs)
  • The syntax can include non-tag factors, such as formal subject headings (Tag: zombies, LCSH: love stories), languages, dates, authors and so forth.
  • The syntax can include weights (eg., Zombies 50%, vampires 50%, love stories 90%). Abby and I experimented with just such a system, creating algorithmic proxies for BISAC (bookstore) headings. It isn't that hard to do.
  • Complex mashes could acquire titles and other metadata.
  • Users could follow a tagmash, and be alerted whenever new material enters the list.
Amazon calls its static, or dead, lists "Listmania." All these tend to create a "Zombie Listmania," lists of books that "won't stay dead." Instead, they change over time, as the underlying social and non-social data change. There's no reason you couldn't create "Zombie" versions of formal subject headings—a series of tags and other markers which approximated the content of a professionally-assigned subject heading.

Pretty cool idea, I think. We'll see what we can do about it.

Details.
  • Tagmashes can be made from any tagmash or tag page. Just search for a tag or two or more tags with a comma between them. The URLS are the same /tag/ plus a tag or tags separated by commas.
  • The weighting of tags is wiggly. We're trying to get at both raw numbers of tags on an item and the relative salience (number divided by total number of tags), and then cross this data tag-by-tag. There is no obvious answer. In an ideal world, some tags would about salience (eg., humor) and others would be threshholds (eg., fiction)--that is, when you're looking for humor, fiction you want the funniest fiction, not the most fictional humor.
  • You can enter the tags in any order, but it will reformat your URL in alphabetical order, with the minuses at the end, such that "wwii, france" is the same as "france, wwii."
  • A single minus (-fiction) "discriminates" against items tagged "fiction." A double minus (--fiction) disqualifies all books with the fiction tag.
  • Tagmashes don't get built until someone builds them. The first time can take a while to generate. There is currently no system to expire older or underused tagmashes.
  • UPDATE: I'm seeing a lot of part/whole tagmashes. These rarely work. When you search for "Einstein, science" or "Manet, art" you're not doing much more than putting a statistical cramp on the smaller of the two tags—a few Manet books won't have an art tag, and that will be the end of them. Tagmashes work with different things, not a thing and its category.

Footnotes!

*What's good about tagging:
  • Tags use everyday terms (the tag cooking vs. the subject cookery)
  • Tags are great for genre fiction that subject systems can't keep up with as fast or as well as their readers (chick lit, cyberpunk, paranormal romance)
  • Tags often encode subtleties that "controlled vocabulary" irons out (lgbt, glbt, queer, gay, homosexuality)
  • Tags capture identity and perspective that subject systems can't or wont (queer, glbt, lgbt, christian living)
  • Tags are good for schools of thought (intelligent design, austrian economics)
  • Tags respond quickly to change (hurricane katrina)
  • Tags "keep happening" in a way that systems like LCSH do not, getting added to books where LCSH misses the "first wave" of anything new (memetics, sociobiology)
**I've left out one problem, not covered at the LC—how "democratic" weighting can put Angela's Ashes at the top of the Ireland tag. books. I want to write a blog post on the topic sometime. I think there are ways around it, and algorithmic solutions that nobody has really tried.

Aside: Much LIS anti-tagging polemic focuses on the most trivial of problems—spelling mistakes and "incorrect" tags. The former underestimates technology, the latter insults our intelligence. LibraryThing has dealt with the spelling problem, and has seen very few "wrong" tags. In fact, there are some serious problems with tagging. But you have to understand tags before you can see the problems, and many refuse to get past the idea that people will spell "white" wrong, or tag white horses as black.
***This is half formed. I have a problem with the reflexive "turn" from people-centered data to algorithms. I see this pattern again and again in software. Something transformative happens--something human. But it's imperfect, so programmers conclude that programs will fix humans. In a way, it's a reassertion of importance. More often, humans fix humans. To adapt David Weinberger, the answer to user-generated data is MORE user-generated data.
****Probably there's got to be some system to expire unused clusters.
*****UPDATE: After turning the feature loose I watched what new tagmashes would be created. One was children, cooking. Should I call the police?

Labels: , ,

32 Comments:

Anonymous sunny said...

> "Personal" tags clouding the tagosphere with junk

Don't call them junk, give us a way to distinguish the personal tags from the others..

7/24/2007 4:18 AM  
Blogger Tim said...

I've thought about this. The bottom end leave a statistical signature. There just aren't enough people tagging things "left out on the deck." And it might be possible to get the top end by just assuming that a widespread tag with no strong secondary signal is personal. But I think you're thinking of either members marking their own tags or of allowing users to mark (or tag) tags as personal.

Some tag clouds are actually doing this now. We went through the top 25k tags for LibraryThing for Libraries, marking th personal ones. These show up lighter.

Anyway, suggestions on how to do it would be encouraged. But I don't think it can be user-by-user. That's a lot of work, and tags shouldn't be that.

7/24/2007 4:26 AM  
Blogger Blue Tyson said...

Damn, that is cool.

When I look at 'superhero, pulp' I get a list of 'top 63 books', does that mean there are more that could be accessed with a 'more' type paging link?

7/24/2007 4:32 AM  
Blogger Tim said...

No, not without an algorithm change. (It might be dropping some of the low-ranking ones off.) At the moment all tagmashes are cut off at 250, the theory being that any list longer than 250 ought to have some further way of pairing it down, but the practice being that larger lists take longer to calculate and more space to store...

7/24/2007 4:34 AM  
Blogger Tim said...

LibraryThing is going to end up a Defense Department project. We need the supercomputer.

7/24/2007 4:35 AM  
Anonymous sunny said...

> members marking their own tags ... as personal

Yes. In the sense of "show tag publicly" versus "show tag only to myself". If done cleverly it could also solve the wish for a "lending feature", couldn't it?

But you're probably right: "show publicly" would have to be default, which means it might not change so much about the numbers of the private tags.

Being able to search for a combination of tags, excluding some, is of course a huge improvement!

7/24/2007 4:44 AM  
Anonymous sunny said...

(Is there another way to access tagmash apart from the links in the blog post text?)

7/24/2007 4:49 AM  
Blogger Tim said...

Adding a note. Yes, you can do a tagmash from the tag page (just type in something with a comma and two terms).

7/24/2007 4:52 AM  
Anonymous alasen said...

Yay, finally! I was hoping this was going to be today's announcement. I'll play some more at a later time, but I just wanted to say thanks because this has been up there on my LT wishlist for a while now, and I know I'm not the only one.

7/24/2007 5:03 AM  
Blogger Blue Tyson said...

Does it break if you use common massive tags?

Tagmash: non-fiction, sf
Fatal error: Allowed memory size of 41943040 bytes exhausted (tried to allocate 89 bytes) in /var/www/html/ajax_tagmash.php on line 64

Tagmash: fantasy, non-fiction
Fatal error: Allowed memory size of 41943040 bytes exhausted (tried to allocate 71 bytes) in /var/www/html/ajax_tagmash.php on line 53

Or is this one just odd?

7/24/2007 5:53 AM  
Blogger gabriel said...

LibraryThing is going to end up a Defense Department project.

Allow me to suggest that that isn't quite as outlandish as it seems- LT seems to be on the cutting edge of distributed tagging and database aggregation (two phrases I just made up). Perhaps the intelligence services could make use of the technology {not for monitoring us, but perhaps for coming up with better methods of intercepting suspicious communications}.

7/24/2007 6:50 AM  
Anonymous ryn_books said...

This is really interesting!
The only question I have is can we increase the character limit in the search box?
I wanted to search "fathers and daughters, science fiction" and finally had to use scifi instead.
I take it that any tag can be used so long as it's listed in the subsidiary combined tags?

7/24/2007 7:33 AM  
Anonymous zweiundzwei said...

LibraryThing just became so much better. I've been wating for a feature like that ever since I signed up.
And now I'll try every combination that I can come up with.

7/24/2007 7:49 AM  
Blogger Jakob said...

Bloody brilliant! Now LibraryThing will consume more of my free time as I get sucked into looking through tagmash pages for new books. Thanks Time, thanks a lot.

P.S. I really do hope that you make it so we can follow up on tagmashes and new books that are added to the zombie listmanias. I'm assuming that this would and could be done through the new connections news?

7/24/2007 8:40 AM  
Anonymous gemmation said...

First, this is so much fun!

Second... it took me a minute to find where I could normally do this from when you said: "Yes, you can do a tagmash from the tag page"....

I decided this meant you could do this by going to the search tab and searching tags from there. Is there any particular reason that you can't make a tag mashup work here too?

As it is if you go to the search tab you have to search from there for one tag and then use the search box on that tag's description page to do a tag mashup search.

Third: off to do some more and expand my wishlist at the same time!

7/24/2007 8:57 AM  
Anonymous thegreattim said...

Yeah, running into some errors myself...

Tagmash: Zombie, Non-Fiction =

Fatal error: Allowed memory size of 41943040 bytes exhausted (tried to allocate 71 bytes) in /var/www/html/ajax_tagmash.php on line 53

Now, did I get this error becuase these are mutually exclusive terms? I'm guessing so...

Any thoughts?

7/24/2007 9:03 AM  
Blogger Barbara said...

Okay, I'm dumb - where do you do this kind of search? No results show up when I use the "search" page. When I go to the tags page all I see are my own tags. When I click on one of the tags on the front page then I can do a search.

But ... it seems there must be a less round-about way of doing this - ?

7/24/2007 9:13 AM  
Anonymous Lilithcat said...

Here I am, highly amused at the fact that "Italy, Henry James" shows as a related tagmash, "leather, --sex", on which page one finds two books by John Preston among the top four!!!! Gosh, would he be startled!

I foresee hours of fun with this.

7/24/2007 9:17 AM  
Anonymous ellen.w said...

This looks fantastic. I'm sure there are problems to be ironed out but I can hardly wait to start playing.

I suspect anyone looking for zombie erotica who ends up with Laurell K. Hamilton will be disappointed, though.

7/24/2007 9:37 AM  
Anonymous Anonymous said...

A location field would make me remove all of the "personal" tags I've entered.

7/24/2007 12:19 PM  
Blogger Stephanie M. said...

Tagmash is a brilliant feature, I've been dying to see it on LT. It makes looking for new books to read so much easier. Thank you! Now if we had a series field and a way to separate books owned from not owned, without resorting to tags, I'd be quite perfectly satisfied. ;)

All of my junky personal tags begin
with @, which I was under the impression was the way to hide those tags from all the others.

7/24/2007 1:48 PM  
Blogger Nathan said...

Tim,

This is great stuff. Thanks again for your work.

You said:

"Of course, both trees are equally artificial. David Weinberger writes how, in the real world, a leaf can be on many branches. But it's equally true that what's trunk and what's branch are largely about where you start--dirt or pinecone."

and

"Of course, tagmash only narrows the gap. It doesn't eliminate it. Tagmash: poetry, San Francisco still can't distinguish between poetry about and poetry from San Francisco--it involves whatever is tagged "San Francisco" and that's probably a mixed bag.*****"

and

"Well-planned and carefully executed subject systems have strengths that no ad hoc, regular-person system can match."

Sometimes the order of terms matters a lot.

For example,

History -- Philosophy

and

Philosophy -- History

Is order important with tag-mashing? I am not a LT user and so am not sure if I can mess around and find out if it is.

Thanks again!

7/24/2007 2:17 PM  
Blogger Tim said...

>Sometimes the order of terms matters a lot.

Yes. That's a good, simple example of when it does. Order does NOT matter on LT. It's key, however, not to think of tagmash as a truly separate scheme. It's based on regular tags. So, in this case, you have to ask yourself "why would someone tag something 'history' and why 'philosophy'?

In fact, "philosophy, history" produces three mostly distinct groupings:

*History of Philosophy (A history of western philosophy by Bertrand Russell)
*Philosophy of History (What is history? by Edward Hallett Carr)
*"Old" philosophy, or philosophy that also sheds light on history (The Republic of Plato by Plato)

It's a good example of how tagmash can perpetuate (or even increase) ambiguity, rather than driving it out.

One interesting note: Tagmash doesn't care WHO tagged something with the tags. It doens't give books more relevance if the SAME person tags it history and philosophy. That might have an interesting effect.

7/24/2007 3:17 PM  
Blogger Tim said...

I want to say one thing—that by "junk" i don't mean that personal tags are junk on anything other than a large-scale algorithmic level. I FAVOR personal tags. The goal, therefore, is to figure out how to prevent the "unread" tag from, say, taking over the recommendations for Steven Hawkings books!

7/24/2007 3:24 PM  
Anonymous SilentInAWay said...

This is really useful! Check out the results for the tagmash "pop-up, -children's"

7/24/2007 8:35 PM  
Anonymous hexmap said...

This is something I'd assumed was there already but hadn't the need to go look for it yet.

As a programmer with a great interest in tagging, I'm more interested in how the "includes" are generated ... "Includes: wwii, 2nd world war, second world war," etc. How do you know that ww11 is world war two and not world war eleven?

One thing that LT has really emphasized to me with each new feature is how people use the feature in ways beyond what I could even imagine possible. Far from "When all you have is a hammer, everything looks like a nail," they're thinking up dozens of new uses and none are wrong! Just like users want to decide what the lumps and splits are in addition to what the ontology says, users want to use the tools in "non-official" ways for what they are interested in doing.

7/25/2007 12:05 AM  
Anonymous sunny said...

(Includes: ... ) shows you which tags have been combined. Tag combination is done by members.

7/25/2007 2:15 AM  
Anonymous paperkingdoms said...

When you can, integration with the search page would be great -- as others have noted, you have to do some clicking to get oneself to a "tag" page to begin with.

It would also be really useful to get the scaled down, easy version -- I still can't search for a list of books I personally have tagged both X and Y. I can't be the only one who'd like to be able to pull up a list of *my* unread zombie erotica. ;^)

7/25/2007 3:29 AM  
Anonymous sunny said...

Search tab -> the field below "Your library":

tag:(zombie erotica unread)
or
tag:zombie tag:erotica tag:unread

See also link to 'advanced tips'.

Tim, how about displaying the advanced tips from the start? Doesn't it look as if way too many people missed the 'new' search options for their libraries?

7/25/2007 6:16 AM  
Anonymous sunny said...

> integration with the search page would be great

Tagmash now works from the 'tags' field under site search on the search page. :-D

7/25/2007 6:18 AM  
Anonymous lorax said...

hexmap, I'm afraid you'll be very disappointed in the answer to the "includes" aspect of tags -- it's all manual combination. Generally the rule of thumb is "don't combine unless it's REALLY OBVIOUS that they're the same thing -- 'world war two' and 'world war 2', for instance", so that, say, "sf" doesn't get combined with "science fiction" because it's possible someone could use the former to mean "san francisco".

7/25/2007 11:54 AM  
Anonymous Anonymous said...

i too reject the term "junk" tags - are we not free to tag how we wish? this is how i manage my library and need these tags - too bad if others don't need them.

9/13/2009 2:30 PM  

Post a Comment

<< Home