If Librarians Ruled the Web (circa 1998)

The idea of librarians creating a “Reference Extract” for the Web — a “credibility engine” of linked Web pages based on how well they help answer users’ questions — has kinda sorta been tried before by Microsoft, who then proceeded to pretty much bungle it.

When I worked as an editor at MSN Search in the mid-1990s, the main goal of our staff was to organize the content of the World Wide Web around the language used in the most commonly entered search terms on our site. The leaders and many members of this team were librarians, and we used library-like language to describe our work. We created synsets with disambiguators to group and describe these keywords, then assigned Web sites or individual Web pages to those synsets in the ranked order we wanted them to be returned to the user of the search engine.

For example, we may see a new term in our keyword logs one week: Saturn. First we’d need to determine whether that keyword was a part of a larger keyphrase that was already in the database. So maybe we already had a synset for “Sega Saturn,” a sadly defunct video game console system. If so, maybe we could add “Saturn” as a new keyword to that synset and call it a day? But does Saturn mean anything else? Of course. There’s Saturn (planet), Saturn (car manufacturer), and Saturn (Roman god). That’s three new synsets to create. Now we need to search the keyword logs for other search queries that equal each of these concepts. Does a search for “Saturn’s rings” get its own sysnet, or can that query be added to Saturn (planet)? And where do queries like “Saturn facts” or “Saturn information” go? Once the synset work is done, now we have to find websites that the editorial staff think best answer each of these queries. We would even write the descriptions for these sites so that when a visitor to MSN Search enters the search term “Saturn,” they receive a lovingly hand-crafted set of the best search results back in return.

In hindsight this approach has many flaws (Microsoft didn’t lose the search wars for nuthin’!). A team of 20 human editors can only tackle so much. We did not open the process up to other professionals — other librarians — to expand the number of credible contributors. We also tended to focus on new search terms and the most popular terms, so that once a search term was included in a synset it would not usually be revisited. And — crucially, I think — Microsoft did not promote the fact that there were reliable, credible human beings doing this work. In fact, they hid it. PR materials would describe MSN Search as being powered by “SmartSense.” What was SmartSense, you ask? We were SmartSense! The 20 librarians, indexers, and editors in the editorial suite out in Redmond! We even had t-shirts made: “Ask me: I have SmartSense.”

As a technology company, Microsoft was proudest of its technology solutions: its crawler, its results engine, its throughput. And of course, the real goals of the service were more about selling banner advertising and sponsored links, and driving users to other MSN resources. But having started down the path of creating a credible, human-powered system for finding the connections between the information people were looking for on the Web, Microsoft did not have the SmartSense to see it.

–lori

2 comments

  1. Eric Likness says:

    As marketing terms go I think SmartSense sounds way cooler than either PageRank or PigeonRank http://www.google.com/technology/pigeonrank.html

  2. Matthew Gunby says:

    I wonder if a model based around a wiki and then followed with a crawler would work. For instance on Wikipedia if you do a search for Saturn the disambiguation gives a whole page of options (http://en.wikipedia.org/wiki/Saturn_%28disambiguation%29), which if you could then in turn follow the terms to a crawler specific to those criteria, you might not get the 181,000,000 results Google gives you. This being said, is it economically in a company such as Google’s interest to get you to the God Saturn in the fastest possible route? Also, you can do some of this yourself by merely entering Saturn Roman God into the search engine. Honestly the greatest value of this model is that if Wikipedia was used as a model there is already a wealth of not only key words, but multiple meanings of key words that allow a researcher or general searcher to easily navigate. Also, it creates an equality of input instead of a select few having to do work that literally thousands would be unable to keep up with. The social dynamic further adds to the possibilities, because a term such as “crawler” means something very different in computer language than to the general public.

    I share your view that people can have better sense then even a sophisticated algorithm much of the time, and I would even say that at the head of this project it would be of great value to have some librarians focusing the project. I admit I know little about the computational aspect of this and how you could program a very sophisticated automated search that had sufficient precision. I believe it is likely possible, but I iterate is it in the best economic interest of search engines to allow for this precise search tool free of charge?

Leave a Reply

Your email address will not be published. Required fields are marked *

*