Saturday, May 16, 2009

The Failure of Search (or the Fallacy of Abundance)

So, what's going on with web search? Why is it giving such high rank to (at best) marginal material, such as the one on this weblog on certain topics? Or as I asked earlier, why do we even feel that we get anything relevant when we perform a search on the Web? How much better material are we actually missing when we limit ourselves to the findings of a search engine?

It is in asking those sorts of questions that we can arrive at modest discoveries or at least novel explanations of what we see around us.

To further the investigation I reported earlier, I went back to the chapters on search in Hubert Dreyfus' little book, On the Internet. According to Dreyfus, given the immense size of the Net, it is "estimated that search engines can recall at most 2 per cent of the relevant sites." (The number might have changed in the last three years but I don't believe that the changes, if any, would affect the arguments in any drastic way.)

We need to ask why "content" (or "information") retrieval systems are receiving the hype they are receiving even if they are hardly adequate when it comes to searching for specific content. How could my weblogs, even if they are somewhat useful, be ranked as the third most useful or important content on certain scholars I've only occasionally quoted and on whose works I still consider myself a novice?

Surely, this sort of system behavior cannot be good if we have hopes to be able to find important bits of documents or knowledge through search and information retrieval.

To explain the hype regarding search and information retrieval, Dreyfus quotes computer scientist David Blair, who cites information retrieval (IR) pioneer Don Swanson:

IR prioneer Don Swanson observed this phenomenon decades ago, and calls it the "fallacy of abundance". The fallacy of abundance is the mistake a searcher makes when he uses a large IR system and is able to find some useful documents. Swanson pointed out that on a sufficiently large system . . . almost any query will retrieve some useful documents. The mistake is to think that just because you got some useful documents the IR system is performing well. What you don't know is how many better documents the system missed.

And so . . . since my weblogs can be ranked highly by Google for certain subjects, they may be perceived (by some searchers) to be more important than they really are.

