I have something to say about that…

Searching: then and now

Let’s talk about information retrieval algorithms.

I’ve been comparing search engines, looking for something suitable for a large amount of unstructured data from a lot of repositories. Now, option 1 says it operates on the probabilistic model of information retrieval (a description of this model is in this paper: part 1 and part 2), though the implementers are extremely vague on exactly how they’re using it.

As far as I can tell, the probabilistic model creates a score for each document based on the probabilities of each of your search terms being in that document. Probability of term 1 being there + probability of term 2 being there (etc) = matching score, which you can then use to rank this document against other documents.

In this implementation, they then weight the search terms that are rarest in all the documents (so that if you’re in a law firm and search for “Smith litigation”, “Smith” will be more important than “litigation”. Your firm will probably have a lot using the term “litigation” so it won’t be as useful to pick out the docs you need). It then normalises for document length, balances repeated terms (so that searching for ‘smith smith litigation’ doesn’t mean it looks for documents with “smith” twice as often) and trims words to their stems using something like the Porter Stemming algorithm.

Okay, now I’ll admit I’m learning. But this algorithm isn’t new: this ‘City model’ of the probabilistic was initially proposed by Robertson and Sparck Jones in 1976 (‘Relevance weighting of search terms’. Journal of the American Society for Information Science, 27, 129-146)  Is this still as good as we can get?

[Tune in next time: we’ll weigh this up against probabilistic latent semantic analysis and I can finally get around to asking my question!]