I have something to say about that…

The pub: Fingerprint first, then your beer

This article from Friday’s Register, Beer Fingerprints to go UK-wide, tells us that South Somerset District Council‘s pilot scheme of fingerprinting patrons of local pubs seems to have led to a 48% drop in alcohol related crime between February and September 2006.From the article: ‘Offenders can be banned from one pub or all of them for a specified time – usually a period of months – by a committee of landlords and police called Pub Watch. Their offences are recorded against their names in the fingerprint system. Bradburn [principal licensing manager at South Somerset District Council] noted the system had a “psychological effect” on offenders.’

Apparently the Government is so impressed that they’re willing to fund the scheme for ‘councils that want to have their pubs keep a regional black list of known trouble makers’. The Home Office have agreed to fund similar systems in Coventry, Hull and Sheffield, while general funding for the rest of the local authorities is to come from the Department for Communities and Local Government‘s Safer, Stronger Communities budget. The article says that the DCLG is distributing the funds through local area agreements (description sites from the central government side and from local government).

This news article fills me with so many questions I’m not sure where to begin.

  • Why is no one else reporting on this activity?? I’ve had a look at South Somerset’s, the DCLG’s and the Home Office’s sites and wasn’t able to find any news on this scheme (though that may be the fault of the search technologies they’re each using. The results I didn’t get, in general, weren’t particularly relevant). I’ve also checked Google news — nothing their either.
  • Where is the fingerprint data going? What kinds of Data Protection Act considerations have been made? How easy will it be to find out that your no-good lazy husband was in fact having a pint when he said he’d be at work late? And what about that female fingerprint in the database just before or just after him?
  • The previous point of course brings up all kinds of data-sharing questions within the government too. As seen in the Climbié debacle, the Government isn’t fantastic at sharing information when it needs to. Does that help or hurt this scheme?
  • How much has business fallen for these pubs? Is all of South Somerset okay with this?

If anyone knows more about what’s going on here, I’d love to be caught up. How bizarre to find this story slid in under the radar.

Searching: then and now

Let’s talk about information retrieval algorithms.

I’ve been comparing search engines, looking for something suitable for a large amount of unstructured data from a lot of repositories. Now, option 1 says it operates on the probabilistic model of information retrieval (a description of this model is in this paper: part 1 and part 2), though the implementers are extremely vague on exactly how they’re using it.

As far as I can tell, the probabilistic model creates a score for each document based on the probabilities of each of your search terms being in that document. Probability of term 1 being there + probability of term 2 being there (etc) = matching score, which you can then use to rank this document against other documents.

In this implementation, they then weight the search terms that are rarest in all the documents (so that if you’re in a law firm and search for “Smith litigation”, “Smith” will be more important than “litigation”. Your firm will probably have a lot using the term “litigation” so it won’t be as useful to pick out the docs you need). It then normalises for document length, balances repeated terms (so that searching for ‘smith smith litigation’ doesn’t mean it looks for documents with “smith” twice as often) and trims words to their stems using something like the Porter Stemming algorithm.

Okay, now I’ll admit I’m learning. But this algorithm isn’t new: this ‘City model’ of the probabilistic was initially proposed by Robertson and Sparck Jones in 1976 (‘Relevance weighting of search terms’. Journal of the American Society for Information Science, 27, 129-146)  Is this still as good as we can get?

[Tune in next time: we’ll weigh this up against probabilistic latent semantic analysis and I can finally get around to asking my question!]