Sunday, March 01, 2009

Preemptive Bloom Filter

Michael Mitzenmacher describes a great trick with bloom filter and hash table that has been used in Longest Prefix Matching techniques to improve performance by reducing the number of dependent memory lookups.

Just in case you are not familiar with bloom filters, have a look at the great introduction in Wikipedia ..
The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Elements can be added to the set, but not removed (though this can be addressed with a counting filter). The more elements that are added to the set, the larger the probability of false positives.


Suppose you are doing lookups of large keys in a slow hash table, and the lookups are primarily for negative search, i.e. in most of the cases, the lookup will fail. In such cases you can front end the (slow) hash table lookup with a bloom filter for the keys in a fast memory. Since most of your lookups are likely to fail, you will potentially save a lot by avoiding access to slower memory based hash table. Of course the bloom filter can return false positives (remember, it never gives false negatives), where you will still have to lookup the hash table.

I was wondering if the same technique can be generalized for lookups in database tables. For all sets of indexes that will be typically used for searching the table, we can have separate bloom filters in fast memory, which will lead to lesser and lesser access of disk based database tables. Of course this will work meaningfully for tables on which lookup failures outweigh successes, as Michael gives some examples of URLs on blacklist or dangerous packet signatures.

A nice trick to have up your sleeve ..

1 comment:

Timothy Wee said...

Great post!

a comment on the latter end of the article, if I understand bloom filters correctly, building them is not a trivial operation. Putting them in front of a database where data isn't static might get expensive.. but I guess they can become another tool just like indexes to optimize for certain usage scenarios and data characteristics.