- Each slave process prepares a bloom filter for the join condition that it is processing
- It then passes on the bloom filter to the other slave processes, which can apply the filter to its own selected set of records before passing on the final set to the join coordinator
Remember each of the above processes may run in a distributed environment - hence the above technique leads to less data being transported across nodes, thereby saving in network bandwidth for some extra CPU cycles. This paper describes all of these in full details with illustrative examples.
This idea of serializing bloom filters instead of data set has been used quite extensively in load balancing MapReduce operations to minimize intermediate results before sending everything across the network for final aggregation. In case of processing distributed join operations, we may need to compose multiple bloom filters to get the final dataset. Bloom joins, as they are called allow cheap serialization of filters over the wire, by employing some clever techniques like linear hash tables and multi-tier bloom filters, as described in this paper in Comonad Reader.
Bloom joins can also be used effectively in MapReduce processing with CouchDB. The map phase can produce the bloom filters, which can be joined in the reduce phase. In a recent application, I needed to store a very large list mainly for set operations. Instead of storing individual elements, I decided to store a bloom filter that nicely fit in a memcached slab. I could pull it out and do all sorts of bit operations easily and it's blinding fast. Next time you decide to distribute your huge list in Terracotta, think back - there may be a lighter weight option in distributing a bloom filter instead. There are use cases when you will be doing membership checks only and some false positives may not do much harm.
No comments:
Post a Comment