In my last post I discussed the count-min sketch data structure that can be used to process data streams using sub-linear space. In this post I will continue with some of my thoughts on how count-min sketches can be used in a typical event sourced application architecture. An event sourcing system typically has a query model which provides a read only view of how all the events are folded to provide a coherent view of the system. I have seen applications where the query model is typically rendered from a relational database. And the queries can take a lot of time to be successfully processed and displayed to the user if the data volume is huge. And when we are talking about Big Data, this is not a very uncommon use case.
Instead of rendering the query from the RDBMS, quite a few types of them can be rendered from a count-min sketch using sub-linear space. Consider the use case where you need to report the highest occuring user-ids in a Twitter stream. The stream is continuous, huge and non ending and you get to see each item once. So you get each item from where you parse out the user-id occurring in it and update the sketch. So each entry of the sketch contains the frequency of the user-id that hashes to that slot. And we can take the minimum of all the slots to which a user-id hashes to, in order to get the frequency of that user-id. The details of how this works can be found in my last post.
Consider the case where we need to find the heavy-hitters - those user-ids whose frequency exceeds a pre-determined threshold. For that, in addition to the sketch we can also maintain a data structure like heap or tree where we update the top-k heavy hitters. When a user-id appears, we update the sketch, get its estimated frequency from the sketch and if it exceeds the threshold, also record it in the data structure. So at any point in time we can probe this accessary data structure to get the current heavy-hitters. Spark examples contain a sample implementation of this heavy hitters query from a Twitter stream using the CountMinSketchMonoid of Algebird.
Can this be a viable approach of implementing the query model in an event sourced system if the use case fits the approximation query approach ? It can be faster, relatively cheap in space and can prove to be responsive enough to be displayed in dashboards in the form of charts or graphs.
3 comments:
Nice post. Consider an analytics system where you're compuing real time report of fairly large data coming in as streams. You might need to compute heavy hitters for threshold detection using a CMS, discard bad ip addresses by using some bloom filters, and any other system. In the mean time you might want some correct, i.e non approximate results. The common way one doing this these days is to have a lambda architecture with hadoop for batched and exact views, and a real time layer with storm or any other stream processor. Or just use both shar and spark streaming, or summingbird for instance.
This architecture called lambda arcitecture, resembles a lot to an event sourcing architecture for me.
We dployed a similar system for real time risk system in an algo trading platform, and sketches where useful as the number of data sources was constantly growing, with data coming in at an un-even rate.
Sketches and approximate data structures in general, in the era of big data, are the equivalent of the swiss army knife for real time analytics.
+1 on your thoughts. Lambda architecture looks quite powerful and we are also in the process of implementing one.
And clever data structures like Count Min Sketch, AMS Sketch, Bloom Filters (the counting version of BF is also a Sketch) provide the swiss army knife type of capabilities for certain use cases.
In fact I see today charts and graphs are generated from an RDBMS where an in memory succinct data structure like a sketch would be much more powerful. Also sketches provide some benefits over sampling - hence I am really exploring the possibilities of using the power of sketches as part of lambda architecture.
Another advantage with a sketch is the linearity property and the fact that they are associative. Think monoids and you have a great use case for distributed systems. Collect summaries from individual nodes and just mappend.
This already exist and is being used in production. Take a look at Storehaus(https://github.com/twitter/storehaus) and specially MergeableStore. There are implementation available for most common data store and when you call insert a value, it uses a user provided monoid to merge the new value with the existing value. So you use CM Sketch, HyperLogLog or any monoid your heart desire
Post a Comment