In my last post I discussed the count-min sketch data structure that can be used to process data streams using sub-linear space. In this post I will continue with some of my thoughts on how count-min sketches can be used in a typical event sourced application architecture. An event sourcing system typically has a query model which provides a read only view of how all the events are folded to provide a coherent view of the system. I have seen applications where the query model is typically rendered from a relational database. And the queries can take a lot of time to be successfully processed and displayed to the user if the data volume is huge. And when we are talking about Big Data, this is not a very uncommon use case.
Instead of rendering the query from the RDBMS, quite a few types of them can be rendered from a count-min sketch using sub-linear space. Consider the use case where you need to report the highest occuring user-ids in a Twitter stream. The stream is continuous, huge and non ending and you get to see each item once. So you get each item from where you parse out the user-id occurring in it and update the sketch. So each entry of the sketch contains the frequency of the user-id that hashes to that slot. And we can take the minimum of all the slots to which a user-id hashes to, in order to get the frequency of that user-id. The details of how this works can be found in my last post.
Consider the case where we need to find the heavy-hitters - those user-ids whose frequency exceeds a pre-determined threshold. For that, in addition to the sketch we can also maintain a data structure like heap or tree where we update the top-k heavy hitters. When a user-id appears, we update the sketch, get its estimated frequency from the sketch and if it exceeds the threshold, also record it in the data structure. So at any point in time we can probe this accessary data structure to get the current heavy-hitters. Spark examples contain a sample implementation of this heavy hitters query from a Twitter stream using the CountMinSketchMonoid of Algebird.
Can this be a viable approach of implementing the query model in an event sourced system if the use case fits the approximation query approach ? It can be faster, relatively cheap in space and can prove to be responsive enough to be displayed in dashboards in the form of charts or graphs.