In my last post I discussed the count-min sketch data structure that can be used to process data streams using sub-linear space. In this post I will continue with some of my thoughts on how count-min sketches can be used in a typical event sourced application architecture. An event sourcing system typically has a query model which provides a read only view of how all the events are folded to provide a coherent view of the system. I have seen applications where the query model is typically rendered from a relational database. And the queries can take a lot of time to be successfully processed and displayed to the user if the data volume is huge. And when we are talking about Big Data, this is not a very uncommon use case.

Instead of rendering the query from the RDBMS, quite a few types of them can be rendered from a count-min sketch using sub-linear space. Consider the use case where you need to report the highest occuring user-ids in a Twitter stream. The stream is continuous, huge and non ending and you get to see each item once. So you get each item from where you parse out the user-id occurring in it and update the sketch. So each entry of the sketch contains the frequency of the user-id that hashes to that slot. And we can take the minimum of all the slots to which a user-id hashes to, in order to get the frequency of that user-id. The details of how this works can be found in my last post.

Consider the case where we need to find the heavy-hitters - those user-ids whose frequency exceeds a pre-determined threshold. For that, in addition to the sketch we can also maintain a data structure like heap or tree where we update the top-k heavy hitters. When a user-id appears, we update the sketch, get its estimated frequency from the sketch and if it exceeds the threshold, also record it in the data structure. So at any point in time we can probe this accessary data structure to get the current heavy-hitters. Spark examples contain a sample implementation of this heavy hitters query from a Twitter stream using the CountMinSketchMonoid of Algebird.

Can this be a viable approach of implementing the query model in an event sourced system if the use case fits the approximation query approach ? It can be faster, relatively cheap in space and can prove to be responsive enough to be displayed in dashboards in the form of charts or graphs.

In today's age of Big Data, streaming is one of the techniques for low latency computing. Besides the batch processing infrastructure of map/reduce paradigm, we are seeing a plethora of ways in which streaming data is processed at near real time to cater to some specific kinds of applications. Libraries like Storm, Samza and Spark belong to this genre and are starting to get their share of user base in the industry today.

This post is not about Spark, Storm or Samza. It's about a data structure which is one of the relatively new entrants in the domain of stream processing, which is simple to implement, but has already proved to be of immense use in serving a certain class of queries over huge streams of data. I have been doing some readings about the application of such structures and thought of sharing them with the readers of my blog.

Using Sublinear Space

Besides data processing, these tools also support data mining over streams that include serving specialized queries over data using limited space and time. Ok, so once we store all data as they come we can always serve queries with O(n) space. But since we are talking about huge data streams, it may not even be possible to run algorithms on the full set of data - it simply will be too expensive. Even if we have the entire set of data in a data warehouse, the processing of the entire data set may take time and consume resources that we cannot afford to have, considering the fee charged under the evolving models of using the platform-as-a-service within the cloud based infrastructure. Also the fact that these algorithms will be working on data streams, there's a high likelihood that they will get to see these data only in a single pass. The bottom line is that we need to have algorithms that work on sub-linear space.

Working on sublinear space implies that we don't get to store or see all data - hence an obvious conclusion from this will be the fact that we also don't get to deliver an accurate answer to some queries. We rely on some approximation techniques and deliver an accuracy with a reasonably high probability bound. We don't store all data, instead we store a lossy compressed representation of the data and deliver user queries from this subset instead of the entire set.

One widely used technique for storing a subset of data is through Random Sampling, where the data stored is selected through some stochastic mechanism. There are various ways to determine which data we select for storing and how we build the estimator for querying the data. There are pros and cons with this approach, but it's one of the simplest ways to do approximation based queries on streaming data.

There are a few other options like Histograms and Wavelet based synopses. But one of the most interesting data structures that have been developed in recent times is the Sketch, which uses summary based techniques for delivering approximation queries, gets around the typical problems that sampling techniques have and are highly parallelizable in practice.

An important class of sketch is one where the sketch vector (which is the summary information) is a linear transform of the input vector. So if we model the input as a vector we can multiply it by a sketch matrix to obtain the sketch vector that contains the synopses data that we can use for serving approximation queries. Here's a diagrammatic representation of the sketch as a linear transform of the input data.

Count-Min Sketch

One of the most popular forms of the sketch data structure is the Count-Min Sketch introduced by Muthukrishnan and Cormode in 2003. The idea is quite simple and the data structure is based on probabilistic algorithms to serve various types of queries on streaming data. The data structure is parameterized by two factors - ε and δ, where the error in answering the query is within a factor of ε with probability δ. So you can tune these parameters based on the space that you can afford and accordingly amortize the accuracy of results that the data structure can serve you.

Consider this situation where you have a stream of data (typically modeled as a vector) like updates to stock quotes in a financial processing system arriving continuously that you need to process and report statistical queries on a real time basis.

We model the data stream as a vector a[1 .. n] and the updates received at time t are of the form (i_t, c_t) which mean that the stock quote for a[i_t] has been incremented by c_t. There are various models in which this update can appear as discussed in Data Streams: Algorithms and Applications by Muthukrishnan which includes negative updates as well and the data structure can be tuned to handle each of these variants.

The core of the data structure is a 2 dimensional array count[w, d] that stores the synopses of the original vector and which is used to report results of queries using approximation techniques. Hence the total space requirement of the data structure is (w * d). We can bound each of w and d in terms of our parameters ε and δ and control the level of accuracy that we want our data structure to serve.

The data structure uses hashing techniques to process these updates and report queries using sublinear space. So assume we have d pairwise-independent hash functions {h₁ .. h_d} that hash each of our inputs to the range (1 .. w). Just for the more curious mind, pairwise independence is a method to construct a universal hash family, a technique that ensures lower number of collisions in the hash implementation.

When an update (i_t, c_t) comes for the stream, we hash a[i_t] through each of the hash functions h₁ .. h_d and increment each of the w entries in the array that they hash to.

for i = 1 to d v = h(i)(a[i_t]) // v is between 1 and w count[i, v] += c_t // increment the cell count by c_t end
At any point in time if we want to know the approximate value of an element a[i] of the vector a, we can get it from computing the minimum of all values in each of the d cells of count where i hashes to. This can be proved formally. But the general intuition is that since we are using hash functions there's always a possibility of multiple i's colliding on to the same cell and contributing additively to the value of the cell. Hence the minimum among all hash values is the closest candidate to give us the correct result for the query.

The figure above shows the processing of the updates in a Count-Min sketch. This is typically called the Point Query that returns an approximation of a[i]. Similarly we can use a Count-Min sketch to get approximation queries for ranges which is typically a summation over multiple point queries. Another interesting application is to serve inner product queries where the data structure is used to query inner products of 2 vectors, a typical application of this being the estimation of join sizes in relational query processing. The paper Statistical Analysis of Sketch Estimators gives all details of how to use sketching as a technique for this.

Count-Min sketches have some great properties which make them a very useful data structure when processing distributed streams. They have associativity properties and can be modelled as monoids and hence terribly performant in a distributed environment where you can parallelize sketch operations. In a future post I will discuss some implementation techniques and how we can use count-min sketches to serve some useful applications over data streams. Meanwhile Twitter's algebird and ClearSpring's stream-lib offer implementations of Count-Min sketch and various other data structures applicable for stream mining applications.

Ruminations of a Programmer

Thursday, January 23, 2014

A Sketch as the Query Model of an EventSourced System

Sunday, January 19, 2014

Count-Min Sketch - A Data Structure for Stream Mining Applications

Using Sublinear Space

Count-Min Sketch