Monday, March 02, 2009

Your data model can speak different languages too

Andrej Koelewijn writes ..

  • REST is about resources. Resource is just another word for object, or record.

  • REST is also about URLs. URLs that identify resources. Just like ids can identify objects or records.

  • And, often overlooked, REST is also about links. Resources use URLs to link to other resources. Just like foreign keys can link records to other records.

and finally concludes ..

REST is a distributed data model

There is no denying the fact that REST is based on the central concept of Resources, which are addressable entities that can be manipulated by all components of the communicating network using a standardized and uniform interface. Resources abstract the underlying data model for the user, based on which the application developer can design nice REST based APIs to the external world. REST abstracts the data model, it is NOT the data model - in fact underlying data representations can change without any impact on the API.

In another related post, talking about the relational data model, he also mentions ..

This is why RDBMSes are so great: it doesn’t bind your data to a single application.

RDBMS does bind your data to the specific application model. What it abstracts from the application, is the underlying physical representation of the data. In fact it is this specific binding and the constraints that it imposes on the data model that makes it so difficult to work with semi-structured data. You design the schema upfront, define relationships and constraints along with the schema, and organize your data based on them. This is one of the reasons why we need to have separate data models for writes (normalized), queries and reports (denormalized).

In an earlier post, I had drawn upon some of my thoughts on Data 2.0, as I see it in an application today. The fact is that, data is no longer viewed as something to be abstracted in a uniform storage and drenched out using a single query language. Gone are the days when we used to think of the ORM as the grand unifier for all database platforms. Both the O and the R are fast losing ubiquity in today's application development context, at least they are not as universal as they used to be, half a decade back.

Things are surely moving towards polyglotism in the data modeling world as well. Not all of your data need to be on the relational database - distribute data to where it belongs. Sometime back Alex Miller had a post that suggested distribution and modeling of data based on lifetimes. Gone are the days when you need to persist your conversational data in the database for scaling out your application. Modern day grid platforms like Terracotta and Gigaspaces offer network attached memory that will store this data for you in the form that is much closer to the application model along with transparent clustering of your application.

Not all data need a relational model. In a typical application, there are data that are document oriented, does not need to have a fixed schema attached to it. They can be stored as key-value pairs and their primary reason for existence is to support fast, real fast inserts, updates and key based lookups. The semantics of the data inside the value is fairly opaque and does not need to have any constraints of relation with the rest of the data model. Why coerce such a simple model into one that forces you to pay the upfront tax of normalization, constraint enforcement, index rebuilding and joins. Think simple, think key value pairs. And we have lots of them being used in the application space today. Rip such data out of your relational model into lightweight transactional stores that scale easily and dynamically. Long lived persistent data can however, happily choose to stay around within the confines of your normalized relational model.

One of the significant advantages that you get out of storing data on the key/value stores is that your persistent data is now more closely mapped to the objects and classes of your application. This leads to less of a cognitive dissonance that the relational data model enforces upon the application.

But what about queries that can fetch relevant records from the key/value stores based on user defined criteria ?

Obviously primary key based fetch is not always that useful in practical applications. All distributed key/value stores provide the capability to index based on custom defined filters and in conjunction with full text search engines like Lucene, return collections of selected entries from the data store. Have a look at this article demonstrating how Sphinx, a full text search engine, can be integrated with MemcacheDB, a distributed key-value store which conforms to the memcached protocol and uses Berkeley DB as its storage back-end.

CouchDB provides an interesting view model that offers the capability to aggregate and query documents stored in the database through the map-reduce paradigm. Users can define the computation to model the query using map functions and subsequent aggregates using the reduce function that make relevant data available to the user.

So, now that your data model is polyglotic, there can be situations where you may need to synchronize data across multiple storage engines. Here is an example ..

In a real life trading application, huge volume of trade messages need to be processed by the back office within a very short time interval. This needs scaling out to throttle at a rate that can be achieved more easily using the light payloads that schemaless, amorphous key value stores offer than traditional relational databases. You just need to validate the message and store it against the trade reference number. Also, these peer based distributed key/value databases offer easy bi-directional replication and updates to shared data in disconnected mode, which can make offline message processing, followed by synchronization later on, quite affordable. Hence it makes every sense to offload this message processing from the database and deploy clusters of key/value stores.

However, there is a second level processing that needs to be done, which updates all books and accounts and customer balances based on each individual trade. And this information needs to be stored as a system of record for various queries, reports, audit trails and other subsequent downstream processing. One way this can be achieved is by pushing the second level of processing to scheduled queue jobs that asynchronously operate on the key value store, do all relevent heavy lifting with the processing of data and finally pushing the balances to the back end relational database. Currently we are doing everything as one synchronous ACID transaction against an RDBMS. Distribute the model based on data lifetime, rely on asynchronous processing, and what you get is eventual consistency with the goodness of better scalability.

A few days back I was listening to this Scaling DIGG episode with Joe Stump. He mentioned repeatedly that it's not the language, the bottleneck is the IO and the latency resulting from the IO. And the database is the single most bottleneck in your architecture. More specifically, it is NOT the database per se, but the synchronous communication between the application tier and the persistent data layer. This is the reason why DIGG is architected around Gearman, MemcacheDB and MogileFS. However, not all sites need the scalability of DIGG. But even with applications that need a fraction of scalability compared to DIGG, architecting it away from strictly synchronous ACID transaction oriented data sources is the way to go.


Anonymous said...

Brilliant, thank you.

Unknown said...

Thank you for highlighting this much forgotten issue about user-defined queries on K/V DBs.

While some headaches with scaling RDBMs can be solved by replacing it with a K/V store, indexing/searching such a database is still cumbersome.

As you said, CouchDB solves this by it's map/reduce views. Not sure about the performance on this, but it really has potential.