Monday, November 02, 2009

NOSQL Movement - Excited with the coexistence of Divergent Thoughts

Today we are witnessing a great bit of excitement with the NoSQL movement. Call it NoSQL (~SQL) or NOSQL (Not Only SQL), the movement has a mission. Not all applications need to store and process data the same way, and the storage should also be architected accordingly. Till today we have always been force-fitting a single hammer to drive every nail. Irrespective of how we process data in our application we have traditionally stored them as rows and columns in a relational database.

When we talk about really big write scaling applications, relational databases suck big time. Normalized data, joins, acid transactions are definite anti-patterns in write scalability. You may think sharding will solve your problems by splitting data into smaller chunks. But in reality, the biggest problem with sharding is that relational databases have never been designed for it. Sharding takes away many of the benefits that relational databases have traditionally been built for. Sharding cannot be an afterthought, sharding intrudes into the business logic of your application and joining data from multiple shards is definitely a non trivial effort. As long as you can scale up your data model vertically by increasing the size of your box, that's possibly the sanest way to go for. But Moore .. *cough* .. *cough* .. Even if you are able to scale up vertically, try migrating a really large MySQL database. It will take hours, and even days. That's one of the problems why some companies are moving to schemaless databases when their applications can afford to.

For horizontal scalability of an application if we sacrifice normalization, joins and ACID transactions, why should we use an RDBMS ? You don't need to .. Digg is moving to Cassandra from MySQL. It all depends on your application and the kind of write scalability that you need to achieve in processing of your data. For read scalability, you can still manage using read-only slaves replicating everything coming to the master database in realtime and setting up a smart proxy router between your clients and the database.

The biggest excitement that the NOSQL movement has created today is because of the divergence of thoughts that each of the products is promising. This is very much unlike the RDBMS movement which started as a single hammer named SQL that's capable of munging rows and columns of data based on the theory of mathematical set operations. And every application adopted the same storage architecture irrespective of how they process the data from within their application. One thing led to another, people thought they can solve this problem with yet another level of indirection .. and the strange thingy called an Object Relational Mapper was born.

At last it needed the momentum of the Web shaped data processing to make us realize that all data are not processed alike. The storage that works so well for your desktop trading application will fail miserably in a social application where you need to process linked data, more in the shape of a graph. The NOSQL community has responded with Neo4J, a graph database that offers easy storage and traversal of graph structures.

If you want to go big on write scalability, the only way out is decentralization and eventual consistency. The CAP theorem kicks in, and you need to compromise on at least one of consistency, availability and network partition tolerance. Riak and Cassandra offer decentralized data stores that can potentially scale indefinitely. If your application needs more structure than a key-value database, you can go for Cassandra, the distributed, peer-to-peer, column oriented data store. Have a look at the nice article from Digg which compares their use case between a relational storage and the columnar storage that Cassandra offers. For a document oriented database with all the goodness of REST and JSON, Riak is the option to choose. Also Riak offers linked map/reduce with the option to store linked data items, much in the way the Web works. Riak is truly a Web shaped data store.

CouchDB has yet another very interesting value proposition in this whole ecosystem of NOSQL databases. Most of the applications are inherently offline and need seamless and painless replication facilities. CouchDB's B-Tree based storage structure, append only operations with MVCC based model of concurrency control, lockless operations, REST APIs and incremental map/reduce operations position it with a sweet enough spot in the space of local browser storage. Chris Anderson, one of the core developers of CouchDB sums up the value of CouchDB in today's Web based world very nicely ..

"CouchApps are the product of an HTML5 browser and a CouchDB instance. Their key advantage is portability, based on the ubiquity of the html5 platform. Features like Web Workers and cross-domain XHR really make a huge difference in the fabric of the web. Their availability on every platform is key to the future of the web."

MongoDB, like CouchDB is also a document store. It doesn't offer REST out of the box, but it's based on JSON storage. It has map/reduce as well, but also offers a strong suite of query APIs much like SQL. This is the main sweet spot of MongoDB, which plays very well to people coming from a SQL background. MongoDB also offers master slave replication and has been working towards an autosharding based scalability and failover support.

There are quite a few other data stores that offer solutions to problems that you face in everyday application design. Caching, worker queues requiring atomic push/pop operations, processing activity streams, logging data etc. Redis and Tokyo Cabinet are nice fits for such use cases. You can think of Redis as a memcached with a backup persistent key-value database. It's single threaded, uses non-blocking IO and is blazing fast. Redis, besides offering every day key/value storage also offer list and sets to be stored along with atomic operations on each of them. Pick the one that fits your bill the best.

Another interesting aspect is the interoperability between these data stores. Riak, for example offers pluggable data backends - possibly we can have CouchDB as the data backend for Riak (can we ?). Possibly we will also see a Cassandra backend for Neo4J. It's extremely heartening to see that each of these communities has a deep sense of cooperation in making the entire ecosystem more meaningful and thriving.

11 comments:

Cedric said...

"When we talk about really big write scaling applications, relational databases suck big time."

Seriously?

Dhananjay Nene said...

Great post. This is one of the best overview around the family of the NoSQL databases I've seen in some time.

While I share the excitement, and have started using one of the databases, I must confess I approach NoSQL with a bit of trepidation. This is because in my perception, at the moment it is more like the wild west. What does worry me is that a lot of lessons around NoSQL still need to be learnt, and vendor/database lockin is quite high, even as some of the databases are being refined and enhanced substantially. Risk mitigation is one area I believe hasn't begun to be discussed within the NoSQL community (and if it has, I haven't seen much of it yet). Imo, this is one aspect every user should pay attention to before committing to using one of the databases.

Martin Kuhn said...

I'm interested in learning more of "NOSQL" DB's.

But I'm not sure in which direction it would be worth to go.

One option would be couchdb and the other interesting part would be cassandra.

Which is easier to use from a java / scala prog.

For Cassandra I was not able to find good documentation especiall for the scala / java space.

Jamie said...

Thanks. That's a great overview. Just wondering where hadoop fits into all this? Isn't hadoop something like couchdb, but written in java?

Debasish said...

@Martin:
As I emphasized in my post, there is no *one* direction. It all depends on your application needs and how it processes data. The RDBMS has taught us to go for *one* unified storage architecture (rows and columns) and then use intermediaries like ORM to transform data according to your application needs. Here in NOSQL we are talking about divergent storages already. If you want to process strongly linked social graphs, u will need graph storing and processing capabilities as in Neo4J or linked map/reduce in Riak. If you need raw speed with key/value storage, choose Tokyo Cabinet or Redis.

HTH

Debasish said...

@Jamie:
Hadoop is a data processing engine for computation of analytics. OTOH CouchDB is a database designed for low latency high availability. It does many more things than what Hadoop does. Both use map/reduce, but for very different purposes. Have a look at this thread for a brief comparison between the two.

Martin Kuhn said...

@Debasish

I think I will try couchdb for a toy project. You invested already some time with couchdb (scouchdb).

Do you have a final opinion on couchdb? Do you work with it in the future?

Fanf said...

There is an ancestor that is regularly forgotten in the "NOSQL" storage movement: it's LDAP.

LDAP has a lot of advantages, among them: it's a normalized protocol with query/update/etc, it's already used in a lot of places, there is several existing LDAP server implementations (some of them (ok, one, OpenLDAP) can handle billion of entries), it has a mixed structure schema, both rigid (LDAP schema) and so allowing interopability and adjustable (the directory tree structure is free), synchro between RDBM and them are well know (see for example lsc-project.org), and that's just an overview !

So, it was just to remind that LDAP and all its ecosystem exists, is mature, and is still evolving: it deserves some interest ;)

Debasish said...

@Martin:
I have written scouchdb, the Scala driver and View Server for CouchDB. CouchDB has a sweet spot and IMHO it has the potential to use it even amongst all the NOSQL databases. There are already many instances of CouchDB being used in production. I am excited ..

Anonymous said...

You are confusing 'relational databases' with 'current implementations of the relational model of data'.

The fact that current implementations suck at the things you describe means that those implementations suck, not that the model is inappropriate.

There is no reason why things such as splitting the data into chunks can not be implemented 'under the cover' of the relational model.

Venkatesh Sellappa said...

A good overview , i wonder though why Berkeley DB is not in there ?