Today we are witnessing a great bit of excitement with the NoSQL movement. Call it NoSQL (~SQL) or NOSQL (Not Only SQL), the movement has a mission. Not all applications need to store and process data the same way, and the storage should also be architected accordingly. Till today we have always been force-fitting a single hammer to drive every nail. Irrespective of how we process data in our application we have traditionally stored them as rows and columns in a relational database.
When we talk about really big write scaling applications, relational databases suck big time. Normalized data, joins, acid transactions are definite anti-patterns in write scalability. You may think sharding will solve your problems by splitting data into smaller chunks. But in reality, the biggest problem with sharding is that relational databases have never been designed for it. Sharding takes away many of the benefits that relational databases have traditionally been built for. Sharding cannot be an afterthought, sharding intrudes into the business logic of your application and joining data from multiple shards is definitely a non trivial effort. As long as you can scale up your data model vertically by increasing the size of your box, that's possibly the sanest way to go for. But Moore .. *cough* .. *cough* .. Even if you are able to scale up vertically, try migrating a really large MySQL database. It will take hours, and even days. That's one of the problems why some companies are moving to schemaless databases when their applications can afford to.
For horizontal scalability of an application if we sacrifice normalization, joins and ACID transactions, why should we use an RDBMS ? You don't need to .. Digg is moving to Cassandra from MySQL. It all depends on your application and the kind of write scalability that you need to achieve in processing of your data. For read scalability, you can still manage using read-only slaves replicating everything coming to the master database in realtime and setting up a smart proxy router between your clients and the database.
The biggest excitement that the NOSQL movement has created today is because of the divergence of thoughts that each of the products is promising. This is very much unlike the RDBMS movement which started as a single hammer named SQL that's capable of munging rows and columns of data based on the theory of mathematical set operations. And every application adopted the same storage architecture irrespective of how they process the data from within their application. One thing led to another, people thought they can solve this problem with yet another level of indirection .. and the strange thingy called an Object Relational Mapper was born.
At last it needed the momentum of the Web shaped data processing to make us realize that all data are not processed alike. The storage that works so well for your desktop trading application will fail miserably in a social application where you need to process linked data, more in the shape of a graph. The NOSQL community has responded with Neo4J, a graph database that offers easy storage and traversal of graph structures.
If you want to go big on write scalability, the only way out is decentralization and eventual consistency. The CAP theorem kicks in, and you need to compromise on at least one of consistency, availability and network partition tolerance. Riak and Cassandra offer decentralized data stores that can potentially scale indefinitely. If your application needs more structure than a key-value database, you can go for Cassandra, the distributed, peer-to-peer, column oriented data store. Have a look at the nice article from Digg which compares their use case between a relational storage and the columnar storage that Cassandra offers. For a document oriented database with all the goodness of REST and JSON, Riak is the option to choose. Also Riak offers linked map/reduce with the option to store linked data items, much in the way the Web works. Riak is truly a Web shaped data store.
CouchDB has yet another very interesting value proposition in this whole ecosystem of NOSQL databases. Most of the applications are inherently offline and need seamless and painless replication facilities. CouchDB's B-Tree based storage structure, append only operations with MVCC based model of concurrency control, lockless operations, REST APIs and incremental map/reduce operations position it with a sweet enough spot in the space of local browser storage. Chris Anderson, one of the core developers of CouchDB sums up the value of CouchDB in today's Web based world very nicely ..
"CouchApps are the product of an HTML5 browser and a CouchDB instance. Their key advantage is portability, based on the ubiquity of the html5 platform. Features like Web Workers and cross-domain XHR really make a huge difference in the fabric of the web. Their availability on every platform is key to the future of the web."
MongoDB, like CouchDB is also a document store. It doesn't offer REST out of the box, but it's based on JSON storage. It has map/reduce as well, but also offers a strong suite of query APIs much like SQL. This is the main sweet spot of MongoDB, which plays very well to people coming from a SQL background. MongoDB also offers master slave replication and has been working towards an autosharding based scalability and failover support.
There are quite a few other data stores that offer solutions to problems that you face in everyday application design. Caching, worker queues requiring atomic push/pop operations, processing activity streams, logging data etc. Redis and Tokyo Cabinet are nice fits for such use cases. You can think of Redis as a memcached with a backup persistent key-value database. It's single threaded, uses non-blocking IO and is blazing fast. Redis, besides offering every day key/value storage also offer list and sets to be stored along with atomic operations on each of them. Pick the one that fits your bill the best.
Another interesting aspect is the interoperability between these data stores. Riak, for example offers pluggable data backends - possibly we can have CouchDB as the data backend for Riak (can we ?). Possibly we will also see a Cassandra backend for Neo4J. It's extremely heartening to see that each of these communities has a deep sense of cooperation in making the entire ecosystem more meaningful and thriving.