Thursday, December 04, 2008

Data 2.0 - more musings

Martin Fowler writes ..

"If you switch your integration protocol from SQL to HTTP, it now means you can change databases from being IntegrationDatabases to ApplicationDatabases. This change is profound. In the first step it supports a much simpler approach to object-relational mapping - such as the approach taken by Ruby on Rails. But furthermore it breaks the vice-like grip of the relational data model. If you integrate through HTTP it no longer matters how an application stores its own data, which in turn means an application can choose a data model that makes sense for its own needs."

and echoes similar sentiments that I expressed here.

Today's application stack has started taking different views on using the database, particularly relational database, as a persistent store. The driving force is, of course, to reduce the semantic distance between the application model and the data model.

For the case Martin mentions above, the data model for the application need not be relational at all. We are seeing more and more cases where applications need not bolt the data that it operates on, forcibly into the clutches of the relational paradigm. In other words, the data remains much closer to the application domain model. Instead of splitting a domain abstraction into multiple relational tables and using the SQL glue to join them for queries and reports, we can directly operate on a semantically richer persistent abstraction. Storage options have evolved, RAM is the new disk, creation of RAM clusters is now easier than creation of disk clusters. And technologies like Map/Reduce have enabled easy parallelization of data processing on commodity hardware.

I have been hacking around with CouchDB for some time now. This is one platform that promises an HTTP based interface to the entire application stack. It's a server and a database, and the best part of it is that, the database driver is HTTP. JSON based storage of application documents, REST APIs, map/reduce based queries in Javascript - no schema, SQL, no database constraints .. your application data is semantically closer to the domain model. Loosely coupled document storage, multi-version concurrency control, easy replication - try replacing the relational database with this stack if it fits your application model.

Another train of thoughts that positions the database in a new light within an application stack, is the upcoming grid computing platforms on the JVM. Java as a platform is growing fast and grid vendors like Terracotta and Gigaspaces are catching up to this trend. They offer coherently clustered in-memory grid that enables a POJO based programming model without any synchronous interaction with the persistent data store. The application layer can program to the POJO based Network Attached Memory of Terracotta, using standard in-memory data structures. Terracotta offers an interface, which, if you implement, will be called to flush your objects to the database asynchronously in a write-behind thread. One other way to reduce the impedance mismatch of your domain model from the data model.

7 comments:

Jesper said...

Eh, can someone explain to a non-web application guy why you would use a protocol definitely not designed for data persistence (such as HTTP) to replace a protocol specifically designed for such a task? A solution like Terracotta persistence, maybe in combination with STM, sounds much more sensible.

Harish Mallipeddi said...

@Jesper

HTTP is an open protocol vs the binary protocols which DBs typically use. First advantage is you don't have to go around implementing database drivers per-database/per-language/per-platform. HTTP is available for free everywhere.

Also using HTTP has other advantages - for instance load balancing between two replicated databases which expose a HTTP API is as simple as sticking a nginx instance (or any reverse-proxy of your choice) in front of them.

Harish Mallipeddi said...

@Jesper

re: Terracotta + STM

I'm not sure if what Terracotta implements constitutes a STM but it seems like whenever you modify the shared state (basically objects shared across multiple JVMs via the Terracotta Shared Cache), you're forced to do them within a "transaction".

Cedric said...

Martin's post makes no sense to me, especially "If you integrate through HTTP it no longer matters how an application stores its own data".

Whether the application uses SQL or HTTP to store its data, it's really just a simple matter of abstraction to hide that layer to other parts of the system, and if you're going to do that, why go with such a specialized and unfitted protocol as HTTP?

Also, SQL is a language and HTTP is a protocol...

--
Cedric

Unknown said...

Cedric -

The way I look at it is, the moment you change the interface from SQL to HTTP, you have a leap in abstraction levels. SQL is tied to relational databases, while with HTTP/REST APIs you can abstract your persistence layer to whatever feels more natural for your application. e.g. using CouchDB u use JSON storage of documents at a layer of abstraction which is much closer to your domain model.

Hence the applications are more free to make their data model fit the domain model.

Anonymous said...

Terracotta offers an interface, which, if you implement, will be called to flush your objects to the database asynchronously in a write-behind thread.

In realtime paradigm, rather than flushing objects async to the db (i.e. in a separate thread) in the same process, sometimes an altogether standlone process might prove beneficial.

Consider a realtime stock trading system, which shows traders realtime pricess and also needs to persist every tick for historical data analysis.

Process 1: Fetching data from exchange
Process 2: Showing the data to traders
Process 3: Persisting the data

Processes 1,2 and 3 are interacting thruogh a message broker, through a unified object model. This approach may improve performance in case huge bursts of data, where embedding process 3 logic in process 1 or 2 might prove too costly.

Anonymous said...

Hi Debasish -

The "coherently clustered in-memory grid that enables a POJO based programming model without any synchronous interaction with the persistent data store" is a concept developed by Tangosol in the Coherence product, now part of Oracle.

The "interface, which, if you implement, will be called to flush your objects to the database asynchronously in a write-behind thread" also is originally a Coherence feature (from 2002), later copied by both of the other vendors you mentioned. See: http://www.oracle.com/technology/products/coherence/coherencedatagrid/writebehind.html

Peace,

Cameron Purdy | Oracle
http://www.oracle.com/technology/products/coherence/index.html