Monday, June 28, 2010

A Phase Shift for the ORM

I came to know about Squealer from one of Dean's tweets over the last weekend. Over there at the git repo README, there's a statement which makes a very succinct point on the role that relational mappers will be playing in the days to come. It says "... ORMs had it the wrong way around: that the application should be persisting its data in a manner natural to it, and that external systems (like reporting and decision support systems - or even numbskull integration at the persistence layer) should bear the cost of mapping."

I have expressed similar observations in the past, when I talked about the rationale of modeling data close to the way applications will be using them. I talk about this same architecture in an upcoming IEEE Software Multi-Paradigm Programming Special Issue for Sep/Oct 2010.

In most of the applications that churn out domain models in an object oriented language and persist data in a relational store, we use the ORM layer as follows :


It sits between the domain model and the relational database, provides an isolation layer between the two at the expense of an intrusive framework invasion within an otherwise non-complicated application architecture.

The scenario changes if we allow the application to manipulate and persist data in the same form that it uses for modeling its domain. My email application needs an address book as a composite object instead of being torn apart into multiple relational tables in the name of normalization. It will be better if I can program the domain model to access and persist all its entities in a document store that gives it the same flexibility. So the application layer does not have to deal with the translation between the two data models that adds a significant layer of complexity today. The normal online application flow doesn't need an ORM.

How does the translation layer get shifted ?

Consider an example application that uses Terrastore as the persistent storage for implementing online domain functionalities. Terrastore is a document database having some similarities with each of CouchDB, MongoDB and Riak in the sense that data is stored in the form of JSON documents. However unlike others it has an emphasis on the "C" component of the CAP theorem and provides advanced scalability and elasticity features without sacrificing consistency.

Terrastore is built on top of Terracotta, which is itself an ACID based object database that offers storage of data larger than your RAM size clustered across multiple machines. Terrastore uses the storage and clustering capabilities of Terracotta and adds more advanced features like partitioning, data manipulation and querying through client APIs.

As long as you are using Terrastore as your persistent database for the domain model, your data layer is at the same level of abstraction as your domain layer. You manipulate objects in memory and store them in Terrastore in the same format. Terrastore, like all other NoSQL stores is schemaless and offers a collection based interface storing JSON documents. No impedance mismatch to handle so far between the application and the data model.

Have a look at the following figure where there's no additional framework sitting between the application model and the data storage (Terrastore).



However there are many reasons why you would like to have a relational database as an underlying store. Requirements like ad hoc reporting, building decision support systems or data warehouses are some of the areas which are best supported with relational engines or any of the extensions that relational vendors offer. Such applications are not real time and can very well be served out of a snapshot of data that has a temporal lag from the online version.

Every NoSQL store and many SQL ones as well offer commit handlers for publishing async jobs. In Terrastore you can write custom event handlers that you can use to publish information from the document store to an underlying relational store. It's as simple as implementing the terrastore.event.EventListener interface. This is well illustrated in the above figure. Translation to the relational model takes place here which is one level down the stack in your application architecture. The Terrastore event handlers are queued up in a synchronous FIFO manner while they execute asynchronously which is exactly what you need to scale out your application.

I took up Terrastore just as an example - you can do most of the stuff (some of them differently) with other NoSQL stores as well front ending as the frontal persistent store of your application. In real life usage choose the store that best maps the need of your domain model. It can be a document store, it can be a generic key value store or a graph store.

The example with which I started the post, Squealer, provides a simple, declarative Ruby DSL for mapping values from document trees into relations. It was built to serve exactly the above use case with MongoDB as the frontal store of your application and MySQL providing the system of record as an underlying relational store. In this model also we see a shift of the translation layer from the main online application functionalities to a more downstream component of the application architecture stack. As the document says "It can be used both in bulk operations on many documents (e.g. a periodic batch job) or executed for one document asynchronously as part of an after_save method (e.g. via a Resque job). It is possible that more work may done on this event-driven approach, perhaps in the form of a squealerd, to reduce latency". Nice!

All the above examples go on to show us that the translation layer which has so long been existing between the core application domain logic and the persistent data model has undergone a phase shift. With many of today's NoSQL data stores allowing you to model your data closer to how the domain needs it, you can push away the translation further downstream. But again, you can only do this for some applications that fit this paradigm. With applications that need instant access to the relational store, this will not be a good fit. Use it whereever you deem it's applicable - besides the simplicity that you will get in your application level programming model, you can also scale your application more easily when you have a scalable frontal database along with non-blocking asynchronous writes shoveling data into the relational model.

13 comments:

brunovernay said...

Interesting view. It reminds me the Complex Event Processing systems or Rules Engines, where the application handle real time events, but can still store relevant bits to a Database, generally in an asynchronous way.

But here you have 2 data stores. Depending on the application, you may need to code some checks for consistency.

kimchy said...

Few notes:

1. The assumption that NoSQL provides closer modeling to your domain model is not really true across the board. Column based storage are not really closer (Cassandra as an example), document based ones do come closer, but suffer from things like relationships.

2. Once you use an event listener mechanism to apply changes done to an external data source, you run into your typical 2 resources coordination and consistency problems. For example, if firing the event is done on a different thread, and that node dies, your event might not have fired. In this case, you need to replicate the "need to fire an event" or make it highly available.

In general what you say make tons of sense, and have been done for quite some time by data grids. Its called write behind, which solves the above problems, and is an order of magnitude more powerful (and more complex to implement ;) ) than post commit hooks... .

-shay.banon

Sergio Bossa said...

Hi Kimchy,

a few observations over your notes.

Regarding #1, I agree column stores provide difficult mappings in the same way of relational stores, but document stores are different: by wisely identifying aggregates and related entities, you can safely serialize the whole aggregate graph as a document unit, and only have to manage aggregate relations.

Regarding #2, it really depends on the event source implementation: if events are synchronously enqueued in FIFO order, you don't have any consistency and coordination problem, just a kind of "time-shifted" consistency. You may only have to deal with idempotent messages in some cases, but it often happens in async distributed systems.

Finally, I understand hand-crafted write-behind may be more powerful than commit hooks: but still, that doesn't mean it's always a better solution ;)

My two euro cents,
Cheers!

Sergio B.

Debasish said...

Hi Kimchy -

Regarding your observation #1, let me take the example of a column store like Cassandra. It's true with column stores you have to think differently. But still I think we ensure that the data is organized in the way the application requires it. Take for example the case for organizing users in a Twitter like application (ref Twissandra use case in http://www.rackspacecloud.com/blog/2010/05/12/cassandra-by-example/). There we make both User and Username as column families just to support scalable writes and heavy queries with usernames. Had we used an RDBMS, we would have designed it differently. Needless to mention the Cassandra design is closer towards how the application uses the data.

Another example is a graph database to store routing information where we are actually manipulating graphs in the domain model.

But I agree that not all designs are very intuitive and you need to play to the strengths of the store that you are using. So choosing the appropriate store is a prerequisite.

Regarding your observation #2, I agree tim-async of Terracotta gives you more flexibility. I just cited an example that Terrastore implements. In this context, it's not unusual to make use of various Event Sourcing patterns that can be processed asynchronously for pushing stuff into a relational database.

Juan Carlos said...

Hi Debasish,

What do you think about CRQS applying to this kind of solutions?

Thanks

kimchy said...

Hi,

Sergio, regarding #2. Once you use the event listeners to update another transactional store, then you would want to make sure that whatever you do against the "nosql" of choice is also done against the other datastore. Fifo doesn't cut it, you must create a way to have it highly available. So if one node dies after the change applies to the nosql, but the event was not raised yet, you won't loose it. Thats basic stuff, and write behind mechanism does that in a highly available, async, and reliable manner. I am not saying that it can't be implemented in nosql solutions, just that its not something new, and current solutions that don't behave like it are lacking.

Debasish: Regarding the column based store. Personally, cassandra, while very powerful, feels very much alien to your typical domain modeling. Thats not how things should be... .

-shay.banon

jeppec said...

I completely agree :-)

It's very important to model your domain correctly (identify your aggregates) and have as little translation as possible when persisting it. As an old ORM guy I now a days have a hard time justifying ORM's for all but the rarest cases.

Your focus on using Event Sourcing (a corner stone i CQRS) to separate Core Domain Persistence & Searching/Querying/Reporting is essential to keep your domain model and its persistence model sane, because they serve very different purposes and have much different characteristics.

/Jeppe

Sergio Bossa said...

Hi Shay,

You're absolutely right: durability and replication of queued messages are important, but again, we're just talking about implementation aspects, rather than the general validity of the EventSource/EventListener solution.
For example, talking about Terrastore, it currently doesn't implement message durability and replication, but this will change in the future ... just let it approach version 1.0 ;)
Other write-behind features such as conflation or batching are missing too and they're maybe not that suited for event sourcing ... but again, just use the right tool for the job and switch to full-power write-behind if needed.

Thanks for sharing your thoughts, always interesting stuff coming from you indeed ...
Cheers!

Sergio B.

Joshua Graham said...

Dean Wampler pointed out your ideas to me, and this is a valuable meme to be spreading right now. I put some other rationale into my blog with more to follow soon. I'll be interested to read your article!

Joshua Graham said...

And yes, the handling for event-based updates is intended to be asynchronous (although it doesn't have to be).

Another driver was this was on Amazon infrastructure (via EngineYard) and MySQL seems to be all-too-often in CPU I/O Wait state (shared network I/O to the disks) whereas MongoDB with memory-mapped files was blazing.

kimchy said...

Hi Sergio,

I have no doubt that terrastore will have something similar to write behind, nothing in its architecture prevents it from having it, except for time (which, god knows, I wish I had more of it myself as well ;) ). Agreed regarding the event sourcing aspect, just wanted to bring up the point that there are many aspects to think about when you implement such design pattern.

-shay.banon

Bart's testblog said...

I'd like to read your post but there seems to be something wrong with the figures/drawings? Could you restore them?
Tnx

bart

Debasish said...

Thanks for noticing .. Now uploaded ..