Friday, September 26, 2008

Lean Data Models and Asynchronous Repositories

In an earlier post, I had talked about scaling out the service layer of your application using actors and asynchronous processing. This can buy you some more donuts over and above your current throughput. With extra processing in the form of n actors pounding the cores of your CPU, the database will still be the bottleneck and SPOF. As long as you have a single database, there will be latency and you will have to accept it.

Sanitizing the data layer ..

I am not talking about scaling up, which implies adding more fuel to your already sizable database server. In order to increase the throughput of your application proportionately with your investment, you need to scale out, add redundancy and process asynchronously. As Michael Stonebraker mentioned once, it boils down to one simple thing - latency. It's a "latency arms race" out there, and the arbitrager with the least latency in their system wins. And when we talk about latency, it's not the latency of any isolated component, it's the latency of the entire architecture.

An instance of an RDBMS is the single most dominant source of latency in any architecture today. Traditionally we have been guilty of upsizing the database payload with logics, constraints and responsibilities that do not belong to the data layer. Or possibly, not in the form that today's relational model espouses. With an ultra normalized schema we try to fit in a data model that is not relational in nature, resulting in the complexities of big joins and aggregates while doing simple business queries. Now, the problem is not with the query per se .. the problem is with the impedance mismatch between the business model and the data model. The user wants to view his latest portfolio statement, which has been stored in 10 different tables with complex indexed structures that need to be joined on the fly to generate the document.

One of the ways to reduce the intellectual weight of your relational data model will be to take out elements that do not belong there. Use technologies like CouchDB, which offer much lighterweight solutions for your problem offering modeling techniques that suit your non-relational document oriented storage requirements like a charm.

Dealing with Impedance Mismatch

One of the reasons we need to do complex joins and use referential integrity within the relational data model is to incorporate data sanity, prevent data redundancy, and enforce business domain contraints within the data layer. I have seen many applications that use triggers and stored procedures to implement business logic. Instead of trying to decry this practice, I will simply quote DHH and his "single layer of cleverness" theory on this ..

.. I consider stored procedures and constraints vile and reckless destroyers of coherence. No, Mr. Database, you can not have my business logic. Your procedural ambitions will bear no fruit and you'll have to pry that logic from my dead, cold object-oriented hands.

He goes on to say in the same blog post ..

.. I want a single layer of cleverness: My domain model. Object-orientation is all about encapsulating clever. Letting it sieve half ways through to the database is a terrible violation of those fine intentions. And I want no part of it.

My domain model is object oriented - the more I keep churning out logic on the relational model, the more subverted it becomes. The mapping of my domain model to a relational database has already introduced a significant layer of impedance mismatch, which we are struggling with till today - I do not want any of your crappy SQL-ish language to further limit the frontiers of my expressibility.

Some time back, I was looking at Mnesia, the commonly used database system for Erlang applications. Mnesia is lightweight, distributed, fault tolerant etc. etc. like all other Erlang applications out there. The design philosophy is extremely simple - it is meant to be a high performant database system for Erlang applications only. They never claimed it to be a language neutral way of accessing data and instead focused on a tighter integration with the native language.

Hence you can do this ..

% create a custom data structure
-record(person, {name, %% atomic, unique key
        data, %% compound unspecified structure
        married_to, %% name of partner or undefined
        children}). %% list of children

% create an instance of it
= #person{name = klacke,
            data = {male, 36, 971191},
            married_to = eva,
            children = [marten, maja, klara]}.

% persist in mnesia

and this ..

query [ || P < table(person),
                 length(P.children) > X]

It feels so natural when I can persist my complex native Erlang data structure directly into my store and then fetch it using it's list comprehension syntax.

Mnesia supports full transaction semantics, when you need it. But for optimum performance it offers lightweight locking and dirty interfaces that promise the same predictable amount of time regardless of the size of the database. And Mnesia is also primarily recommended to be used as an in-memory database where tables and indexes are implemented as linear hash lists. Alternatively all database structures can be persisted to the file system as well using named files. In summary, Mnesia gives me the bare essentials that I need to develop my application layer and integrate it with a persistent data store and with minimum of impedance with my natural Erlang abstraction level.

Let us just assume that we have an Mnesia on the JVM (call it JVMnesia) that gives me access to APIs that enable me to program in the natural collection semantics of the native language. Also I can define abstractions at a level that suits my programming and design paradigm, without having to resort to any specific data manipulation languages. In other words, I can define my Repositories that can transparently interact with a multitude of storage mechanisms asynchronously. My data store can be an in-memory storage that syncs up with a persistent medium using write behind processes, or it can be the file system with a traditional relational database. All my query modules will bootstrap an application context that warms up with an in-memory snapshot of the required data tables. The snapshot needs to be clustered and kept in sync with the disk based persistent store at the backend. We can have multiple options here. Terracotta with it's Network Attached Memory offers similar capabilities. David Pollak talks about implementing something similar using the wire level protocol of Memcached.

Now that my JVMnesia offers a fast and scalable data store, how can we make the data processing asynchronous ? Front end it with an actor based Repository implementation ..

trait Repository extends Actor

class EmployeeRepository extends Repository {

  def init: Map[Int, Employee] = {
    // initialize repository
    // load employees from backend store

  private def fetchEmployeeFromDatabase(id: Int) = //..

  def act = loop(init)

  def loop(emps: Map[Int, Employee]): Unit = {
    react {
      case GetEmployee(id, client) =>
        client ! emps.getOrElse(id, fetchEmployeeFromDatabase(id))
      case AddEmployee(emp: Employee, client) =>
        client ! DbSuccess
        loop(emps + ( -> emp))

case class Employee(id: Int, name: String, age: Int)
case class GetEmployee(id: Int, client: Actor)
case class AddEmployee(emp: Employee, client: Actor)

case object DbSuccess
case object DbFailure

Every repository is an actor that serves up requests through asynchronous message passing to it's clients.

There is an ongoing effort towards implementing Erlang/OTP like behavior in Scala. We can think of integrating the repository implementation with the process supervisor hierarchies that Scala-OTP will offer. Then we have seamless process management, fault tolerance, distribution etc., making it a robust data layer that can scale out easily.

No comments: