Monday, June 28, 2010

A Phase Shift for the ORM

I came to know about Squealer from one of Dean's tweets over the last weekend. Over there at the git repo README, there's a statement which makes a very succinct point on the role that relational mappers will be playing in the days to come. It says "... ORMs had it the wrong way around: that the application should be persisting its data in a manner natural to it, and that external systems (like reporting and decision support systems - or even numbskull integration at the persistence layer) should bear the cost of mapping."

I have expressed similar observations in the past, when I talked about the rationale of modeling data close to the way applications will be using them. I talk about this same architecture in an upcoming IEEE Software Multi-Paradigm Programming Special Issue for Sep/Oct 2010.

In most of the applications that churn out domain models in an object oriented language and persist data in a relational store, we use the ORM layer as follows :

It sits between the domain model and the relational database, provides an isolation layer between the two at the expense of an intrusive framework invasion within an otherwise non-complicated application architecture.

The scenario changes if we allow the application to manipulate and persist data in the same form that it uses for modeling its domain. My email application needs an address book as a composite object instead of being torn apart into multiple relational tables in the name of normalization. It will be better if I can program the domain model to access and persist all its entities in a document store that gives it the same flexibility. So the application layer does not have to deal with the translation between the two data models that adds a significant layer of complexity today. The normal online application flow doesn't need an ORM.

How does the translation layer get shifted ?

Consider an example application that uses Terrastore as the persistent storage for implementing online domain functionalities. Terrastore is a document database having some similarities with each of CouchDB, MongoDB and Riak in the sense that data is stored in the form of JSON documents. However unlike others it has an emphasis on the "C" component of the CAP theorem and provides advanced scalability and elasticity features without sacrificing consistency.

Terrastore is built on top of Terracotta, which is itself an ACID based object database that offers storage of data larger than your RAM size clustered across multiple machines. Terrastore uses the storage and clustering capabilities of Terracotta and adds more advanced features like partitioning, data manipulation and querying through client APIs.

As long as you are using Terrastore as your persistent database for the domain model, your data layer is at the same level of abstraction as your domain layer. You manipulate objects in memory and store them in Terrastore in the same format. Terrastore, like all other NoSQL stores is schemaless and offers a collection based interface storing JSON documents. No impedance mismatch to handle so far between the application and the data model.

Have a look at the following figure where there's no additional framework sitting between the application model and the data storage (Terrastore).

However there are many reasons why you would like to have a relational database as an underlying store. Requirements like ad hoc reporting, building decision support systems or data warehouses are some of the areas which are best supported with relational engines or any of the extensions that relational vendors offer. Such applications are not real time and can very well be served out of a snapshot of data that has a temporal lag from the online version.

Every NoSQL store and many SQL ones as well offer commit handlers for publishing async jobs. In Terrastore you can write custom event handlers that you can use to publish information from the document store to an underlying relational store. It's as simple as implementing the terrastore.event.EventListener interface. This is well illustrated in the above figure. Translation to the relational model takes place here which is one level down the stack in your application architecture. The Terrastore event handlers are queued up in a synchronous FIFO manner while they execute asynchronously which is exactly what you need to scale out your application.

I took up Terrastore just as an example - you can do most of the stuff (some of them differently) with other NoSQL stores as well front ending as the frontal persistent store of your application. In real life usage choose the store that best maps the need of your domain model. It can be a document store, it can be a generic key value store or a graph store.

The example with which I started the post, Squealer, provides a simple, declarative Ruby DSL for mapping values from document trees into relations. It was built to serve exactly the above use case with MongoDB as the frontal store of your application and MySQL providing the system of record as an underlying relational store. In this model also we see a shift of the translation layer from the main online application functionalities to a more downstream component of the application architecture stack. As the document says "It can be used both in bulk operations on many documents (e.g. a periodic batch job) or executed for one document asynchronously as part of an after_save method (e.g. via a Resque job). It is possible that more work may done on this event-driven approach, perhaps in the form of a squealerd, to reduce latency". Nice!

All the above examples go on to show us that the translation layer which has so long been existing between the core application domain logic and the persistent data model has undergone a phase shift. With many of today's NoSQL data stores allowing you to model your data closer to how the domain needs it, you can push away the translation further downstream. But again, you can only do this for some applications that fit this paradigm. With applications that need instant access to the relational store, this will not be a good fit. Use it whereever you deem it's applicable - besides the simplicity that you will get in your application level programming model, you can also scale your application more easily when you have a scalable frontal database along with non-blocking asynchronous writes shoveling data into the relational model.

Sunday, June 20, 2010

Scala Implicits : Type Classes Here I Come

I was having some discussion on type classes in Scala with Daniel on Twitter the other day when I suddenly discovered one of my unfinished posts on the same topic. Not that there's anything new that you will get to know from this rant. But these are some of my thoughts on the value that type class based thinking adds to your design space. I started this post when I published my thoughts on orthogonality in design some time back.

Let's start with the GOF Adapter pattern .. the object adapter that uses the highly recommended composition technique to bind abstractions. The example is also the same which I used for discussing orthogonality in design ..

case class Address(no: Int, street: String, city: String, 
  state: String, zip: String)

And we want to adapt this to the interface of a LabelMaker .. i.e. we would want to use Address as a LabelMaker ..

trait LabelMaker[T] {
  def toLabel(value: T): String

the adapter that does the interface transformation ..

// adapter class
case class AddressLabelMaker extends LabelMaker[Address] {
  def toLabel(address: Address) = {
    import address._
    "%d %s, %s, %s - %s".format(no, street, city, state, zip)

// the adapter provides the interface of the LabelMaker on an Address
AddressLabelMaker().toLabel(Address(100, "Monroe Street", "Denver", "CO", "80231"))

Now what's the incidental complexity that we introduce in the above design ?

As a client our point of focus has shifted to the adapter class AddressLabelMaker, which wraps our original abstraction, the Address instance. This leads to an identity crisis of the Address instance itself, a typical problem with any wrapper based idiom. The identity of the adaptee is now lost within the identity of the adaptor. And if you are brave enough to have the adaptee as an aggregate member of another object, then obviously the entire composition of the adapter idiom breaks down.

Conclusion : Object adapters don't compose. And the class variant of the adapter pattren is worse in the sense that you have a wiring by inheritance right from start, which makes the entire design much more rigid and coupled.

Enter Type Class ..

In the above structure of the adapter pattern, the adaptee was wrapped into the adapter leading to the identity loss of the adaptee that we discussed earlier. Let's take a different approach and see if we can abstract the adapter away from the client usage. The type class approach does exactly offer this way of composing abstractions - the type system of the language selects the appropriate adapter instance during compile time, so that the user can program explicitly to the adaptee.

Consider this function printLabel, that takes an argument and prints the label using the LabelMaker that we supply to it ..

def printLabel[T](t: T)(lm: LabelMaker[T]) = lm.toLabel(t)

In order to make a label out of an Address, we need to define the adapter that does it. In Scala we have first class module support through the object syntax. Let's define a module that defines the semantics to convert an Address to a LabelMaker.

object LabelMaker {
  implicit object AddressLabelMaker extends LabelMaker[Address] {
    def toLabel(address: Address): String = {
      import address._
      "%d %s, %s, %s - %s".format(no, street, city, state, zip)

Note that we make the adapter a Scala object with an implicit qualifier. What this does is that the compiler supplies the implicit argument if it finds one appropriate instance within the lexical scope. But for that we need to have the LabelMaker argument to printLabel implicit as well.

def printLabel[T](t: T)(implicit lm: LabelMaker[T]) = lm.toLabel(t)

or using the Scala 2.8 context bound syntax, we can make the implicit argument unnamed ..

def printLabel[T: LabelMaker](t: T) = implicitly[LabelMaker[T]].toLabel(t)

We don't supply the implicit argument at all, instead use the context bound syntax where the compiler automatically picks up the appropriate instance from the enclosing lexical scope. In the above example the implicit object AddressLabelMaker needs to be in scope of the function where you call printLabel. It will complain if it doesn't find one - hence you are alerted during compile time itself. Look ma - No evil run time failures ..

Now if we want to print a label for an Address, we do ..

printLabel(Address(100, "Monroe Street", "Denver", "CO", "80231"))

No incidental complexity in the client code in the form of adapter objects, the abstractions are explicit, we only supply the object for which we want to print the labels. We get rid of indirections as well. Not only that, if you look at the various abstractions that constitute the surface area of the entire design, modularity speaks for itself. The client only programs on the type class definition while the type class instance is nicely abstracted away *only* for the eyes of the compiler.

Let's see how Haskell does it ..

We have come this far discussing type classes without mentioning Haskell once. Let's make amends and see how the above design fits in the pure functional world of Haskell.

LabelMaker is the type class - let's define it in Haskell ..

class LabelMaker a where
    toLabel :: a -> String

This corresponds to our LabelMaker trait definition in Scala.

We want to make Address as a LabelMaker. Here's an instance of the above type class for Address ..

instance LabelMaker Address where
    toLabel (Address h s c st z) = show(h) ++ " " ++ s ++ "," ++ c ++ "," ++ st ++ "-" ++ z

This corresponds to the implicit object AddressLabelMaker definition in Scala that we did above. Scala's first class support for executable modules is an added feature that we don't have in Haskell.

Oh BTW here's the Address definition in Haskell record syntax ..

type HouseNo = Int
type Street = String
type City = String
type State = String
type Zip = String

data Address = Address {
      houseNo        :: HouseNo
    , street         :: Street
    , city           :: City
    , state          :: State
    , zip            :: Zip

And now we can define printLabel that uses the type class and generates the String for the label ..

printLabel :: (LabelMaker a) => a -> String
printLabel a = toLabel(a)

This corresponds to the Scala definition of printLabel that we have above.

A little observation ..

Note that one place where Haskell is much less verbose than Scala in the above implemnetation is the instance definition of the type class. But here the added verbosity in Scala is not without a purpose and offers a definite advantage over the corresponding Haskell definition. In Scala's case we name the instance explicitly as AddressLabelMaker, while the instance is unnamed in case of Haskell. The Haskell compiler looks into the dictionary on the global namespace for a qualifying instance to be picked up. In Scala's case the search is performed locally in the scope of the method call that triggered it. And because it's an explicitly named instance in Scala, you can inject another instance into the scope and it will be picked up for use for the implicit. Consider the above example where we have another instance of the type class for Address that generates the label in a special way ..

object SpecialLabelMaker {
  implicit object AddressLabelMaker extends LabelMaker[Address] {
    def toLabel(address: Address): String = {
      import address._
      "[%d %s, %s, %s - %s]".format(no, street, city, state, zip)

We can choose to bring this special instance in scope instead of the default one and generate a label for our address in a very special way ..

import SpecialLabelMaker._
printLabel(Address(100, "Monroe Street", "Denver", "CO", "80231"))

You don't get this feature in Haskell.

Type classes in Scala and Haskell ..

Type classes define a set of contracts that the adaptee type needs to implement. Many people misinterpret type classes synonymously with interfaces in Java or other programming languages. With interfaces the focus is on subtype polymorphism, with type classes the focus changes to parametric polymorphism. You implement the contracts that the type class publishes across unrelated types.

A type class implementation has two aspects to it :-
  1. defining the contracts that instances need to implement
  2. allow selection of the appropriate instance based on the static type checking that the language offers
Haskell uses the class syntax to implement (1), though this class is a completely different concept than the one we associate with OO programming. For Scala we used traits to implement this feature and an extension of the trait into an object to implement instances of it.

As I mentioned earlier, Haskell uses a global dictionary to implement (2), while Scala does it through a lookup in the enclosing scope of invocation. This gives an additional flexibility in Scala where you can select the instance by bringing it into the local scope.

Monday, June 14, 2010

Functional Data Structures in a Persistent Setting - Part 2: Okasaki's BankersQueue

In the last post we saw how Okasaki's BatchedQueue does not fit in the model of persistence. Okasaki showed that the data structure gives you an amortized O(1) for tail but strictly in an ephemeral way where we have only one logical future for a sequence of operations. Let's consider a hypothetical situation with the scenario that we left with in the last post. The main problem why the data structure failed in the context of persistence was that by the time it reached the realization stage of logical future #2, we were left with no credits to pay for it. But what happens if we can ensure that the credits for logical future #2 has already been paid for before. That is, logical future #2 could reuse the computation that logical future #1 had already done before.

Enter Memoization and call-by-need. Implementing the functional queue in a lazy language that supports call-by-need (like Haskell) will enable us to reuse the computation that one logical future has already performed.

It's called a BankersQueue that gives us amortized O(1) in a persistent setting. Here's the snippet that implements the data model..

module BankersQueue (BankersQueue) where
  import Prelude hiding (head, tail)
  import Queue

  data BankersQueue a = BQ Int [a] Int [a]

The basic stuff is the same as the BatchedQueue. In the implementation of the BatchedQueue in a strict language, we would use lists with strict evaluation semantics. BankersQueue need lazy lists and call-by-need semantics. In addition to the two lists, the BankersQueue also maintains the lengths of the lists as part of the data structure. This information will be used when we try to determine when to reverse the rear and append it to the front.

Lazy functional amortized data structures are not easy to analyze for time complexity. The main difficulty is the fact that operations don't get executed when they seem to. Instead we have suspensions or thunks that get created, only to be forced when actually needed. Okasaki's framework based on debits suggests that suspensions can be forced only if the debts assigned to them (proportional to the shared cost of the operation) has been paid off. I am not going into the details of explaining the banker's method framework. The basic idea is that you can enjoy your pie only if you have completely paid for it. There has to be a linear and sequential ordering between the following three stages in the lifecycle of a suspension.

Create a Suspension (1) => Payoff a Suspension (2) => Force a Suspension (3)

In the implementation for BatchedQueue we were doing the reverse of the rear only when the front becomes empty. In the banker's method based on debits, doing so will not give us sufficient time to pay for the entire reverse operation. Remember the crucial thing is to ensure that the following sequence is honored:
  1. calling reverse creates a suspension and we assign a debt to it proportional to the shared cost of the operation. In our case it's the length of the rear list.
  3. operations must discharge debits enough to pay off the debt assigned to the suspension for reverse.
  5. actually execute reverse since it's debt has been paid off.
In order to ensure that we pay off the debt assigned to reverse, we have to discharge an equivalent number of debits before the suspension for reverse is forced. And since the debt on the suspension is |rear|, this is possible only if we force the reverse after |rear| operations. Every tail discharges one debit off the front and we need to force the reverse after |front| operations. Which means that, after |front| operations we must be ready to execute reverse assuming that we have paid off all debts associated with it. This is possible if we create the suspension when |rear| just exceeds |front|. The following changes to the check function ensures this ..

check lenf f lenr r =
  if lenr <= lenf then BQ lenf f lenr r
  else BQ (lenf + lenr) (++ reverse r) 0 []

Have a look at the following diagram that shows how reverse gets it done in O(1) amortized in a persistent setting.

Note how laziness with call-by-need is the secret sauce behind the numbers. BankersQueue can also be implemented in languages that offer optional laziness. Scala can implement it with Stream, which is a lazy sequence and lazy val, that offers call by need semantics. However an interesting benchmark can be done with the corresponding implementation in Haskell. It remains to be seen how the overhead of creating and managing suspensions can have an impact on the runtime complexity.

In the next post on this series I will discuss yet another data structure that offers O(1) amortized functional queue and deque implementations. It's actually a tree structured model and can be malleable enough to offer a host of other data structure implementations as well.

Monday, June 07, 2010

Functional Data Structures in a Persistent Setting - Part 1

When you have multiple options of implementing a data structure, or you have multiple data structure implementations of the same ADT to choose from, what do you do ? Do you choose the one that has the simplest of implementations, the one that has the best amortized complexity or the one that has the best worst case complexity. Dick Lipton has a great post on how to decide What is the right complexity measure for algorithms ?. In case you haven't read it yet, leave this useless blog post and go read it over at his blog.

Which implementation of a data structure you should use depends on your use case. If you are more concerned about the overall time complexity of a sequence of operations on the data structure, choose the one with the best amortized complexity bounds. When you have an amortized data structure with a time bound of O(n) for a sequence of n operations, there may be some individual operations that take more than O(1). But you are ok so long the overall bound is maintained at O(n) for the whole sequence.

However, there are situations where you cannot afford amortized data structures. Many real time systems need predictability that can only be guaranteed through worst case time bounds. For such cases, you need to use a data structure implementation that gives you the worst case promise.

In my last post I discussed many functional data structures that come with the guarantee of being persistent. Which means that even if you make changes in the data structure, you will still be able to access all the earlier versions at no added performance cost. In that post I briefly touched upon some of the techniques that implementation of persistence employs that ensures there's no brutal copying going on behind the scenes. Clojure does that to great effect in implementing its family of persistent data structures. Scala 2.8 also has an implementation of Bagwell's Hash Mapped Array Trie, the secret sauce behind Clojure's persistent hash map and vectors. But this is not a post for discussing Clojure or Scala data structure implementations  .. so let's move on ..

Amortized data structures, in a sense, walk the middle path between Dick Lipton's Genius Channels and Dumb Channels. You have an adversary that makes some of the operations complicated threatening to pull down your promise. But at the same time you have the friendly operations that act as the counter-balancing force thereby maintaining the amortized bounds over the whole sequence of operations.

Consider Okasaki's BatchedQueue implementation, that uses a pair of lists to implement a queue data structure. The list of elements in the queue is split across the two lists, front and rear. front contains the half of the elements in the correct order while rear contains the other half in the reverse order. Dequeue takes place from the front (see head below), while enqueue takes place at the rear (see snoc below). Here's the Haskell implementation of the BatchedQueue taken from Okasaki's book ..

Clearly head offers worst case O(1). But tail has worst case O(n), since it has to reverse the rear list when front becomes empty. But with an amortized analysis using either the credits of the Banker's method or the potential of the Physicist's method, you can prove that all operations offer amortized O(1). Okasaki's book contains the proof of the amortized bounds for BatchedQueue.

The problem with BatchedQueue is that all bounds break in a persistent setting. Remember we talked about multiple logical futures for a persistent data structure. For one specific instance of the BatchedQueue, when you invoke tail persistently, you are actually doing the reverse multiple times, once for each of the logical futures of the operation that create the new object. Have a look at the following diagram that illustrates how BatchedQueue cannot offer an amortized O(1) in a persistent setting.

In the figure above, we have the original queue containing the numbers from 1 to 10 distributed evenly across front and rear lists. After 4 invocations of tail, we have the second scenario with only a single element left in the front list. If we have 2 invocations of tail in a persistent setting, each will invoke reverse separately despite the fact that the rear list only has 5 credits that can be spent by one of those operations.

In an upcoming post, I will discuss BankersQueue, an alternative implementation of the queue data structure that offers O(1) amortized complexity under persistent setting.