## Monday, September 15, 2008

### Lazy Functional Queues in Scala

Scala has a nice functional Queue implementation as part of scala.collection.immutable. It implements the standard technique of using a pair of lists [L, R], where L is the first portion of the queue and R is the rear portion in reverse order. As a result, the head of L is the first element of the queue and head of R points to the last element of the queue. Hence both can be fetched in O(1) time. Items are enqueued from the rear end, i.e. a O(1) insert in R, while they are dequeued from the front. Again a (supposedly) O(1) removal from L.

A nice trick indeed .. Here is the snippet from the Scala library ..

class Queue[+A](elem: A*) extends Seq[A] {  protected val in: List[A] = Nil  protected val out: List[A] = elem.elements.toList  //..  //..  def enqueue[B >: A](elem: B) = mkQueue(elem :: in, out)  def dequeue: (A, Queue[A]) = {    val (newOut, newIn) =      if (out.isEmpty) (in.reverse, Nil)      else (out, in)    if (newOut.isEmpty) throw new NoSuchElementException("queue empty")    else (newOut.head, mkQueue(newIn, newOut.tail))  }  //..  //..}

Now have a close look at dequeue. When we try to do a dequeue, with out being empty, in needs to be reversed, a definite O(n) operation. But, for a list of length n, this occurs only when there has been n insertions since last reversal. Hence dequeue still remains an amortized O(1) operation. The point is whether you can afford an occasional O(n) ..

In his paper "Simple and Efficient Purely Functional Queues and Deques", Okasaki discusses a technique that uses lazy lists and incremental computation to improve the above worst case complexity of dequeue to O(log n). In fact, later in the paper, he also implements an improvement of this version using full memoization and lazy languages to implement a worst case O(1) dequeue. Okasaki has written a beautiful thesis on functional data structures and their analyses techniques, which later has been published as a book. A highly recommended piece for the interested ones ..

Being Lazy

The main trick of Okasaki's O(log n) dequeue algorithm is to make L a lazy list. Instead of doing the above reverse in a single step as a strict operation, we can have [L, R] incrementally changed to [L ++ reverse(R), []], so that when the actual need for the reverse comes, we already have R reversed and appended to L. The meat lies in the reverse operation being done incrementally and the ++ operation being done lazily. Okasaki's paper describes the algorithm and it's analysis in great detail.

Using Scala Streams

Scala offers lazy lists in the form of Stream[A], where Stream.cons is defined as ..

object Stream {  object cons {    def apply[A](hd: A, tl: => Stream[A]) = new Stream[A] {      //..    }    //..  }  //..}

and offers a lazy concatenation ..

How about using Stream[A] for the list L and try out Okasaki's algorithm to get a faster functional persistent Queue in Scala ? Name it OQueue[+A] ..

class OQueue[+A] {  protected val front: Stream[A] = Stream.empty  protected val sizef: Int = 0  protected val sizer: Int = 0  protected val rear: List[A] = Nil  //..  //..}

Okasaki calls the incremental adjustment of [L, R] operation as a rotate. Here is the implementation of rotate in OQueue, which gets called during construction of the OQueue. Note that OQueue (and Scala's Queue), being functional, is also persistent - hence every mutating operation returns a new version of the data structure. This is unlike imperative data structures that implement destructive updates and maintain only the latest version. Hence rotate gets called as part of OQueue construction. However, rotate is not called for every call of the constructor - it is called only when |R| = |L| + 1 and we need to start the process of incremental reversal. And the trick is to start the process of reversing R just at the moment R gets longer than L and interleave the append to L with reverse of R. The details are too gory for a blog post and can be found with all it's complexities in the original paper.

class OQueue[+A] {  //..  //..  protected def rotate[A](xs: Stream[A], ys: List[A], rys: Stream[A]): Stream[A] = ys match {    case y :: ys1 =>      if (xs isEmpty) Stream.cons(y, rys)      else        Stream.cons(xs.head, rotate(xs.tail, ys.tail, Stream.cons(y, rys)))    case Nil =>      //..  }  //..}

and now the constructor / factory method for constructing an OQueue[A] ..

class OQueue[+A] {  //..  //..  protected def makeQ[A](f: Stream[A], sf: Int, r: List[A], sr: Int): OQueue[A] = {    if (sr <= sf) {      new OQueue[A]() {        override protected val front = f        override protected val sizef = sf        override protected val rear = r        override protected val sizer = sr      }    } else {      new OQueue[A]() {        override protected val front = rotate(f, r, Stream.empty)        override protected val sizef = sf + sr        override protected val rear = Nil        override protected val sizer = 0      }    }  }  //..}

The other methods are fairly trivial and nicely fall in place ..

class OQueue[+A] {  //..  //..  def isEmpty: Boolean = sizef == 0  def enqueue[B >: A](elem: B) = {    makeQ(front, sizef, elem :: rear, sizer + 1)  }  def dequeue: (A, OQueue[A]) = {    (front.head, makeQ(front.tail, sizef - 1, rear, sizer))  }  def elements: Iterator[A] = (front.force ::: rear.reverse).elements  def length = (sizef + sizer)  //..  //..}

Fast ?

Analyzing lazy, persistent functional data structures is hard and so is benchmarking them. I did not do an elaborate benchmarking of OQueue with scala.collection.immutable.Queue. But from whatever I did, I could not get any speedup over the Scala library implementation. Scala, unlike Haskell, is not a lazy language. And lazy languages offer memoization of function arguments i.e. arguments are evaluated only when they are needed and once evaluated, they are cached for future reuse. In the above implementation of OQueue, we need to evaluate front.tail a number of times - an effectively lazy language can make this operation O(1) through memoization. Also, Scala, being implemented on top of the JVM, is not the best platform for heavy recursion and the above implementation of rotate is also not tail recursive, that Scala could optimize. Okasaki implemented the algorithm in SML, a lazy functional language and possibly got effective speedups over the traditional paired list implementation of queues.

I am not a Scala expert, any suggestions or critiques regarding the above implementation will be most welcome. Of course, I plan to do some more micro-benchmarking. But I think there may be dominant constants that offset the asymptotic complexity figures of the above analysis. Besides, the above implementation does not take any advantage of memoization, which is yet another secret sauce behind speedups of functional implementations. Anyway, it was, I feel, a worthwhile exercise trying my hands on implementing a functional data structure.

Postscript

I just discovered that Rich Hickey addresses this problem in a different way in Clojure. He has used the traditional paired list implementation, but without R storing the rear end in a reversed manner. Instead he uses a PersistentVector as the rear - and his PersistentVector is based upon a trie based implementation that gives near constant time access to elements.

red1 said...

in feb this year i started this http://www.wscoop.com its a digg stylr site but deicated to web desing and development

I need links to web design and development, site , articles links...... anything you think of....

Please are you abel to help me make this into a great resource for developers and designers.

You are more than welcome to prompt your own sites, projects, comments and all... at the moment the posts go straight to the font page to it is good exposer and at least some sweet backlinks

Thanks

red1 said...

in feb this year i started this http://www.wscoop.com its a digg stylr site but deicated to web desing and development

I need links to web design and development, site , articles links...... anything you think of....

Please are you abel to help me make this into a great resource for developers and designers.

You are more than welcome to prompt your own sites, projects, comments and all... at the moment the posts go straight to the font page to it is good exposer and at least some sweet backlinks

Thanks

walterc said...

FYI, scala.collection.immutable.{IntMap, LongMap} are implemented using the same algorithm Rich uses for Clojure's persistent vector.

James Iry said...

Stream is memoized, so the first call to front.tail will have to be calculated but each call after that will reuse the cached result.

Similarly, rotate doesn't need to be tail recursive. The recursive call to rotate is the lazy tail argument for a stream so its invoked outside the original call to rotate.

One way to clean the code up significantly is to make all your fields constructor parameters for a private constructor. Add a public no-arg constructor that calls it with the default empty stuff.

class OQueue[+A] private (front: Stream[A], sizef: Int, rear: List[A], sizer: Int) {
def this() = this(Stream.empty, 0, Nil, 0)

makeQ gets fixed up accordingly
private def makeQ[A](f: Stream[A], sf: Int, r: List[A], sr: Int): OQueue[A] =
if (sr <= sf) new OQueue[A](f, sf, r, sr)
else new OQueue[A](rotate(f, r, Stream.empty), sf + sr, Nil, 0)

rotate can be made private once you do that, too.

I'll have to give some thought to performance. My guess is that the lazy version has a smoother performance curve than Scala's queue since it doesn't have those O(n) reversing hiccups, but that overall it's likely to be slightly slower due to the overhead of lazy thunk creation.

Debasish said...

James -
Yeah .. I need to clean up the code. But your observation is correct - the lazy implementation has a smoother performance curve than the Scala library implementation. And it is somewhat slow as well, I mentioned about dominant constants in the post, which as u mention, is due to the overhead of lazy thunk creation.

But what, according to you is the reason for this algorithm to run faster in languages like SML than the stricter version ?

James Iry said...

Is it really faster in SML? I don't recall Okasaki's paper having any benchmark numbers - just very smart big-O analysis. I'm a bit surprised. As you pointed out, the operations are O(1) (amortized) in both the lazy and eager versions but the constant factors should be higher in the lazy version.

To put it another way, both are doing the work of prepending to sequences and reversing sequences. The lazy one just does the reverse a little bit at a time instead of periodic big bangs. Along the way the lazy one also has to do "extra" work of creating thunks.

Still, it's possible there's some hidden performance boost related to locality of reference or something. I'm just not seeing it.

Debasish said...

I also could not find any benchmark results from Okasaki. I just assumed that it ran faster in SML - but maybe I was wrong.
Actually the strict version is O(1) amortized but O(n) worst case. While the lazy version is O(1) amortized and O(log n) worst case.
He also presents another algorithm that used 3 lists and which is O(1) worst case. I plan to try that as well and see how it fares in Scala.

Skalecki said...

I want to point out a couple things. First, Okasaki implemented his data-structures in a lazy variant of Standard ML. By design SML is a strict (mostly) functional language, so it's unlikely that significant optimization was put into the lazy extensions, as they see little general use. That said, the backend of the SML of NJ compiler doesn't target the JVM and handles tail calls very well, so it was likely faster.

The choice of a lazy implementation over a strict implementation really boils down to the fact that amortized analysis is not generally valid when dealing with eager persistent data structures. Okasaki recognized that lazy data-structures with memoization can be used overcome this.

To see this for yourself, construct an example where the expensive O(n) operation is performed on the same value repeatedly. This kills the performance of the eager data-structure, but this isn't seen in the lazy one.

If, on the other hand, you only use the data-structure in a linear fashion, the amortized performance of the eager implementation is probably very good, but that kills the motivation for having a persistent data-structure in the first place.