Monday, May 31, 2010

Grokking Functional Data Structures

Purely Functional Data StructuresWith so many languages espousing functional programming paradigms, it's no wonder that functional implementations of common data structures are also becoming more and more popular. As a programmer it's important that we learn to adopt them into our repertoire of tools and techniques.

Imperative data structures operate through in-place mutation of data. In our everyday usage of data structures in Java we mutate lists, arrays and any other type to realize the algorithm that we implement. Most of the books on data structures and algorithms explain all the complexity analyses of them using imperative data structures as case studies. When we create an object in Java (an instance of an abstract data type) and invoke some methods on it (some of which can potentially mutate its state) we get some result as an observable outcome of the sequence of operations. This sequence of operations on the value of the abstract data type constitutes a logical future of that type.

With functional data structures, immutability is the buzz of the day, though it's not mandatory that all of them have to be immutable. Many functional languages allow mutable data structures, which, more often than not have also proved to be more pragmatic than the pure school of thought. I wrote a blog post on this same topic some time back.

The other point of view that purists say is that the main point is not that whether you need a mutable array or a hash table. The more important part is that you need to access or update a data structure within a certain bound of time and space complexity. Okasaki's book Purely Functional Data Structures contains a rich collection of functional implementations of many of the popular imperative data structures. And quite contrary to what we tend to believe, given proper language support, these implementations are often as efficient as their imperative counterparts.

When we talk about functional programming, we talk about immutability of data structures and purity of functions. In the absence of mutable state, these are the two cornerstones that make reasoning of concurrent programs much easier in the functional paradigm. Since data structures are immutable, there's no in-place update of values, which means that every operation that changes the value of a data structure will create a new instance of it. So at the end of n operations on an instance of an ADT we will have all the previous versions of the data structure available for access. This is called persistence of data structures, unlike the imperative ones that we discussed earlier, where one update destroys its previous state (we call them ephemeral data structures). Unlike ephemeral data structures which has a single logical future (as we saw earlier), persistent data structures can potentially have multiple logical futures.

Have a look at the following figure, where the list l can have 2 logical futures based on the two possible mutating operations that we make on it. None of them mutate the original list and at the end we still have l as '(1 2 3). Only the two logical futures point to two different instances of the ADT.


Since Clojure sequences are persistent, we have all instances of the ADT accessible even after the entire sequence of operations. Had persistence meant brutal copying of instances, obviously we would not have the same performance guarantees as the imperative ones. Persistent data structures are implemented through careful sharing of structures across instances that make them affordable in real life applications. Have a look at the following tree structure that employs copying of paths from the root down to the newly inserted leaf while sharing the rest with the original ADT.


As with the earlier scenario, both the versions of the tree are available even after adding the new node 8 to the tree - xs points to the original tree, while ys points to the new one.

Despite the fact that functional data structures are persistent, it's possible that their implementations make use of mutable storage. It's mostly done for efficiency of implementation, as we have seen with transients in Clojure. However, for all practical purposes, the data structure is published to its clients as an immutable one.

It's also possible that we can implement a persistent data structure using an imperative language like Java. In fact, FunctionalJava does exactly the same and offers a rich suite of persistent data structures developed in Java. But of course there's a trade-off. First, the usage looks intimidating with Java being quite verbose and without much of type-inferencing capabilities. And, as Okasaki points out in his book, you need to have an implementation of call-by-need in order to have improved amortized complexity of functional data structures. That's the subject of another post, some other day.

What would be a life like without mutable arrays and hash-tables, the bread and butter data structures on which imperative programming thrives upon?

Well, in functional languages lists, trees and tuples take the place of arrays and hash-tables. The basic unit of a functional data structure is to find a general representation of a sequence. And then apply recursive transformations on it to implement specific operations. In this post I give you a very brief look at a couple of such building blocks that are used extensively to implement persistent data structures in many languages.

Consider the use case for a data structure where you would like to reach a specific item (often called a focus point) within the structure and do manipulations around it. With mutable structures you can reach a specific node within a list, an array or a tree and make changes to it. Or the very sole reason for which doubly linked lists exist is to allow you to reach a node within the list and move in either direction. Or you may want to have a new node and splice it within an existing doubly linked list using a minimum number of pointer manipulations.

In the functional world we have zippers which can be adapted to this very use case. Zipper, invented by Gerard Huet, is a generic data structure which allows you to identify a focal point within an ADT and make all manipulations around it as cheap as in-place updates. And it does all this in a completely immutable and persistent way. Every focal point within the tree is identified by a pair of Tree and Context, each of which is again a recursive data structure. The root of the Tree is the focal point around which we would like to do mutable operations. The functional pearl paper by Huet has all the details of this functional data structure. Clojure has a nice implementation of zipper as part of its core distribution. Zipper is a general purpose data structure and has lots of exciting use cases. I hope to cover some of them in my upcoming posts.

Another interesting data structure used functionally is the finger tree, invented by Hinze and Paterson. A finger tree is a 2-3 tree representation of a sequence that gives you cheap access to 'fingers', where you would like to do your manipulations on. And then you can define transformation functions using reductions and monoids that implement specific operations. For example, you can annotate every node of a finger tree using a characteristic function (modeled as a monoid), which you can use to look up specific elements having that property. If you want to implement fast positional operations (like accessing the nth element of the sequence), annotate the finger tree with the 'size' function, so that every node has the size of the subtree stored as its measure. Given size annotations we can now find the nth element in O(log n) time. Similarly annotating the tree with a 'priority' function turns a finger tree into a priority queue. Similarly you can implement deques, interval trees and a host of other data structure variants using the single underlying representation.

This post is just an introduction to functional data structures and I hope to cover some of the specific ones in future. The key difference in implementation with mutable data structures is that the functional ones are based on recursive transformations, easier to model using the power of pure functions. They are persistent and as we discussed above can have multiple logical futures. This makes their runtime analyses a big challenge. In order to obtain amortized complexity that match their mutable counterparts you need to have support for laziness and call-by-need from the underlying implementation language. Computations once forced need to be memoized so that the next logical future can use it without any additional expense on its realized cost of execution.

7 comments:

Viksit said...

Hehe, I've got a few posts coming up on this very topic :) Good stuff.

maddy said...

yea i am totally aggree with u buddy it is really a nice stuff

Jon Harrop said...

Another great post but the following statement is wildly inaccurate:

"And quite contrary to what we tend to believe, given proper language support, these implementations are often as efficient as their imperative counterparts."

Purely functional data structures are not only extremely slow on one core but they destroy the scalability of parallelism across multicores.

Tony Piazza said...

I appreciate your post. Your timing is perfect as I am currently learning Clojure after spending many years programming in Java. I just starting to wrap my head around the whole concept of functional programming. Some recent work I did in graduate school could definitely benefit from a language more suited to concurrent programming. I look forward to your future posts on this topic. Thanks!

Ben Hutchison said...

For a broad perspective, Ben Lippmeier's thesis arguing the need for mutable data structure support in Haskell makes for interesting reading. He's on the "return-leg" of a journey into immutability.

From [http://www.cse.unsw.edu.au/~benl/papers/thesis/lippmeier-impure-world.pdf]:

"I am thinking of a data structure for managing collections of objects. It provides O(1) insert and update operations. It has native hardware support on all modern platforms. It has a long history of use. It’s proven, and it’s lightning fast.

Unfortunately, support for it in my favourite language, Haskell [PJ03a], appears to be somewhat lacking. There are people that would tell me that it’s not needed [PJW92], that there are other options [Oka98b], that it’s bad for
parallelism [Can91] and bad for computer science in general [Bac78]. They say that without it, programs are easier to understand and reason about. Yet, it seems that every time I start to write a program I find myself wanting for it."

Anonymous said...

I looked at Ben Lippmeier's thesis referenced in previous comment since this is something I'm been trying to understand.

He states:
"The first tool is a set of efficient array-like data structures for managing collections of objects, and the second is the ability to broadcast a new value to all parts of a program with minimal burden on the programmer."

The first item, he's talking about arrays. Arrays are just plain fast on the metal and especially in performance critical things like scientific computing, arrays have a place. But they can/should be handled in a localized way.

I was very surprised to see his second point since that is a global variable. I think I'm not the only one that generally considers global variables as undesirable. He also talks about the difficulty of updating things deep down in a tree, but I think that is handled with things like functional references.

Lisak said...

Debasish, this is one of those blogs that makes you wanna take a day off and read it all :-) Please keep writing