Haskell: Queues without pointers
Haskell has been tying my brain in knots. Sure, it keeps teaching me all sorts of amazing things, but it's also forcing me to relearn the basics.
Right now, I'm trying to implement simple data structures in Haskell. It's challenging, because most typical data structures are based on updating pointers, and Haskell doesn't allow that. (Well, I could cheat, and use the IO monad, but where's the fun in that?)
Fortunately, Chris Okasaki has done some amazing research into Haskell (and ML) data structures. He wrote an excellent thesis, which was later published as Purely Functional Data Structures. It's a comprehensive book, well-written, and highly recommended for anyone programming in a functional language.
Let's take a look at a data structure from Okasaki's book: A purely functional queue. It's competitive with a traditional queue (on average, at least), but it doesn't use any pointers.
The queue API
A queue is a list of values to process. We can add values to the back of the queue, and retrieve values from the front.
In Haskell, we often begin by writing down the types of our functions. If we're lucky, the code will fill itself in almost automatically. The first part of our API is easy:
-- A queue holding values of type 'a'.
data Queue a = ???
-- Create a new queue.
newQueue :: Queue a
-- Test whether a queue is empty.
empty :: Queue a -> Bool
The next part is a bit trickier. Since we can't update our actual
Queue values, our data structure is read-only! We can work
around this by making enq and deq return an
updated version of the Queue:
-- Add an item to the back of the queue,
-- returning the updated queue.
enq :: Queue a -> a -> Queue a
-- Remove an item from the front of the
-- queue, returning the item and the
-- updated queue.
deq :: Queue a -> (a, Queue a)
We could use this API as follows:
q = newQueue `enq` 1 `enq` 2 `enq` 3
(x,q') = deq q -- x is 1
(y,q'') = deq q' -- y is 2
Now, the tricky bit: Making it run quickly.
Haskell queues, in O(1) amortized time
We could represent our queues as lists, but they'd be slow. Appending a value to the end of a Haskell list requires walking over all the existing values.
The trick is simple: We represent a queue as two lists. The first list contains the front of the queue, and the second list contains the back. We store the back part in reverse order, so we can add a new value in constant time:
data Queue a = Queue [a] [a]
The newQueue and empty functions are trivial:
newQueue = Queue [] []
empty (Queue [] []) = True
empty _ = False
Adding a new item to the queue is easy. We just append it to the second list:
enq (Queue xs ys) y = Queue xs (y:ys)
The first two parts of enq are similarly straightforward:
-- If the queue is empty, raise an error.
deq (Queue [] []) =
error "Can't deq from an empty queue"
-- If there's at least one item in the front
-- part of the queue, return it.
deq (Queue (x:xs) ys) = (x, Queue xs ys)
But what if the front part of the queue is empty? We can just reverse the back part and move it to the front:
-- If the front part is empty, reverse the
-- back part, move it to the front, and try
-- again.
deq (Queue [] ys) =
deq (Queue (reverse ys) [])
Our implementation of enq runs in constant time. But
deq occasionally needs to reverse a list, which requires
walking the entire length of the list. Fortunately, each item only gets
moved once, so deq averages out to constant time.
And note that our queues are "persistent"--we can hold onto earlier versions of the queue, in case we need to implement "Undo" or some sort of backtracking. (Update: As mauke points out, using this queue persistently can break the amortized O(1) guarantees. See Okasaki for a discussion and simple fix.)
There's a lot more good stuff in Okasaki's book: True constant-time queues, an explanation of how laziness and strictness affect running time, and a wide variety of standard data structures. Highly recommended.
Want to contact me about this article? Or if you're looking for something else to read, here's a list of popular posts.
Generally when you implement this structure, you also want to provide a
frontoperation, so that you can peek at the front item without removing it (that also permits you to havedeqjust yield a new queue, rather than a pair, which is a bit cleaner).To make that work, never permit the heads list to go empty unless the whole queue is empty; it’s a fairly trivial modification.
With a bit more playing, you can actually get the worst-case down to O(1), provided you are willing to do a little more juggling. Hint: Don’t wait till the heads list gets empty to move the tails, and think about interleaving append with reverse. The constant factors are worse, but not by a lot.
I second your recommendation of Okazaki’s book. It is well worth a close reading.
Interesting posting! Thanks.
Your link to Okaski’s thesis is incorrect. The correct link is http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf
According to Okazaki, you actually need to do a bit more work to get non-amortized O(1) queues, at least in a lazy language. He keeps a list of values which need to be forced, and forces one during every operation.
I’d have to work through his example by hand to truly understand the issues involved. I find it hard to reason about when work actually occurs in a lazy language.
But, yeah—awesome book.
Michael Chermside: Thanks! I fixed it.
[Y]ou actually need to do a bit more work to get non-amortized O(1) queues, at least in a lazy language.
That’s the point I was trying to make, though apparently not clearly. In order to get rid of the point cost of a
revyou need to make sure you do them lazily, and only at need — but also sufficiently often that you have enough operations over which to amortize their cost.One way to interpret Okazaki’s algorithm is to insure that, each time you
deq, you take a step of appending the reverse of the tails to the heads, and of course you memoize the results so that you retain the persistence (if you only care about ephemeral queues, this is of course somewhat easier). The trick is to make sure you don’t wait “too long,” which you can do by starting a new rev/append as soon as tails is one position longer than heads. That’s not the only solution, but it makes the analysis work out nicely.Ah, Haskell… Making easy things hard and (some) hard things easy since 1996™.
If Haskell requires such convoluted acrobatics for a basic data type, how could anyone take it seriously?
Well, as with all things, it’s a tradeoff: More hassle for some problems, less hassle for others.
I could easily implement a conventional queue in Haskell, using the IO monad. But purely functional data structures offer a potentially valuable feature: Cheap undo.
I once wrote a set of C++ classes which supported a “undo” function—you could revert to any earlier version of the data structure. All the application’s data was stored in these structures, so implementing application-level “undo” was a snap. But as you can imagine, the C++ classes in question weren’t pretty—they had to support the ability to back out any modification.
If I had used a purely functional data structure, saving previous states would have been trivial—just hold onto a reference.
Cheap “undo” has important consequences in other real-world problems, too. Consider a Sudoku solver: Every time we’re forced to make a guess, we need to save the state of our program. If we encounter a dead-end, we can backtrack to the saved state.
Another good example is a compiler’s symbol table. In a language with lambdas, we may have functions several layers deep, each with its own lexical scope. Every time we leave a nested function, we need to revert to an earlier version of the symbol table.
So these are a few examples of when a purely functional data structure might simplify an application considerably. And in many cases, the price of using a purely functional data structure isn’t that high: This example requires only 10 lines of Haskell, which compares quite nicely to a conventional (templated) queue in C++.
Are tail and head too complex for use or what?
kumma:
headandtailare easy enough (and fast enough, too). But to implement a queue, you also need some sort of “enqueue” operation. And that can get really slow, if you’re not careful.Specifically, consider the following implementation:
data Queue a = Queue [a]
newQueue = Queue []
deq (Queue []) = error “Empty queue”
deq (Queue (x:xs)) = (x,xs)
enq (Queue xs) x = Queue (xs ++ x)
Here,
dequsesheadandtail(in the form of the(x:xs)pattern match), and it runs in constant time, no matter how big the queue.But
enquses++, which is seriously slow, because it makes a complete copy ofxsevery time we add an item to the queue. If we keep (say) an average of 1,000 items in the queue, that means we need to copy a 1,000 item list every time we callenq.The alternate implementation you see above doesn’t have this problem. In our example, it can
enqan item in a single operation 999 times out of 1000, and only reallocate the entire list 1 time in a 1000, instead of every single time. On average, that works out to O(1) time. For 10 lines of Haskell, it’s actually a pretty decent queue.What I don’t get about the implementation is, why use singly linked lists in the first place?
Wouldn’t it be considerably easier to use a doubly linked list and have constanst enq and deq times from the beginning?
Mike: Remember, we trying to build a purely functional queue, so we can’t modify any links once we’ve created them. So unless you do something fairly clever with lazy evaluation, you can’t set up the double links.