Three things I don't understand about monads

Posted by Eric Kidd Mon, 05 Mar 2007 09:32:00 GMT

Monads are a remarkably powerful tool for building specialized programming languages. Some examples include:

But there’s a bunch of things I don’t understand about monads. In each case, my confusion involves some aspect of the underlying math that “bubbles up” to affect the design of specialized languages.

(Warning: Obscure monad geeking ahead.)

Commutative monads

A “commutative monad” is any monad where we can replace the expression:

do a <- ma
   b <- mb
   f a b

…with:

do b <- mb
   a <- ma
   f a b

…without changing the meaning. Examples of commutative monads include Reader and Rand. This is an important property, because it might allow us to parallelize the commonly-used sequence function across huge numbers of processors:

sequence :: (Monad m) => [m a] -> m [a]

Simon Peyton Jones lists this problem as Open Challenge #2, saying:

Commutative monads are very common. (Environment, unique supply, random number generation.) For these, monads over-sequentialise.

Wanted: theory and notation for some cool compromise.

Commutative monad morphisms

Monad morphisms are the category theory equivalent of Haskell’s monad transformers. Haskell’s monad transformers can be expressed as monad layerings, which correspond to the monad morphisms of category theory.

Many complicated monads break down into a handful of monad transformers, often in surprising ways.

But composing monad transformers is a mess, because they interact in poorly-understood ways. In general, the following two types have very different semantics:

FooT (BarT m)
BarT (FooT m)

If FooT and BarT commute with each other, however, the two types would be equivalent. This is helpful when building large stacks of monad transformers.

Chung-chieh Shan encountered a related problem when applying monad morphisms to build a theory of natural language semantics:

It remains to be seen whether monads would provide the appropriate conceptual encapsulation for a semantic theory with broader coverage. In particular, for both natural and programming language semantics, combining monads—or perhaps monad-like objects—remains an open issue that promises additional insight.

Monad morphisms and abstract algebra

Dan Piponi has been drawing some fascinating connections between monad morphisms and abstract algebra. See, for example:

This approach seems to throw a lot of light on monad morphisms—but at least in my case, the light only highlights my confusion.

Of the three problems listed here, this is the one most likely to be discussed in a textbook somewhere. And a solution to this problem would likely help significantly with the other two.

So, my question: Does anybody have any books, papers or ideas that might help untangle this mess?

Update: Be sure to see the comment thread on the second Dan Piponi post above and Chung-chieh Shan’s excellent bibliography on monad transformers.

Tags , ,

Smart classification using Bayesian monads in Haskell

Posted by Eric Kidd Sat, 03 Mar 2007 09:02:00 GMT

(Refactoring Probability Distributions: part 1, part 2, part 3, part 4)

The world is full of messy classification problems:

  • “Is this order fraudulent?”
  • “It this e-mail a spam?”
  • “What blog posts would Rachel find interesting?”
  • “Which intranet documents is Sam looking for?”

In each case, we want to classify something: Orders are either valid or fraudulent, messages are either spam or non-spam, blog posts are either interesting or boring. Unfortunately, most software is terrible at making these distinctions. For example, why can’t my RSS reader go out and track down the 10 most interesting blog posts every day?

Some software, however, can make these distinctions. Google figures out when I want to watch a movie, and shows me specialized search results. And most e-mail clients can identify spam with over 99% accuracy. But the vast majority of software is dumb, incapable of dealing with the messy dilemmas posed by the real world.

So where can we learn to improve our software?

Outside of Google’s shroud of secrecy, the most successful classifiers are spam filters. And most modern spam filters are inspired by Paul Graham’s essay A Plan for Spam.

So let’s go back to the source, and see what we can learn. As it turns out, we can formulate a lot of the ideas in A Plan for Spam in a straightforward fashion using a Bayesian monad.

Functions from distributions to distributions

Let’s begin with spam filtering. By convention, we divide messages into “spam” and “ham”, where “ham” is the stuff we want to read.

data MsgType = Spam | Ham
  deriving (Show, Eq, Enum, Bounded)

Let’s assume that we’ve just received a new e-mail. Without even looking at it, we know there’s a certain chance that it’s a spam. This gives us something called a “prior distribution” over MsgType.

> bayes msgTypePrior
[Perhaps Spam 64.2%, Perhaps Ham 35.8%]

But what if we know that the first word of the message is “free”? We can use that information to calculate a new distribution.

> bayes (hasWord "free" msgTypePrior)
[Perhaps Spam 90.5%, Perhaps Ham 9.5%]

The function hasWord takes a string and a probability distribution, and uses them to calculate a new probability distribution:

hasWord :: String -> FDist' MsgType ->
           FDist' MsgType
hasWord word prior = do
  msgType <- prior
  wordPresent <-
    wordPresentDist msgType word
  condition wordPresent
  return msgType

This code is based on the Bayesian monad from part 3. As before, the “<-” operator selects a single item from a probability distribution, and “condition” asserts that an expression is true. The actual Bayesian inference happens behind the scenes (handy, that).

If we have multiple pieces of evidence, we can apply them one at a time. Each piece of evidence will update the probability distribution produced by the previous step:

hasWords []     prior = prior
hasWords (w:ws) prior = do
  hasWord w (hasWords ws prior)

The final distribution will combine everything we know:

> bayes (hasWords ["free","bayes"] msgTypePrior)
[Perhaps Spam 34.7%, Perhaps Ham 65.3%]

This technique is known as the naive Bayes classifier. Looked at from the right angle, it’s surprisingly simple.

(Of course, the naive Bayes classifier assumes that all of our evidence is independent. In theory, this is a pretty big assumption. In practice, it works better than you might think.)

But this still leaves us with a lot of questions: How do we keep track of our different classifiers? How do we decide which ones to apply? And do we need to fudge the numbers to get reasonable results?

In the following sections, I’ll walk through various aspects of Paul Graham’s A Plan for Spam, and show how to generalize it. If you want to follow along, you can download the code using Darcs:

darcs get http://www.randomhacks.net/darcs/probability
Read more...

Tags , , , , ,

Bayes' rule in Haskell, or why drug tests don't work

Posted by Eric Kidd Thu, 22 Feb 2007 18:11:00 GMT

Part 3 of Refactoring Probability Distributions.
(Part 1: PerhapsT, Part 2: Sampling functions)

A very senior Microsoft developer who moved to Google told me that Google works and thinks at a higher level of abstraction than Microsoft. “Google uses Bayesian filtering the way Microsoft uses the if statement,” he said. -Joel Spolsky

I really love this quote, because it’s insanely provocative to any language designer. What would a programming language look like if Bayes’ rule were as simple as an if statement?

Let’s start with a toy problem, and refactor it until Bayes’ rule is baked right into our programming language.

Imagine, for a moment, that we’re in charge of administering drug tests for a small business. We’ll represent each employee’s test results (and drug use) as follows:

data Test = Pos | Neg
  deriving (Show, Eq)

data HeroinStatus = User | Clean
  deriving (Show, Eq)

Assuming that 0.1% of our employees have used heroin recently, and that our test is 99% accurate, we can model the testing process as follows:

drugTest1 :: Dist d => d (HeroinStatus, Test)
drugTest1 = do
  heroinStatus <- percentUser 0.1
  testResult <-
    if heroinStatus == User
      then percentPos 99
      else percentPos 1
  return (heroinStatus, testResult)

-- Some handy distributions.
percentUser p = percent p User Clean
percentPos p = percent p Pos Neg

-- A weighted distribution with two elements.
percent p x1 x2 =
  weighted [(x1, p), (x2, 100-p)]

This code is based our FDist monad, which is in turn based on PFP. Don’t worry if it seems slightly mysterious; you can think of the “<-” operator as choosing an element from a probability distribution.

Running our drug test shows every possible combination of the two variables:

> exact drugTest1
[Perhaps (User,Pos) 0.1%,
 Perhaps (User,Neg) 0.0%,
 Perhaps (Clean,Pos) 1.0%,
 Perhaps (Clean,Neg) 98.9%]

If you look carefully, we have a problem. Most of the employees who test positive are actually clean! Let’s tweak our code a bit, and try to zoom in on the positive test results.

Read more...

Tags , , , , ,

Refactoring probability distributions, part 2: Random sampling

Posted by Eric Kidd Wed, 21 Feb 2007 23:53:00 GMT

In Part 1, we cloned PFP, a library for computing with probability distributions. PFP represents a distribution as a list of possible values, each with an associated probability.

But in the real world, things aren’t always so easy. What if we wanted to pick a random number between 0 and 1? Our previous implementation would break, because there’s an infinite number of values between 0 and 1—they don’t exactly fit in a list.

As it turns out, Sungwoo Park and colleagues found an elegant solution to this problem. They represented probability distributions as sampling functions, resulting in something called the λ calculus. (I have no idea how to pronounce this!)

With a little bit of hacking, we can use their sampling functions as a drop-in replacement for PFP.

A common interface

Since we will soon have two ways to represent probability distributions, we need to define a common interface.

type Weight = Float

class (Functor d, Monad d) => Dist d where
  weighted :: [(a, Weight)] -> d a

uniform :: Dist d => [a] -> d a
uniform = weighted . map (\x -> (x, 1))

The function uniform will create an equally-weighted distribution from a list of values. Using this API, we can represent a two-child family as follows:

data Child = Girl | Boy
  deriving (Show, Eq, Ord)

child :: Dist d => d Child
child = uniform [Girl, Boy]

family :: Dist d => d [Child]
family = do
  child1 <- child
  child2 <- child
  return [child1, child2]

Now, we need to implement this API two different ways: Once with lists, and a second time with sampling functions.

Read more...

Tags , , , ,

Refactoring probability distributions, part 1: PerhapsT

Posted by Eric Kidd Wed, 21 Feb 2007 08:06:00 GMT

(Warning: This article is a bit more technical than most of my stuff. It assumes prior knowledge of monads and monad transformers.)

Martin Erwig and Steve Kollmansberger wrote PFP, a really sweet Haskell library for computing with probabilities. To borrow their example:

die :: Dist Int
die = uniform [1..6]

If we roll a die, we get the expected distribution of results:

> die
1  16.7%
2  16.7%
3  16.7%
4  16.7%
5  16.7%
6  16.7%

If you haven’t seen PFP before, I strongly encourage you to check it out. You can use it to solve all sorts of probability puzzles.

Anyway, I discovered an interesting way to implement PFP using monad transformers. Here’s what it looks like:

type Dist = PerhapsT ([])

uniform = weighted . map (\x -> (x, 1))

In other words, Dist can be written by adding some semantics to the standard list monad.

Perhaps: A less specific version of Maybe

First, let’s define a simple probability type:

newtype Prob = P Float
  deriving (Eq, Ord, Num)

instance Show Prob where
  show (P p) = show intPart ++ "." ++ show fracPart ++ "%"
    where digits = round (1000 * p)
          intPart = digits `div` 10
          fracPart = digits `mod` 10

Thanks to the deriving (Num) declaration, we can treat Prob like any other numeric type.

We can now define Perhaps, which represents a value with an associated probability:

data Perhaps a = Perhaps a Prob
  deriving (Show)

Now, this is just a generalization of Haskell’s built-in Maybe type, which treats a value as either present (probability 1) or absent (probability 0). All we’ve added is a range of possibilities in between: Perhaps x 0.5 represents a 50% chance of having a value.

Note that there’s one small trick here: When the probability of a value is 0, we may not actually know it! But because Haskell is a lazy language, we can write:

Perhaps undefined 0

We’ll need a convenient way to test for this case, to make sure we don’t try to use any undefined values:

neverHappens (Perhaps _ 0) = True
neverHappens _             = False

So Perhaps is just like Maybe. As it turns out, they’re both monads, and they both have an associated monad transformer.

Read more...

Tags , , , ,

Probabilistic Functional Programming is cool

Posted by Eric Kidd Sat, 10 Feb 2007 16:14:00 GMT

Syntaxfree is hacking on Martin Erwig’s probability monad. This is one of the coolest monads out there—it allows you to trivially solve all kinds of probability problems.

Mikael Johansson has a good example.

I hope to write a bit more about probability monads soon. There’s already a long post sitting on my hard drive, and some more ideas that I’m still trying to puzzle out.

In the meantime, I’d like to recommend The Haskell Road to Logic, Maths and Programming. There’s an excellent review available.

Tags , , ,

Map fusion: Making Haskell 225% faster

Posted by Eric Kidd Sat, 10 Feb 2007 09:55:00 GMT

Or, how to optimize MapReduce, and when folds are faster than loops

Purely functional programming might actually be worth the pain, if you care about large-scale optimization.

Lately, I’ve been studying how to speed up parallel algorithms. Many parallel algorithms, such as Google’s MapReduce, have two parts:

  1. First, you transform the data by mapping one or more functions over each value.
  2. Next, you repeatedly merge the transformed data, “reducing” it down to a final result.

Unfortunately, there’s a couple of nasty performance problems lurking here. We really want to combine all those steps into a single pass, so that we can eliminate temporary working data. But we don’t always want to do this optimization by hand—it would be better if the compiler could do it for us.

As it turns out, Haskell is an amazing testbed for this kind of optimization. Let’s build a simple model, show where it breaks, and then crank the performance way up.

Trees, and the performance problems they cause

We’ll use single-threaded trees for our testbed. They’re simple enough to demonstrate the basic idea, and they can be generalized to parallel systems. (If you want know how, check out the papers at the end of this article.)

A tree is either empty, or it is a node with a left child, a value and a right child:

data Tree a = Empty
            | Node (Tree a) a (Tree a)
  deriving (Show)

Here’s a sample tree containing three values:

tree = (Node left 2 right)
  where left  = (Node Empty 1 Empty)
        right = (Node Empty 3 Empty)

We can use treeMap to apply a function to every value in a tree, creating a new tree:

treeMap :: (a -> b) -> Tree a -> Tree b

treeMap f Empty = Empty
treeMap f (Node l x r) =
  Node (treeMap f l) (f x) (treeMap f r)

Using treeMap, we can build various functions that manipulate trees:

-- Double each value in a tree.
treeDouble tree = treeMap (*2) tree

-- Add one to each value in a tree.
treeIncr tree   = treeMap (+1) tree

What if we want to add up all the values in a tree? Well, we could write a simple recursive sum function:

treeSum Empty = 0
treeSum (Node l x r) =
  treeSum l + x + treeSum r

But for reasons that will soon become clear, it’s much better to refactor the recursive part of treeSum into a reusable treeFold function (“fold” is Haskell’s name for “reduce”):

treeFold f b Empty = b
treeFold f b (Node l x r) =
  f (treeFold f b l) x (treeFold f b r)

treeSum t = treeFold (\l x r -> l+x+r) 0 t

Now we can double all the values in a tree, add 1 to each, and sum up the result:

treeSum (treeIncr (treeDouble tree))

But there’s a very serious problem with this code. Imagine that we’re working with a million-node tree. The two calls to treeMap (buried inside treeIncr and treeDouble) will each create a new million-node tree. Obviously, this will kill our performance, and it will make our garbage collector cry.

Fortunately, we can do a lot better than this, thanks to some funky GHC extensions.

Read more...

Tags , ,

The first Carnival of Mathematics

Posted by Eric Kidd Fri, 09 Feb 2007 09:25:00 GMT

A blog carnival is a summary of recent posts on an interesting topic. A typical carnival appears weekly or monthly, and moves from blog to blog with each edition.

If you’re into math, go check out the first Carnival of Mathematics. There’s something for everyone:

I’ve gotten plenty of submissions that span the entire gamut of math-blogging: education, pure math, applied math, debunking bad math - it’s all there. Only the gender distribution could be made slightly more equal (and that’s an understatement). I’m linking to the posters in roughly increasing order of mathematical difficulty, but don’t let my opinions deter you from reading the posts closer to the bottom.

Why study math? Even the most abstract math has an uncanny tendency to solve unexpected problems:

[T]he enormous usefulness of mathematics in the natural sciences is something bordering on the mysterious and [there] is no rational explanation for it.

And since I’ve started hacking on Haskell, I’ve been equally surprised at how often deep math can solve real-world programming problems.

Tags

Haskell: Queues without pointers

Posted by Eric Kidd Thu, 08 Feb 2007 20:01:00 GMT

Haskell has been tying my brain in knots. Sure, it keeps teaching me all sorts of amazing things, but it’s also forcing me to relearn the basics.

Right now, I’m trying to implement simple data structures in Haskell. It’s challenging, because most typical data structures are based on updating pointers, and Haskell doesn’t allow that. (Well, I could cheat, and use the IO monad, but where’s the fun in that?)

Fortunately, Chris Okasaki has done some amazing research into Haskell (and ML) data structures. He wrote an excellent thesis, which was later published as Purely Functional Data Structures. It’s a comprehensive book, well-written, and highly recommended for anyone programming in a functional language.

Let’s take a look at a data structure from Okasaki’s book: A purely functional queue. It’s competitive with a traditional queue (on average, at least), but it doesn’t use any pointers.

The queue API

A queue is a list of values to process. We can add values to the back of the queue, and retrieve values from the front.

In Haskell, we often begin by writing down the types of our functions. If we’re lucky, the code will fill itself in almost automatically. The first part of our API is easy:

-- A queue holding values of type 'a'.
data Queue a = ???

-- Create a new queue.
newQueue :: Queue a

-- Test whether a queue is empty.
empty :: Queue a -> Bool

The next part is a bit trickier. Since we can’t update our actual Queue values, our data structure is read-only! We can work around this by making enq and deq return an updated version of the Queue:

-- Add an item to the back of the queue,
-- returning the updated queue.
enq :: Queue a -> a -> Queue a

-- Remove an item from the front of the
-- queue, returning the item and the
-- updated queue.
deq :: Queue a -> (a, Queue a)

We could use this API as follows:

q = newQueue `enq` 1 `enq` 2 `enq` 3
(x,q')  = deq q  -- x is 1
(y,q'') = deq q' -- y is 2

Now, the tricky bit: Making it run quickly.

Read more...

Tags

Do early adopters use IE?

Posted by Eric Kidd Wed, 07 Feb 2007 19:16:00 GMT

I saw some interesting statistics in my web server logs last night:

Windows:51%
MacOS X:23%
Linux:19%

Over 40% of Random Hacks readers use MacOS X or Linux! This is a bit surprising, because Windows has roughly a 95% market share, and desktop Linux has well under 1%.

The browser statistics are even more skewed:

Firefox:62%(4/5ths Firefox 2.0!)
IE:10%
Safari:8%
Mozilla:5%
Opera:4%

IE’s market share is only slightly higher than Safari’s!

What’s going on here, and how it relates selling new technologies

This website contains articles about unusual programming languages, including Ruby, Lisp and Haskell. And anybody who uses one of those languages is likely to be an “innovator” or an “early adopter,” as described by Geoffrey A. Moore:

Innovators pursue new technology products aggressively… At root, they are intrigued with any fundamental advance…
Early adopters… are people who find it easy to image, understand, and appreciate the benefits of a new technology, and to relate those benefits to their other concerns… Because early adopters do not rely on well-established references in making these buying decisions, preferring instead to rely on their own intuition and vision, they are key to opening up any high-tech market segment.

Most every technology startup hopes to sell their products to early adopters. After all, early adopters are actually willing to try new things.

But if you’re targeting early adopters, you can’t necessarily assume that IE is the dominant browser, or that only a few people run MacOS X. Because early adopters make unusual technology choices, they live in a technologically diverse world.

And if you want them to use your product, you’ll need to take that into account.

(Rewritten to quote Geoffrey Moore.)

Tags

Older posts: 1 2 3 4 5 6 ... 12