## Smart classification using Bayesian monads in Haskell

Posted by Eric Kidd Sat, 03 Mar 2007 09:02:00 GMT

(Refactoring Probability Distributions: part 1, part 2,
part 3, **part 4**)

The world is full of messy classification problems:

- “Is this order fraudulent?”
- “It this e-mail a spam?”
- “What blog posts would Rachel find interesting?”
- “Which intranet documents is Sam looking for?”

In each case, we want to classify something: Orders are either valid or
fraudulent, messages are either spam or non-spam, blog posts are either
interesting or boring. Unfortunately, most software is *terrible* at
making these distinctions. For example, why can’t my RSS reader go out and
track down the 10 most interesting blog posts every day?

Some software, however, *can* make these distinctions.
Google figures out when I want to watch a movie, and shows me specialized
search results. And most e-mail clients can identify spam with over
99% accuracy. But the vast majority of software is dumb, incapable of
dealing with the messy dilemmas posed by the real world.

So where can we learn to improve our software?

Outside of Google’s shroud of secrecy, the most successful classifiers are spam filters. And most modern spam filters are inspired by Paul Graham’s essay A Plan for Spam.

So let’s go back to the source, and see what we can learn. As it turns out, we can formulate a lot of the ideas in A Plan for Spam in a straightforward fashion using a Bayesian monad.

### Functions from distributions to distributions

Let’s begin with spam filtering. By convention, we divide messages into “spam” and “ham”, where “ham” is the stuff we want to read.

```
data MsgType = Spam | Ham
deriving (Show, Eq, Enum, Bounded)
```

Let’s assume that we’ve just received a new e-mail. Without even looking
at it, we know there’s a certain chance that it’s a spam. This gives us
something called a “prior distribution” over `MsgType`

.

```
> bayes msgTypePrior
[Perhaps Spam 64.2%, Perhaps Ham 35.8%]
```

But what if we know that the first word of the message is “free”? We can use that information to calculate a new distribution.

```
> bayes (hasWord "free" msgTypePrior)
[Perhaps Spam 90.5%, Perhaps Ham 9.5%]
```

The function `hasWord`

takes a string and a probability
distribution, and uses them to calculate a new probability distribution:

```
hasWord :: String -> FDist' MsgType ->
FDist' MsgType
hasWord word prior = do
msgType <- prior
wordPresent <-
wordPresentDist msgType word
condition wordPresent
return msgType
```

This code is based on the Bayesian monad from part 3. As before,
the “`<-`

” operator selects a single item from a probability
distribution, and “condition” asserts that an expression is true. The
actual Bayesian inference happens behind the scenes (handy, that).

If we have multiple pieces of evidence, we can apply them one at a time. Each piece of evidence will update the probability distribution produced by the previous step:

```
hasWords [] prior = prior
hasWords (w:ws) prior = do
hasWord w (hasWords ws prior)
```

The final distribution will combine everything we know:

```
> bayes (hasWords ["free","bayes"] msgTypePrior)
[Perhaps Spam 34.7%, Perhaps Ham 65.3%]
```

This technique is known as the naive Bayes classifier. Looked at from the right angle, it’s surprisingly simple.

(Of course, the naive Bayes classifier assumes that all of our evidence is independent. In theory, this is a pretty big assumption. In practice, it works better than you might think.)

But this still leaves us with a lot of questions: How do we keep track of our different classifiers? How do we decide which ones to apply? And do we need to fudge the numbers to get reasonable results?

In the following sections, I’ll walk through various aspects of Paul Graham’s A Plan for Spam, and show how to generalize it. If you want to follow along, you can download the code using Darcs:

`darcs get http://www.randomhacks.net/darcs/probability`