Bayes' rule in Haskell, or why drug tests don't work

Posted by Eric Kidd Thu, 22 Feb 2007 18:11:00 GMT

Part 3 of Refactoring Probability Distributions.
(Part 1: PerhapsT, Part 2: Sampling functions)

A very senior Microsoft developer who moved to Google told me that Google works and thinks at a higher level of abstraction than Microsoft. “Google uses Bayesian filtering the way Microsoft uses the if statement,” he said. -Joel Spolsky

I really love this quote, because it’s insanely provocative to any language designer. What would a programming language look like if Bayes’ rule were as simple as an `if` statement?

Let’s start with a toy problem, and refactor it until Bayes’ rule is baked right into our programming language.

Imagine, for a moment, that we’re in charge of administering drug tests for a small business. We’ll represent each employee’s test results (and drug use) as follows:

``````data Test = Pos | Neg
deriving (Show, Eq)

data HeroinStatus = User | Clean
deriving (Show, Eq)
``````

Assuming that 0.1% of our employees have used heroin recently, and that our test is 99% accurate, we can model the testing process as follows:

``````drugTest1 :: Dist d => d (HeroinStatus, Test)
drugTest1 = do
heroinStatus <- percentUser 0.1
testResult <-
if heroinStatus == User
then percentPos 99
else percentPos 1
return (heroinStatus, testResult)

-- Some handy distributions.
percentUser p = percent p User Clean
percentPos p = percent p Pos Neg

-- A weighted distribution with two elements.
percent p x1 x2 =
weighted [(x1, p), (x2, 100-p)]
``````

This code is based our FDist monad, which is in turn based on PFP. Don’t worry if it seems slightly mysterious; you can think of the “`<-`” operator as choosing an element from a probability distribution.

Running our drug test shows every possible combination of the two variables:

``````> exact drugTest1
[Perhaps (User,Pos) 0.1%,
Perhaps (User,Neg) 0.0%,
Perhaps (Clean,Pos) 1.0%,
Perhaps (Clean,Neg) 98.9%]
``````

If you look carefully, we have a problem. Most of the employees who test positive are actually clean! Let’s tweak our code a bit, and try to zoom in on the positive test results.

Ignoring negative test results

We don’t care about employees who test negative for heroin use. We can throw away those results using Haskell’s `Maybe` type:

``````drugTest2 :: Dist d => d (Maybe HeroinStatus)
drugTest2 = do
(heroinStatus, testResult) <- drugTest1
return (if testResult == Pos
then Just heroinStatus
else Nothing)
``````

This shows us just the variables we’re interested in, but the percentages are still a mess:

``````> exact drugTest2
[Perhaps (Just User) 0.1%,
Perhaps Nothing 0.0%,
Perhaps (Just Clean) 1.0%,
Perhaps Nothing 98.9%]
``````

Ideally, we want to reach into that distribution, discard all the `Nothing` values, and then normalize the remaining percentages so that they add up to 100%. We can do that with a bit of Haskell code:

``````value (Perhaps x _) = x
prob (Perhaps _ p) = p

catMaybes' :: [Perhaps (Maybe a)] -> [Perhaps a]
catMaybes' [] = []
catMaybes' (Perhaps Nothing _ : xs) =
catMaybes' xs
catMaybes' (Perhaps (Just x) p : xs) =
Perhaps x p : catMaybes' xs

onlyJust :: FDist (Maybe a) -> FDist a
onlyJust dist
| total > 0 = PerhapsT (map adjust filtered)
| otherwise = PerhapsT []
where filtered = catMaybes' (runPerhapsT dist)
total = sum (map prob filtered)
Perhaps x (p / total)
``````

And sure enough, that lets us zoom right in on the interesting values:

``````> exact (onlyJust drugTest2)
[Perhaps User 9.0%,
Perhaps Clean 91.0%]
``````

OK, that’s definitely not good news. Even though our test is 99% accurate, 91% of the people we accuse will be innocent!

(If this seems counter-intuitive, imagine what happens if we have no employees who use heroin. Out of 1000 employees, 10 will have a positive test result, and 100% of them will be innocent.)

The above code gets the right answer, but it’s still pretty awkward. We have `Just` and `Nothing` all over the place, stinking up our application code. Why don’t we hide them inside our monad?

Fortunately, we can do just that, using the `MaybeT` monad transformer. Don’t worry if you don’t understand the details:

``````type FDist' = MaybeT FDist

-- Monads are Functors, no matter what
instance Functor FDist' where
fmap = liftM

instance Dist FDist' where
weighted xws = lift (weighted xws)
``````

As Russel and Norvig point out (chapter 13), cancelling out the impossible worlds and normalizing the remaining probabilities is equivalent to Bayes’ rule. So in homage, we can write:

``````bayes :: FDist' a -> [Perhaps a]
bayes = exact . onlyJust . runMaybeT
``````

We’re missing just one piece, a statement to prune out impossible worlds:

``````condition :: Bool -> FDist' ()
condition = MaybeT . return . toMaybe
where toMaybe True  = Just ()
toMaybe False = Nothing
``````

And now, here’s our final drug test.

``````drugTest3 :: FDist' HeroinStatus ->
FDist' HeroinStatus
drugTest3 prior = do
heroinStatus <- prior
testResult <-
if heroinStatus == User
then percentPos 99
else percentPos 1
-- As easy as an 'if' statement:
condition (testResult == Pos)
return heroinStatus
``````

This gives us the same results as before:

``````> bayes (drugTest3 (percentUser 0.1))
[Perhaps User 9.0%,
Perhaps Clean 91.0%]
``````

So testing all of our employees is still hopeless. But what if we only tested employees with clear signs of heroin abuse? In that case, there’s probably a 50/50 chance of drug use.

And that gives us remarkably better results. Out of the people who test positive, 99% will be using drugs:

``````> bayes (drugTest3 (percentUser 50))
[Perhaps User 99.0%,
Perhaps Clean 1.0%]
``````

The moral of this story: No matter how accurate our drug test, we shouldn’t bother to run it unless we have probable cause.

Similar constraints apply to any population-wide surveillance: If you’re searching for something sufficiently rare (criminals, terrorists, strange diseases), it doesn’t matter how good your tests are. If you test everyone, you’ll drown under thousands of false positives.

If we go back and look at part 1, this gives us:

``````type FDist' = MaybeT (PerhapsT [])
``````

This has some interesting consequences:

1. If we collapse `MaybeT` into `PerhapsT`, we can work with probability distributions that don’t sum to 1, where the “missing” probability represents an impossible world.
2. We can add `condition` to `Rand` (part 2) using `MaybeT Rand`. Bayes’ rule is basically the combination of `MaybeT` and a suitable `catMaybes` function applied to any probability distribution monad.

Also worth noting: Popular theories of natural language semantics are based on the λ-calculus. Chung-chieh Shan has a fascinating paper showing how to incorporate monads and monad transformers into this model. If we replaced Chung-chieh Shan’s Set monad with one of our Bayesian monads, what would we get? (Currently, I have no idea.)

Part 4: The naive Bayes classifier
Part 5: What happens if we replace MaybeT with PerhapsT?

1. Michael said about 2 hours later:

Ah, thanks for the link to the Shan paper—I had not seen it before, and it’s a very interesting read.

As to what would come of using a Bayesian monad in place of Set, I cannot say, though it sounds to me like it might lead to a good model for a semantics including fuzzy categories (in the natural language sense, rather than the category theoretic, even assuming CT has a sense of “fuzzy category”).

2. Eric said about 14 hours later:

Interesting! Is there a good introduction to fuzzy categories for non-linguists?

IIRC, Shan uses the Set monad to represent ambiguous referents. The idea is that if the pronoun “he” might represent one of two people, you can do the calculation either way. (You can see the connection to logic programming here.)

Using a probability distribution monad, you could say, “We’re talking about Frank with 90% probability, and Mike with 10% probability.”

Of course, it’s not clear (to me, at least) how this relates to probabilistic parsing, or what the ability to use Bayes’ rule actually buys us.

And as for fuzzy categories, well, I really shouldn’t have looked, but here you go:

Chapter 15 introduces toposes. A topos is a kind of generalized set theory in which the logic is intuitionistic instead of classical… Categories of fuzzy sets are recognized as almost toposes, and modest sets, which are thought by many to be the best semantic model of polymorphic lambda calculus, live in a specific topos.

Anyway, that’s category theory for you.

3. alpheccar said about 15 hours later:

I really like this serie of posts about probabilities and Haskell !

Is there a good reason why MaybeT isn’t in the standard libraries? A MaybeT (esp with liftIO) is usually something I end up needing.

5. Eric said about 19 hours later:

Judging from what dons told me last night, including MaybeT in the standard Haskell libraries would be uncontroversial—it’s just a matter of somebody making the proposal and going through the process.

A note for anyone trying to follow along at home:

You may want to grab a MaybeT implementation from the New monads page, and take a look at part 1 and part 2.

My apologies if this stuff is still a bit hard to get running. I hope to put up a Darcs repository soon. In the meantime, please feel free to post questions here!

6. Danno said 1 day later:

Sir, you just blew my mind.

Really learning Haskell is continually moving up on my priority scale.

7. David House said 6 days later:

A few comments before I play with this a little more:

1. You need a `Fractional` instance for `Prob` as you’re using division in `onlyJust`. This is fine, you can just derive it.
2. The `Functor` instance for `FDist'` isn’t needed; there’s already an `instance (Functor m) => Functor (MaybeT m)` in the `MaybeT` module.
3. `condition` is just `guard`.

For anyone that’s interested in playing with this themselves, I’ve pastebinned a file which contains all the code you’ll need. Fire up GHCi on it!

8. David House said 6 days later:

Huh. Sorry about those appearing in boxes; they didn’t when I previewed the comment. Methinks someone’s comment CSS is leaking into the comments themselves :)

9. Eric said 10 days later:

David: Thanks for helping people get started!

I’ve now set up a Darcs repository with all the necessary bits:

``darcs get http://www.randomhacks.net/darcs/probability``
10. George Martkov said 17 days later:

Also there’re simple reason why drug tests don’t work )))

11. Chris said 18 days later:

I recently read Shan’s paper. Mind-blowingly awesome. But the Set and Pointed Set monads aren’t used there for fuzzy categories or for ambiguous referents. (He actually uses the reader/environment monad to deal with different variable assignments, like with “he” having multiple possible referents.)

In the paper, Sets and Pointed Sets are used for the semantics of questions and focus, respectively. Consider a sentence like “Who ordered a tuna sandwich?” The idea is that the semantic interpretation of a question like this would be a set of interpretations something like ordered(x,tuna sandwich) for every x in some contextually given set of alternatives. It might be broad – the “who” could be any person or even any animate – but more typically it would be more restricted – the people in a restaurant, the friends you picked up lunch for, etc.

Shan then uses pointed sets to deal with what could be answers to such questions: “John was the one that had the tuna sandwich.” This is like picking one of the alternatives out of that context set. But you still need to care about the rest of the set of alternatives. Consider “Only John ordered a tuna sandwich”. The truth of such a sentence depends on the set of options: it’s more likely to be true if only your friends are under consideration than if every living human being is.

So, in this context, I don’t think Bayesianifying the Set/Pointed Set monads buys you anything. (Not to say there might not be other linguistic uses of Bayesian monads.)

12. Eric said 18 days later:

Yeah, it looks as though I had generalized accidentally from Shan’s treatment of interrogative pronouns (“who”, etc.) to pronouns in general.

And I don’t pretend to understand the linguistic implications of focus, so I should probably refrain from commenting on Pointed Set monads until I read more papers. :-)

But my larger question involved the semantics of ambiguous sentences. Specifically, I was interested in the relationship between natural language parsing and the resulting semantics in such sentences as:

Fruit flies like a banana.

This classic example can be parsed in two fairly plausible ways:

(Fruit flies)-NP (like-VP a banana).
Fruit-NP (flies-VP (like a banana)).

(There’s also a bunch of horribly bad parses which treat “fruit” as a transitive verb. Hey, it’s in the dictionary.)

But these sentences aren’t that different from:

Frank called Mark, and he got pretty upset.

...where “he” could refer to either Frank or Mark. This sentence would become much less ambiguous if we could estimate the following probabilities from the surrounding context:

P(Frank got upset|context)
P(Mark got upset|context)

So, my question: Given Chung-chieh Shan’s framework, and the various probability monads (with or without Bayesian conditioning), can we assign reasonable semantics to ambiguous sentences?

As I said earlier, I don’t have the foggiest idea of how to answer this question. :-)

13. Max Lybbert said 73 days later:

Unfortunately I just barely ran across this blog today. I like it, and will be coming back. So, although this is very late, you may consider looking at the CRM114 Discriminator (http://crm114.sourceforge.net/ ), which is supposed to be a language with Bayesian filtering (and Markov chaining, and …) built in. But its design looks a little more like Perl than Lisp or Haskell.

14. Allan E said 111 days later:

Regarding David’s comment “condition is just guard”, this doesn’t work unless PerhapsT is a MonadPlus instance… but in the darcs implementation there’s a note on how this leads to ambiguous semantics.

Can we remove the ambiguity by picking the one for which

condition = guard

works?

15. John Beattie said 123 days later:

“The moral of this story: No matter how accurate our drug test, we shouldn’t bother to run it unless we have probable cause.”

This can be related to the current security strategy at airports. Look for ‘carnival booth algorithm’ for a description of the strategy and for criticism of it.

At first sight, I thought that your remark above leads to the conclusion that the airport security strategy is right: which is to say, select people for extra screening based on their ethnic background.

But in fact, it does the opposite: one should only select people for extra screening based on whether there is ‘probable cause’, i.e. on careful, human surveillance.