Posted by Eric Kidd
Thu, 22 Feb 2007 23:11:00 GMT
Part 3 of Refactoring Probability Distributions.
(Part 1: PerhapsT,
Part 2: Sampling functions)
A very senior Microsoft developer who moved to Google told
me that Google works and thinks at a higher level of abstraction than
Microsoft. “Google uses Bayesian filtering the way Microsoft uses the if
statement,” he said. -Joel Spolsky
I really love this quote, because it’s insanely provocative
to any language designer. What would a programming language look
like if Bayes’ rule were as simple as an if statement?
Let’s start with a toy problem, and refactor it until Bayes’ rule is baked
right into our programming language.
Imagine, for a moment, that we’re in charge of administering drug tests for
a small business. We’ll represent each employee’s test results (and drug use) as follows:
data Test = Pos | Neg
deriving (Show, Eq)
data HeroinStatus = User | Clean
deriving (Show, Eq)
Assuming that 0.1% of our employees have used heroin recently, and that our test is 99%
accurate, we can model the testing process as follows:
drugTest1 :: Dist d => d (HeroinStatus, Test)
drugTest1 = do
heroinStatus <- percentUser 0.1
testResult <-
if heroinStatus == User
then percentPos 99
else percentPos 1
return (heroinStatus, testResult)
percentUser p = percent p User Clean
percentPos p = percent p Pos Neg
percent p x1 x2 =
weighted [(x1, p), (x2, 100p)]
This code is based our FDist monad, which is in turn based on
PFP. Don’t worry if it seems slightly mysterious; you can think of the
“<-” operator as choosing an element from a probability
distribution.
Running our drug test shows every possible combination of the two
variables:
> exact drugTest1
[Perhaps (User,Pos) 0.1%,
Perhaps (User,Neg) 0.0%,
Perhaps (Clean,Pos) 1.0%,
Perhaps (Clean,Neg) 98.9%]
If you look carefully, we have a problem. Most of the employees who test
positive are actually clean! Let’s tweak our code a bit, and try to zoom
in on the positive test results.
Read more...
Tags Haskell, Math, Monads, Probability, Recommended | 15 comments
Posted by Eric Kidd
Sat, 10 Feb 2007 14:55:00 GMT
Or, how to optimize MapReduce, and when folds are faster than loops
Purely functional programming might actually be worth the pain, if you care about large-scale optimization.
Lately, I’ve been studying how to speed up parallel algorithms. Many
parallel algorithms, such as Google’s MapReduce, have two parts:
- First, you transform the data by mapping one or more functions over each value.
- Next, you repeatedly merge the transformed data, “reducing” it down to a
final result.
Unfortunately, there’s a couple of nasty performance problems lurking here. We really want to combine all those steps into a single pass, so that we can eliminate temporary working data. But we don’t always want to do this optimization by hand—it would be better if the compiler could do it for us.
As it turns out, Haskell is an amazing testbed for this kind of
optimization. Let’s build a simple model, show where it breaks, and then
crank the performance way up.
Trees, and the performance problems they cause
We’ll use single-threaded trees for our testbed. They’re simple enough to demonstrate the basic idea, and they can be generalized to parallel systems. (If you want know how, check out the papers at the end of this article.)
A tree is either empty, or it is
a node with a left child, a value and a right child:
data Tree a = Empty
| Node (Tree a) a (Tree a)
deriving (Show)
Here’s a sample tree containing three values:
tree = (Node left 2 right)
where left = (Node Empty 1 Empty)
right = (Node Empty 3 Empty)
We can use treeMap to apply a function to every value in a
tree, creating a new tree:
treeMap :: (a -> b) -> Tree a -> Tree b
treeMap f Empty = Empty
treeMap f (Node l x r) =
Node (treeMap f l) (f x) (treeMap f r)
Using treeMap, we can build various functions that manipulate
trees:
treeDouble tree = treeMap (*2) tree
treeIncr tree = treeMap (+1) tree
What if we want to add up all the values in a tree? Well, we could write a
simple recursive sum function:
treeSum Empty = 0
treeSum (Node l x r) =
treeSum l + x + treeSum r
But for reasons that will soon become clear, it’s much better to refactor
the recursive part of treeSum into a reusable
treeFold function (“fold” is Haskell’s name for “reduce”):
treeFold f b Empty = b
treeFold f b (Node l x r) =
f (treeFold f b l) x (treeFold f b r)
treeSum t = treeFold (\l x r -> l+x+r) 0 t
Now we can double all the values in a tree, add 1 to each, and sum up the
result:
treeSum (treeIncr (treeDouble tree))
But there’s a very serious problem with this code. Imagine that we’re
working with a million-node tree. The two calls to treeMap
(buried inside treeIncr and treeDouble) will each
create a new million-node tree. Obviously, this will kill our performance,
and it will make our garbage collector cry.
Fortunately, we can do a lot better than this, thanks to some funky GHC
extensions.
Read more...
Tags Haskell, Performance, Recommended | 7 comments
Posted by Eric Kidd
Sat, 03 Dec 2005 16:30:00 GMT
Years ago, I looked at Ruby and decided to ignore it. Ruby wasn’t as
popular as Python, and it wasn’t as powerful as LISP. So why should I
bother?
Of course, we could turn those criteria around. What if Ruby were more
popular than LISP, and more powerful than Python? Would that be enough
to make Ruby interesting?
Before answering this question, we should decide what makes LISP so
powerful. Paul Graham has written eloquently about LISP’s virtues. But, for the sake of argument, I’d like to boil them down to two things:
- LISP is a dense functional language.
- LISP has programmatic macros.
As it turns out, Ruby compares well as a functional language, and it fakes
macros better than I’d thought.
Read more...
Tags LISP, Macros, Recommended, Ruby
Posted by Eric
Tue, 11 Oct 2005 04:00:00 GMT
Back in 1961, John McCarthy (the inventor of LISP) described
an interesting mathematical operator called amb. Essentially,
amb hates to be called with no arguments, and can look
into the future to keep that from happening. Here's how it might look
in Ruby.
Read more...
Tags Continuations, Hacks, Recommended, Ruby | 7 comments
Posted by Eric
Sun, 22 Jun 2003 04:00:00 GMT
I'm a 27-year-old programmer. When I'm 55--in 2031--I want to
still be a programmer. And in 2031, I want to love my job as much as I
do today. What will 2031 look like? Right now, two groups are
offering their visions for the future: Microsoft and the open source
movement. A third group is conspicuously silent: small, independent
developers. What do the Microsoft and open source futures look like?
Will the independent developers speak up? Which future should I fight
for? My choices, and the choices of hundreds of thousands of people
like me, will help determine which future we get. So let's take a look
and start talking.
Read more...
Tags Recommended
Posted by Eric
Sun, 10 Nov 2002 05:00:00 GMT
This Saturday, I attended the LL2 conference at MIT. LL2 is
dedicated to "lightweight" programming languages, a delibrately loose
category including (1) any pleasant, easy-to-use scripting language and
(2) any academic language which makes it easier to prototype and write
software quickly. LL2 is a small, informal workshop with audience
participation. The attendees are a diverse bunch, and enjoy goring
each other's sacred cows. You have been warned.
Read more...
Tags Conference, Recommended | 6 comments
Posted by Eric
Sun, 29 Sep 2002 04:00:00 GMT
The biggest challenge with spam filtering is reducing false
positives--that is, finding the good mail among the spam. Even the
best spam filters occasionally mistake legitimate e-mail for spam. For
example, in some recent
tests, bogofilter
processed 18,000 e-mails with only 34 false positives. Unfortunately,
several of these false positives were urgent e-mails from former
clients. This unpleasant mistake wasn't necessary--the most important
of these false positives could have been avoided with an automatic
whitelisting system.
Read more...
Tags Hacks, Probability, Python, Recommended, Spam | no comments
Posted by Eric
Sun, 22 Sep 2002 04:00:00 GMT
Ever since Paul Graham published A Plan for Spam,
"trainable" spam filters have become the latest fashion. These filters
train themselves to know the characteristics of your personal e-mail.
Supposedly, this extra knowledge allows them to make fewer mistakes,
and makes them harder to fool. But do these filters actually work? In
this article, I try out Eric Raymond's bogofilter, a trainable Bayesian spam filter,
and describe the steps required to evaluate such a filter
accurately.
Read more...
Tags Hacks, Recommended, Spam
Posted by Eric
Fri, 13 Sep 2002 04:00:00 GMT
I've recently been reading a lot of excellent essays on
programming language design by Paul Graham. Paul and I agree about
a number of things: (1) LISP is beautiful and powerful family of
languages, even by modern standards, (2) all existing dialects of LISP
are lacking a certain something, and (3) programmatic macros are a Good
Idea.
Read more...
Tags Hacks, LISP, Macros, Recommended | 4 comments