Posted by Eric Kidd Sat, 03 Mar 2007 09:02:00 GMT

(Refactoring Probability Distributions: part 1, part 2, part 3, part 4)

The world is full of messy classification problems:

• “Is this order fraudulent?”
• “It this e-mail a spam?”
• “What blog posts would Rachel find interesting?”
• “Which intranet documents is Sam looking for?”

In each case, we want to classify something: Orders are either valid or fraudulent, messages are either spam or non-spam, blog posts are either interesting or boring. Unfortunately, most software is terrible at making these distinctions. For example, why can’t my RSS reader go out and track down the 10 most interesting blog posts every day?

Some software, however, can make these distinctions. Google figures out when I want to watch a movie, and shows me specialized search results. And most e-mail clients can identify spam with over 99% accuracy. But the vast majority of software is dumb, incapable of dealing with the messy dilemmas posed by the real world.

So where can we learn to improve our software?

Outside of Google’s shroud of secrecy, the most successful classifiers are spam filters. And most modern spam filters are inspired by Paul Graham’s essay A Plan for Spam.

So let’s go back to the source, and see what we can learn. As it turns out, we can formulate a lot of the ideas in A Plan for Spam in a straightforward fashion using a Bayesian monad.

### Functions from distributions to distributions

Let’s begin with spam filtering. By convention, we divide messages into “spam” and “ham”, where “ham” is the stuff we want to read.

``````data MsgType = Spam | Ham
deriving (Show, Eq, Enum, Bounded)
``````

Let’s assume that we’ve just received a new e-mail. Without even looking at it, we know there’s a certain chance that it’s a spam. This gives us something called a “prior distribution” over `MsgType`.

``````> bayes msgTypePrior
[Perhaps Spam 64.2%, Perhaps Ham 35.8%]
``````

But what if we know that the first word of the message is “free”? We can use that information to calculate a new distribution.

``````> bayes (hasWord "free" msgTypePrior)
[Perhaps Spam 90.5%, Perhaps Ham 9.5%]
``````

The function `hasWord` takes a string and a probability distribution, and uses them to calculate a new probability distribution:

``````hasWord :: String -> FDist' MsgType ->
FDist' MsgType
hasWord word prior = do
msgType <- prior
wordPresent <-
wordPresentDist msgType word
condition wordPresent
return msgType
``````

This code is based on the Bayesian monad from part 3. As before, the “`<-`” operator selects a single item from a probability distribution, and “condition” asserts that an expression is true. The actual Bayesian inference happens behind the scenes (handy, that).

If we have multiple pieces of evidence, we can apply them one at a time. Each piece of evidence will update the probability distribution produced by the previous step:

``````hasWords []     prior = prior
hasWords (w:ws) prior = do
hasWord w (hasWords ws prior)
``````

The final distribution will combine everything we know:

``````> bayes (hasWords ["free","bayes"] msgTypePrior)
[Perhaps Spam 34.7%, Perhaps Ham 65.3%]
``````

This technique is known as the naive Bayes classifier. Looked at from the right angle, it’s surprisingly simple.

(Of course, the naive Bayes classifier assumes that all of our evidence is independent. In theory, this is a pretty big assumption. In practice, it works better than you might think.)

But this still leaves us with a lot of questions: How do we keep track of our different classifiers? How do we decide which ones to apply? And do we need to fudge the numbers to get reasonable results?

In the following sections, I’ll walk through various aspects of Paul Graham’s A Plan for Spam, and show how to generalize it. If you want to follow along, you can download the code using Darcs:

``darcs get http://www.randomhacks.net/darcs/probability``

## Fromberger spam filtering paper

Posted by Eric Mon, 30 Sep 2002 00:00:00 GMT

Michael Fromberger has written a nice formal analysis (PDF) of Paul Graham's Plan for Spam.

Tags ,

## Bayesian Whitelisting: Finding the Good Mail Among the Spam

Posted by Eric Sun, 29 Sep 2002 00:00:00 GMT

The biggest challenge with spam filtering is reducing false positives--that is, finding the good mail among the spam. Even the best spam filters occasionally mistake legitimate e-mail for spam. For example, in some recent tests, `bogofilter` processed 18,000 e-mails with only 34 false positives. Unfortunately, several of these false positives were urgent e-mails from former clients. This unpleasant mistake wasn't necessary--the most important of these false positives could have been avoided with an automatic whitelisting system.

Posted by Eric Mon, 23 Sep 2002 00:00:00 GMT

Useful sites about machine-learning algorithms, for developers of spam filters: Machine Learning Network, the Bow toolkit, Latent Semantic Analysis (used by Apple's mail client), Bayesian Latent Semantic Analysis, text clustering, more text clustering, Using Clustering to Boost Text Classification (PDF) and TFIDF notes.

I wouldn't be entirely surpised if neural networks worked well here, either--the problem has that "figure out where to draw the boundaries between clusters" aspect that maps nicely onto the math of neural networks.

Tags , ,

## FTC Spam Archive

Posted by Eric Mon, 23 Sep 2002 00:00:00 GMT

The FTC appears to have a huge spam database.

Tags

## Using Bogofilter with Spam Assassin

Posted by Eric Mon, 23 Sep 2002 00:00:00 GMT

For deadly-accurate spam filtering, combine a well-trained bogofilter with SpamAssassin. Here's how.

Add the following lines to your procmailrc file, before you run SpamAssassin:

``````:0HB
* ? bogofilter
{
:0fw
| formail -I "X-Spam-Bogofilter: yes"
}``````

Add the following lines to your `/etc/spamassassin/local.cf` file:

``````header    BOGOFILTER  X-Spam-Bogofilter =~ /yes/
describe  BOGOFILTER  Message has too many bogons.
score     BOGOFILTER  5.0``````

Presto! This plugs almost all the holes in SpamAssassin's defense, and uses SpamAssassin's auto-whitelist (you've got it turned on, right?) to protect against false positives.

Tags

## How To Test a Trainable Spam Filter

Posted by Eric Sun, 22 Sep 2002 00:00:00 GMT

Ever since Paul Graham published A Plan for Spam, "trainable" spam filters have become the latest fashion. These filters train themselves to know the characteristics of your personal e-mail. Supposedly, this extra knowledge allows them to make fewer mistakes, and makes them harder to fool. But do these filters actually work? In this article, I try out Eric Raymond's bogofilter, a trainable Bayesian spam filter, and describe the steps required to evaluate such a filter accurately.

Tags , ,

## Weekend Spam Update

Posted by Eric Mon, 16 Sep 2002 00:00:00 GMT

Between midnight Friday and 11:00am Monday, I received over 160 spams. SpamAssassin stopped all but five of them. SpamAssassin misidentified 2 legitimate messages as spam; both were unimportant mailing list messages from a user whose site is frequently used to send spam. (If I needed to correspond with this user on a regular basis, I'd add his name to my whitelist--or help educate him.)

If you don't need a public e-mail address, let me suggest a new rule: Never give your real e-mail address to anybody you don't know. This includes online vendors. If necessary, use a throwaway webmail address instead.

I also devoted quite a bit of work to repacking libJudy, HP's ultra-optimized associative array library. This library is used by bogofilter, Eric Raymond's promising new spam filter.

Content-based spam filtering is extremely good, and is improving rapidly. Just don't send me any e-mail about hot stock picks involving real estate companies in Nigeria that specialize in toner cartridge factories, and there shouldn't be any problem.

Tags

## Bogofilter: A New Spam Filter

Posted by Eric Fri, 13 Sep 2002 00:00:00 GMT

According to Linux Weekly News, Eric Raymond is writing a new spam filter called `bogofilter` based on Bayesian analysis, as suggested by Paul Graham. Unlike the excellent SpamAssasin, which merely requires whitelisting a small number of addresses, `bogofilter` requires training with around 1,000 e-mail messages. But `bogofilter` may ultimately offer more hope for defeating spam.

Once trained, `bogofilter` recognizes most incoming spam (allegedly as much as SpamAssassin, but we'll have to wait and see). More importantly, however, `bogofilter` is very good at not recognizing legitimate e-mail as spam (in other words, it has a very low false positive rate).

The secret strength of `bogofilter`, however, is the training process. Because bogofilter is trained by the user, each user gets a personalized spam filter. This means that (1) information of professional interest to the reader will generally be recognized as non-spam (however incriminating it might otherwise look), and (2) there won't be a centralized list of rules for the spammer to read.

I suspect that the new MacOS X 10.2 mail client may be using a similar technique.

Tags

## SpamAssassin: An Decent Spam Filter

Posted by Eric Tue, 06 Aug 2002 00:00:00 GMT

SpamAssassin is a highly accurate open source spam filter.

There are two major components to the SpamAssassin filtering system: a set of rules which match various properties of an e-mail (e.g., whether it mentiones stock alerts or Nigerian banks), and a set of weights for each rule. The weights are assigned automatically, by analyzing various real-world mail spools. So SpamAssassin is essentially an adaptive system--the rules are periodically recalibrated, and whether a given property is good or bad may change over time.

SpamAssassin also includes an "auto whitelist", which supposedly learns to recognize your most frequent correspondents.

There're probably some chewy ideas in here for an evolutionary biologist--spam filtering involves an arms race between the spammers and the mail administrators of the world, and the most advanced spam filters are beginning to resemble immune systems.

(If you're a Debian user, type ```apt-get install spamassassin spamc libnet-dns-perl razor``` and take a look at the setup instructions. If you want to use `spamd`, try using the `--max-children 10` argument; it will save you a lot of grief.)

Tags