Ever since Paul Graham published A Plan for Spam, "trainable" spam filters have become the latest fashion. These filters train themselves to know the characteristics of your personal e-mail. Supposedly, this extra knowledge allows them to make fewer mistakes, and makes them harder to fool. But do these filters actually work? In this article, I try out Eric Raymond's bogofilter, a trainable Bayesian spam filter, and describe the steps required to evaluate such a filter accurately.
False Positives are Much Worse Than False Negatives
As Paul Graham points out, we don't want spam filters to block as much spam as possible--we want them to block as much spam as possible without blocking any significant amount of legitimate e-mail. Nobody, after all, wants to miss an important message from a business client.
Accidentally blocking a legitimate e-mail is called a "false positive". Failing to block a spam is called a "false negative". A good spam filter should have a very low number of false positives.
Training vs. Control Groups
When evaluating a spam filter, it's easy to make a serious mistake: testing the filter with the same e-mail you used to train it. In my experience, this makes the filter perform unrealistically well. Because the filter has already seen every message, good and bad, it will have little problem sorting those messages correctly in the future.
Instead, you should divide your messages into training and control groups. You can do this in one of two ways: you can place every other message in your control group, or you can place the most recent half of your messages in your control group. The latter approach mimics the way the filter behaves in the real world: Users train the filter on the early messages, and run it on the latter ones.
In this experiment, I used an alternating approach to divide my messages into training and control groups. However, my message counts aren't exactly equal because Perl's Mail::Util and the mutt e-mail client disagree about message boundaries.
Cleaning Your Data Sets
Bogofilter is extremely sensitive to any artifacts in your data set. In particular, if you received your spam messages on one e-mail account, and your non-spam messages on a second account, bogofilter will generally filter using the names of your mail servers. So your spam and non-spam messages should come from the same account, at roughly the same time, and should have any MUA or spam filtering headers carefully removed. Skipping this step will make the filter look unrealistically good.
Your data sets should also be very clean. This means painstakingly wading through several thousand messages and sorting the good from the bad. I cleaned my data sets with a variety of tools--and extensive manual inspection--but approximately 1 message in 1000 (or 2000) appears to have been miscategorized. This is as about as clean a data set as you can get in the real world, especially if you're relying on end users to do the sorting.
Bogofilter Input Data
I tested bogofilter with a variety of data sets:
- NONSPAM.TRAIN: A mix of 1311 inbox and mailing list messages, with a strong bias towards inbox messages.
- NONSPAM.CONTROL: A mix of 1409 inbox and mailing list messages, with a strong bias towards inbox messages.
- SPAM.TRAIN: 1401 spam messages.
- SPAM.CONTROL: 1400 spam messages.
- CNBC.TRAIN: 92 "spammy-sounding" (but legitimate) messages removed from NONSPAM.TRAIN in the first round of testing. Adding these messages to the training set tends to confuse most filters (see below).
- INBOX: 18,000+ inbox messages. Most filtering programs claim to find between 200 and 500 spams in this data.
- ALL-SPAM: An extremely clean (but not perfect) set of 9,000+ spams from a period of several years.
Known biases: Many of my friends and relatives are college educated, and few of them use Exchange, Yahoo or Hotmail. This means that my correspondents use a large vocabulary, and my e-mail tends to be plain text, not HTML. There aren't many Word attachments, marketing proposals or e-mails about "teen cuties". This probably makes my e-mail easier to filter than some user's e-mail.
I trained bogofilter with NONSPAM.TRAIN and SPAM.TRAIN. Bogofilter correctly identified every message in NONSPAM.CONTROL, and had 139 false negatives (10%, or 1 in 10) in SPAM.CONTROL. Bogofilter found 229 spams in INBOX with 34 false positives (<0.2%, or 1 in 500--nearly all of which were non-spam bulk mailings), and failed to filter 1253 spams in ALL-SPAM (14%).
Out of the 139 false negatives in SPAM.CONTROL, approximately 42
were Base64-encoded, which hides the text of the messages from
bogofilter. Another 29 such spams were stopped. In the majority of
cases, header information was not sufficient to filter Base64-encoded
spams. This strongly suggests that Bayesian spam filters should decode
Base64 (and possibly quoted-printable) sections with MIME types of
text/html before processing the text.
For the second trial, I trained bogofilter with the 92 e-mails in CNBC.TRAIN. These messages include daily information about the stock market, promising companies and hot stock picks. In general, these messages are exceptionally hard to tell from spam. I then re-ran my earlier tests with the "polluted" training set.
Bogofilter correctly identified every message in NONSPAM.CONTROL, and had 174 false negatives (12%) in SPAM.CONTROL. Bogofilter found 155 spams in INBOX with 10 false positives (<0.06%), and failed to filter 1610 spams in ALL-SPAM (18%).
Characteristics of False Positives
Nearly all of the false positives filtering INBOX fall into one of two categories: (1) spammy-sounding bulk mail from such organizations as CNBC and the ACLU, and (2) e-mail from family and clients. I can accept (1) as "casualties of war", but (2) is a serious problem (a few of these messages were actually urgent e-mails from paying clients).
SpamAssassin reduces false positives using an "auto-whitelist" (AWL). After computing a score for each message, it adds that score to a running average of that sender's previous scores and divides the result by two. This technique is mathematically elegant, extremely reliable, and conceptually justifiable--your friends don't magically turn into spammers overnight, and the spammers don't know your friends' e-mail addresses. Furthermore, the 'From:' address on a piece of mail is much more important than any particular word in the body of the message, and may therefore be special-cased without accusations of hackery.
(SpamAssassin also provides an explicit whitelist, which allows organizations like the ACLU to e-mail me without getting caught in the filters. An AWL wouldn't work here, because the ACLU almost always sounds like a spammer.)
Open Research Issues
I devoted quite a lot of time to training bogofilter. Most of this time was spent manually sorting mail into spam and non-spam categories. Apple's new mail client provides an elegant training interface which might be sufficient for Bayesian spam filtering. But with the current bogofilter UI, we can't realistically expect end-users to train bogofilter quickly.
A related problem is the increase in false positives caused by training bogofilter with "spammy sounding" e-mail. The CNBC.TRAIN data increased false negatives on SPAM.CONTROL from 10% to 12%, and and on ALL-SPAM from 14% to 18%. It might be possible to improve filter performance by manually whitelisting these "spammy sounding" senders and not retraining the filter to accept them.
Bogofilter catches between 82% and 90% of spam in my collection, with an extremely low false positive rate. There are some readily-identifiable trends among the false positives and false negatives, which could be addressed by adding two features:
- Base64 and quoted-printable decoding. This would cure the single most common type of false negative: A carefully-encoded spam from a previously unknown spammer.
- An auto-whitelist (AWL). This would permit messages from frequent correspondents to pass through the spam filter, and would eliminate nearly all of the serious false positives.
Some open research issues exist. It's not clear how well an ordinary user would be able to train bogofilter, and bogofilter's accuracy is noticeably impaired by training it with "spammy sounding" (but legitimate) e-mails. I encourage further experimentation in these areas.
Update: There appears to be a bug in bogofilter 0.7 which causes it to incorrectly calculate which 15 words are the most significant. Fixing this bug does not appear to change the results of filtering SPAM.CONTROL and ALL-SPAM by more than a few percentage points.