Posted by Eric Kidd
Sat, 03 Mar 2007 09:02:00 GMT
(Refactoring Probability Distributions: part 1, part 2,
part 3, part 4)
The world is full of messy classification problems:
- “Is this order fraudulent?”
- “It this e-mail a spam?”
- “What blog posts would Rachel find interesting?”
- “Which intranet documents is Sam looking for?”
In each case, we want to classify something: Orders are either valid or
fraudulent, messages are either spam or non-spam, blog posts are either
interesting or boring. Unfortunately, most software is terrible at
making these distinctions. For example, why can’t my RSS reader go out and
track down the 10 most interesting blog posts every day?
Some software, however, can make these distinctions.
Google figures out when I want to watch a movie, and shows me specialized
search results. And most e-mail clients can identify spam with over
99% accuracy. But the vast majority of software is dumb, incapable of
dealing with the messy dilemmas posed by the real world.
So where can we learn to improve our software?
Outside of Google’s shroud
of secrecy, the most successful classifiers are spam filters. And most modern
spam filters are inspired by Paul Graham’s essay A Plan for Spam.
So let’s go back to the source, and see what we can learn. As it turns out, we can formulate a lot of the ideas in A Plan
for Spam in a straightforward fashion using a Bayesian
monad.
Functions from distributions to distributions
Let’s begin with spam filtering. By convention, we divide messages into
“spam” and “ham”, where “ham” is the stuff we want to read.
data MsgType = Spam | Ham
deriving (Show, Eq, Enum, Bounded)
Let’s assume that we’ve just received a new e-mail. Without even looking
at it, we know there’s a certain chance that it’s a spam. This gives us
something called a “prior distribution” over MsgType.
> bayes msgTypePrior
[Perhaps Spam 64.2%, Perhaps Ham 35.8%]
But what if we know that the first word of the message is “free”? We can
use that information to calculate a new distribution.
> bayes (hasWord "free" msgTypePrior)
[Perhaps Spam 90.5%, Perhaps Ham 9.5%]
The function hasWord takes a string and a probability
distribution, and uses them to calculate a new probability distribution:
hasWord :: String -> FDist' MsgType ->
FDist' MsgType
hasWord word prior = do
msgType <- prior
wordPresent <-
wordPresentDist msgType word
condition wordPresent
return msgType
This code is based on the Bayesian monad from part 3. As before,
the “<-” operator selects a single item from a probability
distribution, and “condition” asserts that an expression is true. The
actual Bayesian inference happens behind the scenes (handy, that).
If we have multiple pieces of evidence, we can apply them one at a time.
Each piece of evidence will update the probability distribution produced by
the previous step:
hasWords [] prior = prior
hasWords (w:ws) prior = do
hasWord w (hasWords ws prior)
The final distribution will combine everything we know:
> bayes (hasWords ["free","bayes"] msgTypePrior)
[Perhaps Spam 34.7%, Perhaps Ham 65.3%]
This technique is known as the naive Bayes classifier. Looked at from the right angle, it’s surprisingly simple.
(Of course, the naive Bayes classifier assumes that all of our evidence is independent. In theory, this is a pretty big assumption. In practice, it works better than you might think.)
But this still leaves us with a lot of questions: How do we keep track of
our different classifiers? How do we decide which ones to apply? And do
we need to fudge the numbers to get reasonable results?
In the following sections, I’ll walk through various aspects of Paul
Graham’s A Plan for Spam, and show how to generalize it. If you
want to follow along, you can download the code using Darcs:
darcs get http://www.randomhacks.net/darcs/probability
Read more...
Tags Haskell, Math, Monads, Probability, Spam
Posted by Eric
Mon, 30 Sep 2002 00:00:00 GMT
Michael Fromberger has
written a nice
formal analysis (PDF) of Paul Graham's Plan for Spam.
Tags Probability, Spam
Posted by Eric
Sun, 29 Sep 2002 00:00:00 GMT
The biggest challenge with spam filtering is reducing false
positives--that is, finding the good mail among the spam. Even the
best spam filters occasionally mistake legitimate e-mail for spam. For
example, in some recent
tests, bogofilter
processed 18,000 e-mails with only 34 false positives. Unfortunately,
several of these false positives were urgent e-mails from former
clients. This unpleasant mistake wasn't necessary--the most important
of these false positives could have been avoided with an automatic
whitelisting system.
Read more...
Tags Hacks, Probability, Python, Recommended, Spam
Posted by Eric
Mon, 23 Sep 2002 00:00:00 GMT
Useful sites about machine-learning algorithms, for developers of
spam filters: Machine Learning
Network, the Bow
toolkit, Latent Semantic
Analysis (used by Apple's mail client), Bayesian
Latent Semantic Analysis, text
clustering, more
text clustering, Using
Clustering to Boost Text Classification (PDF) and TFIDF
notes.
I wouldn't be
entirely surpised if neural networks worked well here, either--the
problem has that "figure out where to draw the boundaries between
clusters" aspect that maps nicely onto the math of neural
networks.
Tags AI, Probability, Spam
Posted by Eric
Mon, 23 Sep 2002 00:00:00 GMT
The FTC appears to have a huge spam
database.
Tags Spam
Posted by Eric
Mon, 23 Sep 2002 00:00:00 GMT
For deadly-accurate spam filtering, combine a well-trained bogofilter with
SpamAssassin. Here's how.
Add the following lines to your procmailrc file, before you run
SpamAssassin:
:0HB
* ? bogofilter
{
:0fw
| formail -I "X-Spam-Bogofilter: yes"
}
Add the following lines to your /etc/spamassassin/local.cf
file:
header BOGOFILTER X-Spam-Bogofilter =~ /yes/
describe BOGOFILTER Message has too many bogons.
score BOGOFILTER 5.0
Presto! This plugs almost all the holes in SpamAssassin's defense,
and uses SpamAssassin's auto-whitelist (you've got it turned on,
right?) to protect against false positives.
Tags Spam
Posted by Eric
Sun, 22 Sep 2002 00:00:00 GMT
Ever since Paul Graham published A Plan for Spam,
"trainable" spam filters have become the latest fashion. These filters
train themselves to know the characteristics of your personal e-mail.
Supposedly, this extra knowledge allows them to make fewer mistakes,
and makes them harder to fool. But do these filters actually work? In
this article, I try out Eric Raymond's bogofilter, a trainable Bayesian spam filter,
and describe the steps required to evaluate such a filter
accurately.
Read more...
Tags Hacks, Recommended, Spam
Posted by Eric
Mon, 16 Sep 2002 00:00:00 GMT
Between midnight Friday and 11:00am Monday, I received over 160
spams. SpamAssassin stopped all but five of them. SpamAssassin
misidentified 2 legitimate messages as spam; both were unimportant
mailing list messages from a user whose site is frequently used to send
spam. (If I needed to correspond with this user on a regular basis, I'd
add his name to my whitelist--or help educate him.)
If you don't need a public e-mail address, let me suggest a new
rule: Never give your real e-mail address to anybody you don't
know. This includes online vendors. If necessary, use a throwaway
webmail address instead.
I also devoted quite a bit of work to repacking libJudy, HP's
ultra-optimized associative array library. This library is used by
bogofilter, Eric Raymond's promising new
spam filter.
Content-based spam filtering is extremely good, and is improving
rapidly. Just don't send me any e-mail about hot stock picks
involving real estate companies in Nigeria that
specialize in toner cartridge factories, and there shouldn't be
any problem.
Tags Spam
Posted by Eric
Fri, 13 Sep 2002 00:00:00 GMT
According to Linux Weekly
News, Eric Raymond is writing a new spam filter called bogofilter
based on Bayesian
analysis, as suggested by Paul
Graham. Unlike the excellent SpamAssasin, which merely requires
whitelisting a small number of addresses, bogofilter requires
training with around 1,000 e-mail messages. But bogofilter may
ultimately offer more hope for defeating spam.
Once trained, bogofilter recognizes most incoming spam
(allegedly as much as SpamAssassin, but we'll have to wait and see).
More importantly, however, bogofilter is very good at not
recognizing legitimate e-mail as spam (in other words, it has a very
low false positive rate).
The secret strength of bogofilter, however, is the training
process. Because bogofilter is trained by the user, each user gets a
personalized spam filter. This means that (1) information of
professional interest to the reader will generally be recognized as
non-spam (however incriminating it might otherwise look), and (2) there
won't be a centralized
list of rules for the spammer to read.
I suspect that the new MacOS X 10.2 mail
client may be using a similar technique.
Tags Spam
Posted by Eric
Tue, 06 Aug 2002 00:00:00 GMT
SpamAssassin is a
highly accurate open source spam filter.
There are two major components to the SpamAssassin filtering system:
a set of rules which match various properties of an e-mail (e.g.,
whether it mentiones stock alerts or Nigerian banks), and a set of
weights for each rule. The weights are assigned automatically, by
analyzing various real-world mail spools. So SpamAssassin is
essentially an adaptive system--the rules are periodically
recalibrated, and whether a given property is good or bad may change
over time.
SpamAssassin also includes an "auto whitelist", which supposedly
learns to recognize your most frequent correspondents.
There're probably some chewy ideas in here for an evolutionary
biologist--spam filtering involves an arms race between the spammers
and the mail administrators of the world, and the most advanced spam
filters are beginning to resemble immune systems.
(If you're a Debian user, type apt-get install spamassassin spamc
libnet-dns-perl razor and take a look at the setup instructions. If
you want to use spamd, try using the --max-children 10
argument; it will save you a lot of grief.)
Tags Spam