<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Random Hacks: Tag Spam</title>
    <link>http://www.randomhacks.net/articles/tag/Spam?tag=Spam</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Technology and Other Fun Stuff</description>
    <item>
      <title>Smart classification using Bayesian monads in Haskell</title>
      <description>&lt;p&gt;&lt;small&gt;(Refactoring Probability Distributions: &lt;a href="http://www.randomhacks.net/articles/2007/02/21/refactoring-probability-distributions"&gt;part 1&lt;/a&gt;, &lt;a href="http://www.randomhacks.net/articles/2007/02/21/randomly-sampled-distributions"&gt;part 2&lt;/a&gt;,
&lt;a href="http://www.randomhacks.net/articles/2007/02/22/bayes-rule-and-drug-tests"&gt;part 3&lt;/a&gt;, &lt;b&gt;part 4&lt;/b&gt;)&lt;/small&gt;&lt;/p&gt;

&lt;p&gt;The world is full of messy classification problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&amp;#8220;Is this order fraudulent?&amp;#8221;&lt;/li&gt;
&lt;li&gt;&amp;#8220;It this e-mail a spam?&amp;#8221;&lt;/li&gt;
&lt;li&gt;&amp;#8220;What blog posts would Rachel find interesting?&amp;#8221;&lt;/li&gt;
&lt;li&gt;&amp;#8220;Which intranet documents is Sam looking for?&amp;#8221;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In each case, we want to classify something: Orders are either valid or
fraudulent, messages are either spam or non-spam, blog posts are either
interesting or boring.  Unfortunately, most software is &lt;i&gt;terrible&lt;/i&gt; at
making these distinctions.  For example, why can&amp;#8217;t my RSS reader go out and
track down the 10 most interesting blog posts every day?&lt;/p&gt;

&lt;p&gt;Some software, however, &lt;i&gt;can&lt;/i&gt; make these distinctions.
Google figures out when I want to watch a movie, and shows me &lt;a href="http://www.google.com/search?q=cinema+boston"&gt;specialized
search results&lt;/a&gt;.  And most e-mail clients can identify spam with over
99% accuracy.  But the vast majority of software is dumb, incapable of
dealing with the messy dilemmas posed by the real world.&lt;/p&gt;

&lt;p&gt;So where can we learn to improve our software?&lt;/p&gt;

&lt;p&gt;Outside of Google&amp;#8217;s shroud
of secrecy, the most successful classifiers are spam filters.  And most modern
spam filters are inspired by Paul Graham&amp;#8217;s essay &lt;a href="http://www.paulgraham.com/spam.html"&gt;A Plan for Spam&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;So let&amp;#8217;s go back to the source, and see what we can learn.  As it turns out, we can formulate a lot of the ideas in &lt;a href="http://www.paulgraham.com/spam.html"&gt;A Plan
for Spam&lt;/a&gt; in a straightforward fashion using a &lt;a href="http://www.randomhacks.net/articles/2007/02/22/bayes-rule-and-drug-tests"&gt;Bayesian
monad&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;Functions from distributions to distributions&lt;/h3&gt;

&lt;p&gt;Let&amp;#8217;s begin with spam filtering.  By convention, we divide messages into
&amp;#8220;spam&amp;#8221; and &amp;#8220;ham&amp;#8221;, where &amp;#8220;ham&amp;#8221; is the stuff we want to read.&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_haskell "&gt;&lt;span class='keyword'&gt;data&lt;/span&gt; &lt;span class='conid'&gt;MsgType&lt;/span&gt; &lt;span class='keyglyph'&gt;=&lt;/span&gt; &lt;span class='conid'&gt;Spam&lt;/span&gt; &lt;span class='keyglyph'&gt;|&lt;/span&gt; &lt;span class='conid'&gt;Ham&lt;/span&gt;
  &lt;span class='keyword'&gt;deriving&lt;/span&gt; &lt;span class='layout'&gt;(&lt;/span&gt;&lt;span class='conid'&gt;Show&lt;/span&gt;&lt;span class='layout'&gt;,&lt;/span&gt; &lt;span class='conid'&gt;Eq&lt;/span&gt;&lt;span class='layout'&gt;,&lt;/span&gt; &lt;span class='conid'&gt;Enum&lt;/span&gt;&lt;span class='layout'&gt;,&lt;/span&gt; &lt;span class='conid'&gt;Bounded&lt;/span&gt;&lt;span class='layout'&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Let&amp;#8217;s assume that we&amp;#8217;ve just received a new e-mail.  Without even looking
at it, we know there&amp;#8217;s a certain chance that it&amp;#8217;s a spam.  This gives us
something called a &amp;#8220;prior distribution&amp;#8221; over &lt;code&gt;MsgType&lt;/code&gt;.&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_haskell "&gt;&lt;span class='varop'&gt;&amp;gt;&lt;/span&gt; &lt;span class='varid'&gt;bayes&lt;/span&gt; &lt;span class='varid'&gt;msgTypePrior&lt;/span&gt;
&lt;span class='keyglyph'&gt;[&lt;/span&gt;&lt;span class='conid'&gt;Perhaps&lt;/span&gt; &lt;span class='conid'&gt;Spam&lt;/span&gt; &lt;span class='num'&gt;64.2&lt;/span&gt;&lt;span class='varop'&gt;%&lt;/span&gt;&lt;span class='layout'&gt;,&lt;/span&gt; &lt;span class='conid'&gt;Perhaps&lt;/span&gt; &lt;span class='conid'&gt;Ham&lt;/span&gt; &lt;span class='num'&gt;35.8&lt;/span&gt;&lt;span class='varop'&gt;%&lt;/span&gt;&lt;span class='keyglyph'&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;But what if we know that the first word of the message is &amp;#8220;free&amp;#8221;?  We can
use that information to calculate a new distribution.&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_haskell "&gt;&lt;span class='varop'&gt;&amp;gt;&lt;/span&gt; &lt;span class='varid'&gt;bayes&lt;/span&gt; &lt;span class='layout'&gt;(&lt;/span&gt;&lt;span class='varid'&gt;hasWord&lt;/span&gt; &lt;span class='str'&gt;"free"&lt;/span&gt; &lt;span class='varid'&gt;msgTypePrior&lt;/span&gt;&lt;span class='layout'&gt;)&lt;/span&gt;
&lt;span class='keyglyph'&gt;[&lt;/span&gt;&lt;span class='conid'&gt;Perhaps&lt;/span&gt; &lt;span class='conid'&gt;Spam&lt;/span&gt; &lt;span class='num'&gt;90.5&lt;/span&gt;&lt;span class='varop'&gt;%&lt;/span&gt;&lt;span class='layout'&gt;,&lt;/span&gt; &lt;span class='conid'&gt;Perhaps&lt;/span&gt; &lt;span class='conid'&gt;Ham&lt;/span&gt; &lt;span class='num'&gt;9.5&lt;/span&gt;&lt;span class='varop'&gt;%&lt;/span&gt;&lt;span class='keyglyph'&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The function &lt;code&gt;hasWord&lt;/code&gt; takes a string and a probability
distribution, and uses them to calculate a new probability distribution:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_haskell "&gt;&lt;span class='varid'&gt;hasWord&lt;/span&gt; &lt;span class='keyglyph'&gt;::&lt;/span&gt; &lt;span class='conid'&gt;String&lt;/span&gt; &lt;span class='keyglyph'&gt;-&amp;gt;&lt;/span&gt; &lt;span class='conid'&gt;FDist'&lt;/span&gt; &lt;span class='conid'&gt;MsgType&lt;/span&gt; &lt;span class='keyglyph'&gt;-&amp;gt;&lt;/span&gt;
           &lt;span class='conid'&gt;FDist'&lt;/span&gt; &lt;span class='conid'&gt;MsgType&lt;/span&gt;
&lt;span class='varid'&gt;hasWord&lt;/span&gt; &lt;span class='varid'&gt;word&lt;/span&gt; &lt;span class='varid'&gt;prior&lt;/span&gt; &lt;span class='keyglyph'&gt;=&lt;/span&gt; &lt;span class='keyword'&gt;do&lt;/span&gt;
  &lt;span class='varid'&gt;msgType&lt;/span&gt; &lt;span class='keyglyph'&gt;&amp;lt;-&lt;/span&gt; &lt;span class='varid'&gt;prior&lt;/span&gt;
  &lt;span class='varid'&gt;wordPresent&lt;/span&gt; &lt;span class='keyglyph'&gt;&amp;lt;-&lt;/span&gt;
    &lt;span class='varid'&gt;wordPresentDist&lt;/span&gt; &lt;span class='varid'&gt;msgType&lt;/span&gt; &lt;span class='varid'&gt;word&lt;/span&gt;
  &lt;span class='varid'&gt;condition&lt;/span&gt; &lt;span class='varid'&gt;wordPresent&lt;/span&gt;
  &lt;span class='varid'&gt;return&lt;/span&gt; &lt;span class='varid'&gt;msgType&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This code is based on the Bayesian monad from &lt;a href="http://www.randomhacks.net/articles/2007/02/22/bayes-rule-and-drug-tests"&gt;part 3&lt;/a&gt;.  As before,
the &amp;ldquo;&lt;code&gt;&amp;lt;-&lt;/code&gt;&amp;#8221; operator selects a single item from a probability
distribution, and &amp;#8220;condition&amp;#8221; asserts that an expression is true.  The
actual Bayesian inference happens behind the scenes (handy, that).&lt;/p&gt;

&lt;p&gt;If we have multiple pieces of evidence, we can apply them one at a time.
Each piece of evidence will update the probability distribution produced by
the previous step:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_haskell "&gt;&lt;span class='varid'&gt;hasWords&lt;/span&gt; &lt;span class='keyglyph'&gt;[&lt;/span&gt;&lt;span class='keyglyph'&gt;]&lt;/span&gt;     &lt;span class='varid'&gt;prior&lt;/span&gt; &lt;span class='keyglyph'&gt;=&lt;/span&gt; &lt;span class='varid'&gt;prior&lt;/span&gt;
&lt;span class='varid'&gt;hasWords&lt;/span&gt; &lt;span class='layout'&gt;(&lt;/span&gt;&lt;span class='varid'&gt;w&lt;/span&gt;&lt;span class='conop'&gt;:&lt;/span&gt;&lt;span class='varid'&gt;ws&lt;/span&gt;&lt;span class='layout'&gt;)&lt;/span&gt; &lt;span class='varid'&gt;prior&lt;/span&gt; &lt;span class='keyglyph'&gt;=&lt;/span&gt; &lt;span class='keyword'&gt;do&lt;/span&gt;
  &lt;span class='varid'&gt;hasWord&lt;/span&gt; &lt;span class='varid'&gt;w&lt;/span&gt; &lt;span class='layout'&gt;(&lt;/span&gt;&lt;span class='varid'&gt;hasWords&lt;/span&gt; &lt;span class='varid'&gt;ws&lt;/span&gt; &lt;span class='varid'&gt;prior&lt;/span&gt;&lt;span class='layout'&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The final distribution will combine everything we know:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_haskell "&gt;&lt;span class='varop'&gt;&amp;gt;&lt;/span&gt; &lt;span class='varid'&gt;bayes&lt;/span&gt; &lt;span class='layout'&gt;(&lt;/span&gt;&lt;span class='varid'&gt;hasWords&lt;/span&gt; &lt;span class='keyglyph'&gt;[&lt;/span&gt;&lt;span class='str'&gt;"free"&lt;/span&gt;&lt;span class='layout'&gt;,&lt;/span&gt;&lt;span class='str'&gt;"bayes"&lt;/span&gt;&lt;span class='keyglyph'&gt;]&lt;/span&gt; &lt;span class='varid'&gt;msgTypePrior&lt;/span&gt;&lt;span class='layout'&gt;)&lt;/span&gt;
&lt;span class='keyglyph'&gt;[&lt;/span&gt;&lt;span class='conid'&gt;Perhaps&lt;/span&gt; &lt;span class='conid'&gt;Spam&lt;/span&gt; &lt;span class='num'&gt;34.7&lt;/span&gt;&lt;span class='varop'&gt;%&lt;/span&gt;&lt;span class='layout'&gt;,&lt;/span&gt; &lt;span class='conid'&gt;Perhaps&lt;/span&gt; &lt;span class='conid'&gt;Ham&lt;/span&gt; &lt;span class='num'&gt;65.3&lt;/span&gt;&lt;span class='varop'&gt;%&lt;/span&gt;&lt;span class='keyglyph'&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This technique is known as the &lt;a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier"&gt;naive Bayes classifier&lt;/a&gt;.  Looked at from the right angle, it&amp;#8217;s surprisingly simple.&lt;/p&gt;

&lt;p&gt;(Of course, the naive Bayes classifier assumes that all of our evidence is independent.  In theory, this is a pretty big assumption. In practice, it &lt;a href="http://www.cs.unb.ca/profs/hzhang/publications/FLAIRS04ZhangH.pdf"&gt;works better than you might think&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;But this still leaves us with a lot of questions: How do we keep track of
our different classifiers?  How do we decide which ones to apply?  And do
we need to fudge the numbers to get reasonable results?&lt;/p&gt;

&lt;p&gt;In the following sections, I&amp;#8217;ll walk through various aspects of Paul
Graham&amp;#8217;s &lt;a href="http://www.paulgraham.com/spam.html"&gt;A Plan for Spam&lt;/a&gt;, and show how to generalize it.  If you
want to follow along, you can download the code using Darcs:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_sh "&gt;darcs get http://www.randomhacks.net/darcs/probability&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;a href="http://www.randomhacks.net/articles/2007/03/03/smart-classification-with-haskell"&gt;Read More&lt;/a&gt;&lt;/p&gt;</description>
      <pubDate>Sat, 03 Mar 2007 09:02:00 -0500</pubDate>
      <guid isPermaLink="false">urn:uuid:46da7ba0-d82a-40a5-bdac-d730e4f7073b</guid>
      <author>Eric Kidd</author>
      <link>http://www.randomhacks.net/articles/2007/03/03/smart-classification-with-haskell</link>
      <category>Haskell</category>
      <category>Math</category>
      <category>Monads</category>
      <category>Probability</category>
      <category>Spam</category>
      <trackback:ping>http://www.randomhacks.net/articles/trackback/322</trackback:ping>
    </item>
    <item>
      <title>Fromberger spam filtering paper</title>
      <description>    &lt;p&gt;Michael Fromberger has
  written a &lt;a href='http://thayer.dartmouth.edu/~sting/sw/perl/bayes-spam.pdf'&gt;nice
  formal analysis&lt;/a&gt; (PDF) of Paul Graham's &lt;a href='http://www.paulgraham.com/spam.html'&gt;Plan for Spam&lt;/a&gt;.&lt;/p&gt;</description>
      <pubDate>Mon, 30 Sep 2002 00:00:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:d75e1da2-a676-4b8d-89d3-4754b254c925</guid>
      <author>Eric</author>
      <link>http://www.randomhacks.net/articles/2002/09/30/fromberger-spam-filtering-paper</link>
      <category>Spam</category>
      <category>Probability</category>
      <trackback:ping>http://www.randomhacks.net/articles/trackback/40</trackback:ping>
    </item>
    <item>
      <title>Bayesian Whitelisting: Finding the Good Mail Among the Spam</title>
      <description>    &lt;p&gt;The biggest challenge with spam filtering is reducing false
    positives--that is, finding the good mail among the spam.  Even the
    best spam filters occasionally mistake legitimate e-mail for spam.  For
    example, in some &lt;a href='/stories/2002/09/22/trainable-spam-filter-testing' title='How To Test a Trainable Spam Filter'&gt;recent
    tests&lt;/a&gt;, &lt;a href='http://bogofilter.sourceforge.net/'&gt;&lt;code&gt;bogofilter&lt;/code&gt;&lt;/a&gt;
    processed 18,000 e-mails with only 34 false positives.  Unfortunately,
    several of these false positives were urgent e-mails from former
    clients.  This unpleasant mistake wasn't necessary--the most important
    of these false positives could have been avoided with an automatic
    whitelisting system.&lt;/p&gt;&lt;p&gt;&lt;a href="http://www.randomhacks.net/articles/2002/09/29/bayesian-whitelisting"&gt;Read More&lt;/a&gt;&lt;/p&gt;</description>
      <pubDate>Sun, 29 Sep 2002 00:00:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:4e3b83e6-1f2a-48d4-9f65-5e691ab45838</guid>
      <author>Eric</author>
      <link>http://www.randomhacks.net/articles/2002/09/29/bayesian-whitelisting</link>
      <category>Spam</category>
      <category>Hacks</category>
      <category>Python</category>
      <category>Recommended</category>
      <category>Probability</category>
      <trackback:ping>http://www.randomhacks.net/articles/trackback/38</trackback:ping>
    </item>
    <item>
      <title>FTC Spam Archive</title>
      <description>    &lt;p&gt;The FTC appears to have a &lt;i&gt;huge&lt;/i&gt; &lt;a href='http://slashdot.org/comments.pl?sid=39453&amp;amp;cid=4212588'&gt;spam
  database&lt;/a&gt;.&lt;/p&gt;</description>
      <pubDate>Mon, 23 Sep 2002 00:00:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:454a3d5f-235a-4b05-979b-1ff768fffda3</guid>
      <author>Eric</author>
      <link>http://www.randomhacks.net/articles/2002/09/23/ftc-spam-archive</link>
      <category>Spam</category>
      <trackback:ping>http://www.randomhacks.net/articles/trackback/33</trackback:ping>
    </item>
    <item>
      <title>Using Bogofilter with Spam Assassin</title>
      <description>    &lt;p&gt;For deadly-accurate spam filtering, combine a &lt;a href='/stories/2002/09/22/trainable-spam-filter-testing' title='How To Test a Trainable Spam Filter'&gt;well-trained bogofilter&lt;/a&gt; with
    &lt;a href='/stories/2002/08/06/spam-assassin-intro' title='SpamAssassin: An Decent Spam Filter'&gt;SpamAssassin&lt;/a&gt;.  Here's how.&lt;/p&gt;

    &lt;p&gt;Add the following lines to your procmailrc file, before you run
    SpamAssassin:&lt;/p&gt;

    &lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_default "&gt;&lt;notextile&gt;:0HB
* ? bogofilter
{
    :0fw
    | formail -I &amp;quot;X-Spam-Bogofilter: yes&amp;quot;
}&lt;/notextile&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

    &lt;p&gt;Add the following lines to your &lt;code&gt;/etc/spamassassin/local.cf&lt;/code&gt;
    file:&lt;/p&gt;

    &lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_default "&gt;&lt;notextile&gt;header    BOGOFILTER  X-Spam-Bogofilter =~ /yes/
describe  BOGOFILTER  Message has too many bogons.
score     BOGOFILTER  5.0&lt;/notextile&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

    &lt;p&gt;Presto!  This plugs almost all the holes in SpamAssassin's defense,
    and uses SpamAssassin's auto-whitelist (you've got it turned on,
    right?) to protect against false positives.&lt;/p&gt;</description>
      <pubDate>Mon, 23 Sep 2002 00:00:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:30034674-d766-41c5-a628-576181cbe661</guid>
      <author>Eric</author>
      <link>http://www.randomhacks.net/articles/2002/09/23/using-bogofilter-with-spam-assassin</link>
      <category>Spam</category>
      <trackback:ping>http://www.randomhacks.net/articles/trackback/35</trackback:ping>
    </item>
    <item>
      <title>Machine Learning Links</title>
      <description>    &lt;p&gt;Useful sites about machine-learning algorithms, for developers of
    spam filters: &lt;a href='http://kiew.cs.uni-dortmund.de:8001/mlnet/'&gt;Machine Learning
    Network&lt;/a&gt;, the &lt;a href='http://www.cs.cmu.edu/~mccallum/bow/'&gt;Bow&lt;/a&gt;
    toolkit, &lt;a href='http://lsa.colorado.edu/whatis.html'&gt;Latent Semantic
    Analysis&lt;/a&gt; (used by Apple's mail client), &lt;a href='http://elib.cs.berkeley.edu/papers/clustering/bayesian/'&gt;Bayesian
    Latent Semantic Analysis&lt;/a&gt;, &lt;a href='http://www2.parc.com/istl/projects/ia/sg-clustering.html'&gt;text
    clustering&lt;/a&gt;, &lt;a href='http://dewey.yonsei.ac.kr/memexlee/links/clustering.htm'&gt;more
    text clustering&lt;/a&gt;, &lt;a href='http://www.cis.ohio-state.edu/~srini/papers/TM01.pdf'&gt;Using
    Clustering to Boost Text Classification&lt;/a&gt; (PDF) and &lt;a href='http://www.cs.helsinki.fi/group/dime/lado/s01/exerc/samples.txt'&gt;TFIDF
    notes&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;I wouldn't be
    entirely surpised if neural networks worked well here, either--the
    problem has that "figure out where to draw the boundaries between
    clusters" aspect that maps nicely onto the math of neural
    networks.&lt;/p&gt;</description>
      <pubDate>Mon, 23 Sep 2002 00:00:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:0d04c6ce-a4b4-4b42-b400-b23f2c196777</guid>
      <author>Eric</author>
      <link>http://www.randomhacks.net/articles/2002/09/23/machine-learning-links</link>
      <category>Spam</category>
      <category>AI</category>
      <category>Probability</category>
      <trackback:ping>http://www.randomhacks.net/articles/trackback/34</trackback:ping>
    </item>
    <item>
      <title>How To Test a Trainable Spam Filter</title>
      <description>    &lt;p&gt;Ever since Paul Graham published &lt;a href='http://www.paulgraham.com/spam.html'&gt;A Plan for Spam&lt;/a&gt;,
    "trainable" spam filters have become the latest fashion.  These filters
    train themselves to know the characteristics of your personal e-mail.
    Supposedly, this extra knowledge allows them to make fewer mistakes,
    and makes them harder to fool.  But do these filters actually work?  In
    this article, I try out Eric Raymond's &lt;a href='/stories/2002/09/13/bogofilter' title='Bogofilter: A New Spam Filter'&gt;bogofilter&lt;/a&gt;, a trainable Bayesian spam filter,
    and describe the steps required to evaluate such a filter
    accurately.&lt;/p&gt;&lt;p&gt;&lt;a href="http://www.randomhacks.net/articles/2002/09/22/trainable-spam-filter-testing"&gt;Read More&lt;/a&gt;&lt;/p&gt;</description>
      <pubDate>Sun, 22 Sep 2002 00:00:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:c227afc7-6562-4fc5-8b77-7624df1aed2e</guid>
      <author>Eric</author>
      <link>http://www.randomhacks.net/articles/2002/09/22/trainable-spam-filter-testing</link>
      <category>Spam</category>
      <category>Hacks</category>
      <category>Recommended</category>
      <trackback:ping>http://www.randomhacks.net/articles/trackback/31</trackback:ping>
    </item>
    <item>
      <title>Weekend Spam Update</title>
      <description>    &lt;p&gt;Between midnight Friday and 11:00am Monday, I received over 160
    spams.  SpamAssassin stopped all but five of them.  SpamAssassin
    misidentified 2 legitimate messages as spam; both were unimportant
    mailing list messages from a user whose site is frequently used to send
    spam.  (If I needed to correspond with this user on a regular basis, I'd
    add his name to my whitelist--or help educate him.)&lt;/p&gt;

    &lt;p&gt;If you don't need a public e-mail address, let me suggest a new
    rule: &lt;i&gt;Never&lt;/i&gt; give your real e-mail address to anybody you don't
    know.  This includes online vendors.  If necessary, use a throwaway
    webmail address instead.&lt;/p&gt;

    &lt;p&gt;I also devoted quite a bit of work to repacking &lt;a href='http://sourceforge.net/projects/judy/'&gt;libJudy&lt;/a&gt;, HP's
    ultra-optimized associative array library.  This library is used by
    &lt;a href='/stories/2002/09/13/bogofilter' title='Bogofilter: A New Spam Filter'&gt;bogofilter&lt;/a&gt;, Eric Raymond's promising new
    spam filter.&lt;/p&gt;

    &lt;p&gt;Content-based spam filtering is extremely good, and is improving
    rapidly.  Just don't send me any e-mail about &lt;b&gt;hot stock picks&lt;/b&gt;
    involving &lt;b&gt;real estate&lt;/b&gt; companies in &lt;b&gt;Nigeria&lt;/b&gt; that
    specialize in &lt;b&gt;toner cartridge&lt;/b&gt; factories, and there shouldn't be
    any problem.&lt;/p&gt;</description>
      <pubDate>Mon, 16 Sep 2002 00:00:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:20f5a224-7170-4fbd-96c0-34da961ff051</guid>
      <author>Eric</author>
      <link>http://www.randomhacks.net/articles/2002/09/16/weekend-spam-update</link>
      <category>Spam</category>
      <trackback:ping>http://www.randomhacks.net/articles/trackback/27</trackback:ping>
    </item>
    <item>
      <title>Bogofilter: A New Spam Filter</title>
      <description>    &lt;p&gt;According to &lt;a href='http://lwn.net/Articles/9185/'&gt;Linux Weekly
    News&lt;/a&gt;, Eric Raymond is writing a new spam filter called &lt;a href='http://www.tuxedo.org/~esr/bogofilter/'&gt;&lt;code&gt;bogofilter&lt;/code&gt;&lt;/a&gt;
    based on &lt;a href='http://mathworld.wolfram.com/BayesianAnalysis.html'&gt;Bayesian
    analysis&lt;/a&gt;, as &lt;a href='http://www.paulgraham.com/spam.html'&gt;suggested&lt;/a&gt; by Paul
    Graham.  Unlike the excellent &lt;a href='/stories/2002/08/06/spam-assassin-intro' title='SpamAssassin: An Decent Spam Filter'&gt;SpamAssasin&lt;/a&gt;, which merely requires
    whitelisting a small number of addresses, &lt;code&gt;bogofilter&lt;/code&gt; requires
    training with around 1,000 e-mail messages.  But &lt;code&gt;bogofilter&lt;/code&gt; may
    ultimately offer more hope for defeating spam.&lt;/p&gt;

    &lt;p&gt;Once trained, &lt;code&gt;bogofilter&lt;/code&gt; recognizes most incoming spam
    (allegedly as much as SpamAssassin, but we'll have to wait and see).
    More importantly, however, &lt;code&gt;bogofilter&lt;/code&gt; is very good at &lt;i&gt;not&lt;/i&gt;
    recognizing legitimate e-mail as spam (in other words, it has a very
    low false positive rate).&lt;/p&gt;

    &lt;p&gt;The secret strength of &lt;code&gt;bogofilter&lt;/code&gt;, however, is the training
    process.  Because bogofilter is trained by the user, each user gets a
    personalized spam filter.  This means that (1) information of
    professional interest to the reader will generally be recognized as
    non-spam (however incriminating it might otherwise look), and (2) there
    won't be a &lt;a href='http://spamassassin.org/tests.html'&gt;centralized
    list of rules&lt;/a&gt; for the spammer to read.&lt;/p&gt;

    &lt;p&gt;I suspect that the new &lt;a href='http://www.apple.com/macosx/jaguar/mail.html'&gt;MacOS X 10.2 mail
    client&lt;/a&gt; may be using a similar technique.&lt;/p&gt;</description>
      <pubDate>Fri, 13 Sep 2002 00:00:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:826ec767-f2d8-4171-a27e-cc245718888d</guid>
      <author>Eric</author>
      <link>http://www.randomhacks.net/articles/2002/09/13/bogofilter</link>
      <category>Spam</category>
      <trackback:ping>http://www.randomhacks.net/articles/trackback/26</trackback:ping>
    </item>
    <item>
      <title>SpamAssassin: An Decent Spam Filter</title>
      <description>    &lt;p&gt;&lt;a href='http://spamassassin.taint.org/'&gt;SpamAssassin&lt;/a&gt; is a
    highly accurate open source spam filter.&lt;/p&gt;

    &lt;p&gt;There are two major components to the SpamAssassin filtering system:
    a set of rules which match various properties of an e-mail (e.g.,
    whether it mentiones stock alerts or Nigerian banks), and a set of
    weights for each rule.  The weights are assigned automatically, by
    analyzing various real-world mail spools.  So SpamAssassin is
    essentially an adaptive system--the rules are periodically
    recalibrated, and whether a given property is good or bad may change
    over time.&lt;/p&gt;

    &lt;p&gt;SpamAssassin also includes an "auto whitelist", which supposedly
    learns to recognize your most frequent correspondents.&lt;/p&gt;

    &lt;p&gt;There're probably some chewy ideas in here for an evolutionary
    biologist--spam filtering involves an arms race between the spammers
    and the mail administrators of the world, and the most advanced spam
    filters are beginning to resemble immune systems.&lt;/p&gt;

    &lt;p&gt;(If you're a Debian user, type &lt;code&gt;apt-get install spamassassin spamc
    libnet-dns-perl razor&lt;/code&gt; and take a look at the &lt;a href='http://spamassassin.org/dist/README'&gt;setup instructions&lt;/a&gt;.  If
    you want to use &lt;code&gt;spamd&lt;/code&gt;, try using the &lt;code&gt;--max-children 10&lt;/code&gt;
    argument; it will save you a lot of grief.)&lt;/p&gt;</description>
      <pubDate>Tue, 06 Aug 2002 00:00:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:5bb07b36-c839-404b-8be0-5fc19c3c1513</guid>
      <author>Eric</author>
      <link>http://www.randomhacks.net/articles/2002/08/06/spam-assassin-intro</link>
      <category>Spam</category>
      <trackback:ping>http://www.randomhacks.net/articles/trackback/19</trackback:ping>
    </item>
  </channel>
</rss>
