<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Random Hacks: Smart classification using Bayesian monads in Haskell</title>
    <link>http://www.randomhacks.net/articles/2007/03/03/smart-classification-with-haskell</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Technology and Other Fun Stuff</description>
    <item>
      <title>"Smart classification using Bayesian monads in Haskell" by Robin Debreuil</title>
      <description>&lt;p&gt;Great article(s)!&lt;/p&gt;


	&lt;p&gt;For the issue where words that have not been seen being given weight zero (plus a fudge), shouldn&amp;#8217;t all words start off with some probability based on the probability for all known words, and then adjusted as evidence is gathered? Seems more realistic  than a fudge, esp given to bayesian context.&lt;/p&gt;


	&lt;p&gt;Thanks for taking the time to post all this, and the code : )&lt;/p&gt;</description>
      <pubDate>Thu, 12 Apr 2007 23:01:35 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:249e9ba2-97a2-4ca8-9dbc-4e24c344b8bc</guid>
      <link>http://www.randomhacks.net/articles/2007/03/03/smart-classification-with-haskell#comment-389</link>
    </item>
    <item>
      <title>"Smart classification using Bayesian monads in Haskell" by Eric</title>
      <description>&lt;p&gt;Aaron: Good point!&lt;/p&gt;


	&lt;p&gt;But if the evidence is sparse enough, you might be looking at values of &lt;i&gt;x&lt;/i&gt; and &lt;i&gt;y&lt;/i&gt; &amp;lt; 5. In that case, you might want to add a smaller value: 0.1 or something like that.&lt;/p&gt;


	&lt;p&gt;Of course, this is closely related to the problem of &amp;#8220;overfitting&amp;#8221;.  For an elegant approach in Haskell, see &lt;a href="http://www.csse.monash.edu.au/~lloyd/tildeFP/2003ACSC/" rel="nofollow"&gt;Types and Classes of Machine Learning and Data Mining&lt;/a&gt;.&lt;/p&gt;</description>
      <pubDate>Fri, 09 Mar 2007 11:37:08 -0500</pubDate>
      <guid isPermaLink="false">urn:uuid:d1b1dc1e-8a6d-424c-adcb-851e820f25ff</guid>
      <link>http://www.randomhacks.net/articles/2007/03/03/smart-classification-with-haskell#comment-346</link>
    </item>
    <item>
      <title>"Smart classification using Bayesian monads in Haskell" by Aaron Denney</title>
      <description>&lt;p&gt;The usual Bayes-like method of inferring probabilities where all the evidence is one way is to treat (x,y) as (x+1, y+1)&amp;#8212;this essentially builds in the knowledge that either x or y can happen, by pretending that it already did.&lt;/p&gt;</description>
      <pubDate>Fri, 09 Mar 2007 02:47:22 -0500</pubDate>
      <guid isPermaLink="false">urn:uuid:9c530427-6a97-487f-a19a-742e4ba13030</guid>
      <link>http://www.randomhacks.net/articles/2007/03/03/smart-classification-with-haskell#comment-343</link>
    </item>
    <item>
      <title>"Smart classification using Bayesian monads in Haskell" by Eric</title>
      <description>&lt;p&gt;Thank you for the kind words!&lt;/p&gt;


	&lt;p&gt;The code above uses the namespace &lt;code&gt;M&lt;/code&gt;, which is declared in the full Darcs version as:&lt;/p&gt;


&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_haskell "&gt;&lt;span class='keyword'&gt;import&lt;/span&gt; &lt;span class='varid'&gt;qualified&lt;/span&gt; &lt;span class='conid'&gt;Data&lt;/span&gt;&lt;span class='varop'&gt;.&lt;/span&gt;&lt;span class='conid'&gt;Map&lt;/span&gt; &lt;span class='keyword'&gt;as&lt;/span&gt; &lt;span class='conid'&gt;M&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

	&lt;p&gt;As far as I know, this is the preferred replacement for the old Data.FiniteMap.&lt;/p&gt;


	&lt;p&gt;Is this the Map type you were referring to, or is there another I should be using instead?&lt;/p&gt;


	&lt;p&gt;Thank you for the feedback!&lt;/p&gt;</description>
      <pubDate>Sun, 04 Mar 2007 23:05:26 -0500</pubDate>
      <guid isPermaLink="false">urn:uuid:d43e1471-5ba8-4820-bafc-f885998e46ae</guid>
      <link>http://www.randomhacks.net/articles/2007/03/03/smart-classification-with-haskell#comment-324</link>
    </item>
    <item>
      <title>"Smart classification using Bayesian monads in Haskell" by Fred Ross</title>
      <description>&lt;p&gt;Lovely!  I&amp;#8217;ve been really enjoying the probability stuff you and Dan have been churning out.&lt;/p&gt;


	&lt;p&gt;One coding quibble: why don&amp;#8217;t you use Map from the Haskell standard library to hold your pairs of words and their associated factors?&lt;/p&gt;</description>
      <pubDate>Sun, 04 Mar 2007 22:16:36 -0500</pubDate>
      <guid isPermaLink="false">urn:uuid:d6d3289a-9935-4162-9e84-e8397179791c</guid>
      <link>http://www.randomhacks.net/articles/2007/03/03/smart-classification-with-haskell#comment-323</link>
    </item>
    <item>
      <title>Smart classification using Bayesian monads in Haskell</title>
      <description>&lt;p&gt;&lt;small&gt;(Refactoring Probability Distributions: &lt;a href="http://www.randomhacks.net/articles/2007/02/21/refactoring-probability-distributions"&gt;part 1&lt;/a&gt;, &lt;a href="http://www.randomhacks.net/articles/2007/02/21/randomly-sampled-distributions"&gt;part 2&lt;/a&gt;,
&lt;a href="http://www.randomhacks.net/articles/2007/02/22/bayes-rule-and-drug-tests"&gt;part 3&lt;/a&gt;, &lt;b&gt;part 4&lt;/b&gt;)&lt;/small&gt;&lt;/p&gt;

&lt;p&gt;The world is full of messy classification problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&amp;#8220;Is this order fraudulent?&amp;#8221;&lt;/li&gt;
&lt;li&gt;&amp;#8220;It this e-mail a spam?&amp;#8221;&lt;/li&gt;
&lt;li&gt;&amp;#8220;What blog posts would Rachel find interesting?&amp;#8221;&lt;/li&gt;
&lt;li&gt;&amp;#8220;Which intranet documents is Sam looking for?&amp;#8221;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In each case, we want to classify something: Orders are either valid or
fraudulent, messages are either spam or non-spam, blog posts are either
interesting or boring.  Unfortunately, most software is &lt;i&gt;terrible&lt;/i&gt; at
making these distinctions.  For example, why can&amp;#8217;t my RSS reader go out and
track down the 10 most interesting blog posts every day?&lt;/p&gt;

&lt;p&gt;Some software, however, &lt;i&gt;can&lt;/i&gt; make these distinctions.
Google figures out when I want to watch a movie, and shows me &lt;a href="http://www.google.com/search?q=cinema+boston"&gt;specialized
search results&lt;/a&gt;.  And most e-mail clients can identify spam with over
99% accuracy.  But the vast majority of software is dumb, incapable of
dealing with the messy dilemmas posed by the real world.&lt;/p&gt;

&lt;p&gt;So where can we learn to improve our software?&lt;/p&gt;

&lt;p&gt;Outside of Google&amp;#8217;s shroud
of secrecy, the most successful classifiers are spam filters.  And most modern
spam filters are inspired by Paul Graham&amp;#8217;s essay &lt;a href="http://www.paulgraham.com/spam.html"&gt;A Plan for Spam&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;So let&amp;#8217;s go back to the source, and see what we can learn.  As it turns out, we can formulate a lot of the ideas in &lt;a href="http://www.paulgraham.com/spam.html"&gt;A Plan
for Spam&lt;/a&gt; in a straightforward fashion using a &lt;a href="http://www.randomhacks.net/articles/2007/02/22/bayes-rule-and-drug-tests"&gt;Bayesian
monad&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;Functions from distributions to distributions&lt;/h3&gt;

&lt;p&gt;Let&amp;#8217;s begin with spam filtering.  By convention, we divide messages into
&amp;#8220;spam&amp;#8221; and &amp;#8220;ham&amp;#8221;, where &amp;#8220;ham&amp;#8221; is the stuff we want to read.&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_haskell "&gt;&lt;span class='keyword'&gt;data&lt;/span&gt; &lt;span class='conid'&gt;MsgType&lt;/span&gt; &lt;span class='keyglyph'&gt;=&lt;/span&gt; &lt;span class='conid'&gt;Spam&lt;/span&gt; &lt;span class='keyglyph'&gt;|&lt;/span&gt; &lt;span class='conid'&gt;Ham&lt;/span&gt;
  &lt;span class='keyword'&gt;deriving&lt;/span&gt; &lt;span class='layout'&gt;(&lt;/span&gt;&lt;span class='conid'&gt;Show&lt;/span&gt;&lt;span class='layout'&gt;,&lt;/span&gt; &lt;span class='conid'&gt;Eq&lt;/span&gt;&lt;span class='layout'&gt;,&lt;/span&gt; &lt;span class='conid'&gt;Enum&lt;/span&gt;&lt;span class='layout'&gt;,&lt;/span&gt; &lt;span class='conid'&gt;Bounded&lt;/span&gt;&lt;span class='layout'&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Let&amp;#8217;s assume that we&amp;#8217;ve just received a new e-mail.  Without even looking
at it, we know there&amp;#8217;s a certain chance that it&amp;#8217;s a spam.  This gives us
something called a &amp;#8220;prior distribution&amp;#8221; over &lt;code&gt;MsgType&lt;/code&gt;.&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_haskell "&gt;&lt;span class='varop'&gt;&amp;gt;&lt;/span&gt; &lt;span class='varid'&gt;bayes&lt;/span&gt; &lt;span class='varid'&gt;msgTypePrior&lt;/span&gt;
&lt;span class='keyglyph'&gt;[&lt;/span&gt;&lt;span class='conid'&gt;Perhaps&lt;/span&gt; &lt;span class='conid'&gt;Spam&lt;/span&gt; &lt;span class='num'&gt;64.2&lt;/span&gt;&lt;span class='varop'&gt;%&lt;/span&gt;&lt;span class='layout'&gt;,&lt;/span&gt; &lt;span class='conid'&gt;Perhaps&lt;/span&gt; &lt;span class='conid'&gt;Ham&lt;/span&gt; &lt;span class='num'&gt;35.8&lt;/span&gt;&lt;span class='varop'&gt;%&lt;/span&gt;&lt;span class='keyglyph'&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;But what if we know that the first word of the message is &amp;#8220;free&amp;#8221;?  We can
use that information to calculate a new distribution.&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_haskell "&gt;&lt;span class='varop'&gt;&amp;gt;&lt;/span&gt; &lt;span class='varid'&gt;bayes&lt;/span&gt; &lt;span class='layout'&gt;(&lt;/span&gt;&lt;span class='varid'&gt;hasWord&lt;/span&gt; &lt;span class='str'&gt;"free"&lt;/span&gt; &lt;span class='varid'&gt;msgTypePrior&lt;/span&gt;&lt;span class='layout'&gt;)&lt;/span&gt;
&lt;span class='keyglyph'&gt;[&lt;/span&gt;&lt;span class='conid'&gt;Perhaps&lt;/span&gt; &lt;span class='conid'&gt;Spam&lt;/span&gt; &lt;span class='num'&gt;90.5&lt;/span&gt;&lt;span class='varop'&gt;%&lt;/span&gt;&lt;span class='layout'&gt;,&lt;/span&gt; &lt;span class='conid'&gt;Perhaps&lt;/span&gt; &lt;span class='conid'&gt;Ham&lt;/span&gt; &lt;span class='num'&gt;9.5&lt;/span&gt;&lt;span class='varop'&gt;%&lt;/span&gt;&lt;span class='keyglyph'&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The function &lt;code&gt;hasWord&lt;/code&gt; takes a string and a probability
distribution, and uses them to calculate a new probability distribution:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_haskell "&gt;&lt;span class='varid'&gt;hasWord&lt;/span&gt; &lt;span class='keyglyph'&gt;::&lt;/span&gt; &lt;span class='conid'&gt;String&lt;/span&gt; &lt;span class='keyglyph'&gt;-&amp;gt;&lt;/span&gt; &lt;span class='conid'&gt;FDist'&lt;/span&gt; &lt;span class='conid'&gt;MsgType&lt;/span&gt; &lt;span class='keyglyph'&gt;-&amp;gt;&lt;/span&gt;
           &lt;span class='conid'&gt;FDist'&lt;/span&gt; &lt;span class='conid'&gt;MsgType&lt;/span&gt;
&lt;span class='varid'&gt;hasWord&lt;/span&gt; &lt;span class='varid'&gt;word&lt;/span&gt; &lt;span class='varid'&gt;prior&lt;/span&gt; &lt;span class='keyglyph'&gt;=&lt;/span&gt; &lt;span class='keyword'&gt;do&lt;/span&gt;
  &lt;span class='varid'&gt;msgType&lt;/span&gt; &lt;span class='keyglyph'&gt;&amp;lt;-&lt;/span&gt; &lt;span class='varid'&gt;prior&lt;/span&gt;
  &lt;span class='varid'&gt;wordPresent&lt;/span&gt; &lt;span class='keyglyph'&gt;&amp;lt;-&lt;/span&gt;
    &lt;span class='varid'&gt;wordPresentDist&lt;/span&gt; &lt;span class='varid'&gt;msgType&lt;/span&gt; &lt;span class='varid'&gt;word&lt;/span&gt;
  &lt;span class='varid'&gt;condition&lt;/span&gt; &lt;span class='varid'&gt;wordPresent&lt;/span&gt;
  &lt;span class='varid'&gt;return&lt;/span&gt; &lt;span class='varid'&gt;msgType&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This code is based on the Bayesian monad from &lt;a href="http://www.randomhacks.net/articles/2007/02/22/bayes-rule-and-drug-tests"&gt;part 3&lt;/a&gt;.  As before,
the &amp;ldquo;&lt;code&gt;&amp;lt;-&lt;/code&gt;&amp;#8221; operator selects a single item from a probability
distribution, and &amp;#8220;condition&amp;#8221; asserts that an expression is true.  The
actual Bayesian inference happens behind the scenes (handy, that).&lt;/p&gt;

&lt;p&gt;If we have multiple pieces of evidence, we can apply them one at a time.
Each piece of evidence will update the probability distribution produced by
the previous step:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_haskell "&gt;&lt;span class='varid'&gt;hasWords&lt;/span&gt; &lt;span class='keyglyph'&gt;[&lt;/span&gt;&lt;span class='keyglyph'&gt;]&lt;/span&gt;     &lt;span class='varid'&gt;prior&lt;/span&gt; &lt;span class='keyglyph'&gt;=&lt;/span&gt; &lt;span class='varid'&gt;prior&lt;/span&gt;
&lt;span class='varid'&gt;hasWords&lt;/span&gt; &lt;span class='layout'&gt;(&lt;/span&gt;&lt;span class='varid'&gt;w&lt;/span&gt;&lt;span class='conop'&gt;:&lt;/span&gt;&lt;span class='varid'&gt;ws&lt;/span&gt;&lt;span class='layout'&gt;)&lt;/span&gt; &lt;span class='varid'&gt;prior&lt;/span&gt; &lt;span class='keyglyph'&gt;=&lt;/span&gt; &lt;span class='keyword'&gt;do&lt;/span&gt;
  &lt;span class='varid'&gt;hasWord&lt;/span&gt; &lt;span class='varid'&gt;w&lt;/span&gt; &lt;span class='layout'&gt;(&lt;/span&gt;&lt;span class='varid'&gt;hasWords&lt;/span&gt; &lt;span class='varid'&gt;ws&lt;/span&gt; &lt;span class='varid'&gt;prior&lt;/span&gt;&lt;span class='layout'&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The final distribution will combine everything we know:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_haskell "&gt;&lt;span class='varop'&gt;&amp;gt;&lt;/span&gt; &lt;span class='varid'&gt;bayes&lt;/span&gt; &lt;span class='layout'&gt;(&lt;/span&gt;&lt;span class='varid'&gt;hasWords&lt;/span&gt; &lt;span class='keyglyph'&gt;[&lt;/span&gt;&lt;span class='str'&gt;"free"&lt;/span&gt;&lt;span class='layout'&gt;,&lt;/span&gt;&lt;span class='str'&gt;"bayes"&lt;/span&gt;&lt;span class='keyglyph'&gt;]&lt;/span&gt; &lt;span class='varid'&gt;msgTypePrior&lt;/span&gt;&lt;span class='layout'&gt;)&lt;/span&gt;
&lt;span class='keyglyph'&gt;[&lt;/span&gt;&lt;span class='conid'&gt;Perhaps&lt;/span&gt; &lt;span class='conid'&gt;Spam&lt;/span&gt; &lt;span class='num'&gt;34.7&lt;/span&gt;&lt;span class='varop'&gt;%&lt;/span&gt;&lt;span class='layout'&gt;,&lt;/span&gt; &lt;span class='conid'&gt;Perhaps&lt;/span&gt; &lt;span class='conid'&gt;Ham&lt;/span&gt; &lt;span class='num'&gt;65.3&lt;/span&gt;&lt;span class='varop'&gt;%&lt;/span&gt;&lt;span class='keyglyph'&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This technique is known as the &lt;a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier"&gt;naive Bayes classifier&lt;/a&gt;.  Looked at from the right angle, it&amp;#8217;s surprisingly simple.&lt;/p&gt;

&lt;p&gt;(Of course, the naive Bayes classifier assumes that all of our evidence is independent.  In theory, this is a pretty big assumption. In practice, it &lt;a href="http://www.cs.unb.ca/profs/hzhang/publications/FLAIRS04ZhangH.pdf"&gt;works better than you might think&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;But this still leaves us with a lot of questions: How do we keep track of
our different classifiers?  How do we decide which ones to apply?  And do
we need to fudge the numbers to get reasonable results?&lt;/p&gt;

&lt;p&gt;In the following sections, I&amp;#8217;ll walk through various aspects of Paul
Graham&amp;#8217;s &lt;a href="http://www.paulgraham.com/spam.html"&gt;A Plan for Spam&lt;/a&gt;, and show how to generalize it.  If you
want to follow along, you can download the code using Darcs:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_sh "&gt;darcs get http://www.randomhacks.net/darcs/probability&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;a href="http://www.randomhacks.net/articles/2007/03/03/smart-classification-with-haskell"&gt;Read More&lt;/a&gt;&lt;/p&gt;</description>
      <pubDate>Sat, 03 Mar 2007 09:02:00 -0500</pubDate>
      <guid isPermaLink="false">urn:uuid:46da7ba0-d82a-40a5-bdac-d730e4f7073b</guid>
      <author>Eric Kidd</author>
      <link>http://www.randomhacks.net/articles/2007/03/03/smart-classification-with-haskell</link>
      <category>Haskell</category>
      <category>Math</category>
      <category>Monads</category>
      <category>Probability</category>
      <category>Spam</category>
      <trackback:ping>http://www.randomhacks.net/articles/trackback/322</trackback:ping>
    </item>
  </channel>
</rss>
