Bayesian Whitelisting: Finding the Good Mail Among the Spam

Posted by Eric Sun, 29 Sep 2002 00:00:00 GMT

The biggest challenge with spam filtering is reducing false positives--that is, finding the good mail among the spam. Even the best spam filters occasionally mistake legitimate e-mail for spam. For example, in some recent tests, bogofilter processed 18,000 e-mails with only 34 false positives. Unfortunately, several of these false positives were urgent e-mails from former clients. This unpleasant mistake wasn't necessary--the most important of these false positives could have been avoided with an automatic whitelisting system.


Tags , , , ,

Machine Learning Links

Posted by Eric Mon, 23 Sep 2002 00:00:00 GMT

Useful sites about machine-learning algorithms, for developers of spam filters: Machine Learning Network, the Bow toolkit, Latent Semantic Analysis (used by Apple's mail client), Bayesian Latent Semantic Analysis, text clustering, more text clustering, Using Clustering to Boost Text Classification (PDF) and TFIDF notes.

I wouldn't be entirely surpised if neural networks worked well here, either--the problem has that "figure out where to draw the boundaries between clusters" aspect that maps nicely onto the math of neural networks.

Tags , ,

Older posts: 1 2