Bayesian Whitelisting: Finding the Good Mail Among the Spam

Posted by Eric Sun, 29 Sep 2002 00:00:00 GMT

The biggest challenge with spam filtering is reducing false positives--that is, finding the good mail among the spam. Even the best spam filters occasionally mistake legitimate e-mail for spam. For example, in some recent tests, bogofilter processed 18,000 e-mails with only 34 false positives. Unfortunately, several of these false positives were urgent e-mails from former clients. This unpleasant mistake wasn't necessary--the most important of these false positives could have been avoided with an automatic whitelisting system.


Tags , , , ,

Censorship of the Press

Posted by Eric Fri, 27 Sep 2002 00:00:00 GMT

The ACLU on new censorship restrictions: This part of the [Patriot] Act overrides existing state and federal privacy laws, allowing the FBI to investigate which books have been bought or borrowed by anyone it suspects of being a terrorist--an extremely broad and vague determination. Further, it prevents librarians and booksellers from revealing that such a search has taken place, and it bars the press from reporting on such searches...

Thus the press and the public have no way of knowing when, where, or how often such searches have been conducted, or what books and readers are being investigated. Normally, when a court imposes a gag rule on pretrial or trial participants, including the press, it may be fought and, in many cases, overturned. The Patriot Act makes such challenges impossible.

Macintosh Developer Pain

Posted by Eric Tue, 24 Sep 2002 00:00:00 GMT

I'm in maze of twisty little library interfaces, all different. I'm dealing with three C libraries (MPW StdCLib, CarbonStdCLib.o and MSL), two MacOS platforms (PPC and Carbon), two build systems (MPW and CodeWarrior) and a growing sense of desperation. Of course, no piece of this cruft wants to talk to any other piece.


Sincere Choice

Posted by Eric Mon, 23 Sep 2002 00:00:00 GMT

Sincere Choice: A lobbying effort arguing that free software and proprietary software should compete on an equal footing.

FTC Spam Archive

Posted by Eric Mon, 23 Sep 2002 00:00:00 GMT

The FTC appears to have a huge spam database.


Machine Learning Links

Posted by Eric Mon, 23 Sep 2002 00:00:00 GMT

Useful sites about machine-learning algorithms, for developers of spam filters: Machine Learning Network, the Bow toolkit, Latent Semantic Analysis (used by Apple's mail client), Bayesian Latent Semantic Analysis, text clustering, more text clustering, Using Clustering to Boost Text Classification (PDF) and TFIDF notes.

I wouldn't be entirely surpised if neural networks worked well here, either--the problem has that "figure out where to draw the boundaries between clusters" aspect that maps nicely onto the math of neural networks.

Tags , ,

Using Bogofilter with Spam Assassin

Posted by Eric Mon, 23 Sep 2002 00:00:00 GMT

For deadly-accurate spam filtering, combine a well-trained bogofilter with SpamAssassin. Here's how.

Add the following lines to your procmailrc file, before you run SpamAssassin:

* ? bogofilter
    | formail -I "X-Spam-Bogofilter: yes"

Add the following lines to your /etc/spamassassin/ file:

header    BOGOFILTER  X-Spam-Bogofilter =~ /yes/
describe  BOGOFILTER  Message has too many bogons.
score     BOGOFILTER  5.0

Presto! This plugs almost all the holes in SpamAssassin's defense, and uses SpamAssassin's auto-whitelist (you've got it turned on, right?) to protect against false positives.


How To Test a Trainable Spam Filter

Posted by Eric Sun, 22 Sep 2002 00:00:00 GMT

Ever since Paul Graham published A Plan for Spam, "trainable" spam filters have become the latest fashion. These filters train themselves to know the characteristics of your personal e-mail. Supposedly, this extra knowledge allows them to make fewer mistakes, and makes them harder to fool. But do these filters actually work? In this article, I try out Eric Raymond's bogofilter, a trainable Bayesian spam filter, and describe the steps required to evaluate such a filter accurately.


Tags , ,

Things I Hate About CodeWarrior, Part I

Posted by Eric Fri, 20 Sep 2002 00:00:00 GMT

Metrowerks CodeWarrior is a fairly nice compiler suite and IDE for the Macintosh. Unfortunately, it suffers from several severe flaws. Most of these flaws involve CodeWarrior's binary project files.

A short list of problems with this design:

  1. The project files are completely opaque. As Unix users like to complain, binary files are just an opaque blob of bytes. This breaks such vital utilities as diff and merge.
  2. The project files change every time you compile your program. For some unknown reason, CodeWarrior stores object code in the project files. This means the files get changed every time you compile. This makes CVS grumpy.
  3. The project file format is always changing. I've never upgraded CodeWarrior without having to re-import all my project files.
  4. CodeWarrior can't read very old project files at all. Just today, CodeWarrior told me it couldn't open an old project file at all. I wonder what was in there.

Now, don't get me wrong, CodeWarrior was a really sweet product back in early 1995. But by modern standards, it's pretty painful.


EU free software study

Posted by Eric Thu, 19 Sep 2002 00:00:00 GMT

Andy Oram reports on a European Union study of free software developers: Furthermore, 38 percent of all developers agreed with the statement that software should not be proprietary. Or to turn it around: The majority of free software developers don't have major philosophical objections to proprietary software. Interesting.

Older posts: 1 ... 7 8 9 10 11 12