Posted by Eric
Sun, 29 Sep 2002 00:00:00 GMT
The biggest challenge with spam filtering is reducing false
positives--that is, finding the good mail among the spam. Even the
best spam filters occasionally mistake legitimate e-mail for spam. For
example, in some recent
tests, bogofilter
processed 18,000 e-mails with only 34 false positives. Unfortunately,
several of these false positives were urgent e-mails from former
clients. This unpleasant mistake wasn't necessary--the most important
of these false positives could have been avoided with an automatic
whitelisting system.
Read more...
Tags Hacks, Probability, Python, Recommended, Spam
Posted by Eric
Fri, 27 Sep 2002 00:00:00 GMT
The ACLU on new censorship
restrictions: This part of the [Patriot] Act overrides existing
state and federal privacy laws, allowing the FBI to investigate which
books have been bought or borrowed by anyone it suspects of being a
terrorist--an extremely broad and vague determination. Further, it
prevents librarians and booksellers from revealing that such a search
has taken place, and it bars the press from reporting on such
searches...
Thus the press and the public have no way of knowing when, where,
or how often such searches have been conducted, or what books and
readers are being investigated. Normally, when a court imposes a gag
rule on pretrial or trial participants, including the press, it may be
fought and, in many cases, overturned. The Patriot Act makes such
challenges impossible.
Posted by Eric
Tue, 24 Sep 2002 00:00:00 GMT
I'm in maze of twisty little library interfaces, all different. I'm
dealing with three C libraries (MPW StdCLib, CarbonStdCLib.o and MSL),
two MacOS platforms (PPC and Carbon), two build systems (MPW and
CodeWarrior) and a growing sense of desperation. Of course, no piece of
this cruft wants to talk to any other piece.
Tags Mac
Posted by Eric
Mon, 23 Sep 2002 00:00:00 GMT
Sincere Choice: A lobbying
effort arguing that free software and proprietary software should compete
on an equal footing.
Posted by Eric
Mon, 23 Sep 2002 00:00:00 GMT
The FTC appears to have a huge spam
database.
Tags Spam
Posted by Eric
Mon, 23 Sep 2002 00:00:00 GMT
Useful sites about machine-learning algorithms, for developers of
spam filters: Machine Learning
Network, the Bow
toolkit, Latent Semantic
Analysis (used by Apple's mail client), Bayesian
Latent Semantic Analysis, text
clustering, more
text clustering, Using
Clustering to Boost Text Classification (PDF) and TFIDF
notes.
I wouldn't be
entirely surpised if neural networks worked well here, either--the
problem has that "figure out where to draw the boundaries between
clusters" aspect that maps nicely onto the math of neural
networks.
Tags AI, Probability, Spam
Posted by Eric
Mon, 23 Sep 2002 00:00:00 GMT
For deadly-accurate spam filtering, combine a well-trained bogofilter with
SpamAssassin. Here's how.
Add the following lines to your procmailrc file, before you run
SpamAssassin:
:0HB
* ? bogofilter
{
:0fw
| formail -I "X-Spam-Bogofilter: yes"
}
Add the following lines to your /etc/spamassassin/local.cf
file:
header BOGOFILTER X-Spam-Bogofilter =~ /yes/
describe BOGOFILTER Message has too many bogons.
score BOGOFILTER 5.0
Presto! This plugs almost all the holes in SpamAssassin's defense,
and uses SpamAssassin's auto-whitelist (you've got it turned on,
right?) to protect against false positives.
Tags Spam
Posted by Eric
Sun, 22 Sep 2002 00:00:00 GMT
Ever since Paul Graham published A Plan for Spam,
"trainable" spam filters have become the latest fashion. These filters
train themselves to know the characteristics of your personal e-mail.
Supposedly, this extra knowledge allows them to make fewer mistakes,
and makes them harder to fool. But do these filters actually work? In
this article, I try out Eric Raymond's bogofilter, a trainable Bayesian spam filter,
and describe the steps required to evaluate such a filter
accurately.
Read more...
Tags Hacks, Recommended, Spam
Posted by Eric
Fri, 20 Sep 2002 00:00:00 GMT
Metrowerks CodeWarrior is a
fairly nice compiler suite and IDE for the Macintosh. Unfortunately,
it suffers from several severe flaws. Most of these flaws involve
CodeWarrior's binary project files.
A short list of problems with this design:
- The project files are completely opaque. As Unix users
like to complain, binary files are just an opaque blob of bytes.
This breaks such vital utilities as diff and merge.
- The project files change every time you compile your
program. For some unknown reason, CodeWarrior stores object code
in the project files. This means the files get changed every time
you compile. This makes CVS grumpy.
- The project file format is always changing. I've never
upgraded CodeWarrior without having to re-import all my project
files.
- CodeWarrior can't read very old project files at all.
Just today, CodeWarrior told me it couldn't open an old project file
at all. I wonder what was in there.
Now, don't get me wrong, CodeWarrior was a really sweet product back
in early 1995. But by modern standards, it's pretty painful.
Tags Mac
Posted by Eric
Thu, 19 Sep 2002 00:00:00 GMT
Andy Oram reports
on a European Union study of free software developers: Furthermore, 38
percent of all developers agreed with the statement that software should
not be proprietary. Or to turn it around: The majority of free
software developers don't have major philosophical objections to
proprietary software. Interesting.