EE 131A MATLAB Project
Fall 2006
PART A: Bayesian Spam Filter Design (60 points)
In this project you will design a simple Bayesian spam filter. The purpose of a spam filter is
to separate between legitimate emails (known as “ham” in the antispam jargon) and unsolicited
emails (“spam” or “junk”). The two main categories of spam filters are rulebased and probabilis
tic. Bayesian spam filters belong to the second category and are simple applications of the Baye’s
formula taught in class. The underlying assumption is that the words (or other features such as
email domains) used in spam emails have different statistical properties than the ones used in
legitimate email. So, the idea is to train the filter to see and distinguish the patterns in a user’s
email. The fact that a Bayesian spam filter keeps learning and adjusting itself to the properties of
a particular user’s email is what gives it better chances to fight spam.
A Bayesian spam filter is a classifier with three outputs: “spam”, “ham” and “undecided”. It
works by computing a probability
P
[
S

W
s
] =
P
[
spam

words
]
that an email containing certain
words is spam and then making the corresponding decision based on whether
P
[
S

W
s
]
is greater
than a threshold
t
spam
, less than another threshold
t
ham
or between them.
In most filters it is assumed that the probabilities of an incoming email being ham or spam
are equal (i.e.,
P
[
spam
] =
P
[
ham]). Another common assumption in the socalled “naive Baye’s
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
This is the end of the preview.
Sign up
to
access the rest of the document.
 Spring '08
 LORENZELLI
 Probability theory, Email spam, Bayesian spam filtering, Paul Graham

Click to edit the document details