This preview shows page 1. Sign up to view the full content.
Unformatted text preview: l message is spam.
We look at a particular word w, and count the number of times that it occurs in B and in G; nB(w) and nG(w). Estimated probability that an email containing w is spam: p(w) = nB(w)/B Estimated probability that an email containing w is not spam: q(w) = nG(w)/G continued 13 Bayesian Spam Filters
Let S be the event that the message is spam, and E be the event that the message contains the word w. Using Bayes’ Rule,
Assuming that it is equally likely that an arbitrary message is spam and is not spam; i.e., p(S) = ½.
Using our empirical estimates of p(E  S) and
p(E ⎯S). Note: If we have data on the frequency of spam messages, we can obtain a better estimate for p(S). r(w) estimates the probability that the message is spam. We can class the message as spam if r(w) is above a threshold. 14 Bayesian Spam Filters Example: We find that the word “Rolex” occurs in 250 out of 2000 spam messages and occurs in 5 out of 1000 non‐spam messages. Estimate the probability that an incoming message is spam. Suppose our threshold for rejecting the email is 0.9.
Solution: p(Rolex) = 250/2000 =.0125 and q(Rolex) = 5/1000 = 0.005. We class the message as spam and reject the email!
15 Bayesian Spam Filters using Multiple Words
Accuracy can be improved by considering more than one word as evidence. Consider the case where E1 and E2 denote the events that the message contains the words w1 and w2 respectively.
We make the simplifying assumption that the events are independent. And again...
View Full
Document
 Spring '08
 Staff

Click to edit the document details