This preview shows page 1. Sign up to view the full content.
Unformatted text preview: we assume that p(S) = ½. 16 Bayesian Spam Filters using Multiple Words
Example: We have 2000 spam messages and 1000 non‐spam messages. The word “stock” occurs 400 times in the spam messages and 60 times in the non‐spam. The word “undervalued” occurs in 200 spam messages and 25 non‐spam. Solution: p(stock) = 400/2000 = .2, q(stock) = 60/1000=.06, p(undervalued) = 200/2000 = .1, q(undervalued) = 25/1000 = .025 If our threshold is .9, we class the message as spam and reject it. 17 Bayesian Spam Filters using Multiple Words
In general, the more words we consider, the more accurate the spam filter. With the independence assumption if we consider k words: We can further improve the filter by considering pairs of words as a single block or certain types of strings. 18 Section 6.4 19 Section Summary
Linearity of Expectations
Average‐Case Computational Complexity
Independent Random Variables
Chebyshev’s Inequality 20 Expected Value
Definition: The expected value (or expectation or mean) of the random variable X(s) on the sample space S is equal to Example‐Expected Value of a Die: Let X be the number that comes up when a fair die is rolled. What is the expected value of X?
Solution: The random variable X takes the values 1, 2, 3, 4, 5, or 6. Each has probability 1/6. It follows that 21 Expected Value
Theorem 1: If X is a random variable and p(X = r) is the probability that X = r, so that
then Proof: Suppose that X is a random variable with range X(S) and let p(X
= r) be the probability that X takes...
View Full Document
- Spring '08