Winter 2014/15 STAT 200 Chapters 14-17 Probability and Random Variables Do you ever wonder how email spam filters work? In simple terms, algorithms are developed to screen incoming emails. Each email message is assigned a spam score (based on whether certain spammy words are present and their positions / occurrences in relation to other words in the message). You can think of the score as a measure of how likely a message is a spam. A threshold is chosen such that messages whose spam score exceeds the threshold will be classified as spams. Otherwise, the messages are classified as non-spams (or “ham”). Effective spam filters have low false positive (non-spam misclassified as spam) and low false negative (spam misclassified as non-spam) rates. Spam filtering methods are based on probability and statistical theories. Eugenia Yu, UBC Department of Statistics 1

Winter 2014/15 STAT 200 For any incoming email message, it can be a true spam or a non-spam, but we cannot predict its kind with certainty until the message arrives. This is an example of a random phenomenon . What is the chance of the next email message is a spam? There is a probability associated with each possible outcome (spam or non-spam). Eugenia Yu, UBC Department of Statistics 2
Winter 2014/15 STAT 200 Probability concepts (Chapter 14) A sample space S is the set of of all possible outcomes of a random phenomenon. e.g., For tossing a coin, the sample space is the set { Head, Tail } . For rolling a die, the sample space is the set { 1,2,3,4,5,6 } . An event is an outcome or some outcomes from a random phenomenon. We denote an event by an uppercase letter, e.g., A, B, C. e.g., Tossing a head is an event. Tossing a tail is another event. Tossing two heads in two tosses is also an event. Eugenia Yu, UBC Department of Statistics 3

Winter 2014/15 STAT 200 The notation P ( A ) denotes the probability that an event A will occur. Properties of P ( A ) : 1. 0 P ( A ) 1 P ( A ) = 0 implies event A is impossible P ( A ) = 1 implies event A is certain The larger the P ( A ) , the more likely the event A will occur. 2. the sum of the probabilities of all the non-overlapping events in the sample space is equal to 1 Eugenia Yu, UBC Department of Statistics 4
Winter 2014/15 STAT 200 Example 1 Consider a spam filter used to screen 1000 incoming email messages: True spam True non-spam Total Classified as spam 570 5 575 Classified as non-spam 30 395 425 600 400 1000 An email message is randomly chosen from the 1000 messages. What is the probability that

