ClassNotesComplete

ClassNotesComplete - STA 3032 (7661) Engineering Statistics...

Info iconThis preview shows pages 1–6. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: STA 3032 (7661) Engineering Statistics Rob Gordon University of Florida Fall 2011 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 1 / 251 Introduction: Sampling & Descriptive Statistics Definition A population is the entire collection of objects or outcomes about which information is sought. Examples: Entire United States Entire State of Florida All UF students Definition A sample is a subset of the population, containing the objects or outcomes that are actually observed. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 2 / 251 Introduction: Sampling & Descriptive Statistics A common question might be: “How do I know if a sample is truly representative of its population? ” Ideally, the best way to accomplish this goal is to select the members of the sample in the most unbiased possible way. Throughout the rest of this course we will assume our samples will follow the definition of a simple random sample: Definition A simple random sample of size n is a sample chosen by a method in which each collection of N population items is equally likely to comprise a sample (as in a lottery). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 3 / 251 Introduction: Sampling & Descriptive Statistics “Summary Statistics” help the important features of a sample stand out. Definition Let x1 , x2 , . . . , xn denote numbers in a sample. The value of the sample mean is n 1 x= ¯ xi . n i =1 Definition The value of the sample variance 1 s= n−1 n 1 (xi − x ) = ¯ n−1 n 2 2 i =1 ¯ xi2 − nx 2 i =1 Definition √ The value of the sample standard deviation is s = Rob Gordon (University of Florida) STA 3032 (7661) s 2. Fall 2011 4 / 251 Introduction: Sampling & Descriptive Statistics Definition Outliers are points in the sample that are much smaller/larger than the rest. Outliers often result from data entry errors (e.g. incorrect decimal place) and can present many problems for statisticians (more on this later). Caution: Only delete an outlier if it exists due to error! Definition The sample median is numerical value separating the higher half of a sample from the lower half. x= ˜ Rob Gordon (University of Florida) 1 2 x n+1 , 2 xn/2 + xn/2+1 , STA 3032 (7661) if n is odd if n is even. Fall 2011 5 / 251 Introduction: Sampling & Descriptive Statistics The median divides the sample in halves, while quartiles divide the data into quarters. Let Q1 = 1st quartile = # greater than 25% of all data points. Q2 = 2nd quartile = # greater than 50% of all data points. Q3 = 3rd quartile = # greater than 75% of all data points. Note: Sometimes quartiles are not numbers in the sample. Definition Q1 = Rob Gordon x0.25(n+1) avg of values above and below (University of Florida) STA 3032 (7661) if 0.25(n + 1) is an integer otherwise. Fall 2011 6 / 251 Introduction: Sampling & Descriptive Statistics Example 1 Sample = { 1, 2, 3, 4, 5, 6, 7} Q1 = x0.25(7+1) = x2 = 2 Example 2 Sample = {1, 2, 3, 4, 5, 6, 7, 8, 9} Q1 = x0.25(10) = x2.5 1 1 = (x2 + x3 ) = (2 + 3) = 2.5 2 2 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 7 / 251 Introduction: Sampling & Descriptive Statistics Similarly, Definition Q2 = median x0.75(n+1) avg of values above and below Q3 = if 0.75(n + 1) is an integer otherwise. We are not restricted to 25, 50 and 75%. Definition The pth percentile of a sample, for a number p between 0 and 100 divides the sample so that as nearly as possible p % of the sample values are less than the p th percentile, and is calculated as x(p/100)(n+1) . Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 8 / 251 Introduction: Sampling & Descriptive Statistics Sometimes its nice to get a visual picture of the data. Stem & Leaf Plot ⇒ Each item is divided into 2 parts: 1 Stem: Leftmost 1 or 2 (usually 1) digit(s). 2 Leaf: Consists of next digit Example Sample = {400, 410, 411, 550, 600, 612, 613} Stem (hundreds) 4 4 5 6 Rob Gordon (University of Florida) STA 3032 (7661) Leaf 0 11 5 011 Fall 2011 9 / 251 Introduction: Sampling & Descriptive Statistics Boxplots Graphs presenting median, Q1 , Q3 and outliers. We previously defined outliers as really big or really small. What do we mean by really big or really small? Definition Interquartile Range (IQR) = Q3 − Q1 . Definition An outlier is any point in the sample that is either 1.5 x IQR above Q3 or 1.5 x IQR below Q1 . Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 10 / 251 Introduction: Sampling & Descriptive Statistics 21 22 23 24 25 26 27 Sample Boxplot Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 11 / 251 Introduction: Sampling & Descriptive Statistics The previous graph was made using R (http://www.r-project.org/). For your convenience, the code used to generate the plot from the previous slide is below: > x = rnorm(100, 24) > boxplot(x, main=”Sample Boxplot”, file = ”boxplot.pdf”) You will never be tested on the specifics of R code. Slides like these are provided only as a convenience. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 12 / 251 Introduction: Sampling & Descriptive Statistics There are other ways to visually represent data. Some examples are Dot Plots Histograms Please read chapter one of your textbook. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 13 / 251 Chapter 4: Probability Pierre Simon Laplace:The most important questions of life are indeed, for the most part, really only problems of probability. Definition An experiment is a process whose outcomes cannot be predicted in advance with absolute certainty. Definition The set of all possible outcomes of an experiment is called the sample space of the experiment. The sample space is often denoted by S. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 14 / 251 Chapter 4: Probability Definition The empty set or null set is a set containing zero elements. You may see it denoted as {} or ∅. Definition A subset of a sample space is called an event. Example: Roll one 6-sided die. S = {1, 2, 3, 4, 5, 6}. Let E be the event I roll an even number. E = {2, 4, 6}. Is the empty set an event? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 15 / 251 Chapter 4: Probability Definition The empty set or null set is a set containing zero elements. You may see it denoted as {} or ∅. Definition A subset of a sample space is called an event. Example: Roll one 6-sided die. S = {1, 2, 3, 4, 5, 6}. Let E be the event I roll an even number. E = {2, 4, 6}. Is the empty set an event? Yes, since the empty set is a subset of every set. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 15 / 251 Set Theory To understand probability on a basic level some discussion about set theory is needed. Some definitions: Definition A set is a list or collection of objects. Definition Let A and B be two arbitrary events defined on the same sample space. The union of A and B , denoted A ∪ B , is an event containing all elements of both A and B . Example: S = {1, 2, 3, 4, 5, 6}, E = {2, 4, 6}, A = {1, 2} E ∪A= Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 16 / 251 Set Theory To understand probability on a basic level some discussion about set theory is needed. Some definitions: Definition A set is a list or collection of objects. Definition Let A and B be two arbitrary events defined on the same sample space. The union of A and B , denoted A ∪ B , is an event containing all elements of both A and B . Example: S = {1, 2, 3, 4, 5, 6}, E = {2, 4, 6}, A = {1, 2} E ∪ A = {1, 2, 4, 6} Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 16 / 251 Set Theory Definition Let A and B be two arbitrary events defined on the same sample space. The intersection of A and B , denoted A ∩ B , or AB , is the set of outcomes that belong to both A and B . Example: S = {1, 2, 3, 4, 5, 6}, E = {2, 4, 6}, A = {1, 2} E ∩A= Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 17 / 251 Set Theory Definition Let A and B be two arbitrary events defined on the same sample space. The intersection of A and B , denoted A ∩ B , or AB , is the set of outcomes that belong to both A and B . Example: S = {1, 2, 3, 4, 5, 6}, E = {2, 4, 6}, A = {1, 2} E ∩ A = {2} Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 17 / 251 Set Theory Definition Let A and B be two arbitrary events defined on the same sample space. The intersection of A and B , denoted A ∩ B , or AB , is the set of outcomes that belong to both A and B . Example: S = {1, 2, 3, 4, 5, 6}, E = {2, 4, 6}, A = {1, 2} E ∩ A = {2} Definition Let A be an event defined on some sample space. The complement of an ¯ event A, denoted by A (also Ac and A ), is the set of outcomes in the sample space not belonging to A. Example: S = {1, 2, 3, 4, 5, 6}, E = {2, 4, 6} ¯ E= Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 17 / 251 Set Theory Definition Let A and B be two arbitrary events defined on the same sample space. The intersection of A and B , denoted A ∩ B , or AB , is the set of outcomes that belong to both A and B . Example: S = {1, 2, 3, 4, 5, 6}, E = {2, 4, 6}, A = {1, 2} E ∩ A = {2} Definition Let A be an event defined on some sample space. The complement of an ¯ event A, denoted by A (also Ac and A ), is the set of outcomes in the sample space not belonging to A. Example: S = {1, 2, 3, 4, 5, 6}, E = {2, 4, 6} ¯ E = {1, 3, 5} Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 17 / 251 Set Theory The previous definitions are examples of set operations. Venn Diagrams help illustrate these: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 18 / 251 Set Theory Definition Let A and B be two events defined on a sample space. A and B are said to be mutually exclusive if they have no outcomes in common, i.e. AB = ∅. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 19 / 251 Probability Definition The probability of an event is a quantitative measure of how likely an event is to occur. Given an experiment and some event A, defined on a sample space: P (A) denotes the probability that event A occurs. P (A) is the proportion of times event A would occur in the long run, if the experiment is repeated over and over again. Consider a regular two-sided coin. If I flip it 10 times, how many heads will I get? What if I flip it 100 times? 1000? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 20 / 251 Probability Let S denote the sample space. Axioms of Probability 1 P (S ) = 1. 2 For any event A, 0 ≤ P (A) ≤ 1. 3 If A and B are mutually exclusive events, P (A ∪ B ) = P (A) + P (B ). From these axioms we can say P (Ac ) = 1 − P (A) P (∅) = 0 Why? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 21 / 251 Probability What if A and B are not mutually exclusive? Theorem Given two events A and B defined on some sample space S, P (A ∪ B ) = P (A) + P (B ) − P (A ∩ B ). (1) The theorem above can be proven using set theory, but it is quicker to use a Venn Diagram to think it through. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 22 / 251 Probability Example: Let E be the event a new car requires engine work, and T be the event that it requires transmission work. Assume that P (E ) = 0.10, P (T ) = 0.02, P (E ∩ T ) = 0.01. Find the probability that the car needs: 1 either E or T or both. 2 neither E nor T . 3 E but not T . Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 23 / 251 Probability Example: Let E be the event a new car requires engine work, and T be the event that it requires transmission work. Assume that P (E ) = 0.10, P (T ) = 0.02, P (E ∩ T ) = 0.01. Find the probability that the car needs: 1 either E or T or both. 2 neither E nor T . 3 E but not T . Answers: 1 P (either E or T or both) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 23 / 251 Probability Example: Let E be the event a new car requires engine work, and T be the event that it requires transmission work. Assume that P (E ) = 0.10, P (T ) = 0.02, P (E ∩ T ) = 0.01. Find the probability that the car needs: 1 either E or T or both. 2 neither E nor T . 3 E but not T . Answers: 1 P (either E or T or both) = P (E ∪ T ) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 23 / 251 Probability Example: Let E be the event a new car requires engine work, and T be the event that it requires transmission work. Assume that P (E ) = 0.10, P (T ) = 0.02, P (E ∩ T ) = 0.01. Find the probability that the car needs: 1 either E or T or both. 2 neither E nor T . 3 E but not T . Answers: 1 P (either E or T or both) = P (E ∪ T ) = P (E ) + P (T ) − P (E ∩ T ) = 0.10 + 0.02 − 0.01 = 0.11 2 P (neither E nor T) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 23 / 251 Probability Example: Let E be the event a new car requires engine work, and T be the event that it requires transmission work. Assume that P (E ) = 0.10, P (T ) = 0.02, P (E ∩ T ) = 0.01. Find the probability that the car needs: 1 either E or T or both. 2 neither E nor T . 3 E but not T . Answers: 1 2 P (either E or T or both) = P (E ∪ T ) = P (E ) + P (T ) − P (E ∩ T ) = 0.10 + 0.02 − 0.01 = 0.11 P (neither E nor T) = P ((E ∪ T )c ) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 23 / 251 Probability Example: Let E be the event a new car requires engine work, and T be the event that it requires transmission work. Assume that P (E ) = 0.10, P (T ) = 0.02, P (E ∩ T ) = 0.01. Find the probability that the car needs: 1 either E or T or both. 2 neither E nor T . 3 E but not T . Answers: 1 P (either E or T or both) = P (E ∪ T ) = P (E ) + P (T ) − P (E ∩ T ) = 0.10 + 0.02 − 0.01 = 0.11 2 P (neither E nor T) = P ((E ∪ T )c ) = 1 − P (E ∪ T ) = 1 − 0.11 = 0.89 3 P (E but not T) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 23 / 251 Probability Example: Let E be the event a new car requires engine work, and T be the event that it requires transmission work. Assume that P (E ) = 0.10, P (T ) = 0.02, P (E ∩ T ) = 0.01. Find the probability that the car needs: 1 either E or T or both. 2 neither E nor T . 3 E but not T . Answers: 1 P (either E or T or both) = P (E ∪ T ) = P (E ) + P (T ) − P (E ∩ T ) = 0.10 + 0.02 − 0.01 = 0.11 2 P (neither E nor T) = P ((E ∪ T )c ) = 1 − P (E ∪ T ) = 1 − 0.11 = 0.89 3 P (E but not T) = P (E ∩ T c ) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 23 / 251 Probability Example: Let E be the event a new car requires engine work, and T be the event that it requires transmission work. Assume that P (E ) = 0.10, P (T ) = 0.02, P (E ∩ T ) = 0.01. Find the probability that the car needs: 1 either E or T or both. 2 neither E nor T . 3 E but not T . Answers: 1 P (either E or T or both) = P (E ∪ T ) = P (E ) + P (T ) − P (E ∩ T ) = 0.10 + 0.02 − 0.01 = 0.11 2 P (neither E nor T) = P ((E ∪ T )c ) = 1 − P (E ∪ T ) = 1 − 0.11 = 0.89 3 P (E but not T) = P (E ∩ T c ) = P (E ) + P (T c ) − P (E ∪ T c ) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 23 / 251 Probability Example: Let E be the event a new car requires engine work, and T be the event that it requires transmission work. Assume that P (E ) = 0.10, P (T ) = 0.02, P (E ∩ T ) = 0.01. Find the probability that the car needs: 1 either E or T or both. 2 neither E nor T . 3 E but not T . Answers: 1 2 3 P (either E or T or both) = P (E ∪ T ) = P (E ) + P (T ) − P (E ∩ T ) = 0.10 + 0.02 − 0.01 = 0.11 P (neither E nor T) = P ((E ∪ T )c ) = 1 − P (E ∪ T ) = 1 − 0.11 = 0.89 P (E but not T) = P (E ∩ T c ) = P (E ) + P (T c ) − P (E ∪ T c ) = 0.10 + 0.98+??? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 23 / 251 How do we calculate P (E ∪ T c )? First think about what we mean by the union of E and T c . Remember that they are just two symbols representing sets. The union is a set containing all elements found in either E or T c . If we were to draw what that looks like in a Venn diagram, we get this: The red area is the part we’re concerned with. Notice that it contains all that is not T in addition to all of E . Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 24 / 251 How do we calculate P (E ∪ T c )? P (E ∪ T c ) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 25 / 251 How do we calculate P (E ∪ T c )? P (E ∪ T c ) = 1 − P (T ) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 25 / 251 How do we calculate P (E ∪ T c )? P (E ∪ T c ) = 1 − P (T ) +P (E ∩ T ) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 25 / 251 How do we calculate P (E ∪ T c )? P (E ∪ T c ) = 1 − P (T ) +P (E ∩ T ) Now put it all together. We get: P (E ∩ T c ) = P (E ) + P (T c ) − P (E ∪ T c ) = 0.1 + 0.98 − (1 − P (T ) + P (T ∩ E )) = 0.1 + 0.98 − (1 − 0.02 + 0.01) = 0.09. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 25 / 251 Calculating Probabilities Let A be an event defined on some sample space S . Then the classical definition of probability is the following: P (A) = # of elements in A # of ways A occurs = # of possible outcomes # of elements in S Sometimes it is difficult (and often tedious to list the items in an event. Its easier instead to count the number of ways the event occurs. Homework: Sections 4.1 and 4.2, all odd-numbered problems (not to be handed in). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 26 / 251 Counting Methods The Fundamental Principle of Counting Theorem Assume k operations are performed. If there are n1 ways to perform the 1st operation, n2 ways to perform the second, . . . , nk ways to perform the kth operation, then the total number of ways to perform the sequence of k operations is k ni = n1 · n2 · · · nk . i =1 Example: How many ways can I flip a coin and roll a six-sided die? How many ways can I flip a head and roll an even number? What is the probability that I flip a head and roll an even number? Answer: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 27 / 251 Counting Methods The Fundamental Principle of Counting Theorem Assume k operations are performed. If there are n1 ways to perform the 1st operation, n2 ways to perform the second, . . . , nk ways to perform the kth operation, then the total number of ways to perform the sequence of k operations is k ni = n1 · n2 · · · nk . i =1 Example: How many ways can I flip a coin and roll a six-sided die? How many ways can I flip a head and roll an even number? What is the probability that I flip a head and roll an even number? Answer: (2 ways to flip a coin)(6 ways to roll a die) = 12 ways to do both (1 way to flip a head)(3 ways to roll an even number) = 3 ways to do both P (flip head & roll even number) = 3/12 = 1/4. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 27 / 251 Counting Methods Sometimes an Event is described as the number of ways a collection of objects is arranged. Definition A permutation is an ordering of a collection of objects. Example: There are 6 permutations of letters ABC. ABC ACB BAC BCA CAB CBA What if we have more than 3 objects we want to arrange? What if we have 1000? It is significantly more difficult to list all the possibilities. We can derive a formula using the Fundamental Principle of Counting. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 28 / 251 Counting Methods How many permutations exist for a collection of n objects? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 29 / 251 Counting Methods How many permutations exist for a collection of n objects? Think about the problem as placing n objects in n places. How many ways can you place an object in the first place? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 29 / 251 Counting Methods How many permutations exist for a collection of n objects? Think about the problem as placing n objects in n places. How many ways can you place an object in the first place? n Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 29 / 251 Counting Methods How many permutations exist for a collection of n objects? Think about the problem as placing n objects in n places. How many ways can you place an object in the first place? n How many ways can you place an object in the second place? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 29 / 251 Counting Methods How many permutations exist for a collection of n objects? Think about the problem as placing n objects in n places. How many ways can you place an object in the first place? n How many ways can you place an object in the second place? n − 1 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 29 / 251 Counting Methods How many permutations exist for a collection of n objects? Think about the problem as placing n objects in n places. How many ways can you place an object in the first place? n How many ways can you place an object in the second place? n − 1 . . . How many ways can you place an object in the last place? 1 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 29 / 251 Counting Methods How many permutations exist for a collection of n objects? Think about the problem as placing n objects in n places. How many ways can you place an object in the first place? n How many ways can you place an object in the second place? n − 1 . . . How many ways can you place an object in the last place? 1 Using the fundamental theorem of counting, how many permutations do we have? Answer: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 29 / 251 Counting Methods How many permutations exist for a collection of n objects? Think about the problem as placing n objects in n places. How many ways can you place an object in the first place? n How many ways can you place an object in the second place? n − 1 . . . How many ways can you place an object in the last place? 1 Using the fundamental theorem of counting, how many permutations do we have? Answer: n! Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 29 / 251 Counting Methods How many ways can we order k objects from n total? Example: A basketball coach needs to choose 5 players from a roster of 12 to be starters. The coach is awful and wants to randomly choose a starting lineup by picking 5 numbers out of a hat. Also assume that order matters so that the 1st player chosen plays point guard, the 2nd player is shooting guard, etc. How many starting lineup permutations are there? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 30 / 251 Counting Methods How many ways can we order k objects from n total? Example: A basketball coach needs to choose 5 players from a roster of 12 to be starters. The coach is awful and wants to randomly choose a starting lineup by picking 5 numbers out of a hat. Also assume that order matters so that the 1st player chosen plays point guard, the 2nd player is shooting guard, etc. How many starting lineup permutations are there? Consider a similar method to how we answered the last question: How many ways can a coach assign a player from his roster to the first starting spot? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 30 / 251 Counting Methods How many ways can we order k objects from n total? Example: A basketball coach needs to choose 5 players from a roster of 12 to be starters. The coach is awful and wants to randomly choose a starting lineup by picking 5 numbers out of a hat. Also assume that order matters so that the 1st player chosen plays point guard, the 2nd player is shooting guard, etc. How many starting lineup permutations are there? Consider a similar method to how we answered the last question: How many ways can a coach assign a player from his roster to the first starting spot? 12 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 30 / 251 Counting Methods How many ways can we order k objects from n total? Example: A basketball coach needs to choose 5 players from a roster of 12 to be starters. The coach is awful and wants to randomly choose a starting lineup by picking 5 numbers out of a hat. Also assume that order matters so that the 1st player chosen plays point guard, the 2nd player is shooting guard, etc. How many starting lineup permutations are there? Consider a similar method to how we answered the last question: How many ways can a coach assign a player from his roster to the first starting spot? 12 How many ways can a coach assign a player from his roster to the second starting spot? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 30 / 251 Counting Methods How many ways can we order k objects from n total? Example: A basketball coach needs to choose 5 players from a roster of 12 to be starters. The coach is awful and wants to randomly choose a starting lineup by picking 5 numbers out of a hat. Also assume that order matters so that the 1st player chosen plays point guard, the 2nd player is shooting guard, etc. How many starting lineup permutations are there? Consider a similar method to how we answered the last question: How many ways can a coach assign a player from his roster to the first starting spot? 12 How many ways can a coach assign a player from his roster to the second starting spot? 11 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 30 / 251 Counting Methods How many ways can we order k objects from n total? Example: A basketball coach needs to choose 5 players from a roster of 12 to be starters. The coach is awful and wants to randomly choose a starting lineup by picking 5 numbers out of a hat. Also assume that order matters so that the 1st player chosen plays point guard, the 2nd player is shooting guard, etc. How many starting lineup permutations are there? Consider a similar method to how we answered the last question: How many ways can a coach assign a player from his roster to the first starting spot? 12 How many ways can a coach assign a player from his roster to the second starting spot? 11 . . . How many ways can a coach assign a player from his roster to the 5th starting spot? 8 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 30 / 251 Counting Methods How many ways can we order k objects from n total? Example: A basketball coach needs to choose 5 players from a roster of 12 to be starters. The coach is awful and wants to randomly choose a starting lineup by picking 5 numbers out of a hat. Also assume that order matters so that the 1st player chosen plays point guard, the 2nd player is shooting guard, etc. How many starting lineup permutations are there? Consider a similar method to how we answered the last question: How many ways can a coach assign a player from his roster to the first starting spot? 12 How many ways can a coach assign a player from his roster to the second starting spot? 11 . . . How many ways can a coach assign a player from his roster to the 5th starting spot? 8 Then the number of starting lineup permutations are (12)(11)(10)(9)(8). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 30 / 251 Counting Methods How many ways can we order k objects from n total? Example: A basketball coach needs to choose 5 players from a roster of 12 to be starters. The coach is awful and wants to randomly choose a starting lineup by picking 5 numbers out of a hat. Also assume that order matters so that the 1st player chosen plays point guard, the 2nd player is shooting guard, etc. How many starting lineup permutations are there? Consider a similar method to how we answered the last question: How many ways can a coach assign a player from his roster to the first starting spot? 12 How many ways can a coach assign a player from his roster to the second starting spot? 11 . . . How many ways can a coach assign a player from his roster to the 5th starting spot? 8 Then the number of starting lineup permutations are (12)(11)(10)(9)(8). Based on this reasoning, can we come to some conclusion for general n and k ? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 30 / 251 Counting Methods Definition A permutation is the number of ordered arrangements, of k objects selected from n distinct objects (k ≤ n). It is given by nPr = Rob Gordon (University of Florida) n! (n − k )! STA 3032 (7661) (2) Fall 2011 31 / 251 Counting Methods Definition A permutation is the number of ordered arrangements, of k objects selected from n distinct objects (k ≤ n). It is given by nPr = Notice that (12)(11)(10)(9)(8) = Rob Gordon (University of Florida) n! (n − k )! (12)(11)(10)(9)(8)(7)(6)(5)(4)(3)(2)(1) (7)(6)(5)(4)(3)(2)(1) STA 3032 (7661) (2) = 12! 7! = 12! (12−5)! Fall 2011 31 / 251 Counting Methods Definition A permutation is the number of ordered arrangements, of k objects selected from n distinct objects (k ≤ n). It is given by nPr = Notice that (12)(11)(10)(9)(8) = n! (n − k )! (12)(11)(10)(9)(8)(7)(6)(5)(4)(3)(2)(1) (7)(6)(5)(4)(3)(2)(1) (2) = 12! 7! = 12! (12−5)! Remember, permutations require that the order of objects is of particular importance. What if we want to pick objects with no regard to order? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 31 / 251 Counting Methods Definition Each distinct group of objects that can be selected, without regard to order, is called a combination. Back to our basketball coach example: How many combinations of 5 from 12 can we choose? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 32 / 251 Counting Methods Definition Each distinct group of objects that can be selected, without regard to order, is called a combination. Back to our basketball coach example: How many combinations of 5 from 12 can we choose? For each group of 5, we only count it once, regardless of how many permutations exist. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 32 / 251 Counting Methods Definition Each distinct group of objects that can be selected, without regard to order, is called a combination. Back to our basketball coach example: How many combinations of 5 from 12 can we choose? For each group of 5, we only count it once, regardless of how many permutations exist. # permutations of 5 from 12 So # of combinations of starting 5 = # of permutations of 5 from 5 Based on this reasoning, can we come to some conclusion for general n and k ? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 32 / 251 Counting Methods Definition Combination is the number of distinct subsets, or combinations, of size k that can be selected from n distinct objects (k ≤ n). It is given by n k = n! k !(n − k )! (3) What if we need to partition 3 or more groups instead of 2? Definition The number of ways of partitioning n distinct objects into k groups containing n1 , n2 , . . . , nk objects, respectively, is n! where n1 !n2 ! · · · nk ! Rob Gordon (University of Florida) STA 3032 (7661) k ni = n. (4) i =1 Fall 2011 33 / 251 Counting Methods To Review If we need to count the number of ways to arrange objects from a larger set of objects: Use permutations if order matters. Use combinations if order doesn’t matter. Use partitioning if we want combinations for groups of 3 or more. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 34 / 251 Counting Methods Example: A friend owns 10 Jazz CDs, 5 rap CDs, 6 Metal CDs. He picks 3 CDs to bring on a car ride in a completely random way. 1 What is the probability that he brings 1 of each? 2 What is the probability he brings 2 Rap and 1 Metal? Answer: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 35 / 251 Counting Methods Example: A friend owns 10 Jazz CDs, 5 rap CDs, 6 Metal CDs. He picks 3 CDs to bring on a car ride in a completely random way. 1 What is the probability that he brings 1 of each? 2 What is the probability he brings 2 Rap and 1 Metal? Answer: P (1 each) = Rob Gordon (University of Florida) # ways to pick 1 each # ways to pick any 3 STA 3032 (7661) Fall 2011 35 / 251 Counting Methods Example: A friend owns 10 Jazz CDs, 5 rap CDs, 6 Metal CDs. He picks 3 CDs to bring on a car ride in a completely random way. 1 What is the probability that he brings 1 of each? 2 What is the probability he brings 2 Rap and 1 Metal? Answer: P (1 each) = Rob Gordon (University of Florida) # ways to pick 1 each (# 1J)(# 1R)(# 1M) = # ways to pick any 3 # any 3 STA 3032 (7661) Fall 2011 35 / 251 Counting Methods Example: A friend owns 10 Jazz CDs, 5 rap CDs, 6 Metal CDs. He picks 3 CDs to bring on a car ride in a completely random way. 1 What is the probability that he brings 1 of each? 2 What is the probability he brings 2 Rap and 1 Metal? Answer: P (1 each) = = Rob Gordon (University of Florida) # ways to pick 1 each (# 1J)(# 1R)(# 1M) = # ways to pick any 3 # any 3 10 1 5 1 21 3 6 1 STA 3032 (7661) Fall 2011 35 / 251 Counting Methods Example: A friend owns 10 Jazz CDs, 5 rap CDs, 6 Metal CDs. He picks 3 CDs to bring on a car ride in a completely random way. 1 What is the probability that he brings 1 of each? 2 What is the probability he brings 2 Rap and 1 Metal? Answer: P (1 each) = = # ways to pick 1 each (# 1J)(# 1R)(# 1M) = # ways to pick any 3 # any 3 10 1 5 1 21 3 6 1 = (10)(5)(6) (21)(20)(19) (3)(2) = ··· 30 = 0.23 133 P (2R + 1M) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 35 / 251 Counting Methods Example: A friend owns 10 Jazz CDs, 5 rap CDs, 6 Metal CDs. He picks 3 CDs to bring on a car ride in a completely random way. 1 What is the probability that he brings 1 of each? 2 What is the probability he brings 2 Rap and 1 Metal? Answer: P (1 each) = = P (2R + 1M) = Rob Gordon (University of Florida) # ways to pick 1 each (# 1J)(# 1R)(# 1M) = # ways to pick any 3 # any 3 10 1 10 0 5 1 21 3 6 1 5 2 21 3 6 1 = (10)(5)(6) (21)(20)(19) (3)(2) STA 3032 (7661) = ··· 30 = 0.23 133 Fall 2011 35 / 251 Counting Methods Example: A friend owns 10 Jazz CDs, 5 rap CDs, 6 Metal CDs. He picks 3 CDs to bring on a car ride in a completely random way. 1 What is the probability that he brings 1 of each? 2 What is the probability he brings 2 Rap and 1 Metal? Answer: P (1 each) = = P (2R + 1M) = Rob Gordon (University of Florida) # ways to pick 1 each (# 1J)(# 1R)(# 1M) = # ways to pick any 3 # any 3 10 1 10 0 5 1 21 3 6 1 5 2 21 3 6 1 = = (10)(5)(6) (21)(20)(19) (3)(2) (5)(4) 2 30 = 0.23 133 (6) (21)(20)(19) (3)(2) STA 3032 (7661) = ··· = ··· 6 = 0.05 133 Fall 2011 35 / 251 Counting Methods Homework: Section 4.3 - all (not to be handed in). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 36 / 251 Conditional Probability & Independence Recall that when we find a probability, we do it in terms of all possible outcomes (i.e. in reference to the entire sample space.) If we know some additional information, we effectively reduce the sample space. P (A) = Area of A , but what if we are told that B has already occurred? Area of S Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 37 / 251 Conditional Probability & Independence If B occurs, we may want to modify our calculation of A’s probability. In this case we are only concerned about the instances when A occurs at the same time that B occurs. We’ll keep track of the occurrence of B when we write: Definition Conditional Probability: P (A|B ) = P (A∩B ) P (B ) . Note: Read P (A|B ) as P (A given B ). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 38 / 251 Conditional Probability & Independence From the definition of conditional probability we define the notion of independence: Definition Two events A and B are independent if the probability of each event remains the same whether or not the other occurs, i.e. A and B are independent if P (B |A) = P (B ) and P (A|B ) = P (A). Caution: Independence does not mean mutually exclusive. Why? As a consequence of A and B being independent, we can say P (A ∩ B ) = P (A)P (B ). Why? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 39 / 251 Conditional Probability & Independence From the definition of conditional probability we define the notion of independence: Definition Two events A and B are independent if the probability of each event remains the same whether or not the other occurs, i.e. A and B are independent if P (B |A) = P (B ) and P (A|B ) = P (A). Caution: Independence does not mean mutually exclusive. Why? As a consequence of A and B being independent, we can say P (A ∩ B ) = P (A)P (B ). Why? P (A∩B ) P (B ) Rob Gordon = P (A|B ) (University of Florida) STA 3032 (7661) Fall 2011 39 / 251 Conditional Probability & Independence From the definition of conditional probability we define the notion of independence: Definition Two events A and B are independent if the probability of each event remains the same whether or not the other occurs, i.e. A and B are independent if P (B |A) = P (B ) and P (A|B ) = P (A). Caution: Independence does not mean mutually exclusive. Why? As a consequence of A and B being independent, we can say P (A ∩ B ) = P (A)P (B ). Why? P (A∩B ) P (B ) Rob Gordon = P (A|B ) = P (A) (University of Florida) STA 3032 (7661) Fall 2011 39 / 251 Conditional Probability & Independence From the definition of conditional probability we define the notion of independence: Definition Two events A and B are independent if the probability of each event remains the same whether or not the other occurs, i.e. A and B are independent if P (B |A) = P (B ) and P (A|B ) = P (A). Caution: Independence does not mean mutually exclusive. Why? As a consequence of A and B being independent, we can say P (A ∩ B ) = P (A)P (B ). Why? P (A∩B ) P (B ) Rob Gordon = P (A|B ) = P (A) ⇒ P (A ∩ B ) = P (A)P (B ) (University of Florida) STA 3032 (7661) Fall 2011 39 / 251 Conditional Probability & Independence Back to a previous example: What is the probability of flipping a coin and seeing a Head, and rolling a 6-sided die and getting an even number? P (flip head & roll even) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 40 / 251 Conditional Probability & Independence Back to a previous example: What is the probability of flipping a coin and seeing a Head, and rolling a 6-sided die and getting an even number? P (flip head & roll even) = P (flip head)P (roll even) Since the two outcomes are independent Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 40 / 251 Conditional Probability & Independence Back to a previous example: What is the probability of flipping a coin and seeing a Head, and rolling a 6-sided die and getting an even number? P (flip head & roll even) = P (flip head)P (roll even) Since the two outcomes are independent 1 1 1 = = 2 4 4 Homework: Section 4.4 - all odd-numbered problems (do not hand in) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 40 / 251 More consequences of Conditional Probability Slight problem: It is often difficult to know P (A ∩ B ). If this is the case and A and B are dependent, how do we calculate conditional probabilities? Solution: Notice that P (A|B ) = P (A∩B ) P (B ) and P (B |A) = P (A∩B ) P (A) . We can manipulate these formulas. Replace P (A ∩ B ) with P (B |A)P (A) and say P (A|B ) = P (B |A)P (A) . P (B ) We can take this even further. What if we don’t know P (B )? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 41 / 251 The Law of Total Probability and Bayes’ Rule Consider the following example: In a factory many machines make lightbulbs. Each machine makes the same lightbulb and each machine makes a certain % of defective lightbulbs. Let Ai be the event that a lightbulb is from machine i . Let B be the event a defective item is produced. Then the Venn Diagram looks like this: The entire sample space is partitioned according to the number of machines (Ai s ) in the factory (10 in this case). The blue oval represents the event (we’ll call it B ) that a lightbulb from a machine is defective. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 42 / 251 The Law of Total Probability and Bayes’ Rule Intuitively we see that P (B ) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 43 / 251 The Law of Total Probability and Bayes’ Rule Intuitively we see that 10 P (B ) = P (Pieces of B ) = j =1 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 43 / 251 The Law of Total Probability and Bayes’ Rule Intuitively we see that 10 P (B ) = 10 Rob Gordon P (Aj ∩ B ) P (Pieces of B ) = j =1 (University of Florida) (Law of Total Probability) j =1 STA 3032 (7661) Fall 2011 43 / 251 The Law of Total Probability and Bayes’ Rule Intuitively we see that 10 P (B ) = 10 P (Aj ∩ B ) P (Pieces of B ) = j =1 (Law of Total Probability) j =1 10 P (B |Aj )P (Aj ) = j =1 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 43 / 251 The Law of Total Probability and Bayes’ Rule Now consider what we’ve done so far. We have shown Definition Bayes’ Rule: P (Ai |B ) = P (B |Ai )P (Ai ) n j =1 P (B |Aj )P (Aj ) (5) Proof: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 44 / 251 The Law of Total Probability and Bayes’ Rule Now consider what we’ve done so far. We have shown Definition Bayes’ Rule: P (Ai |B ) = P (B |Ai )P (Ai ) n j =1 P (B |Aj )P (Aj ) (5) Proof: Given the past few slides, we can rewrite our previous formula: P (Ai |B ) = = Rob Gordon P (Ai ∩ B ) P (B ) P (B |Ai )P (Ai ) P (B ) (University of Florida) (Def. of Conditional Probability) STA 3032 (7661) Fall 2011 44 / 251 The Law of Total Probability and Bayes’ Rule Now consider what we’ve done so far. We have shown Definition Bayes’ Rule: P (Ai |B ) = P (B |Ai )P (Ai ) n j =1 P (B |Aj )P (Aj ) (5) Proof: Given the past few slides, we can rewrite our previous formula: P (Ai |B ) = = = Rob Gordon P (Ai ∩ B ) P (B ) P (B |Ai )P (Ai ) (Def. of Conditional Probability) P (B ) P (B |Ai )P (Ai ) (Law of Total Probability) P (B |Aj )P (Aj ) (University of Florida) STA 3032 (7661) Fall 2011 44 / 251 Bayes’ Rule Example: Suppose a factory has 3 machines. Machine 1 (M1 ) makes 40% of the lightbulbs and has a 5% defect rate. M2 makes 35% at a 6% defect rate and M3 makes 25% at an 8% defect rate. A quality assurance manager selects 1 lightbulb from a pile of defects. What is the probability it is from M1 ? Answer: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 45 / 251 Bayes’ Rule Example: Suppose a factory has 3 machines. Machine 1 (M1 ) makes 40% of the lightbulbs and has a 5% defect rate. M2 makes 35% at a 6% defect rate and M3 makes 25% at an 8% defect rate. A quality assurance manager selects 1 lightbulb from a pile of defects. What is the probability it is from M1 ? Answer: Another way to ask the question is: “What is the probability the lightbulb is from M1 , given it is defective?” Let D denote the event that a lightbulb is defective. P (M1 |D ) = Rob Gordon P (M1 ∩ D ) P (D ) (University of Florida) STA 3032 (7661) Fall 2011 45 / 251 Bayes’ Rule Example: Suppose a factory has 3 machines. Machine 1 (M1 ) makes 40% of the lightbulbs and has a 5% defect rate. M2 makes 35% at a 6% defect rate and M3 makes 25% at an 8% defect rate. A quality assurance manager selects 1 lightbulb from a pile of defects. What is the probability it is from M1 ? Answer: Another way to ask the question is: “What is the probability the lightbulb is from M1 , given it is defective?” Let D denote the event that a lightbulb is defective. P (M1 |D ) = Rob Gordon P (M1 ∩ D ) (but we aren’t given these explicitly.) P (D ) (University of Florida) STA 3032 (7661) Fall 2011 45 / 251 Bayes’ Rule Example: Suppose a factory has 3 machines. Machine 1 (M1 ) makes 40% of the lightbulbs and has a 5% defect rate. M2 makes 35% at a 6% defect rate and M3 makes 25% at an 8% defect rate. A quality assurance manager selects 1 lightbulb from a pile of defects. What is the probability it is from M1 ? Answer: Another way to ask the question is: “What is the probability the lightbulb is from M1 , given it is defective?” Let D denote the event that a lightbulb is defective. P (M1 |D ) = = Rob Gordon P (M1 ∩ D ) (but we aren’t given these explicitly.) P (D ) P (D |M1 )P (M1 ) P (D ) (University of Florida) STA 3032 (7661) Fall 2011 45 / 251 Bayes’ Rule Example: Suppose a factory has 3 machines. Machine 1 (M1 ) makes 40% of the lightbulbs and has a 5% defect rate. M2 makes 35% at a 6% defect rate and M3 makes 25% at an 8% defect rate. A quality assurance manager selects 1 lightbulb from a pile of defects. What is the probability it is from M1 ? Answer: Another way to ask the question is: “What is the probability the lightbulb is from M1 , given it is defective?” Let D denote the event that a lightbulb is defective. P (M1 |D ) = = = Rob Gordon P (M1 ∩ D ) (but we aren’t given these explicitly.) P (D ) P (D |M1 )P (M1 ) (closer, but we still don’t know P (D )) P (D ) P (D |M1 )P (M1 ) P (D |M1 )P (M1 ) + P (D |M2 )P (M2 ) + P (D |M3 )P (M3 ) (University of Florida) STA 3032 (7661) Fall 2011 45 / 251 Bayes’ Rule Example (continued): Suppose a factory has 3 machines. Machine 1 (M1 ) makes 40% of the lightbulbs and has a 5% defect rate. M2 makes 35% at a 6% defect rate and M3 makes 25% at an 8% defect rate. A quality assurance manager selects 1 lightbulb from a pile of defects. What is the probability it is from M1 ? Answer: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 46 / 251 Bayes’ Rule Example (continued): Suppose a factory has 3 machines. Machine 1 (M1 ) makes 40% of the lightbulbs and has a 5% defect rate. M2 makes 35% at a 6% defect rate and M3 makes 25% at an 8% defect rate. A quality assurance manager selects 1 lightbulb from a pile of defects. What is the probability it is from M1 ? Answer: P (M1 |D ) = Rob Gordon P (D |M1 )P (M1 ) P (D |M1 )P (M1 ) + P (D |M2 )P (M2 ) + P (D |M3 )P (M3 ) (University of Florida) STA 3032 (7661) Fall 2011 46 / 251 Bayes’ Rule Example (continued): Suppose a factory has 3 machines. Machine 1 (M1 ) makes 40% of the lightbulbs and has a 5% defect rate. M2 makes 35% at a 6% defect rate and M3 makes 25% at an 8% defect rate. A quality assurance manager selects 1 lightbulb from a pile of defects. What is the probability it is from M1 ? Answer: P (M1 |D ) = = Rob Gordon P (D |M1 )P (M1 ) P (D |M1 )P (M1 ) + P (D |M2 )P (M2 ) + P (D |M3 )P (M3 ) (0.05)(0.4) (0.05)(0.4) + (0.06)(0.35) + (0.08)(0.25) (University of Florida) STA 3032 (7661) Fall 2011 46 / 251 Bayes’ Rule Example (continued): Suppose a factory has 3 machines. Machine 1 (M1 ) makes 40% of the lightbulbs and has a 5% defect rate. M2 makes 35% at a 6% defect rate and M3 makes 25% at an 8% defect rate. A quality assurance manager selects 1 lightbulb from a pile of defects. What is the probability it is from M1 ? Answer: P (D |M1 )P (M1 ) P (D |M1 )P (M1 ) + P (D |M2 )P (M2 ) + P (D |M3 )P (M3 ) (0.05)(0.4) = (0.05)(0.4) + (0.06)(0.35) + (0.08)(0.25) = 0.33 P (M1 |D ) = The hardest part is always determining what you know versus what you need to find. The only way to get good at this is to practice! Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 46 / 251 End of Chapter 4 Homework: read example 4.39 and do all problems from Section 4.5 (at the very least read the example and do 4.37, 4.39, 4.40, 4.41, 4.43, 4.47) We will skip section 4.6 (Odds and Odds Ratios) for now. We may come back to it later in the semester if time allows. Quiz Announcement: Friday September 2nd. The quiz will take place at the beginning 20 minutes (or so) of class. A lecture will follow once the quiz is complete. Covers Chapters 1 and 4. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 47 / 251 Chapter 5: Discrete Probability Distributions Dr. Hani Doss: “A Random Variable is neither random nor a variable.” Definition A random variable (RV) assigns a numerical value to each outcome in a sample space, and is denoted by a capital letter (usually at the bottom of the alphabet, e.g. X, Y, Z, U, V, etc.) You can think of a random variable as a function (or map if you prefer) from a sample space to a number, e.g. X :S →R It is easier to think of a RV as a “regular” variable that assumes certain values with some element of chance involved. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 48 / 251 Simple Examples Rolling a 6-sided die: S = {1, 2, 3, 4, 5, 6} → {1, 2, 3, 4, 5, 6} Flipping a Coin: S = {H , T } → {0, 1} Roll 2 6-sided die: S = {(1, 1), (1, 2), . . . , (6, 6)} → {2, 3, . . . , 12} Methods for working with Random Variables are more or less the same if you use some advanced mathematics, but we will make a distinction between two types of Random Variables: discrete and continuous. Definition A Random Variable is discrete if its possible values form a discrete set. By “discrete” set we mean there are gaps between items in the set. Examples {1.5, 2.4, 35, 50.3} set of all integers 3 examples above Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 49 / 251 Chapter 5 Remember: we are still talking about computing probabilities. Definition The probability mass function (pmf) of a discrete Random Variable, X , is the function p (x ) = P (X = x ). You might also see a pmf denoted as f (x ). Definition The cumulative distribution function (cdf) of X is the function F (x ) = P (X ≤ x ) = p (t ) = t ≤x P (X = t ) (6) t ≤x Furthermore, the above are defined such that p (x ) = x Rob Gordon (University of Florida) P (X = x ) = 1. (7) x STA 3032 (7661) Fall 2011 50 / 251 Oh man, what? This is a radical shift in the way we think about calculating probabilities. Once you get everything wrapped around your head, you’ll see that redefining probability this way actually makes things much easier. Think of a RV as representing a population, and that specific observations of a RV as numbers in our sample. Recall that on the first day of class we talked about things like mean and variance of a sample. We can talk about the means and variances of Random Variables as well. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 51 / 251 Expectation Example: Think about the 6-sided die. The mean of the numbers is x= ¯ 1 6 i 1 xi = (1 + 2 + 3 + 4 + 5 + 6) = 3.5 6 Think about it another way: each face of the die has a coming up. 1 6 chance of x= ¯ Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 52 / 251 Expectation Example: Think about the 6-sided die. The mean of the numbers is x= ¯ 1 6 i 1 xi = (1 + 2 + 3 + 4 + 5 + 6) = 3.5 6 Think about it another way: each face of the die has a coming up. x= ¯ 1 6 chance of 1 1 1 1 1 1 (1) + (2) + (3) + (4) + (5) + (6) 6 6 6 6 6 6 = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 52 / 251 Expectation Example: Think about the 6-sided die. The mean of the numbers is x= ¯ 1 6 i 1 xi = (1 + 2 + 3 + 4 + 5 + 6) = 3.5 6 Think about it another way: each face of the die has a coming up. 1 6 chance of 1 1 1 1 1 1 (1) + (2) + (3) + (4) + (5) + (6) 6 6 6 6 6 6 = (1)P (X = 1) + (2)P (X = 2) + (3)P (X = 3) + (4)P (X = 4) x= ¯ +(5)P (X = 5) + (6)P (X = 6) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 52 / 251 Expectation Example: Think about the 6-sided die. The mean of the numbers is x= ¯ 1 6 i 1 xi = (1 + 2 + 3 + 4 + 5 + 6) = 3.5 6 Think about it another way: each face of the die has a coming up. 1 6 chance of 1 1 1 1 1 1 (1) + (2) + (3) + (4) + (5) + (6) 6 6 6 6 6 6 = (1)P (X = 1) + (2)P (X = 2) + (3)P (X = 3) + (4)P (X = 4) x= ¯ +(5)P (X = 5) + (6)P (X = 6) (possibilities)(probabilities) = x Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 52 / 251 Expectation and Variance Definition Let X be a RV with pmf P (X = x ). The mean or expectation of X is denoted by µ (or µX ) and is given by µ= xP (X = x ) (8) x You may also see the expectation be referred to as the expected value and other symbols denoting it are EX , E (X ) and E [X ]. Definition 2 Let X be a discrete RV. The variance is denoted by σ 2 (or σX ) and is given by x 2 P (X = x ) − µ2 = E X 2 − [EX ]2 σ 2 = Var (X ) = (9) x Homework: Show E (X − µ)2 = EX 2 − (EX )2 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 53 / 251 Linear Operators Definition We say L is a linear operator if for any functions f and g and any constants a and b if L[af + bg ] = aL[f ] + bL[g ] Examples: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 54 / 251 Linear Operators Definition We say L is a linear operator if for any functions f and g and any constants a and b if L[af + bg ] = aL[f ] + bL[g ] Examples: From previous courses: ∂ ∂x , , From this course: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 54 / 251 Linear Operators Definition We say L is a linear operator if for any functions f and g and any constants a and b if L[af + bg ] = aL[f ] + bL[g ] Examples: From previous courses: ∂ ∂x , , From this course: E [·] What about Var (·)? Is that a linear operator too? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 54 / 251 Linear Operators Definition We say L is a linear operator if for any functions f and g and any constants a and b if L[af + bg ] = aL[f ] + bL[g ] Examples: From previous courses: ∂ ∂x , , From this course: E [·] What about Var (·)? Is that a linear operator too? Nope. Prove it for homework as an exercise. Also for Homework: read Example 5.3 from Section 5.1, and do #5.13 and 5.15 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 54 / 251 Types of Discrete Distributions We will model our experiment with a Random Variable depending on the type of experiment. Section 5.3: Bernoulli Distribution Imagine an experiment that results in 1 of 2 outcomes, one labelled “Success” and the other “Failure.” This is called a Bernoulli Trial. We define a Random Variable, X, as X= 1, 0, success failure X is a discrete RV with pmf defined by P (X = 1) = p , P (X = 0) = 1 − p . Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 55 / 251 Types of Discrete Distributions Examples: Fair coin: P (X = 1) = P (X = 0) = Rob Gordon (University of Florida) 1 2 STA 3032 (7661) Fall 2011 56 / 251 Types of Discrete Distributions Examples: Fair coin: P (X = 1) = P (X = 0) = 1 2 Let a success be rolling a 6 on a 6-sided die and any other # a failure, i.e. roll a 6 roll a # ∈ {1, 2, 3, 4, 5} 1 p = P (X = 1) = 6 5 1 − p = P (X = 0) = 6 X EX Rob Gordon = 1, 0, = (University of Florida) STA 3032 (7661) Fall 2011 56 / 251 Types of Discrete Distributions Examples: Fair coin: P (X = 1) = P (X = 0) = 1 2 Let a success be rolling a 6 on a 6-sided die and any other # a failure, i.e. roll a 6 roll a # ∈ {1, 2, 3, 4, 5} 1 p = P (X = 1) = 6 5 1 − p = P (X = 0) = 6 X EX = = 1, 0, xP (X = x ) = 1P (X = 1) + 0P (X = 0) = P (X = 1) = p Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 56 / 251 Types of Discrete Distributions Examples: Fair coin: P (X = 1) = P (X = 0) = 1 2 Let a success be rolling a 6 on a 6-sided die and any other # a failure, i.e. roll a 6 roll a # ∈ {1, 2, 3, 4, 5} 1 p = P (X = 1) = 6 5 1 − p = P (X = 0) = 6 X EX = = 1, 0, xP (X = x ) = 1P (X = 1) + 0P (X = 0) = P (X = 1) = p Var (X ) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 56 / 251 Types of Discrete Distributions Examples: Fair coin: P (X = 1) = P (X = 0) = 1 2 Let a success be rolling a 6 on a 6-sided die and any other # a failure, i.e. roll a 6 roll a # ∈ {1, 2, 3, 4, 5} 1 p = P (X = 1) = 6 5 1 − p = P (X = 0) = 6 X EX = = 1, 0, xP (X = x ) = 1P (X = 1) + 0P (X = 0) = P (X = 1) = p Var (X ) = EX 2 − (EX )2 = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 56 / 251 Types of Discrete Distributions Examples: Fair coin: P (X = 1) = P (X = 0) = 1 2 Let a success be rolling a 6 on a 6-sided die and any other # a failure, i.e. roll a 6 roll a # ∈ {1, 2, 3, 4, 5} 1 p = P (X = 1) = 6 5 1 − p = P (X = 0) = 6 X EX = = 1, 0, xP (X = x ) = 1P (X = 1) + 0P (X = 0) = P (X = 1) = p Var (X ) = EX 2 − (EX )2 = P (X = 1) − (P (X = 1))2 = p − p 2 = p (1 − p ) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 56 / 251 Discrete Uniform Distribution Suppose we have an experiment where each outcomes is equally likely. and that there are k possible outcomes. Then the pmf is defined the following way: 1 , where x ∈ {1, 2, . . . , k } k Then the Expectation and Variance is given by: P (X = x ) = k EX xP (X = x ) = = x =1 1 k k 1 k x= x =1 (10) k (k + 1) 2 k +1 2 = (11) k x 2 P (X = x ) − Var (X ) = x =1 k +1 2 2 = ··· = k2 − 1 12 (12) Examples: 6-sided die, Fair coin, etc. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 57 / 251 More Discrete Distributions Suppose we have a situation where we are flipping a coin more than once. I might ask: What is the probability that I get k heads in n flips? We have a distribution for that: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 58 / 251 More Discrete Distributions Suppose we have a situation where we are flipping a coin more than once. I might ask: What is the probability that I get k heads in n flips? We have a distribution for that: Suppose a total of n Bernoulli trials are conducted and Trials are independent Each trial has the same success probability p The RV X represents the # of successes in n trials The # n is fixed and known when the experiment starts. Then X has the binomial distribution with parameters n and p , denoted by X ∼ Bin(n, p ). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 58 / 251 Binomial Distribution What is the pmf of the Binomial distribution? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 59 / 251 Binomial Distribution What is the pmf of the Binomial distribution? Say we go back to coin tossing. We toss the coin (n=)3 times. Let’s say we want P (X = 2), i.e. the probability of seeing (k=)2 successes (we define heads as success, where P (H ) = p for this example). We could get 2 heads a few ways: HHT , HTH , THH . So P (X = 2) = P (HHT or HTH or THH ) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 59 / 251 Binomial Distribution What is the pmf of the Binomial distribution? Say we go back to coin tossing. We toss the coin (n=)3 times. Let’s say we want P (X = 2), i.e. the probability of seeing (k=)2 successes (we define heads as success, where P (H ) = p for this example). We could get 2 heads a few ways: HHT , HTH , THH . So P (X = 2) = P (HHT or HTH or THH ) = P (HHT ) + P (HTH ) + P (THH ) (why?) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 59 / 251 Binomial Distribution What is the pmf of the Binomial distribution? Say we go back to coin tossing. We toss the coin (n=)3 times. Let’s say we want P (X = 2), i.e. the probability of seeing (k=)2 successes (we define heads as success, where P (H ) = p for this example). We could get 2 heads a few ways: HHT , HTH , THH . So P (X = 2) = P (HHT or HTH or THH ) = P (HHT ) + P (HTH ) + P (THH ) (why?) = P (H )P (H )P (T ) + P (H )P (T )P (H ) + P (T )P (H )P (H ) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 59 / 251 Binomial Distribution What is the pmf of the Binomial distribution? Say we go back to coin tossing. We toss the coin (n=)3 times. Let’s say we want P (X = 2), i.e. the probability of seeing (k=)2 successes (we define heads as success, where P (H ) = p for this example). We could get 2 heads a few ways: HHT , HTH , THH . So P (X = 2) = P (HHT or HTH or THH ) = P (HHT ) + P (HTH ) + P (THH ) (why?) = P (H )P (H )P (T ) + P (H )P (T )P (H ) + P (T )P (H )P (H ) = p 2 (1 − p ) + p (1 − p )p + (1 − p )p 2 = 3p 2 (1 − p ) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 59 / 251 Binomial Distribution More generally speaking, what if we have n trials and we want P (X = k )? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 60 / 251 Binomial Distribution More generally speaking, what if we have n trials and we want P (X = k )? From our previous example of n = 3, k = 2 we saw that P (X = 2) = 3p 2 (1 − p ) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 60 / 251 Binomial Distribution More generally speaking, what if we have n trials and we want P (X = k )? From our previous example of n = 3, k = 2 we saw that P (X = 2) = 3p 2 (1 − p ) = (constant )p k (1 − p )n−k Its tempting to say, well the constant = n, but that’s not always the case. Let’s figure out when the constant is. For a binomial distribution we are not asking whether or not the the successes and failures come in a certain order, we only care about the total number of successes. Then the number of arrangements of successes and failures is given by Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 n k . 60 / 251 Binomial Distribution Definition The pmf of the Binomial Distribution is given by the following: f (x ; n, p ) = P (X = x ) = nx p (1 − p )n−x , where x ∈ {0, 1, . . . , n} (13) x Special Case of Binomial Distribution: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 61 / 251 Binomial Distribution Definition The pmf of the Binomial Distribution is given by the following: f (x ; n, p ) = P (X = x ) = nx p (1 − p )n−x , where x ∈ {0, 1, . . . , n} (13) x Special Case of Binomial Distribution: n = 1 : P (X = x ) = p x (1 − p )1−x (Bernoulli Distribution) It turns out that a Binomial RV is a sum of independent identically distributed Bernoulli Random Variables (we’ll talk more about this in detail in Chapter 7). Also notice that by the Binomial Theorem: n P (X = x ) = (p + 1 − p )n = 1 x =0 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 61 / 251 Binomial Distribution Finding the expectation and variance is tricky. EX Rob Gordon = (University of Florida) STA 3032 (7661) Fall 2011 62 / 251 Binomial Distribution Finding the expectation and variance is tricky. n EX = n xP (X = x ) = x =0 n = Rob Gordon x =0 x x =1 x nx p (1 − p )n−x x nx p (1 − p )n−x = x (University of Florida) n x x =1 STA 3032 (7661) n! x !(n − x )! p x (1 − p )n−x Fall 2011 62 / 251 Binomial Distribution Finding the expectation and variance is tricky. n EX = n xP (X = x ) = x =0 n = = x =1 Rob Gordon x =0 x x =1 n x nx p (1 − p )n−x x nx p (1 − p )n−x = x n x x =1 n! x !(n − x )! p x (1 − p )n−x n! p x (1 − p )n−x (x − 1)!(n − x )! (University of Florida) STA 3032 (7661) Fall 2011 62 / 251 Binomial Distribution Finding the expectation and variance is tricky. n EX = n xP (X = x ) = x =0 n = = x =1 Rob Gordon x =0 x x =1 n x nx p (1 − p )n−x x nx p (1 − p )n−x = x n x x =1 n! x !(n − x )! p x (1 − p )n−x n! p x (1 − p )n−x Let y = x − 1. (x − 1)!(n − x )! (University of Florida) STA 3032 (7661) Fall 2011 62 / 251 Binomial Distribution Finding the expectation and variance is tricky. n EX = n xP (X = x ) = x =0 n = = x =1 n −1 = y =0 Rob Gordon x =0 x x =1 n x nx p (1 − p )n−x x nx p (1 − p )n−x = x n x x =1 n! x !(n − x )! p x (1 − p )n−x n! p x (1 − p )n−x Let y = x − 1. (x − 1)!(n − x )! n! p y +1 (1 − p )n−y −1 y !(n − y − 1)! (University of Florida) STA 3032 (7661) Fall 2011 62 / 251 Binomial Distribution Finding the expectation and variance is tricky. n EX n = xP (X = x ) = x =0 n = x =0 x x =1 n = x =1 n −1 = y =0 nx p (1 − p )n−x x nx p (1 − p )n−x = x n x x =1 n! x !(n − x )! p x (1 − p )n−x n! p x (1 − p )n−x Let y = x − 1. (x − 1)!(n − x )! n! p y +1 (1 − p )n−y −1 y !(n − y − 1)! n−1 = np y =0 Rob Gordon x (University of Florida) n−1 y p (1 − p )n−1−y = np (1) y STA 3032 (7661) Fall 2011 62 / 251 Binomial Distribution Finding the expectation and variance is tricky. n EX n = xP (X = x ) = x =0 n = x =0 x x =1 n = x =1 n −1 = y =0 x nx p (1 − p )n−x x nx p (1 − p )n−x = x n x x =1 n! x !(n − x )! p x (1 − p )n−x n! p x (1 − p )n−x Let y = x − 1. (x − 1)!(n − x )! n! p y +1 (1 − p )n−y −1 y !(n − y − 1)! n−1 = np y =0 n−1 y p (1 − p )n−1−y = np (1) y (14) Var (X ) = np (1 − p ) Prove for HW Rob Gordon (University of Florida) STA 3032 (7661) (15) Fall 2011 62 / 251 Binomial Distribution: Examples A die is rolled 6 times. Let a success be the event I roll 6. Find the probability I roll (A) 2 sixes, (B) less than 5 sixes and (C) at least 2 and at most 4 sixes. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 63 / 251 Binomial Distribution: Examples A die is rolled 6 times. Let a success be the event I roll 6. Find the probability I roll (A) 2 sixes, (B) less than 5 sixes and (C) at least 2 and at most 4 sixes. 1 First, recognize that X ∼ Bin n = 6, p = 6 (A) P (roll 2 sixes) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 63 / 251 Binomial Distribution: Examples A die is rolled 6 times. Let a success be the event I roll 6. Find the probability I roll (A) 2 sixes, (B) less than 5 sixes and (C) at least 2 and at most 4 sixes. 1 First, recognize that X ∼ Bin n = 6, p = 6 (A) P (roll 2 sixes) = P (X = 2) = 6 2 1 6 2 5 6 6−2 = 0.2 (B) P (less than 5 sixes) = P (X ≤ 4) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 63 / 251 Binomial Distribution: Examples A die is rolled 6 times. Let a success be the event I roll 6. Find the probability I roll (A) 2 sixes, (B) less than 5 sixes and (C) at least 2 and at most 4 sixes. 1 First, recognize that X ∼ Bin n = 6, p = 6 6 125 2 6 6 (B) P (less than 5 sixes) = P (X ≤ 4) = 1 − P (X > 4) (A) P (roll 2 sixes) = P (X = 2) = 6−2 = 0.2 = 1 − [P (X = 5) + P (X = 6)] Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 63 / 251 Binomial Distribution: Examples A die is rolled 6 times. Let a success be the event I roll 6. Find the probability I roll (A) 2 sixes, (B) less than 5 sixes and (C) at least 2 and at most 4 sixes. 1 First, recognize that X ∼ Bin n = 6, p = 6 6−2 6 125 2 6 6 (B) P (less than 5 sixes) = P (X ≤ 4) = 1 − P (X > 4) (A) P (roll 2 sixes) = P (X = 2) = = 0.2 = 1 − [P (X = 5) + P (X = 6)] = 1− 6 5 1 6 5 5 6 1 + 6 6 1 6 6 5 6 0 = 0.999 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 63 / 251 Binomial Distribution: Examples Continued X ∼ Bin n = 6, p = 1 6 (C) P (2 ≤ X ≤ 4) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 64 / 251 Binomial Distribution: Examples Continued X ∼ Bin n = 6, p = 1 6 (C) P (2 ≤ X ≤ 4) = P (X ≤ 4) − P (X ≤ 1) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 64 / 251 Binomial Distribution: Examples Continued X ∼ Bin n = 6, p = 1 6 (C) P (2 ≤ X ≤ 4) = P (X ≤ 4) − P (X ≤ 1) = F (4) − F (1) where F is the cdf for X = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 64 / 251 Binomial Distribution: Examples Continued X ∼ Bin n = 6, p = 1 6 (C) P (2 ≤ X ≤ 4) = P (X ≤ 4) − P (X ≤ 1) = F (4) − F (1) where F is the cdf for X = 0.9993 − 0.7368 = 0.26256 Homework: Section 5.4 (all odds) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 64 / 251 Poisson Distribution For just a second we skip section 5.5. We’ll do 5.6 then immediately go back to 5.5. To review: Use the binomial distribution when we have a sum of independent bernoulli trials (each trial is a success or failure, each trial independepent, etc.) What if we have a situation where a binomial RV is appropriate, but parameters are extreme? If n → ∞, p → 0 but np → λ (a constant), we have what’s called a Poisson Random Variable. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 65 / 251 Poisson Distribution Definition We say X ∼ Poisson(λ) where λ > 0 if the pmf is given by f (x ; λ) = P (X = x ) = e −λ λx , x ∈ {0, 1, . . .} x! (16) Remember that this is just a very extreme case of the binomial distribution. Its possible to use calculus/magic/voodoo/kung-fu to show lim n→∞,p →0 nx e λ λx p (1 − p )n−x = x! x A derivation of this can be found on page 247 of the book. Take a look at it if you’re curious, but you don’t need to memorize it. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 66 / 251 Poisson Distribution What does n → ∞, p → 0, np → λ mean in the real world though? Use the Poisson distribution to model rare events, or events that happen eventually over a long enough times span. Examples: Car crashes A pitcher throwing a no-hitter A stenographer making a typographic error E (X ) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 67 / 251 Poisson Distribution What does n → ∞, p → 0, np → λ mean in the real world though? Use the Poisson distribution to model rare events, or events that happen eventually over a long enough times span. Examples: Car crashes A pitcher throwing a no-hitter A stenographer making a typographic error ∞ E (X ) = x x =0 Rob Gordon e −λ λx = x! (University of Florida) STA 3032 (7661) Fall 2011 67 / 251 Poisson Distribution What does n → ∞, p → 0, np → λ mean in the real world though? Use the Poisson distribution to model rare events, or events that happen eventually over a long enough times span. Examples: Car crashes A pitcher throwing a no-hitter A stenographer making a typographic error ∞ E (X ) = x x =0 Rob Gordon e −λ λx = x! (University of Florida) ∞ x x =1 e −λ λx = x! STA 3032 (7661) Fall 2011 67 / 251 Poisson Distribution What does n → ∞, p → 0, np → λ mean in the real world though? Use the Poisson distribution to model rare events, or events that happen eventually over a long enough times span. Examples: Car crashes A pitcher throwing a no-hitter A stenographer making a typographic error ∞ E (X ) = x x =0 Rob Gordon e −λ λx = x! (University of Florida) ∞ x x =1 e −λ λx = x! STA 3032 (7661) ∞ x =1 e −λ λ x = (x − 1)! Fall 2011 67 / 251 Poisson Distribution What does n → ∞, p → 0, np → λ mean in the real world though? Use the Poisson distribution to model rare events, or events that happen eventually over a long enough times span. Examples: Car crashes A pitcher throwing a no-hitter A stenographer making a typographic error ∞ E (X ) = x x =0 Rob Gordon e −λ λx = x! (University of Florida) ∞ x x =1 e −λ λx = x! STA 3032 (7661) ∞ x =1 e −λ λ x = e −λ (x − 1)! ∞ y =0 Fall 2011 λy +1 y! 67 / 251 Poisson Distribution What does n → ∞, p → 0, np → λ mean in the real world though? Use the Poisson distribution to model rare events, or events that happen eventually over a long enough times span. Examples: Car crashes A pitcher throwing a no-hitter A stenographer making a typographic error ∞ E (X ) = x x =0 e −λ λx = x! ∞ = λ e −λ y =0 Rob Gordon (University of Florida) λy y! ∞ x x =1 e −λ λx = x! ∞ x =1 e −λ λ x = e −λ (x − 1)! ∞ y =0 λy +1 y! = STA 3032 (7661) Fall 2011 67 / 251 Poisson Distribution What does n → ∞, p → 0, np → λ mean in the real world though? Use the Poisson distribution to model rare events, or events that happen eventually over a long enough times span. Examples: Car crashes A pitcher throwing a no-hitter A stenographer making a typographic error ∞ E (X ) = x x =0 e −λ λx = x! ∞ = λ e −λ y =0 λy y! ∞ x x =1 e −λ λx = x! ∞ x =1 e −λ λ x = e −λ (x − 1)! ∞ y =0 = λe −λ e λ = λ (17) Var (X ) = λ (prove for HW) Rob Gordon (University of Florida) λy +1 y! (18) STA 3032 (7661) Fall 2011 67 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. Find.. (1) the probability exactly 2 diodes fail. (2) the mean number of diodes that fail. (3) the standard deviation. (4) the probability that the board works. (5) the probability that out of 5 boards that are shipped to a customer, 4 or more work. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 68 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. Find.. (1) the probability exactly 2 diodes fail. (2) the mean number of diodes that fail. (3) the standard deviation. (4) the probability that the board works. (5) the probability that out of 5 boards that are shipped to a customer, 4 or more work. We have a large number of “trials” and a very small probability of success. Use X ∼ Poisson(λ = (300)(0.002)). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 68 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. Find.. (1) the probability exactly 2 diodes fail. (2) the mean number of diodes that fail. (3) the standard deviation. (4) the probability that the board works. (5) the probability that out of 5 boards that are shipped to a customer, 4 or more work. We have a large number of “trials” and a very small probability of success. Use X ∼ Poisson(λ = (300)(0.002)). (1) P (exactly 2 fail) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 68 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. Find.. (1) the probability exactly 2 diodes fail. (2) the mean number of diodes that fail. (3) the standard deviation. (4) the probability that the board works. (5) the probability that out of 5 boards that are shipped to a customer, 4 or more work. We have a large number of “trials” and a very small probability of success. Use X ∼ Poisson(λ = (300)(0.002)). (1) P (exactly 2 fail) = P (X = 2) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 68 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. Find.. (1) the probability exactly 2 diodes fail. (2) the mean number of diodes that fail. (3) the standard deviation. (4) the probability that the board works. (5) the probability that out of 5 boards that are shipped to a customer, 4 or more work. We have a large number of “trials” and a very small probability of success. Use X ∼ Poisson(λ = (300)(0.002)). (1) P (exactly 2 fail) = P (X = 2) = Rob Gordon (University of Florida) e −(300)(0.002) (300x 0.002)2 2! STA 3032 (7661) = 0.099 Fall 2011 68 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. Find.. (1) the probability exactly 2 diodes fail. (2) the mean number of diodes that fail. (3) the standard deviation. (4) the probability that the board works. (5) the probability that out of 5 boards that are shipped to a customer, 4 or more work. We have a large number of “trials” and a very small probability of success. Use X ∼ Poisson(λ = (300)(0.002)). (1) P (exactly 2 fail) = P (X = 2) = e −(300)(0.002) (300x 0.002)2 2! = 0.099 (2) µ = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 68 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. Find.. (1) the probability exactly 2 diodes fail. (2) the mean number of diodes that fail. (3) the standard deviation. (4) the probability that the board works. (5) the probability that out of 5 boards that are shipped to a customer, 4 or more work. We have a large number of “trials” and a very small probability of success. Use X ∼ Poisson(λ = (300)(0.002)). (1) P (exactly 2 fail) = P (X = 2) = e −(300)(0.002) (300x 0.002)2 2! = 0.099 (2) µ = (300)(0.002) = 0.6 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 68 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. Find.. (1) the probability exactly 2 diodes fail. (2) the mean number of diodes that fail. (3) the standard deviation. (4) the probability that the board works. (5) the probability that out of 5 boards that are shipped to a customer, 4 or more work. We have a large number of “trials” and a very small probability of success. Use X ∼ Poisson(λ = (300)(0.002)). (1) P (exactly 2 fail) = P (X = 2) = e −(300)(0.002) (300x 0.002)2 2! = 0.099 (2) µ = (300)(0.002) = 0.6 (3) σ = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 68 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. Find.. (1) the probability exactly 2 diodes fail. (2) the mean number of diodes that fail. (3) the standard deviation. (4) the probability that the board works. (5) the probability that out of 5 boards that are shipped to a customer, 4 or more work. We have a large number of “trials” and a very small probability of success. Use X ∼ Poisson(λ = (300)(0.002)). (1) P (exactly 2 fail) = P (X = 2) = e −(300)(0.002) (300x 0.002)2 2! = 0.099 (2) µ = (300)(0.002) = 0.6 (3) σ = Rob Gordon Var (X ) = (University of Florida) STA 3032 (7661) Fall 2011 68 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. Find.. (1) the probability exactly 2 diodes fail. (2) the mean number of diodes that fail. (3) the standard deviation. (4) the probability that the board works. (5) the probability that out of 5 boards that are shipped to a customer, 4 or more work. We have a large number of “trials” and a very small probability of success. Use X ∼ Poisson(λ = (300)(0.002)). (1) P (exactly 2 fail) = P (X = 2) = e −(300)(0.002) (300x 0.002)2 2! = 0.099 (2) µ = (300)(0.002) = 0.6 √ √ (3) σ = Var (X ) = λ = 0.6 = 0.77 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 68 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. Find.. (1) the probability exactly 2 diodes fail. (2) the mean number of diodes that fail. (3) the standard deviation. (4) the probability that the board works. (5) the probability that out of 5 boards that are shipped to a customer, 4 or more work. We have a large number of “trials” and a very small probability of success. Use X ∼ Poisson(λ = (300)(0.002)). (1) P (exactly 2 fail) = P (X = 2) = e −(300)(0.002) (300x 0.002)2 2! = 0.099 (2) µ = (300)(0.002) = 0.6 √ √ (3) σ = Var (X ) = λ = 0.6 = 0.77 (4) P (board works) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 68 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. Find.. (1) the probability exactly 2 diodes fail. (2) the mean number of diodes that fail. (3) the standard deviation. (4) the probability that the board works. (5) the probability that out of 5 boards that are shipped to a customer, 4 or more work. We have a large number of “trials” and a very small probability of success. Use X ∼ Poisson(λ = (300)(0.002)). (1) P (exactly 2 fail) = P (X = 2) = e −(300)(0.002) (300x 0.002)2 2! = 0.099 (2) µ = (300)(0.002) = 0.6 √ √ (3) σ = Var (X ) = λ = 0.6 = 0.77 (4) P (board works) = P (X = 0) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 68 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. Find.. (1) the probability exactly 2 diodes fail. (2) the mean number of diodes that fail. (3) the standard deviation. (4) the probability that the board works. (5) the probability that out of 5 boards that are shipped to a customer, 4 or more work. We have a large number of “trials” and a very small probability of success. Use X ∼ Poisson(λ = (300)(0.002)). (1) P (exactly 2 fail) = P (X = 2) = e −(300)(0.002) (300x 0.002)2 2! = 0.099 (2) µ = (300)(0.002) = 0.6 √ √ (3) σ = Var (X ) = λ = 0.6 = 0.77 (4) P (board works) = P (X = 0) = Rob Gordon (University of Florida) e −0.6 0.60 0! STA 3032 (7661) = 0.55 Fall 2011 68 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. (5) Find the probability that out of 5 boards that are shipped to a customer, 4 or more work. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 69 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. (5) Find the probability that out of 5 boards that are shipped to a customer, 4 or more work. This is actually a pretty tricky problem because we have to “combine” two distributions to solve this. We need to define success on 2 levels: success in that a diode works, and success that all diodes work. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 69 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. (5) Find the probability that out of 5 boards that are shipped to a customer, 4 or more work. This is actually a pretty tricky problem because we have to “combine” two distributions to solve this. We need to define success on 2 levels: success in that a diode works, and success that all diodes work. Let X ∼ Poisson(λ = 0.6), and let Y ∼ Binom(n = 5, p = P (X = 0)). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 69 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. (5) Find the probability that out of 5 boards that are shipped to a customer, 4 or more work. This is actually a pretty tricky problem because we have to “combine” two distributions to solve this. We need to define success on 2 levels: success in that a diode works, and success that all diodes work. Let X ∼ Poisson(λ = 0.6), and let Y ∼ Binom(n = 5, p = P (X = 0)). P (4 or more work) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 69 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. (5) Find the probability that out of 5 boards that are shipped to a customer, 4 or more work. This is actually a pretty tricky problem because we have to “combine” two distributions to solve this. We need to define success on 2 levels: success in that a diode works, and success that all diodes work. Let X ∼ Poisson(λ = 0.6), and let Y ∼ Binom(n = 5, p = P (X = 0)). P (4 or more work) = P (Y ≥ 4) = P (Y = 4) + P (Y = 5) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 69 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. (5) Find the probability that out of 5 boards that are shipped to a customer, 4 or more work. This is actually a pretty tricky problem because we have to “combine” two distributions to solve this. We need to define success on 2 levels: success in that a diode works, and success that all diodes work. Let X ∼ Poisson(λ = 0.6), and let Y ∼ Binom(n = 5, p = P (X = 0)). P (4 or more work) = P (Y ≥ 4) = P (Y = 4) + P (Y = 5) 5 4 5−4 = e −0.6 1 − e −0.6 4 5 5 5−5 + e −0.6 1 − e −0.6 5 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 69 / 251 Examples A circuit board has 300 diodes with probability 0.002 of failing. (5) Find the probability that out of 5 boards that are shipped to a customer, 4 or more work. This is actually a pretty tricky problem because we have to “combine” two distributions to solve this. We need to define success on 2 levels: success in that a diode works, and success that all diodes work. Let X ∼ Poisson(λ = 0.6), and let Y ∼ Binom(n = 5, p = P (X = 0)). P (4 or more work) = P (Y ≥ 4) = P (Y = 4) + P (Y = 5) 5 4 5−4 = e −0.6 1 − e −0.6 4 5 5 5−5 + e −0.6 1 − e −0.6 5 = 0.25. Homework: 5.6 odds Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 69 / 251 Back to Section 5.5 Now let’s do another modification of the binomial experiment. Recall one of the assumptions for using a Binomial RV: fixed n (number of trials). Now let’s do the opposite: Instead of fixing the number of trials (n) and testing the number of successes (x ), let’s do an experiment a variable number of times until we get the desired number of successes. Essentially we are reversing the roles of x and n. Example: Roll a die until you get 2 sixes. The sequence might look like this: (F )(F )(S )(F )(F )(F )(S ). By independence of the die rolling, the probability most likely looks something like P [(F )(F )(S )(F )(F )(F )(S )] = cP (F )P (F )P (S )P (F )P (F )P (F )P (S ) where c is some constant. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 70 / 251 Why do we have a constant? Remember we only care about rolling the die on until we get 2 sixes. If we see (F )(F )(S )(F )(F )(F )(S ), we notice that the second success happens on our 7th roll, but this isn’t the only way we can write out this sequence of letters such that the 7th letter is S. The calculation of the probability we get that 2nd six on the 7th roll depends on how many ways we can rearrange all the other letters in the sequence. That means, we’re free to arrange those first 6 letters in any way that we like. How many ways can we arrange those letters? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 71 / 251 Why do we have a constant? Remember we only care about rolling the die on until we get 2 sixes. If we see (F )(F )(S )(F )(F )(F )(S ), we notice that the second success happens on our 7th roll, but this isn’t the only way we can write out this sequence of letters such that the 7th letter is S. The calculation of the probability we get that 2nd six on the 7th roll depends on how many ways we can rearrange all the other letters in the sequence. That means, we’re free to arrange those first 6 letters in any way that we like. How many ways can we arrange those letters? 6 1 Rob Gordon (University of Florida) = # of trials − 1 # successes − 1 STA 3032 (7661) Fall 2011 71 / 251 Negative Binomial More generally, we don’t know # trials (=x), but know # successes (=k). How many ways can we arrange those letters? 6 1 Rob Gordon (University of Florida) = # of trials − 1 # successes − 1 STA 3032 (7661) = Fall 2011 72 / 251 Negative Binomial More generally, we don’t know # trials (=x), but know # successes (=k). How many ways can we arrange those letters? 6 1 = # of trials − 1 # successes − 1 = x −1 k −1 Definition We say X NegBin(k , p ) if it has a pmf given by f (x ; k , p ) = P (X = x ) = x −1 k p (1 − p )x −k , x ∈ {k , k + 1, . . .} (19) k −1 where k is the number of successes and p is the probability of a single success. EX = Rob Gordon (University of Florida) k k (1 − p ) , Var (X ) = p p2 STA 3032 (7661) (20) Fall 2011 72 / 251 Special Case of NegBinom: Geometric Distribution If we perform an experiment until we obtain 1 success (k=1), then f (1; p , k = 1) = P (X = 1) = x −1 1 p (1 − p )x −1 = p (1 − p )x −1 , 1−1 Stated formally: Definition We say X has a Geometric distribution, denoted X ∼ Geo (p ) if its pmf is given by f (x ; p ) = P (X = x ) = p (1 − p )x −1 , x ∈ {1, 2, . . .} EX = 1 1−p , Var (X ) = p p2 (21) (22) Note: The negative binomial is the sum of geometric Random Variables. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 73 / 251 Examples (1) A coin is flipped until 3 heads are obtained. What is the probability of this happening on the 5th flip? (2) A 6-sided die is rolled until a 6 is rolled. What is the probability of this happening on the 4th roll? Answers: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 74 / 251 Examples (1) A coin is flipped until 3 heads are obtained. What is the probability of this happening on the 5th flip? (2) A 6-sided die is rolled until a 6 is rolled. What is the probability of this happening on the 4th roll? Answers: (1) Let X ∼ NegBin(k = 3, p = 1/2). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 74 / 251 Examples (1) A coin is flipped until 3 heads are obtained. What is the probability of this happening on the 5th flip? (2) A 6-sided die is rolled until a 6 is rolled. What is the probability of this happening on the 4th roll? Answers: (1) Let X ∼ NegBin(k = 3, p = 1/2). P (X = 5) = Rob Gordon 5−1 3−1 (University of Florida) 13 2 1− 1 5−3 2 = ··· = STA 3032 (7661) 3 16 Fall 2011 74 / 251 Examples (1) A coin is flipped until 3 heads are obtained. What is the probability of this happening on the 5th flip? (2) A 6-sided die is rolled until a 6 is rolled. What is the probability of this happening on the 4th roll? Answers: (1) Let X ∼ NegBin(k = 3, p = 1/2). P (X = 5) = 5−1 3−1 13 2 1− 1 5−3 2 = ··· = 3 16 (2) Let Y ∼ Geo (p = 1/6). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 74 / 251 Examples (1) A coin is flipped until 3 heads are obtained. What is the probability of this happening on the 5th flip? (2) A 6-sided die is rolled until a 6 is rolled. What is the probability of this happening on the 4th roll? Answers: (1) Let X ∼ NegBin(k = 3, p = 1/2). P (X = 5) = 5−1 3−1 13 2 1− 1 5−3 2 = ··· = 3 16 (2) Let Y ∼ Geo (p = 1/6). P (Y = 4) = 1 6 1− 1 4−1 6 = 0.096 Homework: 5.5 odds Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 74 / 251 To summarize: You will not pass this course if you do not recognize how and where to apply the discrete distributions we have discussed. To review: Distribution Discrete Uniform Bernoulli Binomial Poisson Negative Binomial Geometric Rob Gordon (University of Florida) Situation Each element of support has equal probability 1 trial resulting in Success or Failure (fixed) n Bernoulli trials, results independent, # successes unknown large n, small p , np → λ # successes unknown unknown # trials, # successes known Neg. Binomial with k = 1. STA 3032 (7661) Fall 2011 75 / 251 Chapter 6: Continuous Random Variables Definition A random variable is continuous if its probabilities are given as areas under a curve. The curve is called a probability density function (pdf) for the random variable. The pdf has most of the same properties as the pmf with some slight changes. If f (x ) is a pdf, 0 ≤ f (x ) for all x . If X is continuous P (X = x ) = 0. Why? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 76 / 251 Chapter 6: Continuous Random Variables Definition A random variable is continuous if its probabilities are given as areas under a curve. The curve is called a probability density function (pdf) for the random variable. The pdf has most of the same properties as the pmf with some slight changes. If f (x ) is a pdf, 0 ≤ f (x ) for all x . If X is continuous P (X = x ) = 0. Why? For continuous RVs, probabilities are given by areas under the curve. In other words you’ll see things like P (a ≤ X ≤ b ) = P (X ≤ b ) = P (X ≥ a) = Rob Gordon (University of Florida) b a f (x )dx b f (x )dx −∞ ∞ f (x )dx a STA 3032 (7661) Fall 2011 76 / 251 So why is P (X = x ) = 0 for continuous RVs? Remember, probabilities are defined as areas under a curve, i.e. integrals. x P (X = x ) = P (x ≤ X ≤ x ) = x f (x )dx = 0. This leads to some weird things with the notation. Unlike with pmfs we make no distinction between P (a ≤ X ≤ b ) and P (a < X < b ). If X is discrete, P (X ≥ 0) = P (X = 0) + P (X = 1) + . . . If X is continuous, P (X ≥ 0) = P (X = 0) + P (X > 0) = 0 + P (X > 0) One more property of continuous RVs: ∞ ∞ f (x )dx = 1 This is just the continuous analog of the property for discrete RVs where x P (X = x ) = 1. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 77 / 251 Chapter 6: Continuous RVs Definition Let X be a continuous Random Variable with pdf f (x ). The cumulative distribution function (cdf) of X is x F (x ) = P (X ≤ x ) = f (t )dt . (23) −∞ Examples: Let f (x ) = 2x , 0 ≤ x ≤ 1 0, otherwise (1) Is f a valid pdf? (2) Find the cdf of X . (3) Find P (0.25 ≤ X ≤ 0.75). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 78 / 251 Examples: Continuous Random Variables 2x , 0 ≤ x ≤ 1 0, otherwise (1) Is f a valid pdf? Let f (x ) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 79 / 251 Examples: Continuous Random Variables 2x , 0 ≤ x ≤ 1 0, otherwise (1) Is f a valid pdf? Let f (x ) = ∞ f (x )dx = −∞ Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 79 / 251 Examples: Continuous Random Variables 2x , 0 ≤ x ≤ 1 0, otherwise (1) Is f a valid pdf? Let f (x ) = ∞ −∞ ∞ 1 0 f (x )dx f (x )dx + = −∞ f (x )dx + 0 1 = 0+ 2xdx + 0 = 0 Rob Gordon (University of Florida) STA 3032 (7661) f (x )dx 0 2x 2 2 1 0 = 1. Fall 2011 79 / 251 Examples: Continuous Random Variables 2x , 0 ≤ x ≤ 1 0, otherwise (1) Is f a valid pdf? Let f (x ) = ∞ −∞ ∞ 1 0 f (x )dx f (x )dx + = −∞ f (x )dx + 0 1 = 0+ 2xdx + 0 = 0 f (x )dx 0 2x 2 2 1 0 = 1. (2) Find the cdf of X . Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 79 / 251 Examples: Continuous Random Variables 2x , 0 ≤ x ≤ 1 0, otherwise (1) Is f a valid pdf? Let f (x ) = ∞ −∞ ∞ 1 0 f (x )dx f (x )dx + = −∞ f (x )dx + 0 1 = 0+ 2xdx + 0 = 0 f (x )dx 0 2x 2 2 1 0 = 1. (2) Find the of X . cdf 0, −∞ < x < 0 x 2, 0≤x ≤1 F (x ) = 1, 1<x <∞ Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 79 / 251 Examples Continued (3) Find P (0.25 ≤ X ≤ 0.75). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 80 / 251 Examples Continued (3) Find P (0.25 ≤ X ≤ 0.75). 0.75 2xdx = x 2 P (0.25 ≤ X ≤ 0.75) = 0.25 Rob Gordon (University of Florida) STA 3032 (7661) 0.75 0.25 = 1 2 Fall 2011 80 / 251 Examples Continued (3) Find P (0.25 ≤ X ≤ 0.75). 0.75 1 2 0.25 0.25 = P (X ≤ 0.75) − P (X ≤ 0.25) 9 1 1 = F (0.75) − F (0.25) = − = 16 16 2 P (0.25 ≤ X ≤ 0.75) = Rob Gordon (University of Florida) 2xdx = x 2 STA 3032 (7661) 0.75 = Fall 2011 80 / 251 More fun stuff The mean and variance of continuous RVs are similar to that of their discrete counterparts. Just replace with . Definition The mean of a continuous Random Variable, say X is given by the following: ∞ µ = EX = xf (x )dx (24) −∞ The variance is given by ∞ σ 2 = E (X − µ)2 = (x − µ)2 f (x )dx (25) −∞ An alternate (and equivalent) calculation is ∞ σ 2 = E [X 2 ] − (EX )2 = x 2 dx − µ2 (26) −∞ Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 81 / 251 Example Find the standard deviation of X with pdf f (x ) = Rob Gordon (University of Florida) STA 3032 (7661) 2x , 0 ≤ x ≤ 1 0, otherwise Fall 2011 82 / 251 Example Find the standard deviation of X with pdf f (x ) = ∞ EX = −∞ Rob Gordon (University of Florida) 1 2x 2 dx = xf (x )dx = 0 STA 3032 (7661) 2x , 0 ≤ x ≤ 1 0, otherwise 2x 3 3 1 0 = 2 3 Fall 2011 82 / 251 Example 2x , 0 ≤ x ≤ 1 0, otherwise Find the standard deviation of X with pdf f (x ) = ∞ EX = 1 −∞ ∞ E X2 (University of Florida) 0 2x 3 3 1 x 2 f (x )dx = = −∞ Rob Gordon 2x 2 dx = xf (x )dx = 2x 3 dx = 0 STA 3032 (7661) 1 0 = 2x 4 1 4 0 2 3 = 1 2 Fall 2011 82 / 251 Example 2x , 0 ≤ x ≤ 1 0, otherwise Find the standard deviation of X with pdf f (x ) = ∞ EX = 1 −∞ ∞ E X2 2x 2 dx = xf (x )dx = 0 1 x 2 f (x )dx = = −∞ 2x 3 dx = 0 σ 2 = E X 2 − (EX )2 = Rob Gordon (University of Florida) 2x 3 3 STA 3032 (7661) 1 − 2 2 3 1 0 = 2x 4 1 4 2 = 0 2 3 = 1 2 1 18 Fall 2011 82 / 251 Example 2x , 0 ≤ x ≤ 1 0, otherwise Find the standard deviation of X with pdf f (x ) = ∞ EX = 1 −∞ ∞ E X2 2x 2 dx = xf (x )dx = 0 1 x 2 f (x )dx = = −∞ 2x 3 dx = 0 σ 2 = E X 2 − (EX )2 = σ= 2x 3 3 1/18 = 1 − 2 2 3 1 0 = 2x 4 1 4 2 = 0 2 3 = 1 2 1 18 11 √. 32 Homework: Section 6.1 #2(a, b, d), 3, 5(a, b, c), Section 6.2 #9, 11 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 82 / 251 Frequently Used pdfs Continuous Uniform Distribution Similar to discrete uniform. Recall for discrete uniform distribution, every point in the support has equal chance of occurring. For Continuous Uniform Distribution, every interval with an equal width has an equal chance of occurring. Definition X is said to have a continuous uniform pdf if f (x ) = Rob Gordon (University of Florida) 1 b −a , 0, a≤x ≤b otherwise STA 3032 (7661) (27) Fall 2011 83 / 251 Continuous Uniform Distribution f(x) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Continuous Uniform pdf with a = 2, b = 6 0 2 4 6 8 10 x Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 84 / 251 Continuous Uniform Distribution EX Rob Gordon = (University of Florida) STA 3032 (7661) Fall 2011 85 / 251 Continuous Uniform Distribution b EX = x a Rob Gordon (University of Florida) 1 dx = b−a STA 3032 (7661) Fall 2011 85 / 251 Continuous Uniform Distribution b EX = x a Rob Gordon (University of Florida) 1 x2 1 dx = b−a b−a 2 STA 3032 (7661) b a = b+a 2 Fall 2011 85 / 251 Continuous Uniform Distribution b EX = VarX = 1 x2 1 dx = b−a b−a 2 a 2 (b − a) (Verify for HW) 12 x b a = b+a 2 Example: The RTS 9-Route is supposed to come to a certain stop every 12 minutes. Waiting times at bus stops are often modeled with the continuous uniform distribution. Suppose X is a random variable representing waiting time, i.e. X ∼ Unif (0, 12). (1) What is the probability of waiting 7 or more minutes? (2) What is the average waiting time? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 85 / 251 Example: Continuous Uniform Distribution P (X ≥ 7) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 86 / 251 Example: Continuous Uniform Distribution ∞ P (X ≥ 7) = f (x )dx = 7 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 86 / 251 Example: Continuous Uniform Distribution ∞ P (X ≥ 7) = 7 Rob Gordon (University of Florida) 12 f (x )dx = 7 STA 3032 (7661) 1 dx = 12 − 0 Fall 2011 86 / 251 Example: Continuous Uniform Distribution ∞ P (X ≥ 7) = 7 EX Rob Gordon 12 f (x )dx = 7 1 1 dx = x 12 − 0 12 12 7 = 5 12 = (University of Florida) STA 3032 (7661) Fall 2011 86 / 251 Example: Continuous Uniform Distribution ∞ P (X ≥ 7) = 7 EX Rob Gordon = (University of Florida) 12 f (x )dx = 7 1 1 dx = x 12 − 0 12 12 7 = 5 12 12 + 0 = 6 minutes 2 STA 3032 (7661) Fall 2011 86 / 251 Example: Continuous Uniform Distribution ∞ P (X ≥ 7) = 7 EX = 12 f (x )dx = 7 1 1 dx = x 12 − 0 12 12 7 = 5 12 12 + 0 = 6 minutes 2 What is the probability of waiting less than 5 minutes? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 86 / 251 The Normal Distribution The normal distribution is the most commonly used distribution. Definition X has a normal distribution if its pdf is given by 1 f (x ; µ, σ 2 ) = √ exp 2πσ 2 −1 (x − µ)2 , −∞ < x < ∞ 2σ 2 (28) 2 where µX = µ and σX = σ 2 and is denoted as X ∼ N (µ, σ 2 ). The normal distribution is the bell-shaped curve with the following (very important) properties: Symmetric about x = µ. x = µ ± σ are inflection points (concavity changes). For any normal distribution: About 68% of population is in µ ± σ . About 95% of population is in µ ± 2σ . About 99.7% of population is in µ ± 3σ . Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 87 / 251 The Normal Distribution For any normal distribution: About 68% of population is in µ ± σ . About 95% of population is in µ ± 2σ . About 99.7% of population is in µ ± 3σ . The above is usually referred to as the empirical rule. Its easier to understand with a picture: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 88 / 251 The Normal Distribution Problem: Probabilities are found by calculating areas under the curve, but integrating the pdf for the normal distribution is tedious (and not possible with the “usual” methods). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 89 / 251 The Normal Distribution Problem: Probabilities are found by calculating areas under the curve, but integrating the pdf for the normal distribution is tedious (and not possible with the “usual” methods). Solution: Computers! Its really easy to do this in R. For example, if X ∼ N (1, 4) and I want P (X ≤ 0.5) I can just type > pnorm(q = 0.5, mean=1, sd = sqrt(4), lower.tail = TRUE) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 89 / 251 The Normal Distribution Problem: Probabilities are found by calculating areas under the curve, but integrating the pdf for the normal distribution is tedious (and not possible with the “usual” methods). Solution: Computers! Its really easy to do this in R. For example, if X ∼ N (1, 4) and I want P (X ≤ 0.5) I can just type > pnorm(q = 0.5, mean=1, sd = sqrt(4), lower.tail = TRUE) Problem: So... how do I calculate probabilities for HW/Quizzes/Tests? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 89 / 251 The Normal Distribution Problem: Probabilities are found by calculating areas under the curve, but integrating the pdf for the normal distribution is tedious (and not possible with the “usual” methods). Solution: Computers! Its really easy to do this in R. For example, if X ∼ N (1, 4) and I want P (X ≤ 0.5) I can just type > pnorm(q = 0.5, mean=1, sd = sqrt(4), lower.tail = TRUE) Problem: So... how do I calculate probabilities for HW/Quizzes/Tests? Solution: We’ll use a table. Depending on what the problem wants, we’ll manipulate the problem in the number on a table will give us our probabilities. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 89 / 251 The Normal Distribution Problem: Probabilities are found by calculating areas under the curve, but integrating the pdf for the normal distribution is tedious (and not possible with the “usual” methods). Solution: Computers! Its really easy to do this in R. For example, if X ∼ N (1, 4) and I want P (X ≤ 0.5) I can just type > pnorm(q = 0.5, mean=1, sd = sqrt(4), lower.tail = TRUE) Problem: So... how do I calculate probabilities for HW/Quizzes/Tests? Solution: We’ll use a table. Depending on what the problem wants, we’ll manipulate the problem in the number on a table will give us our probabilities. Problem: There are an infinite number of (µ, σ 2 ) pairs. Are there an infinite number of normal probability tables? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 89 / 251 The Normal Distribution Problem: Probabilities are found by calculating areas under the curve, but integrating the pdf for the normal distribution is tedious (and not possible with the “usual” methods). Solution: Computers! Its really easy to do this in R. For example, if X ∼ N (1, 4) and I want P (X ≤ 0.5) I can just type > pnorm(q = 0.5, mean=1, sd = sqrt(4), lower.tail = TRUE) Problem: So... how do I calculate probabilities for HW/Quizzes/Tests? Solution: We’ll use a table. Depending on what the problem wants, we’ll manipulate the problem in the number on a table will give us our probabilities. Problem: There are an infinite number of (µ, σ 2 ) pairs. Are there an infinite number of normal probability tables? Solution: No. We’ll translate every normal distribution into a Normal Distribution with µ = 0, σ 2 = 1. Note: This is called the standard normal distribution and usually expressed as Z . Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 89 / 251 The Normal Distribution Translate every normal distribution into a Normal Distribution with µ = 0, σ 2 = 1. How does that work? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 90 / 251 The Normal Distribution Translate every normal distribution into a Normal Distribution with µ = 0, σ 2 = 1. How does that work? Theorem Let X ∼ N (µ, σ 2 ). If Z = X −µ σ, then Z ∼ N (0, 1). Proof: Transformations of random variables are covered in higher level stat classes. Just go with it. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 90 / 251 The Normal Distribution Translate every normal distribution into a Normal Distribution with µ = 0, σ 2 = 1. How does that work? Theorem Let X ∼ N (µ, σ 2 ). If Z = X −µ σ, then Z ∼ N (0, 1). Proof: Transformations of random variables are covered in higher level stat classes. Just go with it. Here are some examples: (1) P (Z ≤ 0.5) = 0.6915 (2) P (Z ≥ −0.5) = 1 − P (Z ≤ −0.5) = 1 − 0.3085 = 0.6915 or... Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 90 / 251 The Normal Distribution Translate every normal distribution into a Normal Distribution with µ = 0, σ 2 = 1. How does that work? Theorem Let X ∼ N (µ, σ 2 ). If Z = X −µ σ, then Z ∼ N (0, 1). Proof: Transformations of random variables are covered in higher level stat classes. Just go with it. Here are some examples: (1) P (Z ≤ 0.5) = 0.6915 (2) P (Z ≥ −0.5) = 1 − P (Z ≤ −0.5) = 1 − 0.3085 = 0.6915 or... use symmetry (3) P (−1.96 ≤ Z ≤ 1.96) = P (Z ≤ 1.96) − P (Z ≤ −1.96) = 0.975 − 0.025 = 0.95 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 90 / 251 Examples: Normal Distribution (4) Let X ∼ N (5, 100. Find P (X ≤ 0). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 91 / 251 Examples: Normal Distribution (4) Let X ∼ N (5, 100. Find P (X ≤ 0). − − P (X ≤ 0) = P X105 ≤ 0105 = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 91 / 251 Examples: Normal Distribution (4) Let X ∼ N (5, 100. Find P (X ≤ 0). − − P (X ≤ 0) = P X105 ≤ 0105 = P (Z ≤ −0.5) = 0.3085 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 91 / 251 Examples: Normal Distribution (4) Let X ∼ N (5, 100. Find P (X ≤ 0). − − P (X ≤ 0) = P X105 ≤ 0105 = P (Z ≤ −0.5) = 0.3085 (5) P (−5 ≤ X ≤ 15) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 91 / 251 Examples: Normal Distribution (4) Let X ∼ N (5, 100. Find P (X ≤ 0). − − P (X ≤ 0) = P X105 ≤ 0105 = P (Z ≤ −0.5) = 0.3085 (5) P (−5 ≤ X ≤ 15) = P Rob Gordon (University of Florida) −5−5 10 ≤ X −5 10 ≤ STA 3032 (7661) 15−5 10 = Fall 2011 91 / 251 Examples: Normal Distribution (4) Let X ∼ N (5, 100. Find P (X ≤ 0). − − P (X ≤ 0) = P X105 ≤ 0105 = P (Z ≤ −0.5) = 0.3085 5− − (5) P (−5 ≤ X ≤ 15) = P −10 5 ≤ X105 ≤ 15−5 = P (−1 ≤ Z ≤ 1) = 10 P (Z ≤ 1) − P (Z ≤ −1) = 0.8413 − 0.1587 = 0.6826 or..... Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 91 / 251 Examples: Normal Distribution (4) Let X ∼ N (5, 100. Find P (X ≤ 0). − − P (X ≤ 0) = P X105 ≤ 0105 = P (Z ≤ −0.5) = 0.3085 5− − (5) P (−5 ≤ X ≤ 15) = P −10 5 ≤ X105 ≤ 15−5 = P (−1 ≤ Z ≤ 1) = 10 P (Z ≤ 1) − P (Z ≤ −1) = 0.8413 − 0.1587 = 0.6826 or..... Just use the empirical rule! (6) P (Z ≤ k ) = 0.1762. Find k . Answer: k = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 91 / 251 Examples: Normal Distribution (4) Let X ∼ N (5, 100. Find P (X ≤ 0). − − P (X ≤ 0) = P X105 ≤ 0105 = P (Z ≤ −0.5) = 0.3085 5− − (5) P (−5 ≤ X ≤ 15) = P −10 5 ≤ X105 ≤ 15−5 = P (−1 ≤ Z ≤ 1) = 10 P (Z ≤ 1) − P (Z ≤ −1) = 0.8413 − 0.1587 = 0.6826 or..... Just use the empirical rule! (6) P (Z ≤ k ) = 0.1762. Find k . Answer: k = − 0.93 (7) Let X ∼ N (5, 100). Suppose P (X ≥ k ) = 0.8531. Find k . Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 91 / 251 Examples: Normal Distribution (4) Let X ∼ N (5, 100. Find P (X ≤ 0). − − P (X ≤ 0) = P X105 ≤ 0105 = P (Z ≤ −0.5) = 0.3085 5− − (5) P (−5 ≤ X ≤ 15) = P −10 5 ≤ X105 ≤ 15−5 = P (−1 ≤ Z ≤ 1) = 10 P (Z ≤ 1) − P (Z ≤ −1) = 0.8413 − 0.1587 = 0.6826 or..... Just use the empirical rule! (6) P (Z ≤ k ) = 0.1762. Find k . Answer: k = − 0.93 (7) Let X ∼ N (5, 100). Suppose P (X ≥ k ) = 0.8531. Find k . P (X ≥ k ) Rob Gordon = (University of Florida) P X −5 k −5 ≥ 10 10 STA 3032 (7661) =P Z ≥ k −5 10 = 0.8531 Fall 2011 91 / 251 Examples: Normal Distribution (4) Let X ∼ N (5, 100. Find P (X ≤ 0). − − P (X ≤ 0) = P X105 ≤ 0105 = P (Z ≤ −0.5) = 0.3085 5− − (5) P (−5 ≤ X ≤ 15) = P −10 5 ≤ X105 ≤ 15−5 = P (−1 ≤ Z ≤ 1) = 10 P (Z ≤ 1) − P (Z ≤ −1) = 0.8413 − 0.1587 = 0.6826 or..... Just use the empirical rule! (6) P (Z ≤ k ) = 0.1762. Find k . Answer: k = − 0.93 (7) Let X ∼ N (5, 100). Suppose P (X ≥ k ) = 0.8531. Find k . P (X ≥ k ) Rob Gordon X −5 k −5 k −5 ≥ =P Z ≥ 10 10 10 k −5 ⇒ 1−P Z ≤ = 0.8531 10 = (University of Florida) P STA 3032 (7661) = 0.8531 Fall 2011 91 / 251 Examples: Normal Distribution (4) Let X ∼ N (5, 100. Find P (X ≤ 0). − − P (X ≤ 0) = P X105 ≤ 0105 = P (Z ≤ −0.5) = 0.3085 5− − (5) P (−5 ≤ X ≤ 15) = P −10 5 ≤ X105 ≤ 15−5 = P (−1 ≤ Z ≤ 1) = 10 P (Z ≤ 1) − P (Z ≤ −1) = 0.8413 − 0.1587 = 0.6826 or..... Just use the empirical rule! (6) P (Z ≤ k ) = 0.1762. Find k . Answer: k = − 0.93 (7) Let X ∼ N (5, 100). Suppose P (X ≥ k ) = 0.8531. Find k . P (X ≥ k ) Rob Gordon X −5 k −5 k −5 ≥ =P Z ≥ 10 10 10 k −5 ⇒ 1−P Z ≤ = 0.8531 10 k −5 ⇒ P Z≤ = 0.1469 10 = (University of Florida) P STA 3032 (7661) = 0.8531 Fall 2011 91 / 251 Examples Continued k −5 10 ⇒ k = −5.5. ⇒ P Z≤ = 0.1469 ⇒ k −5 = −1.05 10 Don’t we only care about calculating probabilities? Why would we ever care about going backwards? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 92 / 251 Examples Continued k −5 10 ⇒ k = −5.5. ⇒ P Z≤ = 0.1469 ⇒ k −5 = −1.05 10 Don’t we only care about calculating probabilities? Why would we ever care about going backwards? This will become clear once we move from talking about probability to talking about statistics. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 92 / 251 The Normal Table On a test/quiz I will only provide the positive part of a normal table. How does this change your problem solving techniques? If your question is of the form P (Z ≤ z ) where z ≥ 0 then it changes nothing. If your question is of the form P (Z ≤ z ) where z < 0 then you have to do an extra couple steps. Note that for z < 0: P (Z ≤ z ) = P (Z ≥ −z ) = 1 − P (Z ≤ −z ) Homework: Cont. Uniform Distribution: Section 6.3 # 15(a,b), 19, 21(a,b), 25, 27, 29 Normal Distribution: Section 6.6 # 55, 59, 61a, 63, 65, 67 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 93 / 251 Gamma Function Definition For r > 0 the gamma function is defined by ∞ Γ(r ) = t r −1 e −t dt 0 The properties of the Gamma Function are as follows: If r is a non-negative integer, Γ(r ) = (r − 1)!. For every r Γ(r + 1) = r Γ(r ). √ Γ(1/2) = π . Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 94 / 251 Gamma Distribution Definition The pdf of the gamma distribution with parameters α > 0 and β > 0 is f (x ; α, β ) = 1 x α−1 e −x /β Γ(α)β α (29) where x > 0 and f (x ; α, β ) = 0 otherwise. EX Rob Gordon = (University of Florida) STA 3032 (7661) Fall 2011 95 / 251 Gamma Distribution Definition The pdf of the gamma distribution with parameters α > 0 and β > 0 is f (x ; α, β ) = 1 x α−1 e −x /β Γ(α)β α (29) where x > 0 and f (x ; α, β ) = 0 otherwise. ∞ EX = 0 Rob Gordon (University of Florida) 1 x α e −x /β = Γ(α)β α STA 3032 (7661) Fall 2011 95 / 251 Gamma Distribution Definition The pdf of the gamma distribution with parameters α > 0 and β > 0 is f (x ; α, β ) = 1 x α−1 e −x /β Γ(α)β α (29) where x > 0 and f (x ; α, β ) = 0 otherwise. ∞ EX = 0 1 1 x α e −x /β = α Γ(α)β Γ(α)β α ∞ x α e −x /β 0 = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 95 / 251 Gamma Distribution Definition The pdf of the gamma distribution with parameters α > 0 and β > 0 is f (x ; α, β ) = 1 x α−1 e −x /β Γ(α)β α (29) where x > 0 and f (x ; α, β ) = 0 otherwise. ∞ EX = = Rob Gordon 1 1 x α e −x /β = α Γ(α)β Γ(α)β α 0 α+1 Γ(α + 1)β = Γ(α)β α (University of Florida) STA 3032 (7661) ∞ x α e −x /β 0 Fall 2011 95 / 251 Gamma Distribution Definition The pdf of the gamma distribution with parameters α > 0 and β > 0 is f (x ; α, β ) = 1 x α−1 e −x /β Γ(α)β α (29) where x > 0 and f (x ; α, β ) = 0 otherwise. ∞ EX = = Rob Gordon 1 1 x α e −x /β = α Γ(α)β Γ(α)β α 0 α+1 Γ(α + 1)β αΓ(α)β α β = = Γ(α)β α Γ(α)β α (University of Florida) STA 3032 (7661) ∞ x α e −x /β 0 Fall 2011 95 / 251 Gamma Distribution Definition The pdf of the gamma distribution with parameters α > 0 and β > 0 is f (x ; α, β ) = 1 x α−1 e −x /β Γ(α)β α (29) where x > 0 and f (x ; α, β ) = 0 otherwise. ∞ EX = = Rob Gordon ∞ 1 1 x α e −x /β = x α e −x /β Γ(α)β α Γ(α)β α 0 0  Γ(α + 1)β α+1 αΓ(α)β α β α)β α β Γ(α  = = = Γ(α α Γ(α)β α Γ(α)β α )β (University of Florida) STA 3032 (7661) Fall 2011 95 / 251 Gamma Distribution Definition The pdf of the gamma distribution with parameters α > 0 and β > 0 is f (x ; α, β ) = 1 x α−1 e −x /β Γ(α)β α (29) where x > 0 and f (x ; α, β ) = 0 otherwise. ∞ EX = = Rob Gordon ∞ 1 1 x α e −x /β = x α e −x /β Γ(α)β α Γ(α)β α 0 0  Γ(α + 1)β α+1 αΓ(α)β α β α)β α β Γ(α  = =  = αβ Γ(α α Γ(α)β α Γ(α)β α )β (University of Florida) STA 3032 (7661) Fall 2011 95 / 251 Gamma Distribution Definition The pdf of the gamma distribution with parameters α > 0 and β > 0 is f (x ; α, β ) = 1 x α−1 e −x /β Γ(α)β α (29) where x > 0 and f (x ; α, β ) = 0 otherwise. ∞ EX = = ∞ 1 1 x α e −x /β = x α e −x /β Γ(α)β α Γ(α)β α 0 0  Γ(α + 1)β α+1 αΓ(α)β α β α)β α β Γ(α  = =  = αβ Γ(α α Γ(α)β α Γ(α)β α )β Var (X ) = αβ 2 (Verify for HW) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 95 / 251 Gamma Distribution Special Cases of the Gamma Distribution β = 2: Chi-Square Distribution (We’ll talk more about this later) α = 1: Exponential Distribution (Section 6.4) Applications Exponential Distribution: model waiting time between 1 Poisson event and another When α is an integer, the Gamma Distribution models the time between α Poisson events Theorem If X1 , X2 , . . . , Xn ∼ independent Exp (β ), then n i =1 Xi Rob Gordon (University of Florida) ∼ Gamma(n, β ) STA 3032 (7661) Fall 2011 96 / 251 Gamma Distribution: Examples (1) In a certain city the daily consumption of electric power, in millions of kW-hours, is X ∼ Gamma with mean = 6 and variance = 12. (a) What are α and β ? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 97 / 251 Gamma Distribution: Examples (1) In a certain city the daily consumption of electric power, in millions of kW-hours, is X ∼ Gamma with mean = 6 and variance = 12. (a) What are α and β ? Recall µ = αβ and σ 2 = αβ 2 . Then we have two equations and two unknowns. αβ = 6 ⇒ 6β = 12 ⇒ β = 2, α = 3. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 97 / 251 Gamma Distribution: Examples (1) In a certain city the daily consumption of electric power, in millions of kW-hours, is X ∼ Gamma with mean = 6 and variance = 12. (a) What are α and β ? Recall µ = αβ and σ 2 = αβ 2 . Then we have two equations and two unknowns. αβ = 6 ⇒ 6β = 12 ⇒ β = 2, α = 3. (b) Find the probability that on any given day the daily power consumption will exceed 12 million kW-hours. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 97 / 251 Gamma Distribution: Examples (1) In a certain city the daily consumption of electric power, in millions of kW-hours, is X ∼ Gamma with mean = 6 and variance = 12. (a) What are α and β ? Recall µ = αβ and σ 2 = αβ 2 . Then we have two equations and two unknowns. αβ = 6 ⇒ 6β = 12 ⇒ β = 2, α = 3. (b) Find the probability that on any given day the daily power consumption will exceed 12 million kW-hours. P (X > 12) Rob Gordon (University of Florida) = STA 3032 (7661) Fall 2011 97 / 251 Gamma Distribution: Examples (1) In a certain city the daily consumption of electric power, in millions of kW-hours, is X ∼ Gamma with mean = 6 and variance = 12. (a) What are α and β ? Recall µ = αβ and σ 2 = αβ 2 . Then we have two equations and two unknowns. αβ = 6 ⇒ 6β = 12 ⇒ β = 2, α = 3. (b) Find the probability that on any given day the daily power consumption will exceed 12 million kW-hours. ∞ P (X > 12) = 12 Rob Gordon (University of Florida) 1 x α−1 e −x /β dx = Γ(α)β α STA 3032 (7661) Fall 2011 97 / 251 Gamma Distribution: Examples (1) In a certain city the daily consumption of electric power, in millions of kW-hours, is X ∼ Gamma with mean = 6 and variance = 12. (a) What are α and β ? Recall µ = αβ and σ 2 = αβ 2 . Then we have two equations and two unknowns. αβ = 6 ⇒ 6β = 12 ⇒ β = 2, α = 3. (b) Find the probability that on any given day the daily power consumption will exceed 12 million kW-hours. ∞ P (X > 12) = 12 Rob Gordon (University of Florida) 1 x α−1 e −x /β dx = Γ(α)β α STA 3032 (7661) ∞ 12 1 x 3−1 e −x /2 dx Γ(3)23 Fall 2011 97 / 251 Gamma Distribution: Examples (1) In a certain city the daily consumption of electric power, in millions of kW-hours, is X ∼ Gamma with mean = 6 and variance = 12. (a) What are α and β ? Recall µ = αβ and σ 2 = αβ 2 . Then we have two equations and two unknowns. αβ = 6 ⇒ 6β = 12 ⇒ β = 2, α = 3. (b) Find the probability that on any given day the daily power consumption will exceed 12 million kW-hours. ∞ P (X > 12) = 12 = Rob Gordon (University of Florida) 1 16 1 x α−1 e −x /β dx = Γ(α)β α ∞ 12 1 x 3−1 e −x /2 dx Γ(3)23 ∞ x 2 e −x /2 dx 12 STA 3032 (7661) Fall 2011 97 / 251 Gamma Distribution: Examples (1) In a certain city the daily consumption of electric power, in millions of kW-hours, is X ∼ Gamma with mean = 6 and variance = 12. (a) What are α and β ? Recall µ = αβ and σ 2 = αβ 2 . Then we have two equations and two unknowns. αβ = 6 ⇒ 6β = 12 ⇒ β = 2, α = 3. (b) Find the probability that on any given day the daily power consumption will exceed 12 million kW-hours. ∞ P (X > 12) = 12 = = Rob Gordon (University of Florida) 1 x α−1 e −x /β dx = Γ(α)β α ∞ 1 x 2 e −x /2 dx 16 12 ∞ 1 x 2 e −x /2 (−2) + 2 16 12 STA 3032 (7661) ∞ 12 1 x 3−1 e −x /2 dx Γ(3)23 ∞ e −x /2 (2x )dx 12 Fall 2011 97 / 251 Gamma Distribution: Examples (1) In a certain city the daily consumption of electric power, in millions of kW-hours, is X ∼ Gamma with mean = 6 and variance = 12. (a) What are α and β ? Recall µ = αβ and σ 2 = αβ 2 . Then we have two equations and two unknowns. αβ = 6 ⇒ 6β = 12 ⇒ β = 2, α = 3. (b) Find the probability that on any given day the daily power consumption will exceed 12 million kW-hours. ∞ P (X > 12) = 12 = = = Rob Gordon (University of Florida) 1 x α−1 e −x /β dx = Γ(α)β α ∞ 1 x 2 e −x /2 dx 16 12 ∞ 1 x 2 e −x /2 (−2) + 2 16 12 · · · = 0.06. STA 3032 (7661) ∞ 12 1 x 3−1 e −x /2 dx Γ(3)23 ∞ e −x /2 (2x )dx 12 Fall 2011 97 / 251 Gamma Distribution: More Examples (2) The length of time for 1 individual to be served at a cafeteria is a RV having the exponential distribution with a mean of 4 minutes. What is the probability that a person is served in less than 3 minutes on at least 4 or the next 6 days? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 98 / 251 Gamma Distribution: More Examples (2) The length of time for 1 individual to be served at a cafeteria is a RV having the exponential distribution with a mean of 4 minutes. What is the probability that a person is served in less than 3 minutes on at least 4 or the next 6 days? Let X ∼ Bin(6, p ), Y ∼ Exp (β = 4). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 98 / 251 Gamma Distribution: More Examples (2) The length of time for 1 individual to be served at a cafeteria is a RV having the exponential distribution with a mean of 4 minutes. What is the probability that a person is served in less than 3 minutes on at least 4 or the next 6 days? Let X ∼ Bin(6, p ), Y ∼ Exp (β = 4). 3 p = P (Y ≤ 3) = 0 Rob Gordon (University of Florida) 1 −x /4 e dx = · · · = −e −3/4 + 1 ≈ 0.53 4 STA 3032 (7661) Fall 2011 98 / 251 Gamma Distribution: More Examples (2) The length of time for 1 individual to be served at a cafeteria is a RV having the exponential distribution with a mean of 4 minutes. What is the probability that a person is served in less than 3 minutes on at least 4 or the next 6 days? Let X ∼ Bin(6, p ), Y ∼ Exp (β = 4). 3 1 −x /4 e dx = · · · = −e −3/4 + 1 ≈ 0.53 4 0 P (X ≥ 4) = P (X = 4) + P (X = 5) + P (X = 6) p = P (Y ≤ 3) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 98 / 251 Gamma Distribution: More Examples (2) The length of time for 1 individual to be served at a cafeteria is a RV having the exponential distribution with a mean of 4 minutes. What is the probability that a person is served in less than 3 minutes on at least 4 or the next 6 days? Let X ∼ Bin(6, p ), Y ∼ Exp (β = 4). 3 1 −x /4 e dx = · · · = −e −3/4 + 1 ≈ 0.53 4 0 P (X ≥ 4) = P (X = 4) + P (X = 5) + P (X = 6) 4 6−4 6 = 1 − e −3/4 e −3/4 4 5 6−5 6 e −3/4 + 1 − e −3/4 5 6 6−6 6 + 1 − e −3/4 e −3/4 6 p = P (Y ≤ 3) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 98 / 251 Gamma Distribution: More Examples (2) The length of time for 1 individual to be served at a cafeteria is a RV having the exponential distribution with a mean of 4 minutes. What is the probability that a person is served in less than 3 minutes on at least 4 or the next 6 days? Let X ∼ Bin(6, p ), Y ∼ Exp (β = 4). 3 1 −x /4 e dx = · · · = −e −3/4 + 1 ≈ 0.53 4 0 P (X ≥ 4) = P (X = 4) + P (X = 5) + P (X = 6) 4 6−4 6 = 1 − e −3/4 e −3/4 4 5 6−5 6 e −3/4 + 1 − e −3/4 5 6 6−6 6 + 1 − e −3/4 e −3/4 6 ≈ 0.40. p = P (Y ≤ 3) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 98 / 251 Homework for Gamma & Exponential Homework: Section 6.5: # 45, 46, 48, 50, 51, Section 6.4: 31, 33, 37, 41 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 99 / 251 Beta Distribution The Beta Distribution models Random Variables that take on values in the interval [0, 1]. Definition We say that a Random Variable X has a Beta Distribution if its pdf is of the form: f (x ) = Γ(α + β ) α−1 x (1 − x )β −1 , 0 < x < 1, 0 elsewhere Γ(α)Γ(β ) 1 EX = x 0 Rob Gordon Γ(α + β ) α−1 Γ(α + β ) x (1 − x )β −1 dx = Γ(α)Γ(β ) Γ(α)Γ(β ) (University of Florida) STA 3032 (7661) 1 (30) x α (1 − x )β −1 0 Fall 2011 100 / 251 Beta Distribution The Beta Distribution models Random Variables that take on values in the interval [0, 1]. Definition We say that a Random Variable X has a Beta Distribution if its pdf is of the form: f (x ) = Γ(α + β ) α−1 x (1 − x )β −1 , 0 < x < 1, 0 elsewhere Γ(α)Γ(β ) 1 EX = = Rob Gordon Γ(α + β ) α−1 Γ(α + β ) x (1 − x )β −1 dx = Γ(α)Γ(β ) Γ(α)Γ(β ) 0 Γ(α + β ) Γ(α + 1)Γ(β ) = Γ(α)Γ(β ) Γ(α + β + 1) 1 x (University of Florida) STA 3032 (7661) (30) x α (1 − x )β −1 0 Fall 2011 100 / 251 Beta Distribution The Beta Distribution models Random Variables that take on values in the interval [0, 1]. Definition We say that a Random Variable X has a Beta Distribution if its pdf is of the form: f (x ) = Γ(α + β ) α−1 x (1 − x )β −1 , 0 < x < 1, 0 elsewhere Γ(α)Γ(β ) 1 EX = = Rob Gordon Γ(α + β ) α−1 Γ(α + β ) x (1 − x )β −1 dx = Γ(α)Γ(β ) Γ(α)Γ(β ) 0 Γ(α + β ) Γ(α + 1)Γ(β ) α = Γ(α)Γ(β ) Γ(α + β + 1) α+β 1 x (University of Florida) STA 3032 (7661) (30) x α (1 − x )β −1 0 Fall 2011 100 / 251 Beta Distribution The Beta Distribution models Random Variables that take on values in the interval [0, 1]. Definition We say that a Random Variable X has a Beta Distribution if its pdf is of the form: f (x ) = Γ(α + β ) α−1 x (1 − x )β −1 , 0 < x < 1, 0 elsewhere Γ(α)Γ(β ) 1 EX = = Var (X ) = Rob Gordon Γ(α + β ) α−1 Γ(α + β ) x (1 − x )β −1 dx = Γ(α)Γ(β ) Γ(α)Γ(β ) 0 Γ(α + β ) Γ(α + 1)Γ(β ) α = Γ(α)Γ(β ) Γ(α + β + 1) α+β αβ (Prove for HW) (α + β )2 (α + β + 1) 1 x (University of Florida) STA 3032 (7661) (30) x α (1 − x )β −1 0 Fall 2011 100 / 251 Beta Example (#85b) The proportion of pure iron in certain ore samples has a beta distribution with α = 3 and β = 1. Find the probability that two out of three randomly selected samples will have less than 30% pure iron. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 101 / 251 Beta Example (#85b) The proportion of pure iron in certain ore samples has a beta distribution with α = 3 and β = 1. Find the probability that two out of three randomly selected samples will have less than 30% pure iron. Let X ∼ Beta(α = 3, β = 1) and Y Bin(n = 3, p = P (X < 0.3)) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 101 / 251 Beta Example (#85b) The proportion of pure iron in certain ore samples has a beta distribution with α = 3 and β = 1. Find the probability that two out of three randomly selected samples will have less than 30% pure iron. Let X ∼ Beta(α = 3, β = 1) and Y Bin(n = 3, p = P (X < 0.3)) 0.3 p = P (X < 0.3) = 0 = Γ(4) Γ(3)Γ(1) = 3! x 3 2!1! 3 P (Y = 2) = Rob Gordon (University of Florida) 3 2 0 .3 x 2 dx 0 3 10 0.3 0 Γ(3 + 1) 3−1 x (1 − x )1−1 dx Γ(3)Γ(1) = 27 1000 3 2 1− STA 3032 (7661) = 27 1000 27 1000 3−2 ≈ 0.002128 Fall 2011 101 / 251 More Continuous Distributions Homework: Section 6.8:At least # 78, 79, 83 Recall from previous slides that the Gamma and Exponential Distributions are used to model life-times. Two other distributions mentioned in your book are used to model life-time distributions. They are Lognormal Distribution (Section 6.7) Weibull (Section 6.9) They work similar to previous distributions covered in class and aren’t very interesting. You may find yourself using these distributions in your further studies/job, but for the sake of time we’ll skip them. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 102 / 251 More Continuous Distributions Homework: Section 6.8:At least # 78, 79, 83 Recall from previous slides that the Gamma and Exponential Distributions are used to model life-times. Two other distributions mentioned in your book are used to model life-time distributions. They are Lognormal Distribution (Section 6.7) Weibull (Section 6.9) They work similar to previous distributions covered in class and aren’t very interesting. You may find yourself using these distributions in your further studies/job, but for the sake of time we’ll skip them. This marks the end of the material for Exam 1. Decide: Take Exam 1 on Monday Sept 26 or Wednesday Sept 28. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 102 / 251 Chapter 7 So far we’ve only dealt with 1 Random Variable at a time. In order to accomplish anything in statistics we’ll need to learn how to handle multiple random variables at a time. We’ll start by taking the notion of multiple events and generalizing it to multiple Random Variables. (1) (2) Chapter 4 P (A ∩ B ) A, B independent if P (A ∩ B ) = P (A)P (B ) (3) Rob Gordon P (A|B ) = P (A∩B ) P (B ) (University of Florida) Chapter 7 fX ,Y (x , y ) = P (X = x , Y = y ) X , Y are ind RVs if P(X=x, Y=y) = P(X=x)P(Y=y) P (X = x |Y = y ) = STA 3032 (7661) fX ,Y (x ,y ) fY (y ) Fall 2011 103 / 251 Chapter 7 Let’s describe each one of these 3 features in greater depth. (1) If X , Y are discrete the joint pmf of X and Y is f X ,Y ( x , y ) = P ( X = x , Y = y ) The marginal pmfs can be calculated from the joint pmf: P (X = x ) = fX ,Y (x , y ), P (Y = y ) = y fX ,Y (x , y ) x For the continuous case, we just say fX (x ) = fX ,Y (x , y )dy and fY (y ) = y Rob Gordon (University of Florida) fX ,Y (x , y )dx x STA 3032 (7661) Fall 2011 104 / 251 Chapter 7 We say that fX ,Y (x , y ) is a valid joint pmf if fX ,Y (x , y ) ∈ [0, 1] and fX ,Y (x , y ) = 1. x y Similarly, it is a valid joint pdf if fX ,Y (x , y ) > 0 and fX ,Y (x , y )dydx = 1. x Rob Gordon (University of Florida) y STA 3032 (7661) Fall 2011 105 / 251 Chapter 7 Example 4xy , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 0, otherwise (a) Is f a valid joint pdf? Let fX ,Y (x , y ) = 4xydydx x Rob Gordon = y (University of Florida) STA 3032 (7661) Fall 2011 106 / 251 Chapter 7 Example 4xy , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 0, otherwise (a) Is f a valid joint pdf? Let fX ,Y (x , y ) = 1 1 4xydydx x Rob Gordon y (University of Florida) =4 x 0 ydy dx = 0 STA 3032 (7661) Fall 2011 106 / 251 Chapter 7 Example 4xy , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 0, otherwise (a) Is f a valid joint pdf? Let fX ,Y (x , y ) = 1 1 4xydydx x Rob Gordon y (University of Florida) =4 x 0 1 ydy dx = 4 0 STA 3032 (7661) x 0 y2 2 1 0 Fall 2011 dx 106 / 251 Chapter 7 Example 4xy , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 0, otherwise (a) Is f a valid joint pdf? Let fX ,Y (x , y ) = 1 1 4xydydx x =4 x 0 y 1 ydy dx = 4 0 x 0 y2 2 1 0 dx 1 =2 xdx = 1. 0 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 106 / 251 Chapter 7 Example 4xy , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 0, otherwise (a) Is f a valid joint pdf? Let fX ,Y (x , y ) = 1 1 4xydydx x =4 x 0 y 1 ydy dx = 4 0 x 0 y2 2 1 0 dx 1 =2 xdx = 1. 0 (b) Find P X ≤ 1 , Y ≤ 2 1 3 1 1 P X ≤ ,Y ≤ 2 3 Rob Gordon (University of Florida) = STA 3032 (7661) Fall 2011 106 / 251 Chapter 7 Example 4xy , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 0, otherwise (a) Is f a valid joint pdf? Let fX ,Y (x , y ) = 1 1 4xydydx x =4 x 0 y 1 ydy dx = 4 0 x 0 y2 2 1 0 dx 1 =2 xdx = 1. 0 (b) Find P X ≤ 1 , Y ≤ 2 1 3 1 1 P X ≤ ,Y ≤ 2 3 Rob Gordon (University of Florida) 1/2 1/3 4xydydx = · · · = = x =0 STA 3032 (7661) y =0 1 36 Fall 2011 106 / 251 Chapter 7 Example Let fX ,Y (x , y ) = 4xy , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 0, otherwise (c) Find fX (x ) and fY (y ), the marginal distributions of X and Y . 1 fX (x ) = y =0 Rob Gordon 1 fX ,Y (x , y )dy = (University of Florida) 4xydy = 4x y =0 STA 3032 (7661) y2 2 1 0 = 2x Fall 2011 107 / 251 Chapter 7 Example Let fX ,Y (x , y ) = 4xy , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 0, otherwise (c) Find fX (x ) and fY (y ), the marginal distributions of X and Y . 1 fX (x ) = 1 fX ,Y (x , y )dy = y =0 1 4xydy = 4x y =0 y2 2 1 0 = 2x fX ,Y (x , y )dx = · · · = 2y . fY (y ) = x =0 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 107 / 251 Chapter 7 (2) Now let us describe in detail what we mean by the notion of independence of Random Variables. Theorem X and Y are independent if and only if P (X ∈ I1 , Y ∈ I2 ) = P (X ∈ I1 )P (X ∈ I2 ) where I1 and I2 are subsets of the support of X and Y respectively. In other words X and Y are independent if their joint distribution factors into the product of their marginals Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 108 / 251 Chapter 7 (2) Now let us describe in detail what we mean by the notion of independence of Random Variables. Theorem X and Y are independent if and only if P (X ∈ I1 , Y ∈ I2 ) = P (X ∈ I1 )P (X ∈ I2 ) where I1 and I2 are subsets of the support of X and Y respectively. In other words X and Y are independent if their joint distribution factors into the product of their marginals, i.e. fX ,Y (x , y ) = fX (x )fY (y ). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 108 / 251 Independence Caution: Pay attention to where the Random Variables “live” when determining if they are independent. We saw earlier that 4xy , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 fX ,Y (x , y ) = 0, otherwise can be factored into a product of 2 marginals. However, we can’t do that with 8xy , 0 < x < y < 1 g X ,Y ( x , y ) = 0, otherwise In this case we would say that X and Y are dependent. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 109 / 251 Independence & Dependence We can describe the relationship between variables the following way: Definition The covariance, sometimes denoted σXY , is given by the following: Cov (X , Y ) = EXY − EXEY (= E [(X − EX )(Y − EY )]) xyfX ,Y (x , y )dydx − = x y xfX (x )dx x yfY (y )dy y To get a unit-less measure of the dependence of two variables we use: Definition The correlation or correlation coefficient, denoted ρ, between two random variables X and Y is given by ρ= Rob Gordon (University of Florida) Cov (X , Y ) Var (X )Var (Y ) STA 3032 (7661) = σXY σX σY (31) Fall 2011 110 / 251 Independence & Dependence What happens to the covariance if X and Y are independent? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 111 / 251 Independence & Dependence What happens to the covariance if X and Y are independent? Recall that if X and Y are independent then the joint pmf(pdf) is just the product of the marginals. That means we can say: xyfX ,Y (x , y )dydx − Cov (X , Y ) = x = y y yfY (y )dy y yfY (y )dy − xfX (x )dx x xfX (x )dx x xfX (x )dx x yfY (y )dy = 0 y So there we have it: X , Y independent ⇒ Cov (X , Y ) = 0. If Cov (X , Y ) = 0 can we say that X and Y are independent? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 111 / 251 Independence & Dependence What happens to the covariance if X and Y are independent? Recall that if X and Y are independent then the joint pmf(pdf) is just the product of the marginals. That means we can say: xyfX ,Y (x , y )dydx − Cov (X , Y ) = x = y y yfY (y )dy y yfY (y )dy − xfX (x )dx x xfX (x )dx x xfX (x )dx x yfY (y )dy = 0 y So there we have it: X , Y independent ⇒ Cov (X , Y ) = 0. If Cov (X , Y ) = 0 can we say that X and Y are independent? No! The fact that the covariance is 0 has no bearing on whether or not the joint pmf(pdf) can be factored into separate marginals (see Table 7.4 on page 348 of the textbook for details). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 111 / 251 How does Independence affect Expectations and Variances of Sums of RVs? Let X and Y be arbitrary Random Variables. E (X + Y ) = E (X ) + E (Y ) Var (X + Y ) = E (X + Y )2 − [E (X + Y )]2 = E X 2 + Y 2 + 2XY − (EX )2 + (EY )2 + 2EXEY Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 112 / 251 How does Independence affect Expectations and Variances of Sums of RVs? Let X and Y be arbitrary Random Variables. E (X + Y ) = E (X ) + E (Y ) Var (X + Y ) = E (X + Y )2 − [E (X + Y )]2 = E X 2 + Y 2 + 2XY − (EX )2 + (EY )2 + 2EXEY = E X 2 − [EX ]2 + E Y 2 − [EY ]2 +2 [EXY − EXEY ] Var (X ) Rob Gordon (University of Florida) Var (Y ) STA 3032 (7661) Cov (X ,Y ) Fall 2011 112 / 251 How does Independence affect Expectations and Variances of Sums of RVs? Let X and Y be arbitrary Random Variables. E (X + Y ) = E (X ) + E (Y ) Var (X + Y ) = E (X + Y )2 − [E (X + Y )]2 = E X 2 + Y 2 + 2XY − (EX )2 + (EY )2 + 2EXEY = E X 2 − [EX ]2 + E Y 2 − [EY ]2 +2 [EXY − EXEY ] Var (X ) 2 Var (Y ) Cov (X ,Y ) 2 Var (aX + bY ) = a Var (X ) + b Var (Y ) + 2abCov (X , Y ) If X and Y are independent, the covariance term disappears and we get Var (aX + bY ) = a2 Var (X ) + b 2 Var (Y ) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 112 / 251 More on Independence Independence is also an assumption for the following facts: iid Let X1 , X2 , . . . Xn ∼ Bernoulli (p ). Then Y = Rob Gordon (University of Florida) STA 3032 (7661) n i =1 Xi ∼ Bin(n, p ) Fall 2011 113 / 251 More on Independence Independence is also an assumption for the following facts: iid Let X1 , X2 , . . . Xn ∼ Bernoulli (p ). Then Y = iid Let X1 , X2 , . . . Xk ∼ Geo (p ). Then Y = Rob Gordon (University of Florida) STA 3032 (7661) n i =1 Xi k i =1 Xi ∼ Bin(n, p ) ∼ NegBin(k , p ) Fall 2011 113 / 251 More on Independence Independence is also an assumption for the following facts: iid Let X1 , X2 , . . . Xn ∼ Bernoulli (p ). Then Y = iid Let X1 , X2 , . . . Xk ∼ Geo (p ). Then Y = iid Let X1 , X2 , . . . Xn ∼ Exp (β ). Then Y = Rob Gordon (University of Florida) STA 3032 (7661) n i =1 Xi k i =1 Xi n i =1 Xi ∼ Bin(n, p ) ∼ NegBin(k , p ) ∼ Gamma(n, β ) Fall 2011 113 / 251 More on Independence Independence is also an assumption for the following facts: iid Let X1 , X2 , . . . Xn ∼ Bernoulli (p ). Then Y = iid Let X1 , X2 , . . . Xk ∼ Geo (p ). Then Y = iid Let X1 , X2 , . . . Xn ∼ Exp (β ). Then Y = n i =1 Xi k i =1 Xi n i =1 Xi ∼ Bin(n, p ) ∼ NegBin(k , p ) ∼ Gamma(n, β ) Recall if α = (integer) and X ∼ Gamma(α/2, β = 2) then X ∼ chi-squared(α), also denoted X ∼ χ2 (α). iid If Z1 , Z2 , . . . Zn ∼ N (0, 1) then “degrees of freedom.” Rob Gordon (University of Florida) n 2 i =1 Zi STA 3032 (7661) ∼ χ2 (n). n is called the Fall 2011 113 / 251 Conditional pmfs & pdfs (3) Let X and Y be two arbitrary Random Variables and I1 and I2 be two subsets of the supports of X and Y respectively. Definition The conditional probability density(mass) function of X given Y is the following: P (X ∈ I1 , Y ∈ I2 ) P (X ∈ I1 |X ∈ I2 ) = (32) P (Y ∈ I2 ) In other words we can say fX |Y (x |y ) = fX ,Y (x , y ) . fY (y ) (33) If X and Y are independent, then fX ,Y (x |y ) = fX (x ) (prove for homework). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 114 / 251 Conditional Probability (revisited) fX |Y is a legitimate distribution where we can talk about probabilities and expectations. Examples: Suppose X and Y have a joint pdf given by 8xy , 0 ≤ x ≤ y ≤ 1 fX ,Y (x , y ) = 0, otherwise Find P (Y < 0.5|X = 0.25). P (Y < 0.5|X = 0.25) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 115 / 251 Conditional Probability (revisited) fX |Y is a legitimate distribution where we can talk about probabilities and expectations. Examples: Suppose X and Y have a joint pdf given by 8xy , 0 ≤ x ≤ y ≤ 1 fX ,Y (x , y ) = 0, otherwise Find P (Y < 0.5|X = 0.25). P (Y < 0.5|X = 0.25) = = 0 .5 0.25 fX ,Y (0.25, y )dy fX (0.25) 0 .5 0.25 8(0.25)ydy 1 x 8xydy x =0.25 = 0.5 0.25 fX ,Y (0.25, y )dy 1 x fX ,Y (x , y )dy x =0.25 = · · · = 0.0125. Homework: Section 7.3: 7.3, 7.5, 7.7, 7.9, 7.11 Section 7.4: 7.17 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 115 / 251 Conditional Probability How do expectations work for conditional distributions?Some basic principles are the following: Definition If X and Y are two arbitrary Random Variables, the conditional expectation of X given that Y = y is defined to be ∞ E (X |Y = y ) = xf (x |y )dx . −∞ Theorem Let X and Y denote two arbitrary random variables. Then E (X ) = E (E (X |Y )). The properties of conditional probability are well above the scope of this class so we won’t focus too much on it. Just know that it is a thing and that it exists. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 116 / 251 The Multinomial Distribution We’ve gone over some common pmfs and pdfs for some common Random Variables. Are there any common joint distributions? Consider an experiment similar to that for the binomial random variable: (fixed number of trials, independent outcomes, etc), but let there be k possible outcomes instead of just 2. Definition The multinomial distribution is represented by the following pmf: P (Y1 = y1 , . . . , Yk = yk ) = Rob Gordon (University of Florida) n! y p y1 p y2 · · · pkk y1 !y2 ! · · · yk ! 1 2 STA 3032 (7661) Fall 2011 (34) 117 / 251 The Multinomial Distribution The Multinomial Distribution has a few interesting qualities: Marginally, Yi ∼ Bin(ni , pi ). While trials in the experiment are independent, the Random Variables are not. This makes sense on an intuitive level since there’s no way to factor the pmf into a product of other pmfs. We can also prove this by showing that Cov (Yi , Yj ) = 0. The proof is given in Example 7.17 on page 357 of the textbook. Homework: 7.25, 7.27, 7.29 7.31 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 118 / 251 Statistics Now that some of the major foundations of probability have been established, we can talk more about what we mean by statistics. Many of the assumptions we made previously (even those assumptions that were not explicitly labelled as such will be explored from here on out.) For example... 1 Is the probability of flipping a coin and getting a head really 2 ? Is the mean of a normal distribution really what you say it is? How can we be sure that trials are independent? Is the variance really constant throughout the entire experiment? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 119 / 251 Statistics First let’s get some definitions out of the way: Definition Parameters are numerical descriptive measures of a population. Statistics are numerical descriptive measures of a sample. The only difference between these two definitions is the scope of what they describe. For example if we have a population that is best described by a normal distribution with parameters µ and σ 2 . The mean µ is a parameter describing the mean of the population while x would be the mean of any ¯ sample that we take. Since we can’t survey every person in a population, usually we take a sample instead and take the mean of that sample (¯) to estimate the x mean of that population (µ). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 120 / 251 Statistics Let’s define some terms with a little more rigor than we saw on the last slide. Definition A statistic is a function of Random Variables. Definition ˆ A point estimate of some population parameter θ is a single value θ of a ˆ statistic Θ. ˆ Statistic (Θ) ¯ X= S2 = 1 n −1 ˆ P= Rob Gordon 1 n ˆ Value of Statistic (θ) Parameter Estimated (θ) 1 n µ Xi ¯ Xi − X X ∼Bin(n,p ) n (University of Florida) x= ¯ 2 s2 = 1 n −1 p= ˆ xi (xi − x )2 ¯ x =# successes n STA 3032 (7661) σ2 p Fall 2011 121 / 251 Estimators Definition ˆ A statistic Θ is said to be an unbiased estimator of the parameter θ if ˆ E Θ = θ. Examples: iid Let Xi ∼ N (µ, σ )2 , i = 1, . . . , n. Let Y ∼ Bin(n, p ). ¯ EX ˆ EP Rob Gordon (University of Florida) =E =E 1 n Y n n Xi i =1 = 1 = n n EX1 = µ i =1 1 E (Y ) = p . n STA 3032 (7661) Fall 2011 122 / 251 Estimators iid Again consider Xi ∼ N (µ, σ )2 , i = 1, . . . , n. ES 2 = = = = 1 n−1 1 n−1 1 n−1 1 n−1 n ¯ E (Xi − X )2 i =1 n ¯ E (Xi − µ + µ − X )2 i =1 n E ¯ (Xi − µ) + (µ − X ) 2 i =1 n ¯ ¯ E (Xi − µ)2 + (µ − X )2 + 2(Xi − µ)(µ − X )2 i =1 = · · · = σ 2 (continue the proof for HW) So S 2 is an unbiased estimator of σ 2 . This is why the estimate for 1 variance commonly includes n−1 : to force the estimate to be unbiased. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 123 / 251 Biased vs. Unbiased Estimators Clearly if an estimator is not unbiased, it is “biased.” Why use a biased estimator? 1 It certainly wouldn’t make sense to estimate µ with n+1 Xi instead of 1 Xi . In many other contexts it does make sense to use an unbiased n estimator. One way to choose an estimator is by picking the statistic with the smaller MSE where ˆ MSEΘ = E Θ − θ ˆ 2 ˆ +Var Θ = E ˆ Θ−θ 2 . “bias ” Unbiased estimators can be good choices because the bias term disappears. Sometimes biased estimators can lead to a smaller MSE though. Usually if you can find an unbiased estimator with minimal variance then you’re in good shape. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 124 / 251 Biased vs. Unbiased Estimators Are there any other reasons to use Biased Estimators? Actually there are plenty. Here are two examples: Maximum Likelihood Estimator (MLE): This is literally the most likely value of an unknown parameter given the values from the sample. ¯ X is both unbiased and the MLE of µ. 1 n −1 2 ¯ (Xi − X )2 is the MLE for σ 2 . n S =n Bayes Estimator: If you have some sort of prior knowledge with respect to a sample, then you can apply Bayes’ Theorem to our idea of Random Variables and come up with an estimator that way. These estimators are almost always biased. We may end up talking about MLEs later on if we have time, but we will almost certainly not talk about Bayes’ Estimators. Understanding the properties of both require upper-level undergraduate & graduate level statistics courses and Bayes’ estimators can spill into discussions about Decision Theory which is well beyond the scope of this course. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 125 / 251 The Sampling Distribution Have some more definitions: Definition A sampling distribution is the probability distribution of a sample statistic. The standard deviation of a statistic is known as the standard error of the statistic. Remember that a statistic is literally a function of Random Variables. If a single Random Variable has its own distribution, then the combination of a bunch of Random Variables should have a certain distribution as well. ¯ Our first example will be a discussion of the distribution of the statistic X . Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 126 / 251 Central Limit Theorem (CLT) This is the most important result in statistics. Simply stated: Take a sample from a population. If we take a large enough sample, the distribution of the sample mean is normal, regardless of the distribution it was sampled from. Formally stated: Theorem Let X1 , X2 , . . . , Xn be random variables from a random sample from a population with mean µ and variance σ 2 , and let Sn = X1 + X2 + · · · + Xn ¯ and, X = Sn /n. Then for large enough n(> 30) Sn ∼ N (nµ, nσ 2 ) ¯ X ∼ N (µ, σ 2 /n) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 127 / 251 First some examples: The amount of warpage in a type of wafer used in the manufacture of integrated circuits has mean 1.3mm and standard deviation of 0.1mm. A random sample of 200 wafers is drawn. What is the probability that the sample mean warpage exceeds 1.305 mm? First let’s write down the facts. Remember from the previous slide that ¯ X ∼ N (µ, σ 2 /n). Why? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 128 / 251 First some examples: The amount of warpage in a type of wafer used in the manufacture of integrated circuits has mean 1.3mm and standard deviation of 0.1mm. A random sample of 200 wafers is drawn. What is the probability that the sample mean warpage exceeds 1.305 mm? First let’s write down the facts. Remember from the previous slide that ¯ X ∼ N (µ, σ 2 /n). Why? ¯ ¯ ¯ X ∼ N E X , Var X 1 ¯ E (X ) = E n n=1 Xi = i ¯ Var X = ??? Rob Gordon (University of Florida) 1 n n i =1 EXi STA 3032 (7661) = 1 n n i =1 µ =µ Fall 2011 128 / 251 Example Continued ¯ Var X = Var = Rob Gordon (University of Florida) 1 n 1 Var n2 n Xi i =1 n Xi (why?) i =1 STA 3032 (7661) Fall 2011 129 / 251 Example Continued ¯ Var X = Var = Rob Gordon (University of Florida) 1 n 1 Var n2 n Xi i =1 n Xi (why?) See Slide 112 i =1 STA 3032 (7661) Fall 2011 129 / 251 Example Continued ¯ Var X = Var = = Rob Gordon 1 n (University of Florida) 1 Var n2 1 n2 n Xi i =1 n Xi (why?) See Slide 112 i =1 n Var (Xi ) (why?) i =1 STA 3032 (7661) Fall 2011 129 / 251 Example Continued ¯ Var X = Var = = Rob Gordon 1 n (University of Florida) 1 Var n2 1 n2 n Xi i =1 n Xi (why?) See Slide 112 i =1 n Var (Xi ) (why?) See Slide 112 i =1 STA 3032 (7661) Fall 2011 129 / 251 Example Continued ¯ Var X 1 n = Var = = 1 Var n2 1 n2 n Xi i =1 n Xi n Var (Xi ) (why?) See Slide 112 i =1 n = 1 n2 = Rob Gordon (why?) See Slide 112 i =1 σ2 nσ 2 = . n2 n (University of Florida) σ2 i =1 STA 3032 (7661) Fall 2011 129 / 251 Example Continued The amount of warpage in a type of wafer used in the manufacture of integrated circuits has mean 1.3mm and standard deviation of 0.1mm. A random sample of 200 wafers is drawn. What is the probability that the sample mean warpage exceeds 1.305 mm? ¯ 1.305 − EX X − EX > = P ¯ ¯ Var X Var X ¯ P X > 1.305 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 130 / 251 Example Continued The amount of warpage in a type of wafer used in the manufacture of integrated circuits has mean 1.3mm and standard deviation of 0.1mm. A random sample of 200 wafers is drawn. What is the probability that the sample mean warpage exceeds 1.305 mm? ¯ 1.305 − EX X − EX > = P ¯ ¯ Var X Var X ¯ P X > 1.305 =P Z> 1.305 − µ σ 2 /n 1.305 − 1.3 √ 0.1/ 200 = P (Z > 0.707) = 1 − P (Z < 0.71) = P Z> = 1 − 0.7611 = 0.2389. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 130 / 251 What’s really going on here? First let’s review what’s happening when I talk about probabilities associated with Statistics (functions of random variables). The sample x1 , x2 , . . . , xn are realizations of the Random Variables (X1 , X2 , . . . , Xn ) taking on specific values. If I add them, it creates another Random Variable, Sn with its own distribution. How do we find the distribution of Sn ? The answer is complicated and we can’t discuss it completely. We care more about situations and conclusions anyway so we’ll just talk about those: If n is large enough Sn “becomes” normally distributed. See slide 127 for the exact statement of the theorem. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 131 / 251 To be more specific: Its not exactly correct to say Sn “becomes” normal. Its better to say that Sn converges to a normal Random Variable in the sense that the cdfs converge. Here is roughly how it works: Let Fn be the cdf of Sn . Let Φ be the cdf of some normal Random Variable. Then as n → ∞, Fn → Φ. This is called convergence in distribution since the cumulative distribution functions are converging to another distribution function. If we take Zn = ¯ ¯ X q −E X ¯ Var (X ) = ¯ X −µ √, σ/ n and n > 30 then we would write d something like Zn → Z ∼ N (0, 1). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 132 / 251 More Examples iid Let X ∼ Bin(n, p ). Recall that if Yi ∼ Bernoulli (p ) then X = n=1 Yi . i We can apply the CLT in this case as well. If we have a “large enough” n we can say the following: Theorem Let X ∼ Bin(n, p ). If np > 15 and n(1 − p ) > 15 then the following hold by the CLT: X X ˆ P= n ∼ N (np , np (1 − p )) p (1 − p ) ∼ N p, n In this case a large sample size isn’t the only thing we need to have. Let’s see this in action: www.stat.tamu.edu/~west/applets/binomialdemo1.html Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 133 / 251 Slight Problem The binomial distribution is represented by a discrete pmf, yet the last slide said we can approximate it with a continuous distribution. This leads to some information loss sometimes when we approximate binomials with the CLT. General Rule: Let X ∼ Bin(n, p ). If I have to find P (a ≤ X ≤ b ), pretend its P a − 1 ≤ X ≤ b + 1 . 2 2 Example A machine makes 1000 steel O-rings per day. Each ring has 0.9 probability of meeting a thickness specification. What is the probability that fewer than 890 O-rings meet the specification? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 134 / 251 CLT for Binomial Example A machine makes 1000 steel O-rings per day. Each ring has 0.9 probability of meeting a thickness specification. What is the probability that fewer than 890 O-rings meet the specification? Let X ∼ Bin(1000, 0.9). Note that np = 900 and n(1 − p ) = 100. We can apply the CLT. ≈ P (X ≤ 890 + 1/2) = P (X < 890) P X − np np (1 − p ) ≤ 890 + 1/2 − 1000(0.9) 1000(0.9)(0.1) → P (Z ≤ −1.00) = · · · = 0.158 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 135 / 251 CLT Some things to consider: The book uses 10 instead of 15 as a cutoff. The book does not use the 1/2 trick in any examples. On tests and quizzes I will ask you to check that np and n(1 − p ) are greater than 15. I will not require the 1/2 trick. Similarly we can say something about the Poisson distribution: Theorem If X ∼ Poisson(λ) where λ > 15 then by the CLT X ∼ N (λ, λ) o Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 136 / 251 What if n < 30? If n < 30 then we can’t make inference about µ if we have no idea what the distribution of our sample is. Also what if we don’t know σ ? It turns out if n is large, then s → σ so we don’t have to worry about that unless n is small. Theorem If X1 , X2 , . . . , Xn are normally distributed and n < 30 then we say that ¯ X√ T = S /−µ follows a t-distribution with (n − 1) degrees of freedom. n Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 137 / 251 What the heck is the t-distribution? Consider the form of the statistic T : T = = √ ¯ ¯ X − µ /(σ/ n) X −µ Z √= √ √ = = S /σ S/ n (S / n)/(σ/ n) Z S2 σ2 n −1 n −1 Z Q /(n − 1) where Z ∼ N (0, 1) and Q ∼chi-squared(n-1). It turns out that T = Z / Q /(n − 1) has a t -distribution. In other words T is a function of 2 random variables who itself has a pdf that looks like the following: Γ ( ν +1 2 f (t ; ν ) = √ νπ Γ(ν/2) Rob Gordon (University of Florida) 1+ t2 ν −(ν +1)/2 STA 3032 (7661) , −∞ < t < ∞ Fall 2011 138 / 251 More about the t distribution Like Z , T has a bell-shaped curve with “longer” tails. How do we find probabilities? We could integrate the function from the last slide, but its very nasty. Instead we’ll use a table (Table 4 in your text) just like we did for the Z case. We’ll see how to do this shortly. Remember that the use of the t -distribution depends on if we think the sample comes from a normal distribution. How can we assume normality if we are not clearly told? Recall the empirical (68-95-99.7) rule and our definition of outliers (1.5 IQR away from median of sample). If we have potential outliers in our data then we shouldn’t use the t -distribution to estimate µ! Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 139 / 251 Using the t -distribution Example 8.22) It is known from past samples that pH of water in Bolton Creek tends to be approximately normally distributed. The average water pH level of water in the creek is estimated regularly by taking 12 samples from different parts of the creek. Assuming they represent random samples from the creek, find the approximate probability that the sample mean of the 12 pH measurements will be within 0.2 units of the true average pH for the field. The most recent sample measurements were as follows: 6.63, 6.59, 6.65, 6.67, 6.54, 6.13 6.62, 7.13, 6.68, 6.82, 7.62, 6.56 A quick calculation tells us that s ≈ 0.362. Since n = 12 < 30 and we are told the data come from a normal distribution, we can say ¯ ¯ P (|X − µ| ≤ 0.2) = P −0.2 ≤ X − µ ≤ 0.2 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 140 / 251 Example Continued ¯ ¯ P (|X − µ| ≤ 0.2) = P −0.2 ≤ X − µ ≤ 0.2 ¯ −0.2 X −µ 0.2 √≤ √≤ √ =P s/ n S/ n s/ n 0.2 −0.2 √ ≤ T12−1 ≤ √ =P 0.362/ 12 0.362/ 12 ≈ P (−1.916 ≤ T11 ≤ 1.916) =? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 141 / 251 Example Continued ¯ ¯ P (|X − µ| ≤ 0.2) = P −0.2 ≤ X − µ ≤ 0.2 ¯ −0.2 X −µ 0.2 √≤ √≤ √ =P s/ n S/ n s/ n 0.2 −0.2 √ ≤ T12−1 ≤ √ =P 0.362/ 12 0.362/ 12 ≈ P (−1.916 ≤ T11 ≤ 1.916) =? Note that P (T11 > 1.916) ∈ [0.025, 0.05] That means that P (T11 > 1.916) + P (T11 < −1.916) ∈ [0.05, 0.10]. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 141 / 251 Example Continued ¯ ¯ P (|X − µ| ≤ 0.2) = P −0.2 ≤ X − µ ≤ 0.2 ¯ −0.2 X −µ 0.2 √≤ √≤ √ =P s/ n S/ n s/ n 0.2 −0.2 √ ≤ T12−1 ≤ √ =P 0.362/ 12 0.362/ 12 ≈ P (−1.916 ≤ T11 ≤ 1.916) =? Note that P (T11 > 1.916) ∈ [0.025, 0.05] That means that P (T11 > 1.916) + P (T11 < −1.916) ∈ [0.05, 0.10]. So we can say that P (−1.916 ≤ T11 ≤ 1.916) ∈ [0.90, 0, 95] Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 141 / 251 Example Continued If we use software we can actually say that P (−1.916 ≤ T11 ≤ 1.916) = 0.9183019 using the following code in R: > v = c(6.63, 6.59, 6.65, 6.67, 6.54, 6.13, 6.62, 7.13, 6.68, 6.82, 7.62, 6.56) > s = sqrt(var(v)) > t = 0.2/(s/sqrt(12)) > pt(t, 11) - pt(-1*t, 11) What if we weren’t explicitly told that the data was normally distributed? What would we do? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 142 / 251 Example Continued We could look for outliers. If we have outliers then the data probably doesn’t come from a normal population. Let’s use R to get some summary statistics: > summary(v) Min. 1st Qu. Median Mean 3rd Qu. Max. 6.130 6.582 6.640 6.720 6.715 7.620 Then IQR = Q 3 − Q 1 = 6.715 − 6.582 = 0.133 and 1.5 × IQR = 0.1995. So (Q1 - 1.5IQR, Q3 + 1.5IQR) = (6.582 - 0.1995, 6.715 + 0.1995) = (6.3825, 6.9145) Clearly we have some points in our data set that might qualify as outliers. This is reflected in the boxplot as well. We need to only type boxplot(v) in R to get one: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 143 / 251 6.5 7.0 7.5 Example Continued Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 144 / 251 Example Concluded So what can we say about this example? The homework problem on its own is good practice to see how the t-table works. If I was an actual scientist in the field, I might look at this information and question if we could actually assume that the pH values have a normal distribution. Remember we never want to take a short cut and just delete “bad” numbers. While they may be outliers by our definition, we need to keep them in our study as long as we can determine that there were no measurement/recording errors. Homework: 8.1 - 8.4 odds Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 145 / 251 More on the chi-squared distribution Suppose that X1 , X2 , . . . , Xn are independent random variables sampled from a normal distribution with mean µ and variance σ 2 . 1 ¯2 Recall that S 2 = n−1 Xi − X is itself a random variable, and thus has a distribution function associated with it. Theorem 2 Let U = (n−1)S , with the assumptions from above. Then σ2 U ∼ chi − squared (χ2 ) with (n − 1) degrees of freedom. The probability density function of a χ2 random variable is given by: f (u ) = Rob Gordon (University of Florida) Γ n −1 2 n −2 1 u 2 −1 e −u/2 , u > 0. (n−1)/2 2 STA 3032 (7661) Fall 2011 146 / 251 Another awkward integral! Remember that we didn’t do too much with the gamma distribution. Depending on the parameters, the integral can get messy really quickly. We’ll be using the χ2 distribution a lot though. Since we won’t be integrating that nasty function, we’ll instead just use a table (Table 5 in your book). Before we get into an example, let’s state one quick fact about U . Recall that the χ2 distribution has a mean equal to its degrees of freedom. In other words.... (n − 1) (n − 1) 2 S= E S 2 := n − 1 2 σ σ2 n−1 2 σ = σ2. n−1 EU = E ⇒ E S2 Rob Gordon = (University of Florida) STA 3032 (7661) Fall 2011 147 / 251 How does the table work? Example 8.42) Ammeters produced by a certain company are marketed under the specification that the standard deviation of gauge readings be no larger than 0.2 amp. Ten independent readings on a test circuit of constant current, using one of these ammeters, gaves a sample variance of 0.065. Does this suggest that the ammeter used does not meet the company’s specification? [Hint: Find the approximate probability of a sample variance exceeding 0.065 if the true population variance is 0.04.] (n − 1) 2 (n − 1) S > 0.065 σ2 σ2 (10 − 1) = P U > 0.065 = P (U > 14.625) 0.04 P (S 2 > 0.065) = P Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 148 / 251 How does the table work? Example 8.42) Ammeters produced by a certain company are marketed under the specification that the standard deviation of gauge readings be no larger than 0.2 amp. Ten independent readings on a test circuit of constant current, using one of these ammeters, gaves a sample variance of 0.065. Does this suggest that the ammeter used does not meet the company’s specification? [Hint: Find the approximate probability of a sample variance exceeding 0.065 if the true population variance is 0.04.] (n − 1) 2 (n − 1) S > 0.065 σ2 σ2 (10 − 1) = P U > 0.065 = P (U > 14.625) 0.04 > P (U > 14.684) = 0.1 P (S 2 > 0.065) = P Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 148 / 251 Example Continued P (U > 14.625) > P (U > 14.684) = 0.1 The inequality doesn’t say much, but we can be sure that P (U > 14.625) ≈ 0.1. If we use R we can find the exact value: > pchisq(14.625, df = 9, lower.tail=FALSE) [1] 0.1017651 which agrees with our intuition from above. Homework: 8.5 odds Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 149 / 251 What if I want to compare the means of 2 Different Populations? ¯ If X is an estimate of µX , what do you think a good estimate of µX − µY would be? There are 3 different cases that we need to consider: Large samples Small samples, equal variances Small samples, unequal variances Let’s go through the cases one-by-one and do a few examples. In each case suppose we sample randomly from 2 populations, i.e. X1 , X2 , . . . , Xn1 , Y1 , Y2 , . . . , Yn2 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 150 / 251 Cases 1 and 2: Theorem Suppose the sizes of the samples, n1 and n2 , are both ≥ 25.Then ¯ ¯ X − Y − (µX − µY ) 2 σX n1 + 2 σY n2 ∼ N (0, 1) (35) Theorem Suppose n1 and n2 are small and the variances are unknown but assumed equal. Then if the populations are normal, ¯ ¯ X − Y − ( µX − µY ) Sp where 2 Sp = Rob Gordon (University of Florida) 1 n1 + 1 n2 ∼ tn1 +n2 −2 (36) 2 2 (n1 − 1)SX + (n2 − 1)SY (n1 − 1) + (n2 − 1) STA 3032 (7661) (37) Fall 2011 151 / 251 Case 3: 2 Sp from the last slide is called the pooled variance. Theorem Suppose n1 and n2 are small and the variances are unknown and assumed unequal. Then if the populations are normal, ¯ ¯ X − Y − ( µX − µY ) + 2 SY n2 2 SX n1 2 SX n1 + where ν= 2 2 (SX /n1 ) n1 −1 + 2 SY n2 ∼ tν (38) 2 2 (SY /n2 ) 2 n2 −1 Note that ν is rarely a whole number. In these cases round ν up to get a conservative estimate when using the t-table. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 152 / 251 Examples from the Exercises: Example 8.48) Soil acidity is measured by a quantity called pH. A scientist wants to estimate the difference in the average pH for two large fields using pH measurements from randomly selected core samples. If the scientist selects 20 core samples from field 1 and 15 core samples from field 2, independently of each other, find the approximate probability that the  sample mean of the ( pH measurements for field 1 will be larger than 40) that for field 2 by at least 0.5. The sample variances for pH measurements for fields 1 and 2 are 1 and 0.8 respectively. In the past, both fields have approximately the same mean soil acidity levels. Let’s write down what we know so far: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 153 / 251 Examples from the Exercises: Example 8.48) Soil acidity is measured by a quantity called pH. A scientist wants to estimate the difference in the average pH for two large fields using pH measurements from randomly selected core samples. If the scientist selects 20 core samples from field 1 and 15 core samples from field 2, independently of each other, find the approximate probability that the  sample mean of the ( pH measurements for field 1 will be larger than 40) that for field 2 by at least 0.5. The sample variances for pH measurements for fields 1 and 2 are 1 and 0.8 respectively. In the past, both fields have approximately the same mean soil acidity levels. Let’s write down what we know so far: 2 n1 = 20, s1 = 1.0 2 n2 = 15, s2 = 0.8 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 153 / 251 Example 48 continued 2 n1 = 20, s1 = 1.0 2 n2 = 15, s2 = 0.8 ¯ ¯ P X1 − X2 ≥ 0.5 Rob Gordon = (University of Florida) STA 3032 (7661) Fall 2011 154 / 251 Example 48 continued 2 n1 = 20, s1 = 1.0 2 n2 = 15, s2 = 0.8 ¯ ¯ P X1 − X2 ≥ 0.5 Rob Gordon = P (University of Florida) ¯ ¯ X1 − X2 − (µ1 − µ2 ) 2 s1 n1 + STA 3032 (7661) 2 s2 n2 ≥ 0.5 − (µ1 − µ2 ) 2 s1 n1 + Fall 2011 2 s2 n2 154 / 251 Example 48 continued 2 n1 = 20, s1 = 1.0 2 n2 = 15, s2 = 0.8 ¯ ¯ P X1 − X2 ≥ 0.5 = P ¯ ¯ X1 − X2 − (µ1 − µ2 ) 2 s1 n1 + 2 s2 n2 ≥ 0.5 − (µ1 − µ2 ) 2 s1 n1 + 2 s2 n2 What is the distribution of the thing on the left? What is the value of the term on the right? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 154 / 251 Example 48 continued 2 n1 = 20, s1 = 1.0 2 n2 = 15, s2 = 0.8 ¯ ¯ P X1 − X2 ≥ 0.5 = P ¯ ¯ X1 − X2 − (µ1 − µ2 ) 2 s1 n1 + 2 s2 n2 ≥ 0.5 − (µ1 − µ2 ) 2 s1 n1 + 2 s2 n2 What is the distribution of the thing on the left? What is the value of the term on the right? 0.5 − (0) = P Tν ≥ = P (Tν ≥ 1.555428) 1.0 + 0.8 20 15 ν= 2 s1 n1 2 2 (sX /n1 ) n1 −1 Rob Gordon (University of Florida) + 2 s2 n2 2 2 + 2 (sY /n2 ) n2 −1 STA 3032 (7661) = (1/20 + 0.8/15)2 (1/20)2 20−1 + (0.8/15)2 15−1 = 31.897 Fall 2011 154 / 251 Example 48 continued ¯ ¯ P X1 − X2 ≥ 0.5 = P (Tν ≥ 1.555428) where ν = 31.897 ≈ P (T32 ≥ 1.555428) ∈ [0.05, 0.10] Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 155 / 251 One more example Example 8.52) The service times for customers coming through a checkout counter in a retail store are independent random variables with a mean of 15 minutes and a variance of 4. At the end of the work day, the manager selects independently a random sample of 100 customers each served by checkout counters A and B. Approximate the probability that the sample mean service time for counter A is lower than that for counter B by 5 minutes. So far we know the following: µ = 15, σ 2 = 4 nA = nB = 100 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 156 / 251 Exercise 52 continued µ = 15, σ 2 = 4 nA = nB = 100 ¯ ¯ P XB − XA ≥ 5 Rob Gordon = (University of Florida) STA 3032 (7661) Fall 2011 157 / 251 Exercise 52 continued µ = 15, σ 2 = 4 nA = nB = 100 ¯ ¯ P XB − XA ≥ 5 Rob Gordon =P (University of Florida) ¯ ¯ XB − XA − (µB − µA ) 2 σB nB + STA 3032 (7661) 2 σA nA ≥ 5 − (µB − µA ) 2 σB nB + 2 σA nA Fall 2011 157 / 251 Exercise 52 continued µ = 15, σ 2 = 4 nA = nB = 100 ¯ ¯ P XB − XA ≥ 5 =P ¯ ¯ XB − XA − (µB − µA ) 2 σB nB = P Z ≥ + 2 σA nA + 5 − (µB − µA ) 2 σB nB + 2 σA nA 5−0 4 100 ≥ 4 100 = P (Z ≥ 17.67767) ≈ 0. Homework: 8.47, 49, 51, 53 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 157 / 251 Dependent Samples In the last section we talked about differences between means for 3 cases of independent samples. In certain situations we can also say something about the difference of two means between 2 dependent samples. (See the examples in the textbook). Theorem Consider the Random Variables Xi , Yi , Di , where i = 1, . . . , n and Di = Xi − Yi . If Xi and Yi are drawn from a normal distribution, then ¯ D − µD √ ∼ Tn−1 Sd / n ¯ where D = Rob Gordon Di /n, EDi = µD = µX − µY and SD = (University of Florida) STA 3032 (7661) (39) 1 n −1 ¯ Di − D . Fall 2011 158 / 251 Example Example Six bean plants had their carbohydrate concentrations (in percent by weight) measured both in the shoot and in the root. The following results were obtained: Plant 1 2 3 4 5 6 Shoot 4.42 5.81 4.65 4.77 5.25 4.75 Root 3.66 5.51 3.91 4.47 4.69 3.93 Previous experience indicates that the shoot concentration be 0.5% more than the root concentration. Find the probability that the shoot measurements are greater than root measurements on average by more than 0.45. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 159 / 251 Example Continued Previous experience indicates that the shoot concentration be 0.5% more than the root concentration. Find the probability that the shoot measurements are greater than root measurements on average by more than 0.45. The differences are given by: 0.76 0.30 0.74 0.30 0.56 0.82. ¯ Using a calculator we find that d = 0.58, sd ≈ 0.23. ¯ P D > 0.55 Rob Gordon = (University of Florida) STA 3032 (7661) Fall 2011 160 / 251 Example Continued Previous experience indicates that the shoot concentration be 0.5% more than the root concentration. Find the probability that the shoot measurements are greater than root measurements on average by more than 0.45. The differences are given by: 0.76 0.30 0.74 0.30 0.56 0.82. ¯ Using a calculator we find that d = 0.58, sd ≈ 0.23. ¯ P D > 0.55 ¯ D − µD 0.55 − µD 0.55 − 0.5 √> √ √ = P T5 > Sd / n Sd / n 0.23/ 6 = P (T5 > 0.524) > P (T5 > 0.727) = 0.25. =P In fact, P (T5 > 0.524) = 0.31. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 160 / 251 Difference of Two Proportions Previously we applied the Central Limit Theorem to find the limiting ˆ distribution of P . Using the properties of a normal distribution, we can ˆ ˆ also find the distribution for the difference P1 − P2 , i.e. the difference in the estimates of two proportions from independent samples. Theorem Suppose X1 ∼ Bin(n1 , p1 ) and X2 ∼ Bin(n2 , p2 ). If the assumptions for applying the CLT for X1 and X2 both hold (see slide 133) then ˆ ˆ P1 − P2 − (p1 − p2 ) p1 (1−p1 ) n1 Rob Gordon (University of Florida) + p2 (1−p2 ) n2 STA 3032 (7661) ∼ N (0, 1). (40) Fall 2011 161 / 251 Example Example The specification for the pull strength of a wire that connects an integrated circuit to its frame is 10g or more. In a sample of 85 units made with gold wire, 68 met the specification, and in a sample of 120 units made with aluminum wire, 105 met the specification. Scientists at the facility believe the true difference of proportions is about 0.5 Find the probability that the difference in the proportions are less than 0.4 Note that p1 = 68/85 = 0.8 and p2 = 105/120 = 0.875. ˆ ˆ P ˆ ˆ P1 − P2 < 0.4 Rob Gordon (University of Florida) = STA 3032 (7661) Fall 2011 162 / 251 Example Example The specification for the pull strength of a wire that connects an integrated circuit to its frame is 10g or more. In a sample of 85 units made with gold wire, 68 met the specification, and in a sample of 120 units made with aluminum wire, 105 met the specification. Scientists at the facility believe the true difference of proportions is about 0.5 Find the probability that the difference in the proportions are less than 0.4 Note that p1 = 68/85 = 0.8 and p2 = 105/120 = 0.875. ˆ ˆ P ˆ ˆ P1 − P2 < 0.4 ˆ ˆ = P −0.4 < P1 − P2 < 0.4 ˆ1 − P2 − 0.5 ˆ P = P −17.03 < < −1.89 ˆ1 (1−p1 ) ˆ P ˆ P2 (1−p2 ) ˆ + n1 n2 = P (−17.03 < Z < −1.89) ≈ 0.0292. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 162 / 251 Comparing Population Variances So far we’ve studied distributions of the Statistics that estimate the means and proportions of 1 or more populations, and the Statistic to estimate the variance of one population. All that’s left to do for chapter 8 is to compare two variances. Recall previous sections where we discussed the difference in population parameters: we estimated their difference, i.e. we estimated θ1 − θ2 with ˆ ˆ Θ1 − Θ2 . We can’t take the same approve when comparing 2 variances though. − When we talked about probabilities in terms of S 2 we used U = nσ21 S 2 2 distribution: a distribution defined on only the positive which had a χ half of the Real number line. Any time we subtract two random variables, we risk getting a negative value with positive probability. We’ll have to change our approach in order to compare two variances. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 163 / 251 A slightly different approach Suppose I have two unknown numbers, say θ1 and θ2 . I could test if θ1 > θ2 but that’s the same thing as checking if θ1 − θ2 > 0. We did things like this for comparing means or proportions since its easy to find the distribution of the difference of normal Random Variables (they are almost always normal). For comparing variances it doesn’t work since a difference of chi-square Random Variables doesn’t give us anything useful. Equivalently we can use θ1 > θ2 ⇒ Rob Gordon (University of Florida) θ1 > 1. θ2 STA 3032 (7661) Fall 2011 164 / 251 The point of all this: It turns out that we can throw in some constants and get the distribution S2 of S1 . We’ll just use that to compare two population variances. 2 2 Theorem Suppose two independent random samples from normal distributions with 2 2 respective sizes n1 and n2 yield sample variances of S1 and S2 . Let ni −1 2 2 Ui = σ2 , i = 1, 2 where σ1 and σ2 are the variances of population 1 and i 2 respectively. Then U1 /(n1 − 1) (41) F= U2 /(n2 − 1) has a known sampling distribution, called an F − distribution with ν1 = n1 − 1 and ν2 = n2 − 1 degrees of freedom. Note: the book goes ahead and cancels out the degrees of freedom from S 2 /σ 2 the Ui and just gives the statistic as F = S1 /σ1 . Its the same thing. 2 2 2 Rob Gordon (University of Florida) STA 3032 (7661) 2 Fall 2011 165 / 251 More about the F-distribution So why use the F-distribution? Why write it in terms of Ui and not like how the book does it? Notice that each Ui has a chi-square distribution, so F is just a ratio of two non-negative quantities... meaning F is non-negative as well. Now we don’t have to worry about dealing with negative numbers when talking about variances. In the statement of our theorem, F is written as a ratio of two chi-square random variables, each divided by its degrees of freedom. This is how the F − distribution is constructed in all cases, not just in the specific case of discussing the comparison of population variances. “My” definition is just more in line with statistics literature. Also... we’ll be using it in future chapters. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 166 / 251 More about the F-distribution What does the pdf of the F − distribution look like? Let d1 and d2 represent the numerator and denominator degrees of freedom respectively. d (d1 x )d1 d2 2 (d1 x +d2 )d1 +d2 f (x ; d1 , d2 ) = xB 1 = B where B(α, β ) = d1 d2 2, 2 d1 d2 2, 2 d1 d2 d1 2 x d1 −1 2 d1 1+ x d2 − d1 +d2 2 Γ(x )Γ(y ) . Γ(x + y ) This is yet another pdf that we don’t want to integrate directly. We’ll be using yet another table (Tables 6 and 7 in your appendix) for this calculation. We’ll see how to use the table in an example. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 167 / 251 Using the F-distribution Example Pull-strength tests on 10 soldered leads for a semi conductor device yield the following results in pounds of fource required to rupture the bond: 19.8 18.8 12.7 11.1 13.2 14.3 16.9 17.0 10.6 12.5 Another set of 8 leads was tested after encapsulation to determine whether the pull strength has been increased by encapsulation of the device, with the following results: 24.9 22.8 23.6 22.1 20.4 21.6 21.8 22.5 Comment on the evidence available concerning equality of the two population variances. With a calculator we can see: ν1 = 10 − 1 = 9, ν2 = 8 − 1 = 7 and 2 2 s1 = 10.441, s2 = 1.846. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 168 / 251 Example Continued With a calculator we can see: ν1 = 10 − 1 = 9, ν2 = 8 − 1 = 7 and 2 2 s1 = 10.441, s2 = 1.846. P 2 S1 >1 2 S2 2 2 2 S1 /σ1 1/σ1 > = P (F9,7 > 1) 2 2 2 S2 /σ2 1/σ2 since we test under the initial assumption that =P population variances are equal ≈ Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 169 / 251 Example Continued With a calculator we can see: ν1 = 10 − 1 = 9, ν2 = 8 − 1 = 7 and 2 2 s1 = 10.441, s2 = 1.846. P 2 S1 >1 2 S2 2 2 2 S1 /σ1 1/σ1 > = P (F9,7 > 1) 2 2 2 S2 /σ2 1/σ2 since we test under the initial assumption that =P population variances are equal ≈ 0.51. Its difficult to get an accurate idea of the probability without using a computer, since now we have 3 dimensions (numerator df, denominator df and probability) whereas we had 2 dimensions for the t-table (df and probability) and only one dimension for the z-table(probability). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 169 / 251 Another example of using the F-table Of course we can go forwards and backwards using the F-table. Example Find the value F0 such that P (F9,7 > F0 ) = 0.05. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 170 / 251 Another example of using the F-table Of course we can go forwards and backwards using the F-table. Example Find the value F0 such that P (F9,7 > F0 ) = 0.05. We see from the α = 0.05 F-table that F0 = 3.6767. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 170 / 251 Chapter 9: Some review To be honest, we already covered the material from 9.1, but it doesn’t hurt to do a little review. Table 9.1 on page 428 is a good resource: Parameter Estimator ¯ µ = population mean µ = X = sample mean ˆ ˆ σ 2 = S 2 = sample variance σ 2 = population variance p = population proportion p = X /n = sample proportion ˆ ¯ ¯ µ1 − µ2 = diff in population means µ1 − µ2 = X1 − X2 = diff in sample means 1 2 p1 − p2 = diff in population proportions p1 − p2 = X1 − X2 n n = diff in sample proportions 2 σ1 2 σ2 = ratio of two population variances 2 σ1 2 σ2 = 2 S1 2 S2 = ratio of sample variances Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 171 / 251 Topics from 9.1 we’ve discussed already: Estimators (biased and unbiased). Mean Squared Error (MSE) ˆ ˆ MSE Θ = E Θ − θ 2 ˆ + Var Θ Concepts of “better” estimators Theorem ˆ ˆ ˆ If Θ1 and Θ2 are two estimators of θ, then the estimator Θ1 is considered ˆ 2 if a better estimator than Θ ˆ ˆ MSE Θ1 ≤ MSE Θ2 Just read 9.1 to remind yourself of these concepts or just re-read those slides. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 172 / 251 Limitations of Point Estimators ¯ Consider the sample mean, X . It only takes one outlier to throw off your estimator completely. It is often convenient to estimate parameters with an interval estimate, where the interval contains all the reasonable values that a parameter could take on. We call this a Confidence Interval. We derive the confidence interval using the sampling distribution of the ¯ point estimate (e.g. Z and T for X , etc.) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 173 / 251 Confidence Interval Defined Generally speaking: Definition ˆ Suppose Θ is an estimator of θ with a known sampling distribution, and ˆ ˆ ˆ we can find two quantities that depend of θ, say, g1 Θ and g2 Θ2 , such that ˆ ˆ P g1 Θ ≤ θ ≤ g2 Θ = 1 − α ˆ ˆ where α ∈ (0, 1). Then we can say that g1 Θ , g2 Θ forms an interval that has probability (1 − α) of capturing the true θ. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 174 / 251 Example Let Z ∼ N (0, 1) and α ∈ (0, 1). Let zα/2 be the value such that P (Z ≥ zα/2 ) = α/2. Since the pdf of Z is symmetric about 0, we can also say P (Z ≤ −zα/2 ) = α/2. Then 1 − α = P (−zα/2 ≤ Z ≤ zα/2 ) = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 175 / 251 Example Let Z ∼ N (0, 1) and α ∈ (0, 1). Let zα/2 be the value such that P (Z ≥ zα/2 ) = α/2. Since the pdf of Z is symmetric about 0, we can also say P (Z ≤ −zα/2 ) = α/2. Then 1 − α = P (−zα/2 ≤ Z ≤ zα/2 ) ¯ X −µ √ ≤ zα/2 = P −zα/2 ≤ σ/ n Now if we just take the whole argument within P () and solve for µ we get an interval for µ. Namely σ σ x − zα/2 √ ≤ µ ≤ x + zα/2 √ ¯ ¯ n n The above is the confidence interval for µ when we have a large sample size. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 175 / 251 Caution: We need to be very careful about how we discuss confidence intervals. For example we cannot say “The probability that µ falls into its interval is 95%.” Why is that? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 176 / 251 Caution: We need to be very careful about how we discuss confidence intervals. For example we cannot say “The probability that µ falls into its interval is 95%.” Why is that? µ is a parameter (a fixed unknown real number) and not a random variable. Well why not just get a 100% confidence interval? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 176 / 251 Caution: We need to be very careful about how we discuss confidence intervals. For example we cannot say “The probability that µ falls into its interval is 95%.” Why is that? µ is a parameter (a fixed unknown real number) and not a random variable. Well why not just get a 100% confidence interval? The only way to do this would be to say that our confidence interval is (−∞, ∞) and that’s just useless. Traditionally we find 90%, 95%, 99% (α =0.10, 0.05, 0.01 respectively). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 176 / 251 So what’s really going on here? 100 50 Simulation 150 200 95% Confidence Bands for the mean of 100 samples of size 10 from N(0,1) 0 Includes 0 Does not include 0 -3 -2 -1 0 1 2 3 Confidence bands Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 177 / 251 Some True/False questions: Example Suppose a random sample of 114 students was chosen, and each student was asked how many hours he or she studies each week. The resulting 95% confidence interval for µ was (8.9, 11.8). Determine if each one of the following statements is true or false: 95% of all students study between 8.9 and 11.8 hours per week: 95% of all sample means will be between 8.9 and 11.8: 95% of samples will have averages between 8.9 and 11.8: For 95% of all samples, µ will be between 8.9 and 11.8: For 95% of all samples, µ will be included in the resulting 95% confidence interval: The formula produces intervals that capture the sample mean for 95% of all samples: The formula produces intervals that capture the population mean for 95% of all samples: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 178 / 251 Some True/False questions: Example Suppose a random sample of 114 students was chosen, and each student was asked how many hours he or she studies each week. The resulting 95% confidence interval for µ was (8.9, 11.8). Determine if each one of the following statements is true or false: 95% of all students study between 8.9 and 11.8 hours per week: FALSE 95% of all sample means will be between 8.9 and 11.8: 95% of samples will have averages between 8.9 and 11.8: For 95% of all samples, µ will be between 8.9 and 11.8: For 95% of all samples, µ will be included in the resulting 95% confidence interval: The formula produces intervals that capture the sample mean for 95% of all samples: The formula produces intervals that capture the population mean for 95% of all samples: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 178 / 251 Some True/False questions: Example Suppose a random sample of 114 students was chosen, and each student was asked how many hours he or she studies each week. The resulting 95% confidence interval for µ was (8.9, 11.8). Determine if each one of the following statements is true or false: 95% of all students study between 8.9 and 11.8 hours per week: FALSE 95% of all sample means will be between 8.9 and 11.8: FALSE 95% of samples will have averages between 8.9 and 11.8: For 95% of all samples, µ will be between 8.9 and 11.8: For 95% of all samples, µ will be included in the resulting 95% confidence interval: The formula produces intervals that capture the sample mean for 95% of all samples: The formula produces intervals that capture the population mean for 95% of all samples: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 178 / 251 Some True/False questions: Example Suppose a random sample of 114 students was chosen, and each student was asked how many hours he or she studies each week. The resulting 95% confidence interval for µ was (8.9, 11.8). Determine if each one of the following statements is true or false: 95% of all students study between 8.9 and 11.8 hours per week: FALSE 95% of all sample means will be between 8.9 and 11.8: FALSE 95% of samples will have averages between 8.9 and 11.8: FALSE For 95% of all samples, µ will be between 8.9 and 11.8: For 95% of all samples, µ will be included in the resulting 95% confidence interval: The formula produces intervals that capture the sample mean for 95% of all samples: The formula produces intervals that capture the population mean for 95% of all samples: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 178 / 251 Some True/False questions: Example Suppose a random sample of 114 students was chosen, and each student was asked how many hours he or she studies each week. The resulting 95% confidence interval for µ was (8.9, 11.8). Determine if each one of the following statements is true or false: 95% of all students study between 8.9 and 11.8 hours per week: FALSE 95% of all sample means will be between 8.9 and 11.8: FALSE 95% of samples will have averages between 8.9 and 11.8: FALSE For 95% of all samples, µ will be between 8.9 and 11.8: FALSE For 95% of all samples, µ will be included in the resulting 95% confidence interval: The formula produces intervals that capture the sample mean for 95% of all samples: The formula produces intervals that capture the population mean for 95% of all samples: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 178 / 251 Some True/False questions: Example Suppose a random sample of 114 students was chosen, and each student was asked how many hours he or she studies each week. The resulting 95% confidence interval for µ was (8.9, 11.8). Determine if each one of the following statements is true or false: 95% of all students study between 8.9 and 11.8 hours per week: FALSE 95% of all sample means will be between 8.9 and 11.8: FALSE 95% of samples will have averages between 8.9 and 11.8: FALSE For 95% of all samples, µ will be between 8.9 and 11.8: FALSE For 95% of all samples, µ will be included in the resulting 95% confidence interval: TRUE The formula produces intervals that capture the sample mean for 95% of all samples: The formula produces intervals that capture the population mean for 95% of all samples: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 178 / 251 Some True/False questions: Example Suppose a random sample of 114 students was chosen, and each student was asked how many hours he or she studies each week. The resulting 95% confidence interval for µ was (8.9, 11.8). Determine if each one of the following statements is true or false: 95% of all students study between 8.9 and 11.8 hours per week: FALSE 95% of all sample means will be between 8.9 and 11.8: FALSE 95% of samples will have averages between 8.9 and 11.8: FALSE For 95% of all samples, µ will be between 8.9 and 11.8: FALSE For 95% of all samples, µ will be included in the resulting 95% confidence interval: TRUE The formula produces intervals that capture the sample mean for 95% of all samples: FALSE The formula produces intervals that capture the population mean for 95% of all samples: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 178 / 251 Some True/False questions: Example Suppose a random sample of 114 students was chosen, and each student was asked how many hours he or she studies each week. The resulting 95% confidence interval for µ was (8.9, 11.8). Determine if each one of the following statements is true or false: 95% of all students study between 8.9 and 11.8 hours per week: FALSE 95% of all sample means will be between 8.9 and 11.8: FALSE 95% of samples will have averages between 8.9 and 11.8: FALSE For 95% of all samples, µ will be between 8.9 and 11.8: FALSE For 95% of all samples, µ will be included in the resulting 95% confidence interval: TRUE The formula produces intervals that capture the sample mean for 95% of all samples: FALSE The formula produces intervals that capture the population mean for 95% of all samples: TRUE Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 178 / 251 Confidence Intervals Each confidence interval can be written in the following form: (Point Estimate - Margin of Error, Point Estimate + Margin of Error) Theorem A large random sample confidence interval for µ with confidence coefficient approximately (1 − α) is given by σ σ¯ ¯ X − zα/2 √ , X + zα/2 √ n n (42) If σ is unknown, replace it with s, the sample standard deviation, with no serious loss of accuracy. Definition The (1 − α)100% margin of error to estimate µ from a large sample is σ B = zα/2 √ n Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 179 / 251 Confidence Intervals Sometimes Confidence Intervals can be very wide and as such does not always give very valuable information. Consider what happens if we let α vary. How does increasing/decreasing α affect the width of the confidence interval? Suppose we have a fixed level of α and a fixed margin of error in mind. We can guarantee that size of the margin of error if we take a large enough sample. Theorem ¯ The sample size for establishing a confidence interval of the form X ± B with confidence coefficient (1 − α) is given by n≥ Rob Gordon (University of Florida) zα/2 σ B STA 3032 (7661) 2 (43) Fall 2011 180 / 251 Example Example How many samples will it take so that a 95% Confidence Interval specifies the mean to within ±25? Suppose σ = 221. (1.96)(221) 25 = 301. 2 n≥ = 300.2041 We always round up for our final answer since we need whole numbers for sample sizes. We don’t round down because that would leave us with less than 95% confidence. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 181 / 251 One more thing... One last thing before we write out all the confidence intervals we’ll be using: sometimes we are interested in only upper or lower bounds. For example: ¯ X −µ √ ≤ zα 1 − α = P (Z ≤ zα ) = P σ/ n σ ¯ ⇒ µ ≥ X − zα √ n Theorem A one-sided (upper) confidence interval for µ is given by σ ¯ X − zα √ ∞ n Similarly a one-sided (lower) confidence interval is given by σ ¯ −∞, X + zα √ n Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 182 / 251 What’s left? Essentially we’ve covered all the theory about confidence intervals in chapter 8 and all the derivations more or less look the same, and most (if not all) are in the book. Let’s just list the confidence intervals and do some examples. The following are Confidence Intervals for single parameters: Parameter Details CI µ n > 30 σ x ± zα/2 √n ¯ µ n < 30 & σ unknown s x ± tα/2,ν √n ¯ normal population where ν = n − 1 p CLT σ2 normal population p ± zα/2 ˆ p (1−p ) ˆ ˆ n (n−1)s 2 (n−1)s 2 , χ2 2 ,ν χ2−α/2 ,ν α/ 1 where ν = n − 1. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 183 / 251 The following are Confidence Intervals for comparing 2 parameters: Parameter Details CI 2 σ1 n1 µ1 − µ 2 large ind samples ¯ ¯ X1 − X2 ± zα/2 µ1 − µ 2 small samples ¯ ¯ X1 − X2 ± tα/2,ν 2 σ2 n2 + 2 s1 n1 + 2 s2 n2 unknown σi „ normal ind. pops µ1 − µ 2 pooled variances where ν = 2 s2 s1 + n2 n1 2 «2 (s 2 /n2 )2 (s 2 /n1 )2 1 + 2 −1 n1 −1 n2 ¯ ¯ X1 − X2 ± tα/2,ν sp 1 n1 + 1 n2 where ν = n1 + n2 − 2 2 and sp = p1 − p2 CLT 2 σ1 2 σ2 normal ind. pops p1 − p2 ± zα/2 ˆ ˆ 2 2 (n1 −1)s1 +(n2 −1)s2 n1 +n2 −2 p1 (1−p1 ) ˆ ˆ n1 + p2 (1−p2 ) ˆ ˆ n2 2 2 s2 s2 2 F1−α/2,ν1 ,ν2 , s 2 Fα/2,ν1 ,ν2 s1 1 where ν1 = n1 − 1, ν2 = n2 − 1. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 184 / 251 Example (9.12) An important property of plastic clays is the percent of shrinkage on drying. For a certain type of plastic clay, 45 test specimens showed an average shrinkage percentage of 18.4 and a standard deviation of 1.2. Estimate the true average percent of shrinkage for specimens of this type in a 98% confidence interval. Which CI do we use? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 185 / 251 Example (9.12) An important property of plastic clays is the percent of shrinkage on drying. For a certain type of plastic clay, 45 test specimens showed an average shrinkage percentage of 18.4 and a standard deviation of 1.2. Estimate the true average percent of shrinkage for specimens of this type in a 98% confidence interval. Which CI do we use? large sample µ CI. n = 45, x = 18.4, σ = 1.2 ¯ 98% confidence ⇒ 0.98 = 1 − α ⇒ α = 0.02 ⇒ α/2 = 0.01 12 σ x ± zα/2 √n = 18.4 ± 2.326 √.45 = (17.984, 18.816) ¯ This stuff is really easy! The hard part is just figuring out which CI to use and just fill in the blanks from there. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 185 / 251 Example (9.18) Careful inspection of 70 precast concrete supports to be used in a construction project revealed 28 with hairline cracks. Estimate the true proportion of supports of this type with cracks in a 98% confidence interval. Which CI do we use? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 186 / 251 Example (9.18) Careful inspection of 70 precast concrete supports to be used in a construction project revealed 28 with hairline cracks. Estimate the true proportion of supports of this type with cracks in a 98% confidence interval. Which CI do we use? p CI. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 186 / 251 Example (9.18) Careful inspection of 70 precast concrete supports to be used in a construction project revealed 28 with hairline cracks. Estimate the true proportion of supports of this type with cracks in a 98% confidence interval. Which CI do we use? p CI. Wait, can we say the CLT holds? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 186 / 251 Example (9.18) Careful inspection of 70 precast concrete supports to be used in a construction project revealed 28 with hairline cracks. Estimate the true proportion of supports of this type with cracks in a 98% confidence interval. Which CI do we use? p CI. Wait, can we say the CLT holds? np = 28, n(1 − p ) = 70 − 28 = 42. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 186 / 251 Example (9.18) Careful inspection of 70 precast concrete supports to be used in a construction project revealed 28 with hairline cracks. Estimate the true proportion of supports of this type with cracks in a 98% confidence interval. Which CI do we use? p CI. Wait, can we say the CLT holds? np = 28, n(1 − p ) = 70 − 28 = 42. p = 28/70, n = 70, α = 0.02 ˆ p ± zα/2 ˆ Rob Gordon p (1−p ) ˆ ˆ n = (University of Florida) 28 70 ± 2.326 (28/70)(42/70) 70 STA 3032 (7661) = (0.264, 0.536) Fall 2011 186 / 251 Example (9.22) The warpwise breaking strength measured on five specimens of a certain cloth gave a sample mean of 180 psi and a standard deviation of 5 psi. Estimate the true mean warpwise breaking strength for cloth of this type in a 95% confidence interval. What assumption is necessary for your answer to be valid? What CI do we use? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 187 / 251 Example (9.22) The warpwise breaking strength measured on five specimens of a certain cloth gave a sample mean of 180 psi and a standard deviation of 5 psi. Estimate the true mean warpwise breaking strength for cloth of this type in a 95% confidence interval. What assumption is necessary for your answer to be valid? What CI do we use? small sample µ CI What assumption do we need? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 187 / 251 Example (9.22) The warpwise breaking strength measured on five specimens of a certain cloth gave a sample mean of 180 psi and a standard deviation of 5 psi. Estimate the true mean warpwise breaking strength for cloth of this type in a 95% confidence interval. What assumption is necessary for your answer to be valid? What CI do we use? small sample µ CI What assumption do we need? Data comes from normal population. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 187 / 251 Example (9.22) The warpwise breaking strength measured on five specimens of a certain cloth gave a sample mean of 180 psi and a standard deviation of 5 psi. Estimate the true mean warpwise breaking strength for cloth of this type in a 95% confidence interval. What assumption is necessary for your answer to be valid? What CI do we use? small sample µ CI What assumption do we need? Data comes from normal population. x = 180, s = 5, n = 5, α = 0.05 ¯ s 5 x ± tα/2,ν √n = 180 ± t0.025,4 √5 = (177.2236, 182.7764) ¯ Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 187 / 251 Example (9.42c) For a certain species of fish, the LC50 measurements (in parts per million) for DDT in 12 experiments were as follows, according to the EPA: 16, 5, 21, 19, 10, 5, 8, 2, 7, 2, 4, 9 Another common insecticide, Diazinon, gave LC50 measurements of 7.8, 1.6, and 1.3 in three independent experiments. Estimate the true variance ratio in a 90% confidence interval. What CI do we use? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 188 / 251 Example (9.42c) For a certain species of fish, the LC50 measurements (in parts per million) for DDT in 12 experiments were as follows, according to the EPA: 16, 5, 21, 19, 10, 5, 8, 2, 7, 2, 4, 9 Another common insecticide, Diazinon, gave LC50 measurements of 7.8, 1.6, and 1.3 in three independent experiments. Estimate the true variance ratio in a 90% confidence interval. 2 2 What CI do we use? σ1 /σ2 CI 2 = 41.27273, s 2 = 13.46333, n = 12, n = 3, α = 0.10 s1 1 2 2 2 2 s2 s2 13.46333 = 13.46333 F1−0.05,11,2 , 41.27273 F0.05,11,2 2 F1−α/2,ν1 ,ν2 , s 2 Fα/2,ν1 ,ν2 41.27273 s1 1 = 13.46333 0.2511113, 13.46333 19.40496 = (0.08191351, 6.329975) 41.27273 41.27273 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 188 / 251 End of Chapter 9 For now we’re done with chapter 9. We might go back to 9.4 later on if time allows. Homework: 9.2 odds, 9.3 odds The summary in section 9.7 is very good, but covers some situations not covered in slides (linear functions of means). You are only responsible for the situations covered in the slides. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 189 / 251 Chapter 10: Hypothesis Testing The first page of the text summarizes the ideas behind Hypothesis Testing. For homework, please read it. Generally speaking, Hypothesis Testing is the idea of checking claims in such a way that after looking at a sample, a claim about a population (usually a population parameter like µ, σ, p , etc.) is either rejected or not rejected. Usually instead of the word “claim” we’ll use the word hypothesis. Definition A hypothesis is a statement about the population parameter or process characteristic. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 190 / 251 Types of hypotheses There are two types: the null and alternative. Definition A null hypothesis is a statement that specifies a particular value (or values) for the parameter being studied. It is denoted by H0 . The null hypothesis represents ideas that are currently accepted as the norm. Definition An alternative hypothesis is a statment of the change from the null hypothesis that the investigation is designed to check. It is denoted by Ha (also sometimes H1 .) Think of Ha as a new idea that contradicts the null hypothesis. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 191 / 251 The process Every hypothesis test is started by clearly stating H0 and H1 . Example The following are generic examples of how to state hypotheses: H0 : µ = 5 H1 : µ = 5 H0 : µ ≤ 5 H1 : µ > 5 H0 : p ≥ 0.25 H1 : p < 0.25 Note that the hypotheses never “overlap.” After a hypothesis test is over, we should come to some concrete decision; it does’t make sense to choose one hypothesis over the other if they both share something in common. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 192 / 251 Steps of the Hypothesis Test 1 State H0 , H1 . 2 Assume H0 is true. 3 Compute a relevant test statistic. Definition The test statistic (TS) is the sample quantity on which the decision to reject or not reject H0 is based. ¯ We’ve seen some test statistics already (i.e. estimate µ with X , estimate p with X /n where X ∼ Bin(n, p ), etc). 4 Come to some conclusion based upon the value of the test statistic. Definition The rejection region (or critical region) is the set of values of the test statistic that leads to the rejection of null hypothesis in favor of an alternative hypothesis. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 193 / 251 More about conclusions... The rejection region is just an extension of our idea regarding confidence intervals: if the confidence interval gives us the place where a parameter is likely to “live,” then the rejection region is everywhere else on the real line. In the real world though, we might only care about making a decision one way or another rather than writing out the entire interval of where a parameter could be. Instead of reporting confidence intervals/rejection regions, we instead talk about something called a p-value. Before we formally define the p-value, we need to discuss the basis by which we come to conclusions: everything we do is based on minimizing the probability that we make an error. We first formally define the types of errors we can make and their associated probabilities. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 194 / 251 Types of Errors Think about the ways we can be right or wrong about anything: Do not reject H0 Reject H0 H0 true correct type I error H0 false Type II error correct denote α = P (type I error) = P (reject H0 |H0 true) β = P (type II error) = P (don’t reject H0 |H0 false) Remember that our goal is to make decisions in a way that minimizes the probabilities of being wrong. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 195 / 251 Minimizing Errors Often we can’t minimize both α and β at the same time (more on this later), so we minimize the more serious error. Consider the following example: Do not reject H0 Reject H0 H0 : Parachute Broken correct type I error Ha : Parachute Works Type II error correct One of these mistakes has a worse consequence! The more serious error probability is given to α. When we start the experiment, we fix α to be a small value, and let β be whatever it ends up being. Our decision to reject H0 depends on whether or not our p-value is less than α. Definition The p-value is the probability of observing a test statistic value at least as extreme as the one computed from the sample data if H0 is true. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 196 / 251 Example Suppose H0 : µ ≥ 5, Ha : µ < 5, x = 3, s = 1, n = 30 ¯ ¯ ¯ p − value = P X < 3|H0 true = P X < 3|µ = 5 ¯ 3−µ 3−5 X −µ √ < √ µ=5 =P Z < √ =P σ/ n σ/ n 1/ 30 = P (Z < −10.95) ≈ 3.163034e − 28 Note: The inequality in the p-value calculation is the same one given in Ha . How do we interpret the p-value? The probability of getting x = 3 or less when µ = 5 is very small. This ¯ leads us to believe that the true value of µ may actually be a number less than 5. Therefore, we reject H0 : µ ≥ 5 in favor of the alternative hypothesis. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 197 / 251 Hypothesis Tests Remember one very important fact when conducting a hypothesis test (quoted from the textbook): Note that not rejecting the hypothesis that µ = 2 is not the same as accepting the hypothesis that µ = 2. When we do not reject the hypothesis µ = 2, we are saying that 2 is a plausible value of µ, but there are other equally plausible values for µ. We cannot conclude that µ is equal to 2 and 2 alone. So what’s the cutoff? How small does the p-value have to be so that we are comfortable with rejecting H0 ? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 198 / 251 Hypothesis Tests Procedure: Fix α to be small (usually 0.05, sometimes 0.01 or 0.10). If p-value < α, reject H0 A quick word on notation: H 0 : µ = µ0 H a : µ = µ0 called a 2-sided test H 0 : µ ≤ µ0 Ha : µ > µ0 called a 1-sided test Earlier we saw an example of a 1-sided test. Let’s see an example of a two-sided test. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 199 / 251 Example 10.26) Yield stress measurements on 51 steel rods with 10 mm diameters gave a mean of 485 N/mm2 and a standard deviation of 17.2. Suppose the manufacturer claims that the mean yield stress for these bars is 490. Does the sample information suggest rejecting the manufacturer’s claim, at the 5% significance level? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 200 / 251 Example 10.26) Yield stress measurements on 51 steel rods with 10 mm diameters gave a mean of 485 N/mm2 and a standard deviation of 17.2. Suppose the manufacturer claims that the mean yield stress for these bars is 490. Does the sample information suggest rejecting the manufacturer’s claim, at the 5% significance level? Step 0: Write down what we know: n = 51, x = 485, s = 17.2, α = 0.05 ¯ Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 200 / 251 Example 10.26) Yield stress measurements on 51 steel rods with 10 mm diameters gave a mean of 485 N/mm2 and a standard deviation of 17.2. Suppose the manufacturer claims that the mean yield stress for these bars is 490. Does the sample information suggest rejecting the manufacturer’s claim, at the 5% significance level? Step 0: Write down what we know: n = 51, x = 485, s = 17.2, α = 0.05 ¯ Step 1: State the null and alternative hypotheses: H0 : µ = 490 Ha : µ = 490 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 200 / 251 Example 10.26) Yield stress measurements on 51 steel rods with 10 mm diameters gave a mean of 485 N/mm2 and a standard deviation of 17.2. Suppose the manufacturer claims that the mean yield stress for these bars is 490. Does the sample information suggest rejecting the manufacturer’s claim, at the 5% significance level? Step 0: Write down what we know: n = 51, x = 485, s = 17.2, α = 0.05 ¯ Step 1: State the null and alternative hypotheses: H0 : µ = 490 Ha : µ = 490 Step 2: Assume H0 is true. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 200 / 251 Example 10.26) Yield stress measurements on 51 steel rods with 10 mm diameters gave a mean of 485 N/mm2 and a standard deviation of 17.2. Suppose the manufacturer claims that the mean yield stress for these bars is 490. Does the sample information suggest rejecting the manufacturer’s claim, at the 5% significance level? Step 0: Write down what we know: n = 51, x = 485, s = 17.2, α = 0.05 ¯ Step 1: State the null and alternative hypotheses: H0 : µ = 490 Ha : µ = 490 Step 2: Assume H0 is true. Step 3: Find the value of the test statistic: x −µ ¯ 485 490 z = σ/√n = 17.2−√51 = −2.075997 / Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 200 / 251 Example 10.26) Yield stress measurements on 51 steel rods with 10 mm diameters gave a mean of 485 N/mm2 and a standard deviation of 17.2. Suppose the manufacturer claims that the mean yield stress for these bars is 490. Does the sample information suggest rejecting the manufacturer’s claim, at the 5% significance level? Step 0: Write down what we know: n = 51, x = 485, s = 17.2, α = 0.05 ¯ Step 1: State the null and alternative hypotheses: H0 : µ = 490 Ha : µ = 490 Step 2: Assume H0 is true. Step 3: Find the value of the test statistic: x −µ ¯ 485 490 z = σ/√n = 17.2−√51 = −2.075997 / Step 4: Find the p-value: p-value = 2P (Z < −2.075997) = 2 ∗ 0.01894713 = 0.0379 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 200 / 251 Example 10.26) Yield stress measurements on 51 steel rods with 10 mm diameters gave a mean of 485 N/mm2 and a standard deviation of 17.2. Suppose the manufacturer claims that the mean yield stress for these bars is 490. Does the sample information suggest rejecting the manufacturer’s claim, at the 5% significance level? Step 0: Write down what we know: n = 51, x = 485, s = 17.2, α = 0.05 ¯ Step 1: State the null and alternative hypotheses: H0 : µ = 490 Ha : µ = 490 Step 2: Assume H0 is true. Step 3: Find the value of the test statistic: x −µ ¯ 485 490 z = σ/√n = 17.2−√51 = −2.075997 / Step 4: Find the p-value: p-value = 2P (Z < −2.075997) = 2 ∗ 0.01894713 = 0.0379 Step 5: Come to a conclusion: Since p-value < α = 0.05 we reject H0 . Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 200 / 251 Some more terms... Before we see a few more examples, let’s define a few more terms. Definition The power of a statistical test is the probability of rejecting the null hypothesis when an alternative hypothesis is true. Power = 1 − β = 1 − P (type II error) (44) The power tells us how good our test is. If we want to compare our test to some other test with the same α level, then the power lets us know which test is better. Earlier we mentioned that we usually can’t decrease α and β at the same time. However we can if we increase sample size, though in many realistic situations (high cost of generating new samples, time to complete trials, etc) this isn’t possible. If we decrease one it increases the other. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 201 / 251 End of Chapter 10 (for now) Make sure you are comfortable with the examples we did from class. Your homework for chapter ten are the odd-numbered problems from 10.1, 10.2 and 10.3 Everything after this slide will not be on exam 2. We may come back to other sections of chapter 10 later on in the course if we have time. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 202 / 251 Chapter 2 It is often the case that a scientist is interested in the relationship between two variables. Examples of questions include Is exposure to the sun related to skin cancer? Is a person’s height related to his/her weight? Is the number of beers I drink the day before a test related to the score I receive? We can apply some of the principles of probability and statistics to answer questions like these. For homework read sections 2.1 and 2.2. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 203 / 251 Scatterplots Before digging deeply into the statistics of determining relationships between variables, lets first look at a quick way to see how strongly variables are related. “The simplest graphical tool used for detecting association between two variables is the scatterplot, which simply plots the ordered pairs of data points on a rectangular coordinate system.” Here’s some more facts from section 2.3 about scatterplots: If the plot has a roughly elliptical cloud shape, then it is reasonable to say a linear relationship exists. If the ellipse tilts up and to the right, the association is positive. If the tilt is down and to the right, the association is negative. If the ellipse is thin and long the relationship is strong. Fat and round imply a weak relationship. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 204 / 251 6 4 y2 2 0 y1 0123456 Examples 0 2 4 6 0 1 2 y4 1 2 3 4 5 6 4 5 01234567 4 3 y3 2 1 0 0 3 x2 5 x1 0 x3 1 2 3 4 5 x4 Top Left: No clear linear relationship. Top Right: Positive linear relationship. Bottom Left: Strong negative linear relationship. Bottom Right: Negative linear relationship. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 205 / 251 Terms explained Positive and negative relationships determine the direction of the relationship and not the strength of the relationship. A positive relationship means that as one value increases(decreases), the other variable increases(decreases). A negative relationship means that as one value increases(decreases), the other variable decreases(increases). For homework read the rest of section 2.3. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 206 / 251 Measuring Linear Relationships Once we decide that it is reasonable to assume that a relationship is linear, it is usually a good idea to measure that relationship somehow. We talked briefly about the relationship between Random Variables in previous slide and on slide 110 we defined the notion of the correlation between two Random Variables. ρ= Cov (X , Y ) Var (X )Var (Y ) = σXY E [(X − EX )(Y − EY )] = σX σY σX σY Then a reasonable estimate of ρ is given by ρ≡r = ˆ Rob Gordon (University of Florida) 1 n−1 n i =1 xi − x ¯ sx STA 3032 (7661) yi − y ¯ sy (45) Fall 2011 207 / 251 More about r r is called Pearson’s correlation coefficient. It has the following properties: −1 ≤ r ≤ 1 A value of r near 0 implies little to no linear relationship between y and x . In contrast, the closer r is to 1 or -1, the stronger the linear relationship between y and x . If r = ±1, all points fall exactly on the line. A positive value of r implies that y increases as x decreases. A negative value of r implies that y decreases as x increases. These situations are illustrated in section 2.4 of the book. Please read that section and do a few of the odd-numbered questions. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 208 / 251 Modeling Linear Relationships Its one thing to measure a linear relationship using Pearson’s correlation coefficient; its another thing entirely to accurately model the relationship. If two variables have a linear relationship then we should be able to model that relationship with the equation of a line. In general, a linear relation between two variables is given by the following equation (called a simple linear regression model): y = β0 + β1 x (46) where y is the response variable x is the explanatory variable or predictor variable β0 is the y-intercept. It is the value of y , for x = 0. β1 is the slope of the line. It gives the amount of change in y for a unit change in the value of x . Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 209 / 251 Fitting the model Fitting this “regression” line to a data set involves estimating the slope ˆ ˆ and intercept to produce a line that is denoted by y = β0 + β1 x . ˆ 0 1 2 y 3 4 5 The question then becomes, “How do we go about constructing a line that best fits the data?” The real question is “What do we mean by best?” 0 1 2 3 4 5 6 x Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 210 / 251 Fitting the model There are many conceivable ways to define the “best” line. The one presented in this class is not the only one, but it is considered one of the most basic and straightforward. The line that we create to model the linear relationship is the one that minimizes what is known as the sum of squared errors (SSE). Definition n n (yi − yi )2 = ˆ SSE = i =1 ˆ ˆ yi − β0 − β1 xi 2 (47) i =1 ˆ ˆ We can think of SSE as a function of two variables: β0 and β1 . The minimization problem is easily solved with calculus and linear algebra. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 211 / 251 Fitting the model I’ll spare the gory details and just give the formula for the regression coefficients : Definition ˆ ˆ The least-squares regression line is y = β0 + β1 x with ˆ ˆ slope β1 = i (xi − x )(yi − y ) ¯ ¯ ˆ and y-intercept β0 = y − β1 x . ¯ ˆ¯ 2 ¯ ( xi − x ) (48) ˆ sx Note: r = β1 sy (proof on page 85). By now it should be clear that these formulas can get tedious really quickly. Reset assured I won’t ask you to do anything too taxing on a quiz or test. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 212 / 251 Fitting the model For homework you will definitely have to do some kind of model fitting however. I suggest investing in a nice calculator or acquiring decent software including one of the following: Excel/Open Office (limited in functionality, low learning curve) MATLAB (very good. I’m not sure if they offer cheap student licenses, slightly high learning curve.) MiniTab (very user friendly, free 30-day license, very low learning curve, can only do basic analyses) R (completely free, open source, available on all operating systems, can do everything from basic to complicated analyses, slightly high learning curve) In class I’ll be presenting solutions to problems in MiniTab (the book presents results with this) and R. I highly recommend you acquire one or both. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 213 / 251 Fitting the model Example (2.32) Chapman and Demeritt reported diameters (in inches) and ages (in years) of oak trees. The data are shown (in the book). a Make a scatterplot. Is there any association between the age of the oak tree and the diameter? If yes, discuss the nature of the relation. b Can the diameter of oak tree be useful for predicting the age of the tree? If yes, construct a model for the relationship. If no, discuss why not. c If the diameter of one oak tree is 5.5 inches, what do you predict the age of this tree to be? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 214 / 251 10 20 Age 30 40 Make a scatterplot. Is there any association between the age of the oak tree and the diameter? 1 2 3 4 5 6 7 8 Diameter Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 215 / 251 Can the diameter of oak tree be useful for predicting the age of the tree? If yes, construct a model for the relationship. Let y represent the tree age and x represent the diameter. We’ll use R to compute β0 and β1 . > age = c(4, 5, 8, 8, 8, 10, 10, 12, 13, 14, 16, 18, 20, 22, 23, 25, 28, 29, 30, 30, 33, 34, 35, 38, 38, 40, 42) > diam = c(0.8, 0.8, 1, 2, 3, 2, 3.5, 4.9, 3.5, 2.5, 2.5, 4.6, 5.5, 5.8, 4.7, 6.5, 6, 4.5, 6, 7, 8, 6.5, 7, 5, 7, 7.5, 7.5) > plot(diam, age,pch = 19, xlab = "Diameter", ylab = "Age") > fit = lm(age ~ diam) > fit$coeff (Intercept) diam -0.1882781 4.7618114 ˆ ˆ This output tells us β0 = −0.1882781 and β1 = 4.7618114. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 216 / 251 If the diameter of one oak tree is 5.5 inches, what do you predict the age of this tree to be? The predicted value of y is y . This is easily solved using our fitted ˆ regression formula: ˆ ˆ y = β0 + β1 x ˆ = −0.1882781 + 4.7618114(5.5) = 4.768824 Its easy to find this in R using the following code: > fit$coeff[[1]] + fit1$coeff[[2]]*5.5 [1] 4.768824 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 217 / 251 We can easily plot the fitted line on top of the scatterplot using the following code: 10 20 Age 30 40 plot(diam, age,pch = 19, xlab = "Diameter", ylab = "Age") abline(fit$coeff[[1]], fit$coeff[[2]]) 1 2 3 4 5 6 7 8 Diameter Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 218 / 251 “How do we get the correlation in R?” To get the correlation in R we just type cor(age, diam, method="pearson") [1] 0.8891367 Another way to measure how well x predicts Y is Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 219 / 251 Another way to determine how well x predicts Y How do we know if our line “fits” our data? Do a “goodness-of-fit” test. Recall that r measures the strength of the linear relationship between X and Y . r 2 is also a good stat for goodness-of-fit. Let’s say that we wish to estimate some variable Y . We can do this by minimizing n (yi − y )2 . ¯ total sum of squares = SStotal = SSyy = (49) i =1 Similarly we could do this by minimizing n (yi − yi )2 . ˆ SSE = (50) i =1 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 220 / 251 We can say that our line is a good estimate if SSE is much smaller than SStotal . Then SStotal − SSE is a goodness of fit statistic. Problem: This difference has units of y 2 . It would be nice if we had some kind of unitless measurement like we have with r . Instead we’ll use: coefficient of determination = r 2 = SStotal − SSE SStotal (51) (yi − y )2 ˆ¯ (52) Furthermore it can be shown that (yi − y )2 = ¯ (yi − yi )2 + ˆ where (yi − y )2 is referred to as the Sum of Squares for Regression ˆ¯ (denoted by SSR ). Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 221 / 251 To summarize: Definition The square of the coefficient of correlation (r ) is called the coefficient of determination. It represents the proportion of the sume of squares of deviations of the y values about their mean that can be attributed to a linear relation between y and x . r2 = SSE SSR SStotal − SSE =1− = SStotal SStotal SStotal (53) Note that like r , r 2 ∈ [−1, 1]. Last time we saw an example of fitting the line with R. Let’s see how its done with minitab. Pay close attention... this part isn’t in the slides! Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 222 / 251 Example 12 motors are operated under high temperature conditions until engine failure. Temp 40 45 50 55 60 65 70 75 80 85 90 Hours 851 635 764 708 469 661 586 371 337 245 129 Make scatterplot of hours (y) vs temp(x) and verify if linear model is appropriate. Compute LS line. Compute fitted values and residuals for each point If temp increased by 5 degrees, how much would you predict lifetime to increase or decrease? Predict lifetime for temp of 73 degrees. Should we estimate the LS line for temp = 120? For what temp would you predict a lifetime of 500 hours? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 223 / 251 Welcome back Since the last time we saw slides, we saw lectures related to simple and multiple linear regression (chapters 2 and 11). Quiz 9 will be wednesday and will cover only simple linear regression (sections 2.3 - 2.7 and 11.1 - 11.3). It is recommended that you all do the homework problems for those sections. We’ll give a few more examples of multiple linear regression. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 224 / 251 Example 1 The article “Application of Analysis of Variance to Wet Clutch Engagement” (Mansouri, Khonsari, et al., Proceedings of the INstitution of Mechanical Engineers, 2002:117-125) presents the following fitted model for predicting clutch engagement time in seconds (y ) from engagement starting speed in m/s (x1 ), maximum drive torque in N · m(x2 ), system intertia in kg · m2 (x3 ), and applied force rate in kN/s (x4 ): y = −0.83 + 0.017x1 + 0.0895x2 + 42.771x3 + 0.027x4 − 0.0043x2 x4 ˆ The sum of squares for regression was SSR = 1.08613 and the sum of squares for error was SSE = 0.036310. There were 44 degrees of freedom for error. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 225 / 251 Example 1 continued... a. Predict the clutch engagement time when the starting speed is 20 m/s, the maximum drive torque is 17N·m, the system inertia is 0.006 kg·m2 , and the applied force rate is 10 kn/s. Solution: y = −0.83 + 0.017x1 + 0.0895x2 + 42.771x3 ˆ +0.027x4 − 0.0043x2 x4 = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 226 / 251 Example 1 continued... a. Predict the clutch engagement time when the starting speed is 20 m/s, the maximum drive torque is 17N·m, the system inertia is 0.006 kg·m2 , and the applied force rate is 10 kn/s. Solution: y = −0.83 + 0.017x1 + 0.0895x2 + 42.771x3 ˆ +0.027x4 − 0.0043x2 x4 = − 0.83 + 0.017(20) + 0.0895(17) + 42.771(0.006) +0.027(10) − 0.0043(17)(10) = 0.827126 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 226 / 251 Example 1 continued... b. Is it possible to predict the change in engagement time associated with an increase of 2m/s in starting speed? If so, find the predicted change. If not, explain why not. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 227 / 251 Example 1 continued... b. Is it possible to predict the change in engagement time associated with an increase of 2m/s in starting speed? If so, find the predicted change. If not, explain why not. Solution: Just look at the coefficient next to x1 . The coefficient represents the change in y per 1-unit change in x1 . Then we can predict the change in y to be 2×0.017 = 0.034 seconds. c. Is it possible to predict the change in engagement time associated with an increase of 2 N·m in maximum drive torque? Why or why not? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 227 / 251 Example 1 continued... b. Is it possible to predict the change in engagement time associated with an increase of 2m/s in starting speed? If so, find the predicted change. If not, explain why not. Solution: Just look at the coefficient next to x1 . The coefficient represents the change in y per 1-unit change in x1 . Then we can predict the change in y to be 2×0.017 = 0.034 seconds. c. Is it possible to predict the change in engagement time associated with an increase of 2 N·m in maximum drive torque? Why or why not? Solution: No. Note that max drive torque (x2 ) is present in the model as both a main effect and interaction term. Since we don’t know the change of x4 , we can’t predict the change in y . Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 227 / 251 Example 1 continued... d. Compute the coefficient of determination R 2 . Solution: Recall the last sentence from the problem: The sum of squares for regression was SSR = 1.08613 and the sum of squares for error was SSE = 0.036310. There were 44 degrees of freedom for error. Then R2 = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 228 / 251 Example 1 continued... d. Compute the coefficient of determination R 2 . Solution: Recall the last sentence from the problem: The sum of squares for regression was SSR = 1.08613 and the sum of squares for error was SSE = 0.036310. There were 44 degrees of freedom for error. Then SSR SSR R 2 = SST = SSR +SSE = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 228 / 251 Example 1 continued... d. Compute the coefficient of determination R 2 . Solution: Recall the last sentence from the problem: The sum of squares for regression was SSR = 1.08613 and the sum of squares for error was SSE = 0.036310. There were 44 degrees of freedom for error. Then SSR SSR 1.08613 R 2 = SST = SSR +SSE = 1.08613+0.036310 = 0.9676508 e. Compute the F statistic for testing the null hypothesis that all the coefficients are equal to 0. Can this hypothesis be rejected? Solution: F = Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 228 / 251 Example 1 continued... d. Compute the coefficient of determination R 2 . Solution: Recall the last sentence from the problem: The sum of squares for regression was SSR = 1.08613 and the sum of squares for error was SSE = 0.036310. There were 44 degrees of freedom for error. Then SSR SSR 1.08613 R 2 = SST = SSR +SSE = 1.08613+0.036310 = 0.9676508 e. Compute the F statistic for testing the null hypothesis that all the coefficients are equal to 0. Can this hypothesis be rejected? Solution: F = Rob Gordon SSR /p SSE /(n−p −1) (University of Florida) = STA 3032 (7661) Fall 2011 228 / 251 Example 1 continued... d. Compute the coefficient of determination R 2 . Solution: Recall the last sentence from the problem: The sum of squares for regression was SSR = 1.08613 and the sum of squares for error was SSE = 0.036310. There were 44 degrees of freedom for error. Then SSR SSR 1.08613 R 2 = SST = SSR +SSE = 1.08613+0.036310 = 0.9676508 e. Compute the F statistic for testing the null hypothesis that all the coefficients are equal to 0. Can this hypothesis be rejected? Solution: F = SSR /p SSE /(n−p −1) = 1.08613/5 0.036310/(44) = 263.2317 The F -table gives a critical value of around 3.4. Then the p-value is much smaller than 0.05 so we reject the hypothesis. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 228 / 251 Example 2 The following MINITAB output is for a multiple regression. Some of the numbers got smudged and are illegible. Fill in the missing numbers. Predictor Constant X1 X2 X3 Coef (a) 1.2127 7.8369 (d) SE Coef 1.4553 (b) 3.2109 0.8943 T 5.91 1.71 (c) -3.56 P 0.000 0.118 0.035 0.050 S = 0.82936 R-Sq = 78.0% R-Sq(adj) = 71.4% Source DF SS MS F P Regression (e) (f) 8.1292 11.818 0.01 Residual Error 10 6.8784 (g) Total 13 (h) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 229 / 251 Example 3 A research article describes an experiment involving a chemical process designed to separate enantiomers. A model was fit to estimate the cycle time (y ) in terms of the flow rate (x1 ), sample concentration (x2 ), and mobile-phase composition (x3 ). The results of a least-squares fit are presented in the following table. Predictor Constant x1 x2 x3 2 x1 2 x2 2 x3 x1 x2 x1 x3 x2 x3 Rob Gordon (University of Florida) Coefficient 1.603 -0.619 0.086 0.306 0.272 0.057 0.105 -0.022 -0.036 0.036 STA 3032 (7661) T P -22.289 3.084 11.011 8.542 1.802 3.300 -0.630 -1.004 1.018 0.000 0.018 0.000 0.000 0.115 0.013 0.549 0.349 0.343 Fall 2011 230 / 251 Example 3 Of the following, which is the best next step in the analysis? i. Nothing needs to be done. This model is fine. 22 2 ii. Drop x1 , x2 and x3 from the model, and then perform an F test. iii. Drop x1 x2 , x1 x3 and x2 x3 from the model, and then perform an F test. 2 iv. Drop x1 and x1 from the model, and then perform an F test. 3 3 3 v. Add cubic terms x1 , x2 , and x3 to the model to try to improve the fit. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 231 / 251 Example 4 The following MINITAB output is for a best subsets regression involving five dependent variables X1 , . . . , X5 . Vars 1 1 2 2 3 3 4 4 5 Rob Gordon R-Sq 77.3 10.2 89.3 77.8 90.5 89.4 90.7 90.6 90.7 R-Sq(adj) 77.1 9.3 89.0 77.3 90.2 89.1 90.3 90.2 90.2 (University of Florida) S 1.40510 2.79400 0.97126 1.39660 0.91630 0.96763 0.91446 0.91942 0.91895 1 X X X X X X X X STA 3032 (7661) 2 3 4 5 X X X X X X X X X X X X X X X X X Fall 2011 232 / 251 Example 4 a. Which variables are in the model selected by the adjusted R 2 criterion? b. Are there any other good models? Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 233 / 251 Example 5 Suppose we try to fit the model equation y = β0 + β1 x1 + β2 x2 + β3 x3 + β 4x4 + β5 x3 x4 + β6 x6 + β7 x7 + and we get the following fit from some statistical software: y = −0.257 + 0.778x1 − 0.105x2 + 1.213x3 − 0.00624x4 + 0.00386x3 x4 ˆ −0.00740x6 − 0.00148x7 . Furthermore, we get the following output: Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 234 / 251 Example 5 continued... Predictor Constant x1 x2 x3 x4 x3 ∗ x4 x6 x7 S = 0.22039 Coef 0.2565 0.77818 -.10479 1.2128 -0.0062446 0.0038642 -0.007404 -0.0014773 SE Coef 0.7602 0.05270 0.03647 0.4270 0.01351 0.008414 0.009313 0.0005170 R-Sq = 93.5% T 0.34 14.77 -2.87 2.84 -0.46 0.46 -0.79 -2.86 P 0.736 0.000 0.005 0.005 0.645 0.647 0.428 0.005 R-Sq(adj) = 93.2% (continued on next slide) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 235 / 251 Example 5 continued... Analysis of Variance Source DF Regression 7 Residual Error 157 Total 164 SS 111.35 7.7302 119.08 MS 15.907 0.049237 F 323.06 P 0.000 Notice that x4 , x3 ∗ x4 and x6 have large p-values. After dropping those terms from the model, the fit for the reduced model is y = −0.219 + 0.779x1 − 0.108x2 + 1.354x3 − 0.00134x7 ˆ and the MINITAB output is given as Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 236 / 251 Example 5 continued... Predictor Constant x1 x2 x3 x7 S = 0.22039 Coef -0.21947 0.779 -0.10827 1.3536 -0.0013431 R-Sq = 93.5% Analysis of Variance Source DF Regression 7 Residual Error 157 Total 164 Rob Gordon SE Coef 0.4503 0.04909 0.0352 0.2880 0.0004722 (University of Florida) SS 111.35 7.7716 119.08 T -0.49 15.87 -3.08 4.70 -2.84 P 0.627 0.000 0.002 0.000 0.005 R-Sq(adj) = 93.3% MS 15.907 0.049237 STA 3032 (7661) F 323.06 P 0.000 Fall 2011 237 / 251 Example 5 continued... a. Compute the f statistic for testing the plausibility of the reduced model. b. How many degrees of freedom does the F statistic have? c. Find the P -value for the F statistic. Is the reduced model plausible? d. Someone claims that since each of the variables being dropped had large P -values, the reduced model must be plausible, and it was not necessary to perform an F test. Is this correct? Explain why or why not. e. The total sum of squares is the same in both models, even though the independent variables are different. Is there a mistake? Explain. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 238 / 251 Model Selection One systematic way to perform a model selection is Best Subsets Regression There’s a few ways we can go about this, but the general strategy is the following: Create full model Generate all possible reduced models Choose a model based on goodness of fit statistics R 2 (always tells us to choose full model) Adjusted R 2 (penalizes for too many variables) others (AIC, BIC, Mallow’s Cp , etc.) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 239 / 251 Specific Types of Variable Selection Forward selection Start with intercept-only model Add a single predictor variable. If the t-test has a p-value < α, continue adding variables. Backward selection Start with the full model If one variable has a p-value > α, delete it and fit the reduced model. Continue until all variables are significant. Neither method is perfect. Some potential problems: Sometimes the order in which variables are added in a forward selection process can affect the model obtained. Sometimes adding a variable causes p-values of previous variables to become insignificant. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 240 / 251 A compromise Stepwise Regression: (1) Choose “threshhold p-values” αin , αout usually where αin ≤ αout . (2) Do a forward selection: Choose predictor variable with smalled p-value, add to model provided its p-value < αin . (3) Do another forward selection: If a second variable has p-value < αin , add to model and otherwise stop. (4) Backwards selection: sometimes adding 2nd variable increases p-value of 1st variable. If p-value of 1st variable > αout , remove from model. (5) Return to (3). Continue until there does not exist 2 variables when adding to the model does not give p-values< αin nor p-values < αout . Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 241 / 251 Warning Model selection procedures sometimes produce models when they don’t make sense. Example: Annual birthrate in Great Britain was almost perfectly correlated with annual production of pig iron* in the United States from 1875-1920. *pig iron is a byproduct of smelting iron ore with a type of coal. Don’t blatantly throw predictors in a model that make no sense. If you aren’t sure if a relationship makes sense, redo the experiment to verify results. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 242 / 251 Example (Navidi, 2nd ed. pg 616) In mobile ad hoc computer networks, messages must be forwarded from computer to computer until they reach their destinations. The data overhead is the # of bytes of information that must be transmitted along with the messages to get them to the right places. A successful protocol will generally have a low data overhead. A study is conducted on 25 simulated computer networks. The overhead, average speed, pause time, link change rate (LCR) are recorded. LCR for a given computer is the rate at which over computers in the network enter and leave the transmission range of the given computer. To start let’s say we fit the following model (raw data not show, see course website): Overhead = β0 + β1 LCR + β2 Speed + β3 Pause + β4 Speed · Pause +β5 LCR 2 + β6 Speed 2 + β7 Pause 2 + Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 243 / 251 Fit the Model We can fit the model in R with the following statement: > mydata = read.table(‘/Users/robertgordon/Documents/ table8-4.txt’, sep=’,’, header=TRUE) > attach(mydata) > fit1 = lm(Overhead ~ LCR + Speed + Pause + I(Speed*Pause) + I(LCR^2) + I(Speed^2) + I(Pause^2)) We can see the results of fitting the model by typing > summary(fit1) Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 244 / 251 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 367.96413 19.40264 18.965 7.12e-13 LCR 3.47669 2.12913 1.633 0.12087 Speed 3.04382 1.59133 1.913 0.07278 Pause 2.29237 0.69838 3.282 0.00439 I(Speed * Pause) -0.01222 0.01534 -0.797 0.43663 I(LCR^2) -0.10412 0.03192 -3.262 0.00459 I(Speed^2) -0.03131 0.01906 -1.643 0.11885 I(Pause^2) -0.01318 0.01045 -1.261 0.22442 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 *** . ** ** Residual standard error: 5.723 on 17 degrees of freedom Multiple R-squared: 0.9723,Adjusted R-squared: 0.9609 F-statistic: 85.33 on 7 and 17 DF, p-value: 5.409e-12 Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 245 / 251 Look at the high p-values on the previous slide LCR Speed · Pause Speed 2 Pause 2 Leave LCR in the model since LCR 2 is significant. Try the following reduced model: Overhead = β0 + β1 LCR + β2 Speed + β3 Pause + β5 LCR 2 + and perform the F-test to see if the reduced model works. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 246 / 251 Note: SSEfull = 556.8 (how?) on 17 degrees of freedom. Fitting the reduced model gives SSEreduced = 830.3. Then the F test statistic is f= 830.3 − 556.8)/(7 − 4) = 2.78. 556.8/17 Under H0 we say that f has a F3,17 distribution. We can find the p-value with R: > pf(q=2.78, df1=3, df2=17, lower.tail=FALSE) [1] 0.072727 This means the reduced model is plausible. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 247 / 251 Can we do a Best Subsets procedure in R? Yes. Consider how many ways we can do this: almost 7 + 7 + 7 + 7 + 7 + 7 + 7 = 127. 1 2 3 4 5 6 7 (why?) That’s a lot of models to go through. Let’s automate this process. To do this in R we first need to install the ‘leaps’ package. > install.packages(’leaps’) > library(leaps) Then use the regsubsets function (instead of lm) to fit the model: > leaps<-regsubsets(Overhead ~ LCR + Speed + Pause + I(Speed*Pause) + I(LCR^2) + I(Speed^2) + I(Pause^2), data=mydata,nbest=2) The nbest=2 option tells R to only report the 2 best models for that subgroup. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 248 / 251 1 1 2 2 3 3 4 4 5 5 6 6 7 ( ( ( ( ( ( ( ( ( ( ( ( ( 1 2 1 2 1 2 1 2 1 2 1 2 1 Rob Gordon ) ) ) ) ) ) ) ) ) ) ) ) ) LCR "" "" "" "" "*" "*" "*" "*" "*" "" "*" "*" "*" Speed "" "" "" "*" "" "*" "*" "" "*" "*" "*" "*" "*" (University of Florida) Pause "" "*" "*" "*" "*" "" "*" "*" "*" "*" "*" "*" "*" I(S*P) "*" "" "*" "" "" "" "" "" "" "*" "" "*" "*" I(LCR^2) "" "" "" "" "*" "" "*" "*" "*" "*" "*" "*" "*" STA 3032 (7661) I(S^2) "" "" "" "" "" "*" "" "*" "" "*" "*" "*" "*" I(P^2) "" "" "" "" "" "" "" "" "*" "" "*" "" "*" Fall 2011 249 / 251 The list of best subgroups is nice, but we can make a more informed decision if we knew the R 2 values associated with each row. We can see this graphically with the following: > > > > par(mfrow=c(1,2) plot(leaps,scale="r2") library(car) subsets(leaps, statistic="rsq") Note: par tells R to partition a graph into 1 row and 2 columns library tells R to access the car package. Rob Gordon (University of Florida) STA 3032 (7661) Fall 2011 250 / 251 L-S-P-I*P-I(L-I(S-I(P L-S-P-I(L-I(S-I(P L-S-P-I*P-I(L-I(S L-S-P-I(L-I(P S-P-I*P-I(L-I(S L-S-P-I(L L-P-I(L-I(S L-P-I(L 0.97 0.9 0.97 0.97 L-S-I(S 0.97 r2 0.95 0.93 0.9 P-I*P S-P 0.8 0.96 I*P 0.7 Statistic: rsq 0.97 0.83 0.6 0.82 0.74 0.55 (Intercept) LCR Speed Pause I(Speed * Pause) I(LCR^2) I(Speed^2) I(Pause^2) P Rob Gordon (University of Florida) 1 2 3 4 5 6 7 Subset Size STA 3032 (7661) Fall 2011 251 / 251 ...
View Full Document

This note was uploaded on 12/13/2011 for the course STA 3032 taught by Professor Kyung during the Fall '08 term at University of Florida.

Page1 / 501

ClassNotesComplete - STA 3032 (7661) Engineering Statistics...

This preview shows document pages 1 - 6. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online