This preview shows page 1. Sign up to view the full content.
Unformatted text preview: 1/4/11 PADP 8120: Data Analysis and Sta5s5cal Modeling Probability Distribu.ons Spring 2011 Angela Fer5g, Ph.D. Plan Last 5me we covered mean, median, std dev Today we will learn about probability distribu5ons and the normal distribu5on in par5cular Then, we'll do some simula5ons 1 1/4/11 Why do we need to know about probability distribu5ons? We are interested in how likely/probable it is that our sample is similar to the popula5on To make a judgment, we need to think about probability distribu5ons Probability The probability of an outcome is the propor5on of 5mes that the outcome would occur in a long run of repeated observa5ons. Example: Imagine tossing a coin with heads and tails If we flip the coin lots of 5mes, then the number of heads is likely to be similar to the number of tails Thus, the probability of a coin landing heads on any one flip is 0.5 2 1/4/11 Probability Distribu5on A probability distribu3on lists the possible outcomes and their probabili5es The mean of a probability distribu5on is: = yP(y) if y is discrete = yP(y)dy if y is continuous.
Also called the expected value: E(y)=probability * payoff The standard devia3on of a probability distribu5on is (larger = more spread out distribu5on) Probability Distribu5on Example Variable: hours/wk spent working by students Mean=20, Std Dev=5 Distribu5on: Assume that the probability of students working btwn 0 and 10 hours is 2.5%. Area between 0 and 10 is 2.5 per cent of the total Area beneath the curve
There is a 2.5% probability that if I picked a student they would have done less than 10 hrs work in a week. 3 1/4/11 Normal distribu5on Normal distribu5ons: Have a bell-shape (the student work hours distribu5on is normal) Are symmetrical Follow the empirical rule: The probability of falling within z standard devia5ons of the mean is: 68% if z=1 95% if z=2 (actually 1.96) Almost 100% if z=3 What is Z? The Z-score for a value Y on a variable is the number of standard devia5ons that Y falls from : We can use Z-scores to determine the probability in the tail of a normal distribu5on that is beyond a number Y. For any value of z, there is a corresponding probability. 4 1/4/11 Why is the normal distribu5on so important? The normal distribu5on is relevant to us because the distribu,on of sample means (or any sta,s,c) are normally distributed. If we took lots of samples of a popula5on, we would have a distribu5on of sample means, or the sampling distribu3on. Due to averaging the sample mean does not vary as widely as the individual observa5ons. Moreover, if we took lots of samples, then the distribu5on of the sample means would be centered around the popula5on mean. Population distribution Mean of population Mean of all sample means Sampling distribution 2 Very Important Things If we have lots of sample means, then the average will be the same as the popula5on mean. The sample mean is an unbiased es,mator of the popula5on mean. As the sample size increases, the sampling distribu5on looks more and more like a normal distribu5on. This is called the central limit theorem. This is true regardless of the shape of the popula5on distribu5on. 5 1/4/11 Sampling distribu5on is normal even if the popula5on distribu5on is crazy Sampling distribution Mean of population Mean of all sample means Population distribution Dispersion of the sampling distribu5on Sampling distribu5ons that are 5ghtly clustered will give us a more accurate es5mate on average than those that are more dispersed. We need to es5mate the dispersion of our sampling distribu5on so that we know how good our sta5s5c is. The standard error is the standard devia5on of the sampling distribu5on. 6 1/4/11 Standard error n We don't know that is, but we do know s. The standard error is s an es5mate of how Standard error (X) = n far any sample mean `typically' deviates where from the popula5on = population standard deviation mean. s = sample standard deviation
Standard deviation (X) = X = sample mean n = sample size Example I want to know the average SAT score among all high school students in Georgia (popula5on) I can't get everyone's scores so I have to take samples I randomly choose 5 high schools in GA and get the average SAT score for the school: 1400, 1550, 1500, 1450, and 1490 Sample mean ( X ) = 2 Sample standard devia5on (s) = i - X) = (X n Sample size (n) = Standard error ( X ) = s = n 7 1/4/11 2 More Very Important Things The formula for standard error means that: As the sample size increases (n), the sampling distribu5on is 5ghter. As the distribu5on of the popula5on becomes 5ghter, the sampling distribu5on is also 5ghter. Binary variables (see Ch. 5) Here are the formulas for binary variables, where the mean is just the propor5on 8 1/4/11 Recap Popula5on distribu5on Sample distribu5on We don't know this, but we want to know about it (e.g. the mean). We know this, and calculate sta5s5cs such as the sample mean and the sample standard devia,on from it. This describes the variability in value of the sample means amongst all of the possible samples of a certain size. We can work this out from informa5on about the sample distribu,on and the fact that the sampling distribu,on is normal if the sample size is large. Sampling distribu5on 9 ...
View Full Document
- Summer '11