Unformatted text preview: 3/8/11 PADP 8120: Data Analysis and Sta5s5cal Modeling Ordinary Least Squares Spring 2011 Angela Fer5g, Ph.D. Plan Previously: We've been discussing descrip(ve inferences comparing means, determining whether variables were associated with each other, etc. This 5me: We are moving toward causal inferences Introduc5on to Linear Regression 1 3/8/11 Example for today Say we could run this experiment: We have 8 lowincome families, each with one daughter, age 10, who scored poorly on a standardized school test We move these 8 families to different neighborhoods with different poverty rates, and aQer a year, have the girls take the test again Test score 1 year aQer the move. This is the dependent variable. This is what we are interested in predic5ng. Neighborhood poverty rate. This is the independent variable. This is what we think predicts the dependent variable. We have 2 variables: Note that we think we have a clear causal "story". We change the neighborhood poverty rate and the girls' school performance change. Generally, causality can be more difficult to ascertain. Here's our data Poverty Test rate score Ava 4% 85 Bella 6% 80 Clara 8% 83 Dolores 10% 75 Evie 12% 60 Fern 14% 70 Gabbie 16% 55 Hermione 18% 50 Girl It appears that higher poverty rates result in lower test scores. 2 3/8/11 Sca`erplot 90 85 80 Test score 75 70 65 60 55 50 0 2 4 6 8 10 I fit a line "by eye" for now by trying to minimize the differences between each point and the line. Evie 12 14 16 18 20 Poverty rate Equa5on of a line There are 2 components of a line: Y = + X The slope () describes how many fewer units of y there are when one unit of x is increased. The intercept () describes how many units of y there are when x is 0. 110 100 90 80 70 60 50 0 5 10 15 20 7080=10 106=4 Slope=rise/run=10/4=2.5 The intercept is 98. Test score = 98 2.5*poverty rate 3 3/8/11 How to rigorously fit a line (not by eye): Least Squares We want to minimize devia5ons between the data point and the line, but some points are higher and some are lower, so minimize the sum of the squared devia<ons from the line. The line that best manages this is called the Ordinary Least Squares (OLS) line. Here are the formulas for calcula5ng the slope and intercept of this line. = (X  X )(Y  Y ) (X  X)
2 = Y  X Let's calculate it! Y 85 80 83 75 60 70 55 50 69.75 X 4 6 8 10 12 14 16 18 11 XmeanX YmeanY (XmeanX)(YmeanY) 7 5 3 1 1 3 5 7 mean 0 15.25 10.25 13.25 5.25 9.75 0.25 14.75 19.75 mean 0 2 (XmeanX)^2 49 25 9 1 1 9 25 49 sum 168 Our eyeballing was close! 106.75 51.25 39.75 5.25 9.75 0.75 73.75 138.25 sum 424 mean mean = (X  X )(Y  Y ) = 424 = 2.52 168 (X  X) = Y  X = 69.75 + 2.52 *11 = 97.5 4 3/8/11 Is this rela5onship sta5s5cally significant? The line we generated is based on a sample. Does it represent a true rela5onship for the popula5on? If we moved lowincome families to lowpoverty neighborhoods, would girls' school performance improve? We want to test the null hypothesis that is zero in the popula5on using our sample. If =0, then the line is horizontal (increases in X do not affect Y). Some nota5on Let's call the popula5on coefficients and . Let's call the coefficients from the sample that we have b and a. We want to es5mate the true regression coefficients and , but all we have is a sample of the popula5on giving us a line with coefficients b and a. 5 3/8/11 An error term We need to add an error term to our equa5on for the line because: Our measures aren't perfect (test scores aren't a great measure of school performance) There is variability across people (some girls will be be`er test takers than others, some will be sick on the test day, etc.) Y = + X + Y = a + bX + e Best way to think about error term There's a distribu5on of Y values for each value of X. There are a range of test scores among girls from lowincome households in a neighborhood with poverty rate=x%. The true regression line will go through the mean of each of these distribu5ons for all of the X values (what we call the condi5onal mean). 6 3/8/11 Y, given X=4 Graphically Y, given X=8 Y, given X=12 110 100 Test score 90 80 70 60 50 0 2 4 6 8 10 12 14 Mean Y, given X=4 Regression line for the population (the line goes through the mean of Y for each value of X). 16 18 20 Poverty rate Y, given X=4 Graphically Y, given X=8 Y, given X=12 e: Difference between observed and predicted (for es5mated regression). 110 100 Test score 90 80 70 60 50 : Difference between observed and predicted (for true regression). 0 5 10 Estimated regression True regression 15 20 Poverty rate 7 3/8/11 Assump5ons There are some important assump5ons underlying all of this. The distribu5on of values of Y at each value of X is normally distributed. (Sa5sfied if the sample is large.) The spread of the distribu5on of values of Y at each value of X is the same. The true rela5onship between the variables in the popula5on is linear. The sample is random. Now back to significance tes5ng If we took lots of separate samples and then calculated lots of separate regression lines, we would get a distribu5on of slope coefficients b. The sampling distribu5on of b is normal if the sample size is large, and the mean of all the possible b's is . The formula for the standard error of b is: Standard error of b = ^ (Y  Y )
2 s (X  X ) 2 where s = n2 ^ = a + bX . and Y The SE depends on the variability of the Y observa5ons around our es5mated line (s) and the spread of the X observa5ons (denominator). When the X's are spread out, the SE is smaller because when the observa5ons are bunched together, it is hard to get a precise es5mate of the slope based on the small bit of line. 8 3/8/11 Confidence Interval for b
With 95% confidence: Example: If b=0.5 and SE of b=0.2, then = 0.5 1.96 * 0.20 = 0.5 0.39 So, with 95% confidence, the slope of our line will lie between 0.11 and 0.89. Hypothesis tes5ng We usually want to test whether =0 in the popula5on. So, we calculate the zsta5s5c how many SEs from zero is the b? Then get the pvalue. In this example, b is 2.5 SEs more than 0, and the probability of this is 0.012 (or 1.2%). So, we can reject the null hypothesis that =0. That is, the effect is significantly different from zero. 9 3/8/11 Graphically Y, given X=25 for H0 Y, given X=45 for H0 Our regression line Null hypothesis Correla5on b tells us whether the rela5onship between the dependent and independent variables are sta5s5cally significantly different from zero and the direc5on of the rela5onship, but because its value depends on the measured used, it is not great at telling us the strength of the rela5onship. For this, we oQen compute the correla<on (r), which is unitfree (doesn't depend on the units of measurement for X and Y). r= (X  X)(Y  Y ) (X  X) (Y  Y )
2 2 Unlike b, the correla5on doesn't make a dis5nc5on between the dependent and the independent variables. 10 3/8/11 Interpre5ng r r= (X  X)(Y  Y ) (X  X) (Y  Y )
2 2 If X>mean(X) and Y>mean(Y), r is posi5ve. If X decreases when Y increases, r is nega5ve. The denominator standardizes r so it runs from 1 to +1. r=1 tells us the variables are perfectly posi5vely correlated r=1 tells us the variables are perfectly nega5vely correlated r=0 tells us there is no (linear) rela5onship between the variables. 11 ...
View
Full Document
 Summer '11
 FERTIG
 Regression Analysis, Null hypothesis

Click to edit the document details