day10 - 4/6/11 PADP 8120: Data Analysis and Sta6s6cal...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 4/6/11 PADP 8120: Data Analysis and Sta6s6cal Modeling Regression Diagnos,cs Spring 2011 Angela Fer6g, Ph.D. Plan Last 6me: We extended the OLS model to include: Mul6ple independent variables Categorical independent variables Interac6ons between independent variables This 6me: We will look at what happens when some of the underlying assump6ons behind linear regression are not met. Non-linear rela6onships Different amounts of varia6on at different levels of the independent variable (heteroskedas6city) Outliers Independent variables that are highly correlated (mul6collinearity) 1 4/6/11 Non-linear rela6onships OLS assumes a linear rela6onship This model won't do a good job of predic6ng the rela6onships if there are non-lineari6es 1. How do we tell if there are non-lineari6es? 2. What do we do if there are? How do we tell if there are non- lineari6es? Graph the residuals to see if there is any structure to the varia6on that you can't explain The devia6on of the observa6on from our predic6on (the e is our equa6on) is called the residual y = a + b1 X1 + b2 X 2 + e Residuals should be randomly distributed around zero (it's inherent varia6on that we can't predict). 2 4/6/11 Graph of residuals (y-axis) on a key independent variable (x-axis) A random sca^er of residuals, with no apparent pa^ern, is what we want to see generally. Example Say I was interested in what predicts adult health status. My hypotheses are that: People with higher birth weights have be^er health as adults. Women has worse self-reported health than men. 3 4/6/11 Self-reported Health Status-average over 3 6me periods Graph Mul6ple Regression Source | SS df MS -------------+-----------------------------Model | 20.9664036 2 10.4832018 Residual | 2118.51576 7853 .269771521 -------------+-----------------------------Total | 2139.48216 7855 .272372013 Number of obs F( 2, 7853) Prob > F R-squared Adj R-squared Root MSE = = = = = = 7856! 38.86! 0.0000! 0.0098! 0.0095! .5194! ------------------------------------------------------------------------------! hstatavg | Coef. Std. Err. t P>|t| [95% Conf. Interval]! -------------+----------------------------------------------------------------! bwt | -.0000319 .0000115 -2.78 0.005 -.0000545 -9.44e-06! female | .0933943 .0118612 7.87 0.000 .0701432 .1166454! _cons | 1.721228 .0401889 42.83 0.000 1.642447 1.800009! ------------------------------------------------------------------------------! Before we send these results to a journal, we should examine plots of the residuals against each X variable to see if there is any structure Each residual is the actual value of Y minus the predicted value of Y (Yhat). 4 4/6/11 Graph of the residuals What do we do? 3 main op6ons 1. Include X (birth weight in this case) and X2 in our specifica6on. 2. "Transform" X (normally by taking the log of X) 3. Use a different kind of regression that lets you fit a squiggly line to the data (not covered in this class). 5 4/6/11 Squared terms The most common way to take account of non-linearity is to use polynomial regression func1ons. y = a + b1 X1 + b2 X12 + b3 X 2 + e Depending on the value of b2 the func6on will be convex (if b2<0) or concave (if b2>0). Using squared terms is par6cularly useful when the rela6onship "goes up and down" or "goes down and up" Back to our example . gen bwt2=bwt*bwt! . reg hstatavg bwt bwt2 female! Source | SS df MS -------------+-----------------------------Model | 22.8722879 3 7.62409597 Residual | 2116.60987 7852 .269563152 -------------+-----------------------------Total | 2139.48216 7855 .272372013 Number of obs F( 3, 7852) Prob > F R-squared Adj R-squared Root MSE = = = = = = 7856! 28.28! 0.0000! 0.0107! 0.0103! .51919! ------------------------------------------------------------------------------! hstatavg | Coef. Std. Err. t P>|t| [95% Conf. Interval]! -------------+----------------------------------------------------------------! bwt | -.0002721 .000091 -2.99 0.003 -.0004505 -.0000936! bwt2 | 3.61e-08 1.36e-08 2.66 0.008 9.49e-09 6.27e-08! female | .0939249 .0118583 7.92 0.000 .0706795 .1171704! _cons | 2.110489 .1518059 13.90 0.000 1.812909 2.408069! ------------------------------------------------------------------------------! Be^er fit than before. 6 4/6/11 Log transforma6on Using the logged value of X instead of X as our independent variable is an alternate way of dealing with non-lineari6es This is especially useful when we think that the rela6onship looks like an exponen1al func1on Y Y X X Doesn't work as well as the polynomial in this example . gen lbwt=log(bwt)! . reg hstatavg lbwt female! Source | SS df MS -------------+-----------------------------Model | 21.4551965 2 10.7275983 Residual | 2118.02696 7853 .269709278 -------------+-----------------------------Total | 2139.48216 7855 .272372013 Number of obs F( 2, 7853) Prob > F R-squared Adj R-squared Root MSE = = = = = = 7856! 39.77! 0.0000! 0.0100! 0.0098! .51934! ------------------------------------------------------------------------------! hstatavg | Coef. Std. Err. t P>|t| [95% Conf. Interval]! -------------+----------------------------------------------------------------! lbwt | -.1129419 .0365409 -3.09 0.002 -.1845718 -.041312! female | .093054 .0118511 7.85 0.000 .0698227 .1162852! _cons | 2.529657 .2970308 8.52 0.000 1.947398 3.111916! ------------------------------------------------------------------------------! Worse fit than polynomial. 7 4/6/11 More about logs How the log transforma6on works: On a log-scale, the distance between 1 and 2 is the same as the distance between 2 and 4, and the distance between 4 and 8, etc. Thus, taking the log means that high values will be more bunched together and low values will be more spread apart. This means that in some cases, if the rela6onship between X and Y is curved, then the rela6onship between log(X) and Y will be close to linear. Note that you need to be careful when interpre6ng coefficients on logged variables Next problem: Heteroskedas6city This just means that the values of Y are more variable at some levels of X compared to others If these are big differences, then they violate one of the assump6ons we made behind OLS regression. 8 4/6/11 Y, given X=4 Remember this graph Y, given X=8 Y, given X=12 110 100 Test score 90 80 70 60 50 0 2 4 6 8 10 12 14 Mean Y, given X=4 Regression line for the population (the line goes through the mean of Y for each value of X). 16 18 20 Poverty rate How can we tell if we have this problem? Graph the residuals. Residuals have `low' variation Residuals have `high' variation 9 4/6/11 What do we do? If the residuals are more spread out for some values of X than others, then the standard errors that we calculated will not be correct. We need to calculate robust standard errors. These errors compensate for heteroskedas6city. They are generally larger than normal standard errors. If you use normal standard errors, then you may have results that look like they are sta6s6cally significant, but they really aren't. Next problem: Outliers Non-lineari6es and heteroskedas6city are two examples of structure lem amer we have fi^ed our model that we want to explain. But, omen we're also interested in par6cular individual observa6ons that don't fit our predic6ons. These "outlying" observa6ons can some6mes radically change our regression results. They can also help us to think about other independent variables that may be important. 10 4/6/11 Example Say we are interested in voter turnout around the world (what percentage of people vote). A reasonable hypothesis is that a more compe66ve the party system would lead to more people bothering to vote So let's model turnout using compe66veness as an independent variable in a regression. Outliers Belgium Australia Regression line But Belgium and Australia have compulsory vo6ng...need to include that in the model. 11 4/6/11 What if you can't explain the outliers? Check the data make sure it isn't an error. Assess whether it ma^ers: 1. How big is its residual? Standardize the residuals (called studen1zed residuals) by dividing the residual by the standard devia6on we would expect from normal sampling variability. This is like a z-sta6s6c, so about 5% of values should be above 1.96 or below 1.96, so you can work out how outlying the outliers are. 2. Does it affect the es6mated coefficients? What is the leverage of the observa6on? Leverage Outlying observa6ons far from the mean make more difference to the regression line To detect influen6al observa6ons, we calculate a diagnos6c called DFBETA, which tells us the effect of removing the observa6on on each parameter es6mate in the model The DFBETA for a parameter is high (more than 1) when the observa6on has a big residual and has a lot of leverage. 12 4/6/11 Should we delete the outlier? Generally not a good idea unless you have some reason (the data point is a typo, there is a missing variable, etc); it is a real observa6on amer all. If the "interes6ng" rela6onships are dependent on the outlier, then you need to be cau6ous when interpre6ng them. Finally: Mul6collinearity This just means that the independent variables are closely related to each other When one of them increases the others increase as well, making it difficult to work out the separate effect of each predictor This is common in social science because our variables omen "overlap" a lot. 13 4/6/11 Example When you ask people about their various astudes, many astudes are highly correlated with each other (e.g. defense and foreign policy) Correlation between our two independent variables is 0.98. When foreign policy goes up, defence policy goes up. We can't work out what happens when foreign policy goes up and defence stays the same. How to tell? If SEs are huge, check correla6on between independent variables Calculate variance infla6on factor; a variable with VIF values>10 may be a linear combina6on of other independent variables. What to do? Drop one of the highly correlated independent variables (this makes sense when the causal rela6onships are clear) Make a scale with the highly correlated independent variables (this makes sense when there is an underlying variables that we haven't/can't measure) 14 ...
View Full Document

Ask a homework question - tutors are online