### Lecture 15 Part 2_Leverage etc

Course: STAT 102, Spring 2012
School: UPenn
15 Lecture Part 2, Leverage and Influence STAT 102 Outliers, Leverage and influence in simple linear regression (Review) Outliers and influential observations in multiple linear regression; Leverage plots Notes re the text: This topic is covered in Section 6.7. The emphasis there is on the use of quantitative measures like DFITS and Cooks D to identify leverage. We do not recommend these methods. Instead we...

15 Lecture Part 2, Leverage and Influence STAT 102 Outliers, Leverage and influence in simple linear regression (Review) Outliers and influential observations in multiple linear regression; Leverage plots Notes re the text: This topic is covered in Section 6.7. The emphasis there is on the use of quantitative measures like DFITS and Cooks D to identify leverage. We do not recommend these methods. Instead we suggest in these overheads a more graphical perspective, based on Leverage plots. 1 Outliers and influential points in simple regression 130 120 19 110 Score Does the age at which a child begins to talk predict a score on a test of mental ability at a later age? gesell.JMP contains data on the age at first word (x) and their Gesell Adaptive score (y), an ability test taken at a later age. Child 18 is an outlier in the x direction, so it is a leverage point and potentially influential. Child 19 is a regression outlier. 100 90 80 70 18 60 50 5 10 15 20 25 30 Age 35 40 45 2 Outliers in Simple Linear Regression Three types of outliers in scatterplots: Outlier in x direction Outlier in y direction Outlier from regression line of scatterplot (residual has large magnitude) Several possibilities need to be investigated when an outlier is observed: There was an error in recording the value. The point is not representative of the population of interest. The observation is valid. Identify regression outliers from the scatterplot and residual plot 3 Leverage and Influential Points An observation has high leverage if it is an outlier in the x direction. An observation is influential if removing it would markedly change the slope of the least squares line. Observations that have high leverage and moderate to large residuals tend to be influential. Observations with little or no leverage ( x x ) cannot be influential 4 Outliers and influential points in simple linear regression 130 120 19 110 Score To assess whether a point is influential, fit the least squares line with and without the point and see how much of a difference it makes. [In JMP, exclude the corresponding row and re-fit the line.] Child 18 has high leverage, and turns out to be influential; Child 19 has low leverage and hence turns out to not be influential. 100 90 80 70 18 60 50 5 10 15 20 25 30 Age 35 40 45 5 130 40 19 W/O 18 Score Residual 120 Score 110 100 Full data 90 30 19 20 10 0 18 80 -10 70 -20 50 60 70 80 90 100 110 120 130 Score Predicted 18 W/O 19 60 50 5 10 15 20 25 30 Age 35 Full data: Rsquare=0.41 Score = 109.87 - 1.127 Age W/O 18: Rsquare=0.11 Score = 105.63 - 0.779Age #18: High leverage and influential. 40 45 Residuals of full data (Shows #19 may be a regression outlier.) W/O 19: Rsquare=0.57 Score = 109.30 - 1.193Age # 19: possible outlier, but not influential 6 Bivariate Fit of Score By Age Bivariate Fit of Score By Age (all 130 (w/o influential point #18) 130 120 120 110 Score 110 Score data) 100 100 90 80 70 90 60 80 50 5 70 5 10 15 Age 20 10 15 20 25 25 30 Age 35 40 Linear Fit Linear Fit Score = 105.63 - 0.78Age Score = 109.87 - 1.13Age Summary of Fit 45 Summary of Fit RSquare RSquare 0.1121 Analysis of Variance Analysis of Variance Source Model Error C. Total 0.41 DF Sum of Squares Mean Square 1 280.5195 280.519 18 2220.4805 123.360 19 2501.0000 F Ratio 2.2740 Prob > F Source Model Error C. Total DF Sum of Squares 1 1604.0809 19 2308.5858 20 3912.6667 Mean Square 1604.08 121.50 F Ratio 13.2018 Prob > F 0.0018 0.1489 Parameter Estimates Parameter Estimates Term Intercept Age Estimate 105.62987 -0.779221 Std Error 7.161928 0.516733 t Ratio 14.75 -1.51 Prob>|t| <.0001 Term Intercept Age Estimate 109.87384 -1.126989 Std Error 5.067802 0.310172 t Ratio 21.68 -3.63 Prob>|t| <.0001 0.0018 0.15 Conclusion: It is not clear at all that scores and ages are related for normal children 7 How to identify leverage points in multiple regression? Use the Leverage Plot! Use the plot for the j-th factor to find leverage points with potential to influence j Use their residual on this same plot to help determine if they are influential 8 Outliers, leverage and influential points in multiple regression Pollution Example Data set pollution.JMP provides information about the relationship between pollution and mortality for 60 cities between 1959-1961. The variables are y (MORT)=total age adjusted mortality in deaths per 100,000 population; PRECIP=mean annual precipitation (in inches); EDUC=median number of school years completed for persons 25 and older; NONWHITE=percentage of 1960 population that is nonwhite; LogNOX=Log(pollution of Nox) LogSO2=Log(pollution of SO2) 9 Based on the previous model-building study we will use only PRECIP, EDUC, NONWHITE, and Log(SO2) to predict MORT. (Previously we used SqRt(SO2); now we use Log(SO2); the results are similar but not identical) RSquare Root Mean Square Error Observations Term Intercept PRECIP NONWHITE EDUC Log(SO2) Estimate 944.8 1.641 3.321 -13.96 15.01 0.684 36.24 60 Std Error 93.79 0.6134 0.5850 6.884 3.433 t Ratio 10.07 2.68 5.68 -2.03 4.37 Prob>|t| <.0001 0.0098 <.0001 0.0474 <.0001 10 Outliers in Multiple Regression Outliers in terms of multiple regression: Observations with large residuals. Identify from residual plot If residuals come from normal distribution, then a residual with absolute value larger than about 2.6se is expected only 1% of the time. Investigate observations with residuals of large magnitude. 11 Residual plot of MORT on PRECIP, NONWHITE, EDUC, and Log(SO2) MORT Residual 100 New Orleans, LA 50 0 -50 -100 750 Lancaster, PA 850 900 950 1050 1150 MORT Predicted Two cities on the plot show somewhat large regression residuals and hence can be classified as possible regression outliers Notice that resid. plots for multiple regression use resids vs predicted values 12 Leverage in Multiple Regression In a simple regression a point has high leverage if it is an outlier in X. In a multiple regression We will identify leverage points for each predictor. We use leverage plots to identify high leverage and influential points for each regression coefficient. High leverage observations for a certain x-variable may affect the estimated value of that coefficient. 13 Leverage Plots A simple regression view of a multiple regression coefficient. xj: Residual For y (w/o xj) vs. Residual xj (vs the rest of xs) (both axes are recentered at their means) Slope = Coefficient for that variable in the multiple regression The p-value = same as the effect test p-value Distances from the points to the LS line are the multiple regression residuals. Useful to identify (relative to xj) outliers leverage influential points Use them the same way as in a simple regression. 14 Pollution Data: Final Model Results Summary of Fit RSquare Root Mean Square Error Observations Source Model Error C. Total DF 4 55 59 Term Intercept NONWHITE EDUC PRECIP Log(SO2) 0.6835 36.24 60 Analysis of Variance Sum of Squares Mean Square 156030 39007 72243 1313 228273 Parameter Estimates Estimate Std Error 944.8 93.79 3.321 0.5850 -13.96 6.884 1.641 0.6134 15.01 3.433 t Ratio 10.07 5.68 -2.03 2.68 4.37 F Ratio 29.70 Prob > F <.0001 Prob>|t| <.0001 <.0001 0.0474 0.0098 <.0001 15 Leverage plots: 1150 1100 New Orleans, LA 1050 1000 950 900 850 800 750 10 MORT Leverage Residuals 1150 1100 New Orleans, LA 1050 1000 950 900 850 800 750 20 30 40 50 60 70 PRECIP Leverage, P=0.0138 EDUC Leverage Plot -5 0 5 10 15 20 25 30 35 40 NONWHITE Leverage, P<.0001 Log(SO2) Leverage Plot 1150 1100 1050 1000 950 MORT Leverage Residuals NONWHITE Leverage Plot New Orleans, LA 900 850 800 750 9.0 9.5 10.0 11.0 12.0 13.0 EDUC Leverage, P=0.0281 MORT Lev'ge Resid's MORT Leverage Residuals PRECIP Leverage Plot 1150 1100 1050 1000 950 900 850 800 750 New Orleans, LA -1 0 1 2 3 4 Log(SO2) Leverage, 5 6 16 Interpretation of Leverage Plots The labeled observation New Orleans is a moderate outlier and it is somewhat leveraged for estimating the coefficient of both Log(SO2) and NONWHITE and possibly of EDUC. Since New Orleans is both moderately highly leveraged and has large |residuals|, we suspect that it may be influential. 17 Term Intercept NONWHITE EDUC PRECIP Log(SO2) Prob>|t| <.0001 <.0001 0.0474 0.0098 <.0001 Term Intercept NONWHITE EDUC PRECIP Log(SO2) Parameter Estimates with New Orleans Estimate Std Error t Ratio 944.8 93.79 10.07 3.321 0.5850 5.68 -13.96 6.884 -2.03 1.641 0.6134 2.68 15.01 3.433 4.37 Parameter Estimates without New Orleans Estimate Std Error t Ratio 862.9 86.04 10.03 2.730 0.5419 5.04 -7.934 6.316 -1.26 1.785 0.5472 3.26 19.69 3.279 6.00 Prob>|t| <.0001 <.0001 0.2145 0.0019 <.0001 The first output is for the data with New Orleans; the second is for the data without it Note the large change in the coefficient of EDUC There are also noticeable changes in coeffs for Log(SO2) and NONWHITE The coeff for EDUC isnt even statistically significant in the analysis without 18 The influential points can have extreme impact on the analysis Because of the importance of NOX and SO2, we might have chosen the final model to be: MORTvs.PRECIP,NONWHITE, EDUC and log Nox and log SO2 Notice that log Nox is not significant. One could still leave it in the model so that we can better see whether it has an effect. 0.688278 Sum of Squares 157115.28 71157.80 228273.08 Model 5 Error 54 C. Total 59 Parameter Estimates Term Estimate Intercept 940.6541 PRECIP 1.9467286 EDUC -14.66406 NONWHITE 3.028953 Log(NOX) Log(SO2) Effect Tests Source PRECIP EDUC NONWHITE Log(NOX) Log(SO2) 6.7159712 11.35814 Mean Square 31423.1 1317.7 F Ratio 23.8462 Prob > F <.0001 Std Error 94.05424 0.700696 6.937846 0.668519 t Ratio 10.00 2.78 -2.11 4.53 Prob>|t| <.0001 0.0075 0.0392 <.0001 7.39895 5.295487 0.91 2.14 0.3681 0.0365 Sum of Squares 10171.388 5886.913 27051.227 F Ratio 7.7188 4.4674 20.5285 Prob > F 0.0075 0.0392 <.0001 1085.691 6062.217 0.8239 4.6005 0.3681 0.0365 Residual by Predicted Plot 100 New Orleans, LA 50 MORT Residual We might have used an alternative model Whole Model Summary of Fit RSquare Analysis of Variance Source DF 0 -50 - 100 750 800 850 900 950 1050 MORT Predicted 1150 19 Log NOX Leverage Plot 1150 MORT Leverage Residuals MORT Leverage Residuals PRECIP Leverage Plot 1100 1050 1000 950 900 850 800 750 20 30 40 50 1050 1000 950 900 850 800 60 0 PRECI P Lev erage, P=0.0075 1 2 3 4 5 6 Log NOX Leverage, P= 0.3681 EDUC Leverage Plot Log SO2 Leverage Plot 1150 MORT Leverage Residuals MORT Leverage Residuals 1100 750 10 1100 1050 1000 950 900 850 800 750 1150 1100 1050 1000 950 900 850 800 750 9.0 9.5 10. 0 10. 5 11. 0 11. 5 12. 0 12. 5 13. 0 EDU C Lev erage, P=0.0392 1150 1100 1050 1000 950 900 850 800 750 -5 0 5 10 15 20 25 30 -1 0 1 2 3 4 5 6 Log SO2 Lev erage, P=0.0365 NONWHITE Leverage Plot MORT Leverage Residuals 1150 35 NON WHITE Leverage, P<.0001 40 The observation New Orleans () is an outlier for estimating each coefficient and is highly leveraged for estimating the coefficients of interest on log Nox and log SO2. Since New Orleans is both highly leveraged and an outlier, we expect it to be influential. 20 Multiple Regression with New Orleans Multiple Regression without New Orleans Summary of Fit Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.688278 0.659415 36.30065 940.3568 60 Analysis of Variance Source DF Sum Model 5 Error 54 C. Total 59 of Squares Mean Square F Ratio 157115.28 31423.1 23.8462 71157.80 1317.7 Prob > F 228273.08 <.0001 Parameter Estimates Term Intercept PRECIP EDUC NONWHITE Log NOX Log SO2 Estimate 940.6541 1.9467286 -14.66406 3.028953 6.7159712 11.35814 Std Error t Ratio Prob>|t| 94.05424 10.00 <.0001 0.700696 2.78 0.0075 6.937846 -2.11 0.0392 0.668519 4.53 <.0001 7.39895 0.91 0.3681 5.295487 2.14 0.0365 RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.724661 0.698686 32.06752 937.4297 59 Analysis of Variance Source DF Sum Model 5 Error 53 C. Total 58 of Squares Mean Square F Ratio 143441.28 28688.3 27.8980 54501.26 1028.3 Prob > F 197942.54 <.0001 Parameter Estimates Term Intercept PRECIP EDUC NONWHITE Log NOX Log SO2 Estimate 852.3761 1.3633298 -5.666948 3.0396794 -9.898442 26.032584 Std Error t Ratio Prob>|t| 85.9328 9.92 <.0001 0.635732 2.14 0.0366 6.52378 -0.87 0.3889 0.590566 5.15 <.0001 7.730645 -1.28 0.2060 5.931083 4.39 <.0001 Removing New Orleans has a large impact on the coefficients of log NOX , EDUC and log SO2, in particular, it reverses the sign of log NOX. 21
