350LectureT_RegDiag_student

350LectureT_RegDiag_student - Lecture T: Regression...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Lecture T: Regression Diagnostics In this lecture, we will examine some methods for testing some of the important assumptions of regression. Then we will discuss some remedial measures that can be used to remedy violations of these assumptions. Key Assumptions of Simple Linear Regression, Model: y = α + βx +e • The underlying relationship between x and y is linear: y = α + βx • e has constant variance (σ 2) for all values of x. • e is normally distributed ˆ Because we do not know the "true" line (y = α + βx), only our estimate of it ( y = a + bx ) , we will use the residuals as proxy for e to test the assumptions about e. Is the relationship between x and y linear? • Residual plot - for each observation, plot the residuals (on the vertical axis) against the x values. • Scatter plot (y vs. x) Do the residuals have constant variance? • Residual plot • Scatter plot (y vs. x) Are the residuals normally distributed? • QQ-plot of residuals • Histograms • Tests of Normality (e.g., Shapiro-Wilk, Kolmogorov-Smirnov) Ex. Toluca Company "The Toluca Company manufactures refrigeration equipment as well as many replacement parts. In the past, one of the replacement parts has been produced periodically in lots of varying sizes. When a cost improvement program was undertaken, company officials wished to determine the optimum lot size for producing this part The production of this part involves setting up the production process (which must be done no matter what is the lot size) and machining and assembly operations. One key input for the model... was the relationship between lot size and labor hours to produce the lot." from Kutner, Nachstein, Neter & Li, Applied Linear Statistical Models, 5th Edition data toluca; input lotsize workhours @@; cards; 80 399 30 121 50 221 80 352 100 353 50 157 110 435 100 420 30 212 90 468 40 244 80 342 ; proc reg data=toluca; model workhours = lotsize; plot workhours*lotsize; plot r.*lotsize; output out=tolucaout r=resid; run; Knapp Stat 350 Spring 2009 90 40 50 70 376 160 268 323 70 70 90 361 252 377 60 90 110 224 389 421 120 20 30 546 113 27 Lecture T: Regression Diagnostics Page 1 The REG Procedure Model: MODEL1 Dependent Variable: workhours Number of Observations Read Number of Observations Used 25 25 Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 1 23 24 252378 54825 307203 252378 2383.71562 Root MSE Dependent Mean Coeff Var 48.82331 312.28000 15.63447 R-Square Adj R-Sq F Value Pr > F 105.88 <.0001 0.8215 0.8138 Parameter Estimates Variable Intercept lotsize DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 62.36586 3.57020 26.17743 0.34697 2.38 10.29 0.0259 <.0001 w khour s = 62. 366 + 5702 l ot si ze or 3. 600 N 25 R sq 0. 8215 AR dj sq 0. 8138 RS ME 48. 823 500 400 300 200 100 20 30 40 50 60 70 80 90 100 110 120 l ot si ze Knapp Stat 350 Spring 2009 Lecture T: Regression Diagnostics Page 2 Residual Plot w khour s = 62. 366 + 5702 l ot si ze or 3. 125 N 25 R sq 0. 8215 AR dj sq 0. 8138 100 RS ME 48. 823 75 50 25 0 - 25 - 50 - 75 - 100 20 30 40 50 60 70 80 90 100 110 120 l ot si ze proc print data=tolucaout; run; Obs workhours 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Knapp Stat 350 Spring 2009 lotsize 80 30 50 90 70 60 120 80 100 50 40 70 90 20 110 100 30 50 90 110 30 90 40 80 70 399 121 221 376 361 224 546 352 353 157 160 252 389 113 435 420 212 268 377 421 273 468 244 342 323 resid 51.018 -48.472 -19.876 -7.684 48.720 -52.578 55.210 4.018 -66.386 -83.876 -45.174 -60.280 5.316 -20.770 -20.088 0.614 42.528 27.124 -6.684 -34.088 103.528 84.316 38.826 -5.982 10.720 Lecture T: Regression Diagnostics Page 3 proc univariate data=tolucaout normal; var resid; histogram resid / normal kernel(L=2 color=red); qqplot resid / normal (L=1 mu=est sigma=est); run; Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling W D W-Sq A-Sq Pr Pr Pr Pr 0.978904 0.09572 0.033263 0.207142 < > > > W D W-Sq A-Sq 0.8626 >0.1500 >0.2500 >0.2500 35 30 25 P e r c e n t 20 15 10 5 0 - 180 - 140 - 100 - 60 - 20 20 60 100 140 180 R esi dual Knapp Stat 350 Spring 2009 Lecture T: Regression Diagnostics Page 4 Example 1: Underlying Model is NOT linear proc reg data=example1; model y=x; plot y*x; plot r.*x; run; The REG Procedure Model: MODEL1 Dependent Variable: y Number of Observations Read Number of Observations Used 30 30 Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 1 28 29 1042563 179671 1222234 1042563 6416.83692 Root MSE Dependent Mean Coeff Var 80.10516 186.11915 43.03972 R-Square Adj R-Sq F Value Pr > F 162.47 <.0001 0.8530 0.8477 Parameter Estimates Variable Intercept x Knapp Stat 350 Spring 2009 DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 -147.71672 21.53780 29.99720 1.68970 -4.92 12.75 <.0001 <.0001 Lecture T: Regression Diagnostics Page 5 y = - 147. 72 + 21. 538 x 700 N 30 R sq 0. 8530 AR dj sq 0. 8477 600 RS ME 80. 105 500 400 300 200 100 0 - 100 0 5 10 15 20 25 30 x y = - 147. 72 + 21. 538 x 200 N 30 R sq 0. 8530 AR dj sq 0. 8477 RS ME 80. 105 150 100 50 0 - 50 - 100 - 150 0 5 10 15 20 25 30 x Knapp Stat 350 Spring 2009 Lecture T: Regression Diagnostics Page 6 Example 2: Variance is NOT constant proc reg data=example2; model y=x; plot y*x; plot r.*x; run; The REG Procedure Model: MODEL1 Dependent Variable: y Number of Observations Read Number of Observations Used 100 100 Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 1 98 99 843391571 31901153 875292724 843391571 325522 Root MSE Dependent Mean Coeff Var 570.54533 5100.08890 11.18697 R-Square Adj R-Sq F Value Pr > F 2590.89 <.0001 0.9636 0.9632 Parameter Estimates Variable Intercept x Knapp Stat 350 Spring 2009 DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 19.44981 100.60671 114.97028 1.97653 0.17 50.90 0.8660 <.0001 Lecture T: Regression Diagnostics Page 7 y = 19. 45 + 100. 61 x 14000 N 100 R sq 0. 9636 AR dj sq 0. 9632 RS ME 570. 55 12000 10000 8000 6000 4000 2000 0 0 10 20 30 40 50 60 70 80 90 100 x y = 19. 45 + 100. 61 x 3000 N 100 R sq 0. 9636 AR dj sq 0. 9632 RS ME 570. 55 2000 1000 0 - 1000 - 2000 0 10 20 30 40 50 60 70 80 90 100 x Knapp Stat 350 Spring 2009 Lecture T: Regression Diagnostics Page 8 Example 3 – Errors/Residuals are NOT normally distributed proc reg data=example3; model y=x; plot y*x; plot r.*x; output out=ex3out r=resid; run; proc univariate data=ex3out normal; var resid; histogram resid / normal kernel(L=2 color=red); qqplot resid / normal (L=1 mu=est sigma=est); run; The REG Procedure Model: MODEL1 Dependent Variable: y Number of Observations Read Number of Observations Used 100 100 Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 1 98 99 834009927 7062.42942 834016989 834009927 72.06561 Root MSE Dependent Mean Coeff Var 8.48915 5090.06139 0.16678 R-Square Adj R-Sq F Value Pr > F 1.157E7 <.0001 1.0000 1.0000 Parameter Estimates Variable Intercept x Knapp Stat 350 Spring 2009 DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 37.75911 100.04559 1.71064 0.02941 22.07 3401.90 <.0001 <.0001 Lecture T: Regression Diagnostics Page 9 y = 37. 759 + 100. 05 x 12000 N 100 R sq 1. 0000 AR dj sq 1. 0000 RS ME 8. 4891 10000 8000 6000 4000 2000 0 0 10 20 30 40 50 60 70 80 90 100 x y = 37. 759 + 100. 05 x 30 N 100 R sq 1. 0000 AR dj sq 1. 0000 25 RS ME 8. 4891 20 15 10 5 0 -5 - 10 - 15 0 10 20 30 40 50 60 70 80 90 100 x Knapp Stat 350 Spring 2009 Lecture T: Regression Diagnostics Page 10 Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling W D W-Sq A-Sq 0.885197 0.14061 0.550032 3.470824 Pr Pr Pr Pr -3 3 < > > > W D W-Sq A-Sq <0.0001 <0.0100 <0.0050 <0.0050 21 27 35 30 25 P e r c e n t 20 15 10 5 0 - 21 - 15 -9 9 15 33 39 R esi dual 30 20 10 R e s i d u a l 0 - 10 - 20 -3 -2 -1 0 Nm or al Knapp Stat 350 Spring 2009 1 2 3 Q uant i l es Lecture T: Regression Diagnostics Page 11 What to do if assumptions are violated? Remedial Measures Often transformations of the x and/or y variable can alleviate many of the problems with a dataset including non-linear relationships, non-constant variance, and non-normal error terms, allowing us to use simple linear regression on the transformed data. General Guidelines Transform x • if the relationship is non-linear but the residuals appear to be normal with constant variance Transform y • if the variances are unequal • if the residuals are non-normal • the relationship between x and y may be linear or non-linear • Often the transformation of y will fix the residuals (variance/normality), but wrecks the straight-line relationship and you may have to transform both x and y. How to transform variables in SAS data simple; input varx vary; sqrtx = sqrt(varx); lnx = log(varx); log10x = log10(varx); xsquared = varx**2; invx = 1/varx; expx = exp(varx); cards; 1 25 2 20 3 10 4 5 5 10 ; run; /* /* /* /* /* /* square root natural log log-base-10 square inverse: 1/x exp(x) = e^x */ */ */ */ */ */ proc print data=simple; run; The SAS System Obs 1 2 3 4 5 Knapp Stat 350 Spring 2009 varx 1 2 3 4 5 vary sqrtx lnx 25 20 10 5 10 1.00000 1.41421 1.73205 2.00000 2.23607 0.00000 0.69315 1.09861 1.38629 1.60944 log10x 0.00000 0.30103 0.47712 0.60206 0.69897 1 xsquared 1 4 9 16 25 invx 1.00000 0.50000 0.33333 0.25000 0.20000 expx 2.718 7.389 20.086 54.598 148.413 Lecture T: Regression Diagnostics Page 12 ...
View Full Document

Ask a homework question - tutors are online