EXST7015 Fall2011 Lect13

EXST7015 Fall2011 Lect13 - Statistical Techniques II Page...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Statistical Techniques II Page 52 Observation Diagnostics The first columns are the value of Yi and the predicted value of Yi. You are responsible for understanding these, along with the residual (the difference between these two values). These have not changed from SLR. You are not responsible for the Std Err Predict or the Std Err Residual. These are estimates of standard deviations and have been adjusted by hii values. You are responsible for the confidence intervals, Upper and Lower 95% MEAN and Upper and Lower 95% PREDICT. These are confidence intervals for the regression line (Yhati) and for individual points (Yi) respectively. Recall that for simple linear regression Yi = b0 + b1Xi + ei ˆ Y =Y+e i i 2 ˆ The variance for Y is SYY|X ˆ ( Xi X. 1 MSE n Xi X.2 2 ) The the variance for an individual observation, Yi, is ˆ Yi Y ei 2 SY Y|X 2 2 1 X i - X . MSE MSE 1 1 X i - X . MSE n X i - X .2 n X i - X .2 You are responsible for: The Studentized residual, and perhaps more important the deleted studentized residual (RSTUDENT). For the hat diag values (hii) The remaining 3 diagnostics of interest are the influence diagnostics (DFFITS, DFFBetas and Cook's D). You are NOT responsible for the column titled Cov Ratio. Partial Residual Plots These are “scatter plots” of the Y variable adjusted for all Xi except one plotted on that Xi adjusted for all other Xi. I used these to get across the concept that not only are the Yi adjusted for each Xi, but the Xi are also adjusted for each other. Beyond this these are used more like “scatter plots” than “residual plots”. We can look for curvature, nonhomogeneous variance, etc. If they appear to represent random scatter about zero it is because the variable does not contribute anything to the model, not because it is a “residual plot”. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 53 Ordinary Residual Plots (see SAS output) Full Model with diagnostics Plot of e*YHat. Legend: A = 1 obs, B = 2 obs, etc. ‚ A ‚ ‚ 2 ˆ A A ‚ A ‚ A A A A ‚ A A ‚ A ‚ A A A A 1 ˆ A A ‚ A A A A R ‚ A A A A A e ‚ C AA A A s ‚ A AA A A A A A A i ‚ A AA AA A AA A A A d 0 ˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒAAƒƒƒƒƒAƒƒƒƒƒAƒƒƒƒƒƒƒƒƒƒƒƒƒAƒƒƒƒƒƒƒƒƒƒƒƒAƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ u ‚ A A AA A A A A B A a ‚ A A A A l ‚ A A AA A A A A A A ‚ A A AA ‚ A A A AA A A A -1 ˆ A A A ‚ A A A ‚ A A A A A ‚ A A ‚ A A ‚ A -2 ˆ ‚ A ‚ Šƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 Predicted Value of InfRisk FORMCHAR "‚ƒ„…†‡ˆ‰Š‹Œ+=|-/\<>*" versus FORMCHAR "|----|+|---+=|-/\<>*" Use OPTIONS FORMCHAR="|----|+|---+=|-/\<>*"; Residual Analysis with PROC UNIVARIATE This is an important procedure for evaluating residuals, especially for the assumption of normality. The Shapiro-Wilk test values are: W:Normal 0.985827 Pr 0.8069 These results would lead you to FAIL to reject the hypothesis of normality. We conclude the observed results are consistent with a normal distribution. The plots lead to the same conclusion. Stem Leaf # 24 6 1 22 20 2 1 18 7 1 16 836 3 14 6989 4 12 2 1 10 32458 5 8 797 3 6 0899279 7 4 5623557 7 2 024614567899 12 0 238803455679 12 -0 5541943 7 -2 88322110 8 -4 9864310762210 13 -6 99327 5 -8 55864421 8 -10 291 3 -12 7424 4 -14 3720 4 -16 23 2 -18 5 1 -20 -22 2 1 ----+----+----+----+ Multiply Stem.Leaf by 10**-1 Boxplot 0 | | | | | | | | +-----+ | | *--+--* | | | | +-----+ | | | | | | | | | Normal Probability Plot 2.5+ * | ++ | *+ | *+ | **+* | ***+ | *++ | *** | +** | +** | +*** | *** 0.1+ *** | *** | *** | **** | ** | **** | ** | *** | ** | **+ | *++ | ++ -2.3+*+ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 We would also check for outliers, and again see no great problems. Obs # 53 is too large, but only one out of 113, so not entirely unexpected. This is consistent with our observations from the RStudent values. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 54 PROC GLM and PROC MIXED For regression, there is not much that is new with these procedures, PROC GLM and PROC MIXED can do the same analysis as PROC REG. These procedures can provide both Type I SS and TYPE III SS with tests. PROC GLM provides Type I sums of squares by default, PROC MIXED provides them on request. The tests of the Type III SS (or TYPE II SS) are identical to the t-tests of the regression coefficients. Observation diagnostics (See computer output handout) First we will discuss “observation diagnostics” and tests of the assumptions. The “ALL” option produces a host of output, but not everything. The INFLUENCE, COLLIN and PARTIAL are also needed for some additional output. One difference from SLR is that where previously we used Xi we now use Yhat. For example, residual plots are usually plotted on Yhat. The variance calculations for multiple regression use matrix algebraand include all variances and covariances for the regression coefficients. Using Studentized residuals Bonferroni adjustment Doing more tests increases your chance of error. It is possible to do 20, 100, even 1000 tests and have no Type I errors (at = 0.05), but the chance of an error goes up. The rate of increase is not linear, so twice as many tests does not double your chance of error. However, as an approximation Bonferroni noted that the probability of error would be NO GREATER than the sum of the a values of the individual tests. For example, do one test at and have probability of error Two tests and have no more than 2 chance of error 10 tests and error rate is < 10 This Bonferroni concept suggests a simple fix. If we were to do 2 tests at /2, then the two tests together would have no more than a 2*/2 error rate, giving us a overall. If we were to do 10 tests at /10, then the two tests together would have no more than a 10* /10 error rate (=). Two tailed tests are already /2, so we actually want /4 for two tests and /20 for 10 tests. To make this correction simply choose the t value to reflect the smaller a value. For studentized residuals use t/2n, n–p d.f. For deleted residuals use t/2n,n–p–1 d.f. where there is an extra “–1” because of the deleted value. For our numerical analysis the Bonferroni adjusted critical value would be, t/2, n–p d.f. = 2.144788596 (unadjusted) t/2n,n–p–1 = 3.621389624 The RSTUDENT value for one observation (#17) exceeds this value and is a probable outlier. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 55 Variable Selection We have previously discussed the concept of partial sums of squares and partial regression coefficients. As you know, the addition or removal of any variable will change all other variables in the model. Therefore, if you decide to add or remove variables from a model this should be done one variable at a time. Stepwise Variable Selection The procedure has been formalized in several options. We will discuss a few of these, Forward Selection, Backward Selection and Stepwise Selection. One additional reason for reducing the model in Example 2 is that we had multicollinearity. Stepwise regression is not specifically designed to avoid multicollinearity, but it will tend to not pick up two variables that are collinear. Backward selection is the simplest. It starts with the full model, a model with all variables of interest already present in the model. A selection criteria is established. Perhaps we want no non-significant variables in the model (=0.05). See the SAS output. Forward selection Forward stepwise selection works by calculating all possible simple linear regressions, and picking the best one to start with. Again, the F test of the Type II SS, or t-test of the slopes, are used as criteria for selection. The “best” variable is the most significant one, as long as it meets a minimum criteria. Once chosen, this best variable will remain in the model for the whole analysis. After picking the one best variable, the analysis checks all possible 2 factor models, trying each of the remaining variables together with the first one chosen. If there are additional variables that meet the criteria, the analysis chooses the best of these. The step is repeated until no remaining variables meet the criteria. Stepwise selection There is a variation of FORWARD selection called the “Stepwise” option requested by “selection=stepwise”. This is like forward stepwise selection, except that at each step the analysis checks to make sure that each variable still meets the criteria. If a variable falls below the criteria it will be removed. Think of it a forward selection with a backward glance. There is one additional option that can be useful among the selection options. You can specify INCLUDE=#. This will force SAS to keep the first # variables in the model. They will be in to start with and will not be removed. This is good if you have a base model you want to keep intact and want to check for additional variables. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 56 I ran the following program to force a larger model. proc reg data=SENIC; model InfRisk = LtofStay Age CulRatio XRay NoBeds Census Nurses Services / selection=stepwise sle=.5 sls=0.5; and got these results Step 1 2 3 4 5 6 7 Variable Entered CulRatio LtofStay Services XRay Nurses NoBeds Age Variable Removed Number Vars In 1 2 3 4 5 6 7 Partial R-Square 0.3127 0.1377 0.0430 0.0227 0.0036 0.0029 0.0023 Model R-Square 0.3127 0.4504 0.4934 0.5161 0.5197 0.5226 0.5249 C(p) 41.5161 13.3525 5.9368 2.9592 4.1729 5.5431 7.0440 F Value 50.49 27.57 9.25 5.07 0.80 0.64 0.50 Pr>F <.0001 <.0001 0.0029 0.0263 0.3731 0.4260 0.4795 Other criteria used to determine the number of variables to retain. As you know, when a variable is added to a model the usual R2 always gets larger. The adjusted R2 is “adjusted” such that the value will not get larger unless the variation accounted for by the variable is equal to at least one MSE. Otherwise this value can actually decrease. The ordinary R2 always increases while the adjusted R2 value starts decreasing after the 4th variable is added. 0.55 R2 0.5 Adjusted R2 0.45 0.4 0.35 0.3 0.25 0.2 0 1 2 3 4 5 6 7 8 9 8 9 Variables Plotting the MSE suggests a similar result. 1.3 MSE 1.2 1.1 1.0 0.9 0.8 0 1 2 3 4 5 6 7 Variables Mallow's C(p) statistic is supposed to indicate the “best” model when C(p) = p. Mallow's C(p) statistic depends on the Full model being a pretty good model with no multicollinearity. This is not always true, and as we know it is probably not true for this example. 45 40 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 James P. Geaghan - Copyright 2011 Statistical Techniques II Page 57 Other selection criteria. 50 The information indices (AIC, BIC and SBC) will be discussed later. 40 30 SBC 20 Cp 10 BIC 0 -10 0 AIC 1 2 3 4 5 6 7 8 9 Variables Multicollinearity of the Reduced Model Question! Did the stepwise selection and the resulting reduced model cure our “multicollinearity” problems? I reran the reduced model with options to get the collinearity diagnostics. Variable Intercept LtofStay CulRatio XRay Services Variance Inflation 0 1.35777 1.28045 1.33266 1.15566 The mean VIF is less than two, and no values even reach the mean, much less the criteria of 10. Also, the highest condition number was only 16, well below the criteria of 30. We conclude the model selected with stepwise regression clearly has no multicollinearity problems. Stepwise selection does not ALWAYS fix problems with multicolinearity. The reduced model has 4 significant variables. The full model had only 3 significant variables, and they were not the same variables that were significant. The reduced model also has an R2 = 51.61% while the full model had R2 = 52.51. The simpler model with nearly the same R2 value is most likely a superior model. The R2 selection option There is one other “variable selection” option that is very interesting. It is quite different from the stepwise selection model. Suppose that you are going to fit a model with a number of variables, lets call them a, b, c, d, e, and f. What happens if stepwise selection chooses one set of variables, but for some reason you prefer a different set? For example, if you feel that variables a, b & d should be the best variables, and stepwise selects a, b and e. How much better is this model than the one that you feel is best? Or suppose that variable d is inexpensive and easy to measure while c is expensive and difficult. If you use d instead of c, how much do you lose? We will examine the RSquare selection option. This procedure will show you the best models, not just one, but several. It will also show you how good larger (more variables) and smaller (fewer variables) models might be. The major criteria here is the value of R2, which is something of a limitation. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 58 To request the procedure ask for model options “selection=rsquare”. I also included the options “start=3 stop=6 best=8”; This instructs SAS to start with 3 variable models, go up to 6 variables and show me the best 8 models for each number of variables. As requested, the RSQUARE selection option first produces the best 8 3-factor models (plus intercept). Number 3 3 3 3 3 3 3 3 R-square 0.49340010 0.48523075 0.47356336 0.47347050 0.46955655 0.46300398 0.46191250 0.45384242 Variables in Model LTOFSTAY CULRATIO SERVICES LTOFSTAY CULRATIO NURSES LTOFSTAY CULRATIO NOBEDS LTOFSTAY CULRATIO CENSUS LTOFSTAY CULRATIO XRAY CULRATIO XRAY SERVICES CULRATIO XRAY CENSUS CULRATIO XRAY NURSES And then the best 4-factor, 5-factor, etc. Number in 4 4 4 4 4 4 4 4 R-square 0.51613081 0.51023237 0.50002851 0.49971593 0.49556642 0.49556459 0.49348607 0.49341314 Variables in Model LTOFSTAY CULRATIO XRAY SERVICES LTOFSTAY CULRATIO XRAY NURSES LTOFSTAY CULRATIO XRAY CENSUS LTOFSTAY CULRATIO XRAY NOBEDS LTOFSTAY CULRATIO NURSES SERVICES LTOFSTAY AGE CULRATIO SERVICES LTOFSTAY CULRATIO NOBEDS SERVICES LTOFSTAY CULRATIO CENSUS SERVICES The best model we found was a 4-factor model. Here we can check for alternative 4-factor models. Note that frequently very little is lost by replacing one or two variables with different variables, often less than a few percentage points on the R2 value. The other variables may be more interpretable, more reliably measured, cheaper and easier to measure, or have some other advantage. Other Regression Topics As mentioned earlier, the intercept for our last problem was not very meaningful (when all Xi equal zero we have no beds, no nurses, a length of stay of zero days, etc.) This is not an uncommon problem. In evaluating the abundance of marine organism with salinity, temperature and depth, for example, a salinity of zero is not a marine environment, a temperature of zero is not liquid and a depth of zero is not wet, so the intercept is meaningless. So, if you want to plot your data on one of the Xi values, what can you do. If you just extract the intercept and slope of interest, you are essentially setting all other Xi equal to zero. This can lead to unreasonable values of Yhat even if you do not show the intercept. ˆ Yi b0 b1 X1i b2 X 2i b3 X 3i b4 X 4i ˆ Yi b0 b1 X1i b2 (0) b3 (0) b4 (0) ˆ Yi b0 b1 X1i James P. Geaghan - Copyright 2011 ...
View Full Document

Ask a homework question - tutors are online