EXST7015 Fall2011 Appendix 01

EXST7015 Fall2011 Appendix 01 - Statistical Techniques II...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Statistical Techniques II Regression Diagnostic Criteria Appendix 1 Supplement Page 143 Criteria for the interpretation of selected regression statistics from the SAS output Reference was primarily Neter, J., Kutner, M. H., Nachtsheim, C. J., and Wasserman, W., Applied Linear Statistical Models, 4th Edition, Richard D. Irwin, Inc., Burr Ridge, Illinois, 1996. General regression diagnostics n 1 2 Adjusted R2 : Radj 1 n p b g FG SSError IJ FG n 1 IJ c1 R h b g H SSTotal K H n p K 2 This is intended to be an adjustment to R2 for additional variables in the model Unlike the usual R2, this value can decrease as more variables are entered in the model if the variables do not account for sufficient additional variation (equal to the MSE). Standardized regression coefficient bj': bj' = bj (Sxj / Sy) Unlike the usual regression coefficient, the magnitude of the standardized coefficient provides a meaningful comparison among the regression coefficients. Larger standardized regression coefficients have more impact on the calculation of the predicted value and are more “important”. Partial correlations Squared semi-partial correlation TYPE I = SCORR1 = SeqSSXj / SSTotal Squared partial correlation TYPE I = PCORR1 = SeqSSXj / (SeqSSXj + SSError) Squared semi-partial correlation TYPE II = SCORR2 = PartialSSXj / SSTotal Squared partial correlation TYPE II = PCORR2 = PartialSSXj / (PartialSSXj + SSError) Note that for regression, TYPE II SS and TYPE III SS are the same. Residual Diagnostics The hat matrix main diagonal elements, hii (Hat Diag , H values in SAS) , called “leverage values”, they are used to detect unusual observations in the X space. . This can also identify substantial extrapolation of new values. As a general rule, hii values greater than 0.5 are “large” while those between 0.2 and 0.5 are moderately large, also look for a leverage value which is noticeably larger than the next largest. The hii values sum to p mean,hii = p/n (note that this is < 1) A value may be an “outlier” if it is more than twice the valuehii (i.e.hii > 2p/n). Studentized residuals (“Student Residual” in SAS). Also called Internally Studentized Residual. There are two versions: Simpler calculation = ei / root(MSE) More common application = ei / root(MSE * (1-hii)) [SAS produces these] We already assume these are normally distributed, so these values would approximately follow a t distribution, where for large samples about 65% are between -1 and +1 about 95% are between -2 and +2 about 99% are between -2.6 and +2.6 Deleted Studentized residuals (“RStudent” in SAS). Also called externally studentized residual. There are also two versions as with the studentized residuals above Deleted Studentized = ei(i) / root(MSE(i)) Deleted Internally Studentized = ei(i) / root(MSE(i) *1-hii) [SAS produces these values] As with the studentized residuals above these values would approximately follow a t distribution James P. Geaghan - Copyright 2011 Statistical Techniques II Regression Diagnostic Criteria Appendix 1 Supplement Page 144 Influence Diagnostics DFFITS; an influence statistic, it measures the difference in fits as judged by the change in predicted value when the point is omitted This is a standardized value and can be interpreted as the number of standard deviation units for small to medium size databases, DFFITS should not exceed 1, while for large databases it should not exceed 2*sqrt(p/n) DBETAS; an influence statistic, it measures the difference in fits as judged by the change in the values of the regression coefficients note that this is also a standardized value for small to medium size databases, DFBETAS should not exceed 1, while for large databases it should not exceed 2/sqrt(n) Cook's D : influence statistic (D is for distance) The boundary of a simultaneous regional confidence region for all regression coefficients this does not follow an F distribution, but it is useful to compare it to the percentiles of the F distribution [F1-; p, n-p] where a change of < 10th or 20th percentile shows little effect, while the 50th percentile is considered large Multicollinearity Diagnostics VIF is related to the severity of multicollinearity a standardized estimate of regression coefficients would be expected to have a value of 1 if the regressors are uncorrelated If the mean of this value is much greater than 2, serious problems are indicated. No single VIF should exceed 10 Tolerance is the inverse of VIF, where Tolerancek = 1-Rk2 The Condition number (a multivariate evaluation) Eigen values are extracted from the regressors, These are variances of linear combinations of the regressors, and go from larger to smaller. If one or more are zero (at the end) then the matrix is not full rank. These sum to p, and if the Xk are independent, each would equal 1 The condition number is the square root of the ratio of the largest (always the first) to each of the others. If this value exceeds 30 then multicollinearity may be a problem. Model Evaluation and Validation R2p, AdjR2p and MSEp can be used to graphically compare and evaluate models. The subscript p refers to the number of parameters in the model Mallow's Cp criterion Use of this statistic presumes no bias in the full model MSE, so the full model should be carefully chosen to have little or no multicollinearity Cp criterion = (SSEp / TrueMSE) -(n - 2p) The Cp statistics will be approximately equal to p if there is no bias in the regression model PRESSp criterion (PRESS = Prediction SS) This criterion is based on deleted residuals. There are n deleted residuals in each regression, and PRESSp is the SS of deleted residuals This value should approximately equal the MSE if predictions are good, it will get larger as predictions are poorer They may be plotted, and the smaller PRESS statistic models represent better predictive models. This statistics can also be used for model validation James P. Geaghan - Copyright 2011 ...
View Full Document

This note was uploaded on 12/29/2011 for the course EXST 7015 taught by Professor Wang,j during the Fall '08 term at LSU.

Ask a homework question - tutors are online