EXST7015 Fall2011 Lect12

EXST7015 Fall2011 Lect12 - Statistical Techniques II Page...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Statistical Techniques II Page 47 On the other hand, you may not be interested in all of the variables. You may not know which variables are important, and your objective may be to determine which ones are important. In this case you may want to keep and discuss all of the variables, or you may want to select the important variables and present them in a smaller (reduced) model with only significant variables. Our second model is more likely to be of this type. There are 8 variables, and most likely not all are important. Our objective here is probably to determine which ones are correlated to the dependent variable. We may start with a full model, but will then probably reduce the model to some subset of significant variables. A larger multiple variable example (Appendix 8): from Neter, Kutner, Nachtsheim and Wasserman. 1996. Applied Linear Statistical Models. Irwin Publishing Co. It is a sample various hospitals. Data was taken for the “Study of the Efficacy of Nosocomial Infection Control” (SENIC Project). Each observation is for one hospital. There were 113 observations. The variables are as follows; Identification number Average length of stay of all patients (in days) Average age of all patients (years) Average Infection risk for the hospital (in percent) This was taken as dependent variable Routine culture ratio to patients without symptoms to patients with symptoms Routine chest x-ray ratio for patients with and without symptoms of pneumonia Average number of beds (during study) Med School Affiliation (1=Yes, 2=No) Region NE=1, NC=2, S=3, W=4 Average no. of patients during study Average no. of nurses Percent of 35 potential service facilities that were actually provided by the hosp. Infection risk was used as the dependent variable. Eight other variables were considered independent variables. The two class (categorical, group) variables were omitted from the analysis (Med school affiliation and Region). Identification number was not used. See the computer output. Selected output was omitted as trivial or not of interest. Interpretation and evaluation The PROC REG statement is given on the second page. Only one new option is included here, the COLLIN option on the model statement. We have seen all of the other output down through the PROC UNIVARIATE, PROC GLM is excluded. The Stepwise regressions at the end of the output are new. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 48 So, what are we looking for in this regression? Which variables are important, if any? Are there problems with Multicollinearity, and what do you do about it anyway? Are there any problems with the observations? Yi variable outliers? Xi variable “outliers”? Influential observations? Are the assumptions for linear regression met? Normality? Independence? Homogeneity of variance? Are the Xi variables measured without error? And what are our objectives? If we want find the best model for predicting infection risk, do we need all 8 variables? How do we go about removing the ones we don't want? See the computer output. Some output has been deleted to save space. This is information that I considered very simple, or information that we will not cover this semester, Descriptive Statistics Uncorrected SS and Crossproducts Correlation section Model Crossproducts X'X X'Y Y'Y X'X Inverse, Parameter Estimates, SSE Matrix solution We did not cover matrix algebra in detail, but I expect you to know something about these. As with the SLR, we need all SS and CP from the variables to do a multiple regression. I expect you to know where these are. The Variance-Covariance matrix is key to most variance calculations. Know where this is and where it comes from. Standardized regression coefficients (used extensively in some disciplines). In your discipline, take note of the statistics presented in the literature. Standardization is sometimes used to put variables on the same scale. For example, if our slope (Y units per X unit) is meaningful in terms of the original scale (e.g. mg phosphorus available per mg in the soil) we may want to keep the original scale for interpretative purposes. However, in other cases our scales may be arbitrary. For example, if we are trying to predict a Freshman's first semester college performance (scale 0 to 4) from SAT (scale 0 to 36), ACT verbal (scale 200 to 800) and High School GPA (scale 0 to 4), then the arbitrary scales may confuse and complicate the study. We could “standardize” the 4 variables so that all have a mean = 0 and variance = 1. Since the original scales are arbitrary we lose little by doing this. The resulting regression would have regression coefficients that are without scale, and whose “relative size” would give an indication of the “relative importance” or “relative impact” of the variable in James P. Geaghan - Copyright 2011 Statistical Techniques II Page 49 determining the value of the predicted value. So, the standardized regression coefficients are relative measurements and have no units. We might be interested in the standardized regression coefficients. We saw earlier that the Ltofstay variable had the largest regression coefficient, but not the largest t value. Note that the Std. Reg. Coeff. match the t-value pattern for the 3 most significant variables, not the raw reg. coeff. However, the standardized regression coefficients do not match the t value pattern for all variables! XRay was the third significant variable, with the third largest t value, but three variables that were not significant (smaller t-values) had larger absolute values for the std. reg. coeff. (NOBEDS, NURSES and SERVICES). What's up? More later. Partial R2 Recall the R2 is the SSModel / SSTotal (both corrected). However, since we are not very interested in the overall model could we get an R2 type statistics for each variable? Of course we could, in fact there are four. What would you imagine the individual R2 values to be? You might guess the Type I or Type II SS divided by the total. Congratulations, you just invented the “Squared semi-partial correlation Type I” and the “Squared semi-partial correlation Type II”. Squared semi-partial correlation: TYPE I = SCORR1 = SeqSSXj / SSTotal Squared semi-partial correlation: TYPE II = SCORR2 = PartialSSXj / SSTotal But think about the Extra SS we talked about earlier. When variables go into a model they account for some of the SSTotal, and that fraction of the SS is not available to later variables, or it may even enhance the SS accounted for by later variables. Doesn't it seem that we should look at the SS available to a variable when we consider how much SS it accounts for? This is more in keeping with the concept of “Partial” SS we talked about earlier. So, when a variable (say X1) enters the model after the other variables, what SS are available to it? Obviously, the SS it accounted for was available (SSX1|X2,X3). And the SSError was also available, though not accounted for. So, the SS available to each variable is the part it accounts for (SSXj|all other variables), plus the part no variable accounts for (SSError). If we use this as the available SS to be accounted for, instead of the SSTotal, we have the “Squared Partial correlations”. These are calculated as Squared partial correlation TYPE I = PCORR1 = SeqSSXj / (SeqSSXj + SSError*) * Note that for sequential SS the error changes as each variable enters. This must be taken into account. Squared partial correlation TYPE II = PCORR2 = PartialSSXj / (PartialSSXj + SSError) So how are these used? The interpretation is similar to that of the R2, except that it is a fraction for each variable. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 50 The ones that make the most sense to me are For models using the Type I SS the Squared semi-partial correlation TYPE I For models using the Type II SS, the Squared partial correlation TYPE II Since the Type I SS sum to the SSReg, the Semi-partial R2 Type I will sum to the overall R2. Since the Partial SS may sum to more or less than the SSReg, and the denominator is not the SSTotal, the sum of these partial R2 values is unpredictable. Since we use the Type II SS for this problem we would probably be interested in the Squared partial correlation Type II. Note that these values follow pattern more similar to the t values than the standardized regression coefficients. The largest tend to match the significant variables, and in the same order. However, the match is not perfect across all variables. So we have 3 statistics we can use to interpret the contribution or importance of the variables. I consider the t-test to be the best test, and the standard regression coefficients and partial R2 type values are ancillary statistics. Variance Inflation Factor (VIF) Now lets consider multicollinearity. Recall variance inflation. Is it possible that some other variable is important, but does not show up because it is in competition with another variable for the SS? Or that two multicollinear variables have inflated variance and do not have significant t-tests because of this? Lets see. VIF values Variable INTERCEP LTOFSTAY AGE CULRATIO XRAY NOBEDS CENSUS NURSES SERVICES DF 1 1 1 1 1 1 1 1 1 Tolerance . 0.47122355 0.83280592 0.67841753 0.72771534 0.03005773 0.02881692 0.13774001 0.34228836 Variance Inflation 0.00000000 2.12213501 1.20075995 1.47401851 1.37416369 33.26931229 34.70183289 7.26005447 2.92151333 Two VIF values (NOBEDS and CENSUS) are greater than 10, a clear indication of problems. A second variable (NURSES) is greater than 7. I would consider this a possible problem as well. We have talked about this problem, but have not discussed what to do about it. So, what do we do? Statistics quote: Figures don't lie, but liars figure. - Samuel Clemens (alias Mark Twain) James P. Geaghan - Copyright 2011 Statistical Techniques II Page 51 Addressing Multicollinearity issues First we know that with multicollinearity problems we have potentially wide fluctuations in the regression coefficients, and that we have inflated variances. Since the problems are caused by two or more correlated variables, one obvious solution is to leave off some of the variables that are correlated. This is often the easiest and best solution. Fit a reduced model that has some subset of the original variables and where correlated variables are reduced. A second solution is to get more data. Sometimes additional data will bring out the differences in the variables and reduce the correlation. A third solution is “Ridge Regression”. This is an interesting regression, an example of where a statistician may seek biased estimates in order to reduce variance. However, it can be hard to interpret the results or to decide exactly how much bias needed. I consider it a last resort, but one I have used. We will not discuss this option in detail. We will apply the first and best option to this example (reducing the number of variables). When we reduce a model like this, we do it one variable at a time. We will use “Stepwise regression” to do this. This section will come when we finish with the rest of Example 2 in PROC REG. Variance-Covariance Matrix This section was deleted, but it is simply the (X'X)–1 matrix multiplied by the MSE. It is available in the SAS program output. You are responsible for knowing what is in this matrix and how it relates to the parameter estimate variances. Correlation of Estimates (deleted) You are not responsible for these Sequential Parameter Estimates Recall that when multicollinearity exists, reg. coeff can fluctuate greatly. In this section we look to see if there are large fluctuations, or if the parameter estimates are stable. The Sequential Parameter Estimates provide an additional indication of problems, and some indication of which variables are affected by which others. Collinearity Diagnostics: These are produced by the “collin” option on the model statement. If the ratio is greater than about 30, multicollinearity is indicated. Multicolinearity summary Collinearity Diagnostics are a very good tool to evaluate multicollinearity, perhaps the best. VIF is also a very good tool, and is the most popular. Simple correlations are good, but may miss higher order correlations (“multi”). Sequential bi values are useful, but somewhat subjective. You are not responsible for the sections titled, Consistent Covariance of Estimates (deleted) Test of First and Second Moment Specification (deleted) James P. Geaghan - Copyright 2011 Statistical Techniques II Page 52 Observation Diagnostics The first columns are the value of Yi and the predicted value of Yi. You are responsible for understanding these, along with the residual (the difference between these two values). These have not changed from SLR. You are not responsible for the Std Err Predict or the Std Err Residual. These are estimates of standard deviations and have been adjusted by hii values. You are responsible for the confidence intervals, Upper and Lower 95% MEAN and Upper and Lower 95% PREDICT. These are confidence intervals for the regression line (Yhati) and for individual points (Yi) respectively. Recall that for simple linear regression Yi = b0 + b1Xi + ei ˆ Y =Y+e i i 2 ˆ The variance for Y is SYY|X ˆ ( Xi X. 1 MSE n Xi X.2 2 ) The the variance for an individual observation, Yi, is ˆ Yi Y ei 2 SY Y|X 2 2 1 X i - X . MSE MSE 1 1 X i - X . MSE n X i - X .2 n X i - X .2 You are responsible for: The Studentized residual, and perhaps more important the deleted studentized residual (RSTUDENT). For the hat diag values (hii) The remaining 3 diagnostics of interest are the influence diagnostics (DFFITS, DFFBetas and Cook's D). You are NOT responsible for the column titled Cov Ratio. Partial Residual Plots These are “scatter plots” of the Y variable adjusted for all Xi except one plotted on that Xi adjusted for all other Xi. I used these to get across the concept that not only are the Yi adjusted for each Xi, but the Xi are also adjusted for each other. Beyond this these are used more like “scatter plots” than “residual plots”. We can look for curvature, nonhomogeneous variance, etc. If they appear to represent random scatter about zero it is because the variable does not contribute anything to the model, not because it is a “residual plot”. James P. Geaghan - Copyright 2011 ...
View Full Document

Ask a homework question - tutors are online