EXST7015 Fall2011 Lect11

EXST7015 Fall2011 Lect11 - Statistical Techniques II Page...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Statistical Techniques II Page 44 Regression in GLM PROC GLM and PROC MIXED do regression, but do not have all of the regression diagnostics available that we find in PROC REG. However, they do have a few advantages. They facilitate the inclusion of class variable (something we will be interest in later), and They provide tests of both Type I and TYPE II SS (as well as Types III and IV). The formatting is different, but most of the same information is available. Tests of both SS 1 and SS 3 are given by default. Note that the Type II and Type III are the same as in PROC REG (recall extra SS), but tests are provided. These F test values are calculated by dividing each SS (Sequential or Partial) by the MSE. Also note that the t-tests of the parameter estimates are the same as the tests of the Partial SS. More material and Summarization of Multiple regression will be done with the second example. Multicollinearity An important consideration in multiple regression is the effect of correlation among independent variables. There is a problem that exists when two independent variables are very highly correlated. The problem is called multicollinearity. At one extreme of this phenomenon is the case where two independent variables are perfectly correlated. This results in “singularity”, and the X'X matrix that cannot be inverted. To illustrate the problem, take the following data set. Y 1 2 3 X1 1 2 3 X2 2 3 4 If entered in PROC REG, SAS will report problems and will fit only the first variable, since the second one is perfectly correlated. Suppose we did want to fit both parameters for X1 and X2, what bi values could we get. The table below shows some possible values for b1 and b2. Acceptable values of b0, b1 and b2 in the model Yi b0 b1 X 1i b2 X 2i . b0 b1 0 1 –1 0 99 100 999 1000 –101 –100 –1001 –1000 –1000001 –1000000 b2 0 1 –99 –999 101 1001 1000001 There are an infinite number of solutions when singularity exists, and that is why no program can, or should, fit the parameter estimates. But suppose that I took and added to one of the Xi observations the value 0.0000000001. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 45 Now the two independent variables are not PERFECTLY correlated!!! SAS will report no error and will give a solution. How good is that solution. Remember how the bi values could go way up or way down as long as they were balanced by the other? b0 b1 0 1 –1 0 99 100 999 1000 –101 –100 –1001 –1000 b2 0 1 –99 –999 101 1001 Typically when very high correlations exist (but NOT perfect correlations) small changes in the data result in large fluctuations of the regression coefficients. Basically, under these conditions, the regression coefficient estimates are useless. Also, the variance estimates are inflated. So how do we detect these problems? First, look at the correlations, the simple correlations among the Xi variables produced by the PROC REG in the summary statistics section. For the Phosphorus example Correlation CORR X1 X1 1.0000 X2 0.4616 X3 0.1520 X2 0.4616 1.0000 0.3175 X3 0.1520 0.3175 1.0000 Y 0.6934 0.3545 0.3617 Large correlations (usually > 0.9) can indicate potential multicollinearity problems. However, to detect Multicollinearity these statistics alone are not enough. It is possible that there is no pairwise correlation, but that some combination of Xi variables correlates with some other combination. So we need another statistic to address this. The Variance Inflation Factor (VIF) is the statistic most commonly used to detect this problem. For the Phosphorus example Variable INTERCEP X1 X2 X3 Tolerance . 0.78692352 0.72432171 0.89915421 Variance Inflation 0.00000000 1.27077152 1.38060199 1.11215627 VIF values over 5 or 10, or a mean of the VIF values much over 2 indicate potential problems with multicollinearity. Tolerance is just the inverse of the VIF, so as VIF go up, Tolerance goes down. Both can be used to detect multicollinearity. We will ignore Tolerance. Another criteria for the evaluation of multicolinearity are called the “Collinearity Diagnostics”. The value examined is the “condition number” or “condition index”. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 46 This criteria does not provide a P-value. The last condition number is examined and values of 30 to 40 are considered to indicate probable multicolinearity. Another evaluation criteria for multicolinearity is the “condition number” or “condition index” Collinearity Diagnostics Condition Number Eigenvalue Index 1 7.92221 1.00000 2 0.70667 3.34824 3 0.23449 5.81248 4 0.04727 12.94588 5 0.03554 14.93049 6 0.02878 16.59008 7 0.01657 21.86586 8 0.00546 38.09407 9 0.00301 51.29072 --------------------Proportion of Variation-------------------Intercept LtofStay Age CulRatio XRay 0.00008203 0.00026608 0.00008384 0.00241 0.00055647 0.00062369 0.00113 0.00066774 0.02105 0.00504 0.00150 0.00158 0.00222 0.67694 0.00076682 0.00082735 0.04556 0.00039099 0.00635 0.00030739 0.00123 0.00104 0.00130 0.09639 0.43090 0.01898 0.01720 0.02926 0.06172 0.48518 0.03592 0.60385 0.02075 0.05279 0.05646 0.03300 0.27836 0.00195 0.01730 0.00245 0.90784 0.05102 0.94338 0.06505 0.01834 Multiple Regression Variable Diagnostics (See Appendix 1) So, the multiple regression differs from the SLR in that it has several variables. We need new statistics to examine parameter estimates from these variables, and to determine if there are problems among the variables. I will collectively refer to these as the “variable diagnostics” and these will be covered in the next section. Recall that multicollinearity can cause the regression coefficients to fluctuate greatly. Examining the Sequential Parameter Estimates for large fluctuations as variables enter is another indicator of multicollinearity. Sequential Parameter Estimates INTERCEP X1 X2 81.277777778 0 0 59.258958792 1.8434360081 0 56.251024085 1.7897741162 0.08664925 43.652197791 1.7847796802 -0.083397057 X3 0 0 0 0.161132691 Before looking at output, let’s consider our objectives. Objectives can vary. You are probably interested in testing for relationships (actually “partial” correlations) between the various independent variables and the dependent variable. On the one hand, you may be interested in the effect of each and every variable included in the analysis. If this is the case, you are probably particularly interested in the parameter estimates. And, you will probably be interested in confidence intervals on these parameter estimates, or on testing them against certain hypothesized values. Since you are interested in all of the variables, you are probably not interested in removing any variables from the model. Our first example may have been this type of analysis. We wanted to examine the effect of 3 soil phosphorus components, and would present the results for all 3, even though only one was significant, because all 3 are of interest. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 47 On the other hand, you may not be interested in all of the variables. You may not know which variables are important, and your objective may be to determine which ones are important. In this case you may want to keep and discuss all of the variables, or you may want to select the important variables and present them in a smaller (reduced) model with only significant variables. Our second model is more likely to be of this type. There are 8 variables, and most likely not all are important. Our objective here is probably to determine which ones are correlated to the dependent variable. We may start with a full model, but will then probably reduce the model to some subset of significant variables. A larger multiple variable example (Appendix 8): from Neter, Kutner, Nachtsheim and Wasserman. 1996. Applied Linear Statistical Models. Irwin Publishing Co. It is a sample various hospitals. Data was taken for the “Study of the Efficacy of Nosocomial Infection Control” (SENIC Project). Each observation is for one hospital. There were 113 observations. The variables are as follows; Identification number Average length of stay of all patients (in days) Average age of all patients (years) Average Infection risk for the hospital (in percent) This was taken as dependent variable Routine culture ratio to patients without symptoms to patients with symptoms Routine chest x-ray ratio for patients with and without symptoms of pneumonia Average number of beds (during study) Med School Affiliation (1=Yes, 2=No) Region NE=1, NC=2, S=3, W=4 Average no. of patients during study Average no. of nurses Percent of 35 potential service facilities that were actually provided by the hosp. Infection risk was used as the dependent variable. Eight other variables were considered independent variables. The two class (categorical, group) variables were omitted from the analysis (Med school affiliation and Region). Identification number was not used. See the computer output. Selected output was omitted as trivial or not of interest. Interpretation and evaluation The PROC REG statement is given on the second page. Only one new option is included here, the COLLIN option on the model statement. We have seen all of the other output down through the PROC UNIVARIATE, PROC GLM is excluded. The Stepwise regressions at the end of the output are new. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 48 So, what are we looking for in this regression? Which variables are important, if any? Are there problems with Multicollinearity, and what do you do about it anyway? Are there any problems with the observations? Yi variable outliers? Xi variable “outliers”? Influential observations? Are the assumptions for linear regression met? Normality? Independence? Homogeneity of variance? Are the Xi variables measured without error? And what are our objectives? If we want find the best model for predicting infection risk, do we need all 8 variables? How do we go about removing the ones we don't want? See the computer output. Some output has been deleted to save space. This is information that I considered very simple, or information that we will not cover this semester, Descriptive Statistics Uncorrected SS and Crossproducts Correlation section Model Crossproducts X'X X'Y Y'Y X'X Inverse, Parameter Estimates, SSE Matrix solution We did not cover matrix algebra in detail, but I expect you to know something about these. As with the SLR, we need all SS and CP from the variables to do a multiple regression. I expect you to know where these are. The Variance-Covariance matrix is key to most variance calculations. Know where this is and where it comes from. Standardized regression coefficients (used extensively in some disciplines). In your discipline, take note of the statistics presented in the literature. Standardization is sometimes used to put variables on the same scale. For example, if our slope (Y units per X unit) is meaningful in terms of the original scale (e.g. mg phosphorus available per mg in the soil) we may want to keep the original scale for interpretative purposes. However, in other cases our scales may be arbitrary. For example, if we are trying to predict a Freshman's first semester college performance (scale 0 to 4) from SAT (scale 0 to 36), ACT verbal (scale 200 to 800) and High School GPA (scale 0 to 4), then the arbitrary scales may confuse and complicate the study. We could “standardize” the 4 variables so that all have a mean = 0 and variance = 1. Since the original scales are arbitrary we lose little by doing this. The resulting regression would have regression coefficients that are without scale, and whose “relative size” would give an indication of the “relative importance” or “relative impact” of the variable in James P. Geaghan - Copyright 2011 Statistical Techniques II Page 49 determining the value of the predicted value. So, the standardized regression coefficients are relative measurements and have no units. We might be interested in the standardized regression coefficients. We saw earlier that the Ltofstay variable had the largest regression coefficient, but not the largest t value. Note that the Std. Reg. Coeff. match the t-value pattern for the 3 most significant variables, not the raw reg. coeff. However, the standardized regression coefficients do not match the t value pattern for all variables! XRay was the third significant variable, with the third largest t value, but three variables that were not significant (smaller t-values) had larger absolute values for the std. reg. coeff. (NOBEDS, NURSES and SERVICES). What's up? More later. Partial R2 Recall the R2 is the SSModel / SSTotal (both corrected). However, since we are not very interested in the overall model could we get an R2 type statistics for each variable? Of course we could, in fact there are four. What would you imagine the individual R2 values to be? You might guess the Type I or Type II SS divided by the total. Congratulations, you just invented the “Squared semi-partial correlation Type I” and the “Squared semi-partial correlation Type II”. Squared semi-partial correlation: TYPE I = SCORR1 = SeqSSXj / SSTotal Squared semi-partial correlation: TYPE II = SCORR2 = PartialSSXj / SSTotal But think about the Extra SS we talked about earlier. When variables go into a model they account for some of the SSTotal, and that fraction of the SS is not available to later variables, or it may even enhance the SS accounted for by later variables. Doesn't it seem that we should look at the SS available to a variable when we consider how much SS it accounts for? This is more in keeping with the concept of “Partial” SS we talked about earlier. So, when a variable (say X1) enters the model after the other variables, what SS are available to it? Obviously, the SS it accounted for was available (SSX1|X2,X3). And the SSError was also available, though not accounted for. So, the SS available to each variable is the part it accounts for (SSXj|all other variables), plus the part no variable accounts for (SSError). If we use this as the available SS to be accounted for, instead of the SSTotal, we have the “Squared Partial correlations”. These are calculated as Squared partial correlation TYPE I = PCORR1 = SeqSSXj / (SeqSSXj + SSError*) * Note that for sequential SS the error changes as each variable enters. This must be taken into account. Squared partial correlation TYPE II = PCORR2 = PartialSSXj / (PartialSSXj + SSError) So how are these used? The interpretation is similar to that of the R2, except that it is a fraction for each variable. James P. Geaghan - Copyright 2011 ...
View Full Document

This note was uploaded on 12/29/2011 for the course EXST 7015 taught by Professor Wang,j during the Fall '08 term at LSU.

Ask a homework question - tutors are online