EXST7015 Fall2011 Lect10

EXST7015 Fall2011 Lect10 - Statistical Techniques II Page...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Statistical Techniques II Page 35 In summary, SSX1 = 2381.31494 SSX2 = 1446.14793 SSX1|X2 =3077.286793 SSX2|X1 = 2142.119793 In this case the variables actually enhanced each other, performing better together than alone. Although not the rule, this is not unusual. 3 factor model The calculation of extra SS is exactly the same for larger models. The following example is a 3 factor multiple regression. In SAS this model would be PROC REG; MODEL Y = X1 X2 X3; The raw data is given below. Obs 1 2 3 4 5 6 7 8 9 10 11 12 Y 1 3 5 3 6 4 2 8 9 3 5 6 X1 2 4 7 3 5 3 2 6 7 8 7 9 X2 9 6 7 5 8 4 3 2 5 2 3 1 X3 2 5 9 5 9 2 6 1 3 4 7 4 The results of the regressions are; SSTotal is 62.91667 for all models For the 1 factor models the results are; Regression of Y on X1: SSError = 38.939, SSModel = 23.978 Regression of Y on X2: SSError = 58.801, SSModel = 4.115 Regression of Y on X3: SSError = 62.680, SSModel = 0.237 The extra SS are equal to the model SS for 1 factor models. SSX1 = 23.978 SSX2 = 4.115 SSX3 = 0.237 (or SSX1|X0) (or SSX2|X0) (or SSX3|X0) These SS are adjusted for the intercept (correction factor). This will always be the case for our examples, so the X0 is often omitted. Fitting X1 and X2 and X3 together TWO AT A TIME we get the following results. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 36 Regression of Y on X1 and X2 : SSError = 38.842, SSModel = 24.074 Regression of Y on X1 and X3 : SSError = 37.546, SSModel = 25.371 Regression of Y on X2 and X3 : SSError = 58.779, SSModel = 4.137 Now lets get the Extra SS for variables fitted together. SSX1 alone = 23.978 SSX2 alone = 4.115 SSX1 and X2 together = 24.074 Again, look for the improvement in the model due to the second variable. Calculate how much each variable adds to a model with the other variable already in the model. For the model with X1 and X2, subtract the amount accounted for by each variable alone from the amount together. Start with X1 (SSX1=23.978), add X2 (SSX1,X2 = 24.074). The improvement with X2 is 24.074–23.978 = 0.096 = SSX2|X1 Start with X2 (SSX2=4.115), Add X1 (SSX1,X2=24.074). The improvement is 24.074–4.115 = 19.959 = SSX1|X2 So, SSX1|X2=19.959, and SSX2|X1=0.096 Likewise for the model with X1 and X3. Start with X1 (SSX1=23.978), Add X3 (SSX1,X3 = 25.371). The improvement is 25.371–23.978 = 1.393 = SSX3|X1 Start with X3 (SSX3=0.237), Add X1 (SSX1,X3 = 25.371). The improvement is 25.371–0.237 = 25.134 = SSX1|X3 So, SSX1|X3=25.134, and SSX3|X1=1.393 Likewise for the model with X2 and X3. Start with X2 (SSX2=4.115), Add X3 (SSX2,X3 = 4.137). The improvement is 4.137–4.115 = 0.022 = SSX3|X2 Start with X3 (SSX3=0.237), Add X2 (SSX2,X3 = 4.137). The improvement is 4.137–0.237 = 3.900 = SSX2|X3 So, SSX2|X3=3.900, and SSX3|X2=0.022 Finally, and most important (?), for all 3 variables in the model. How much does each variable improve the model over a model with the other two variables present in the model? Start with the full model, SSX1,X2,X3=26.190 SSX1,X2 = 24.074; SSX3|X1,X2=2.116 SSX1,X3 = 25.371; SSX2|X1,X3=0.819 SSX2,X3 = 4.137; SSX1|X2,X3=22.053 James P. Geaghan - Copyright 2011 Statistical Techniques II Page 37 Summarizing the Extra SS calculations. Extra SS SSX1 SSX2 SSX3 SSX1|X2 SSX2|X1 SSX1|X3 SSX3|X1 SSX2|X3 SSX3|X2 SSX1|X2,X3 SSX2|X1,X3 SSX3|X1,X2 SS 23.978 4.115 0.237 19.959 0.096 25.134 1.393 3.900 0.022 22.053 0.819 2.116 d.f. Error 10 10 10 9 9 9 9 9 9 8 8 8 Error SS 38.939 58.802 62.680 38.843 38.843 37.546 37.546 58.780 58.780 36.727 36.727 36.727 Note that all Extra SS are also corrected for the intercept. More extra SS A final note on extra SS. It is also useful to be able to express SS for two or more variables with two or more degrees of freedom as extra SS. For example, the SS due to X1 and X2 together (adjusted only for the intercept) is SSX1, X2. These extraSS can be obtained directly from the two factor models fitted in SAS. Another possibility is the two variable SS fitted after one or more other variables. For example, the SS for X1 and X2 adjusted for X3 (and of course the intercept) is SSX1, X2 | X3. To calculate this we start with the full model (SSX1,X2,X3=26.190). We know X3 alone fits SSX3 = 0.237. So SSX1,X2 | X3 = 26.190 – 0.237 = 25.953 Extra SS SSX1, X2 SSX1, X3 SSX2, X3 SSX1,X2|X3 SSX1,X3|X2 SSX2,X3|X1 SSX1,X2,X3 SS 24.074 25.371 4.137 25.953 22.075 2.212 26.190 d.f. Error 9 9 9 8 8 8 8 Error SS 38.843 37.546 58.780 36.727 36.727 36.727 36.727 Type I SS So, what is important here? Or why do we need extra SS? SAS will provide us with two types of sum of squares. We need to understand both, and extra SS is one key to this understanding. The first one is the SAS type 1 SS, the second is SAS type 2 or 3 or 4 (which are the same for regression analysis). The SAS Type 1 SS are called the sequentially adjusted SS. They have a number of potential problems. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 38 The Type I SS are adjusted in a sequential or serial fashion. Each SS is adjusted for the variables previously entered in the model, but not for variables entered later. For the model [Y = X1 X2 X3], X1 would be first and adjusted for nothing else (except X0). X2 would enter second, be adjusted for X1, but not for X3. X3 enters last and is adjusted for both X1 and X2. The result, SSX1 SSX2|X1 SSX3|X1,X2. The SAS Type 1 SS are called the sequentially adjusted SS. They have a number of potential problems. Unfortunately, these SS are different depending on the order of the variables, so different researchers could get different results for the same data. Use of this SS type is rare, it is only used where there is a mathematical reason to place the variables in a particular order. Use is restricted mostly to polynomial regressions (which we will later) and a few other applications we will discuss. For the model Y = X1 X2 X3 in SAS the Type I SS are: SSX1, SSX2|X1, SSX3|X1,X2 A different order would give different SS and different results. So, we will not usually use Type I. However, they are provided by default by SAS PROC GLM (not PROC REG or PROC MIXED). Type II SS The SS Type II SS (or type III or IV for regression) are called PARTIAL SS, or fully adjusted SS, or uniquely attributable SS. These are the ones most often used. From the “fully adjusted” terminology you might guess that we are talking about each variable fitted after the other variables. This is correct. Note that in SAS, for regression, Type II and TYPE III and TYPE IV are the same. SAS provides TYPE II in PROC REG and it provides TYPE I and TYPE III by default in PROC GLM. Testing and evaluation of variables is usually done with the TYPE II or TYPE III SS. ANOVA table for our example, using the TYPE III SS (Partial SS). Note: tabular F0.05,1,8=5.32. Source SSX1|X2,X3 SSX2|X1,X3 SSX3|X1,X2 ERROR d.f. 1 1 1 8 SS 22.053 0.819 2.116 36.727 MS 22.053 0.819 2.116 4.591 F value 4.804 0.178 0.461 The best variable appears to be X1, though SSX1|X2,X3 is not quite significant. It might achieve significance if we removed the variables that account for less variation SSX2|X1,X3 and SSX3|X1,X2. However, since they are fully adjusted for each other we don't know how the SS might change when we remove one variable. So when we remove variables we remove ONE AT A TIME and check the remaining variables. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 39 ANOVA table for analysis of the variables X1 and X3 alone. (F0.05,1,9=5.117). Note that X1 is now significant, but X3 is not and may be removed. Source SSX1|X3 SSX3|X1 ERROR d.f. 1 1 9 SS 25.134 1.393 37.546 MS 25.134 1.393 4.172 F value 6.024 0.334 The variable X1 is still significant. (F0.05,1,10=4.965) This one at a time variable removal process is called “backward stepwise regression”. Source SSX1 ERROR d.f. 1 10 SS 23.977 38.939 MS 23.977 3.894 F value 6.158 General Linear Hypothesis Test This test is relatively easy and intuitive given what we know about extra SS. In this test we can examine the addition of any variable or group of variables to a model. The model without the variables is called the Reduced model. The model with the additional variables we want to test is called the Full model. For example, suppose we had previously seen a study such as the SENIC hospital study that had two variables only, Length of stay and average age of the patient (X1 and X2). We want to jointly test the other 6 variables to see if they add anything JOINTLY to the model. To do this we would calculate the extra SS= SSX3, X4, X5, X6, X7, X8 | X1, X2 We fit the Reduced model and the Full model. Full model results: dfError = 104, SSE=95.63982 Reduced model results: dfError=110, SSE=141.99965 Then we set up the table below to test the difference. Source Reduced model Full model Difference Full model d.f. 110 104 6 104 SSE 141.9997 95.6398 46.3598 95.6398 MSE F 7.7266 0.9196 8.4020 P>F 0.00000020 In this case we see that the difference is highly significant, indicating that the amount of variation described by the omitted variables is significantly different from zero. At least one of these variables would be a useful addition to the model. All extra SS tests can be viewed as simple cases of the GLHT, having a model with a term (full) and a model with that term removed (reduced) Statistics quote: Statistics means never having to say you're certain. (Anon.) James P. Geaghan - Copyright 2011 Statistical Techniques II Page 40 Summary The primary new aspect of multiple regression (compared to SLR) is the need to evaluate and interpret several independent variables. A major tool for this are the SS produced for each variable. There are two types of SS for regression, Sequential and Partial. Extra SS are needed to understand the difference between these two types of SS. Extra SS are simply the SS that each variable accounts for, and causes to be removed from the SSError and placed in the SSModel or SSReg. It is necessary, however, to know which variables, if any, have been entered in the model in advance of the variable being examined. Also note a curious behavior of the variables when they occur together. When one Xi is adjusted for another independent (X) variable, sometimes it's SS are larger, and sometimes smaller. This is unpredictable and can go either way. For example. SSX1 was 23.978, but dropped to 19.959 when adjusted for X2 and increased to 25.134 when adjusted for X3. It dropped to 22.053 when adjusted for both (see X1 next page). Extra SS SSX1 SSX2 SSX3 SSX1|X2 SSX2|X1 SSX1|X3 SSX3|X1 SSX2|X3 SSX3|X2 SSX1|X2,X3 SSX2|X1,X3 SSX3|X1,X2 SS 23.978 4.115 0.237 19.959 0.096 25.134 1.393 3.900 0.022 22.053 0.819 2.116 Not only will the SS of one variable increase or decrease as other variables are added to the model, but the regression coefficient values will also change. They may even change sign, and hence interpretation. So variables in combination do not necessarily have the same interpretation as they might have alone, though the interpretation does not usually change. Final notes Multiple regression shares a lot in interpretation and diagnostics with SLR. The coefficients should be adjusted for each other. This is the Type III SS in SAS. This is the big and important difference from SLR. See Extra SS for the phosphorus example in SAS. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 41 Multiple regression with SAS output The population equation is Yi 0 1 X1i 2 X 2i 3 X 3i i The sample equation is Yi b0 b1 X 1i b2 X 2i b3 X 3i ei Always remember that our estimates of the bi are sample estimates of the true population values. The objectives in multiple regression are generally the same as SLR. Testing hypotheses (about bi values, predicted values, correlations), quantifying relationships (but NOT proving that there is a relationship), estimating parameters with confidence intervals. The assumptions for the regression are the same as for Simple Linear Regression Normality Independence Homogeneity of variance Xi measured without error in short: ei ~ NIDr.v.(0,2). Do not use this expression in an exam unless you can explain how it relates to the assumptions. The interpretation of the parameter estimates are the same as simple linear regression For the slope, the units are Y units per X units, and measure the change in Y for a 1 unit change in X). For the intercept the units are Y units. The diagnostics used in simple linear regression are mostly the same for multiple regression. Residuals can still be examined for outliers, homogeneity, normality, curvature, influence, etc., as with SLR. The only difference is that, since we have several X's, we would usually plot the residuals on Yhat instead of a single X variable. Interpretation From our discussion of Extra SS you may recall that SAS will provide several types of SS. The first is called SS Type I, or the Sequential SS. SSX1 SSX2|X1 SSX3|X1,X2 There will be some specific instances where these are desirable. However, we will usually want the variables adjusted for each other. All variables adjusted for all other variables in the model. This is desirable because when we adjust for other variables we account for the effect of the other variables, or we remove the effect of the other variables, or we hold the other variables constant. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 42 So we know that in multiple regression each variable may have an effect on the dependent variable, and we want to isolate the effect of each variable while adjusting for the effect of other variables on the dependent variable (Yi). The Type III SS do this, while the Type I SS adjust in a particular order (WHICH IS NOT UNIQUE!!) Note that the Type II or Type III SS (these are the same for regression) are also called the PARTIAL SS. The may also be referred to as the fully adjusted SS or the uniquely attributable SS. SSX1| X2, X3 SSX2| X1, X3 SSX3| X1, X2 So we will generally use the Partial SS. Remember the word PARTIAL. What about other things in regression, are they sequentially adjusted or fully adjusted? The regressions coefficients for example. Or correlations that may be calculated between the Yi and the various Xi. The regression coefficients in a multiple regression are called the partial regression coefficients, and we will see partial correlation coefficients. The word partial suggests that these too are FULLY ADJUSTED, Everything “partial” is fully adjusted. Numerical examples We will look at two examples. The primary example is a database called “SENIC” (Study of the Efficacy of Nosocomial Infection Control) from Neter, Kutner, Nachtsheim and Wasserman. 1996. Applied Linear Statistical Models. Irwin Publishing Co.. The three factor example (Plant available phosphorus) used to for the matix example and extra SS example will also be used. examples. SAS Example 1: Snedecor and Cochran (1967) – see Appendix 6 Three types of soil phosphorus levels were determined, and the amount of phosphorus available to plants was determined. We want to do a regression that determines which of the soil measurement relate (correlate) to the plant available phosphorus. The 4 variables in the data set are; Plant available phosphorus, the dependent variable. Inorganic phosphorus, the first independent variable (order is not important if Type II SS are used). Organic phosphorus hydrolyzed in hypobromite, another independent variable. Organic phosphorus NOT hydrolyzed in hypobromite, also a independent variable. See the SAS program. PROC REG is used for this problem. There were a number of new options used. We will use this example primarily to examine the fitting of regression with matrices. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 43 Three matrices are contained in a small array with X'X a 4 by 4 matrix in the upper left, X'Y a 4 by 1 matrix on the upper right, and Y'Y is the scalar value (a 1 by 1 matrix) in the lower right corner. The remaining 1 by 3 matrix in the lower left is the (X'Y)'. X'X (4x4) (X'Y)' (1x4) X'Y (4x1) Y'Y (1x1) Calculated values in the resulting 5 by 5 matrix are given below. Intercept X1 X2 X3 Y Intercept n ∑X1 ∑X2 ∑X3 ∑Y X1 ∑X1 ∑X12 ∑X1X2 ∑X1X3 ∑X1Y X2 ∑X2 ∑X1X2 ∑X22 ∑X2X3 ∑X2Y X3 ∑X3 ∑X1X3 ∑X2X3 ∑X32 ∑X3Y Y ∑Y ∑X1Y ∑X2Y ∑X3Y ∑Y2 We will occasionally refer to the X matrix and the X'X matrix as the semester progresses. The X matrix is the matrix of the various independent values (Xi). The X'X contains all of the SS and cross products of those variables. The X matrix has p columns and the X'X is a pxp square matrix. Note the symmetry of the off diagonal elements. The position of the elements of the “X'X Inverse, Parameter Estimates, and SSE” are as follows. See computer output. (X'X)–1 (4x4) B (1x4) B (4x1) SSE (1x1) Degrees of freedom (d.f.) in multiple regression. The model will have p–1 d. f., where p is the number of parameters including the intercept. The corrected total has n–1 d.f., where n is the number of observations. The error has n–p d.f. As a reminder, Type I SS are SSX1 SSX2 | X1 SSX3 | X1, X2 and type II or III SS (same for reg) are SSX1 | X2, X3 SSX2 | X1, X3 SSX3 | X1, X2 Note that the last variable is the same for both types. This is always true. Do an F test of these, first calculate the Mean Square (all have one d.f.), and then divide the MS by the MSError, which in this example has 14 d.f. The result would then be compared to tabular values from the F table with 1, 14 d.f. This can also be used to test each parameter estimate against zero. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 44 Regression in GLM PROC GLM and PROC MIXED do regression, but do not have all of the regression diagnostics available that we find in PROC REG. However, they do have a few advantages. They facilitate the inclusion of class variable (something we will be interest in later), and They provide tests of both Type I and TYPE II SS (as well as Types III and IV). The formatting is different, but most of the same information is available. Tests of both SS 1 and SS 3 are given by default. Note that the Type II and Type III are the same as in PROC REG (recall extra SS), but tests are provided. These F test values are calculated by dividing each SS (Sequential or Partial) by the MSE. Also note that the t-tests of the parameter estimates are the same as the tests of the Partial SS. More material and Summarization of Multiple regression will be done with the second example. Multicollinearity An important consideration in multiple regression is the effect of correlation among independent variables. There is a problem that exists when two independent variables are very highly correlated. The problem is called multicollinearity. At one extreme of this phenomenon is the case where two independent variables are perfectly correlated. This results in “singularity”, and the X'X matrix that cannot be inverted. To illustrate the problem, take the following data set. Y 1 2 3 X1 1 2 3 X2 2 3 4 If entered in PROC REG, SAS will report problems and will fit only the first variable, since the second one is perfectly correlated. Suppose we did want to fit both parameters for X1 and X2, what bi values could we get. The table below shows some possible values for b1 and b2. Acceptable values of b0, b1 and b2 in the model Yi b0 b1 X 1i b2 X 2i . b0 b1 0 1 –1 0 99 100 999 1000 –101 –100 –1001 –1000 –1000001 –1000000 b2 0 1 –99 –999 101 1001 1000001 There are an infinite number of solutions when singularity exists, and that is why no program can, or should, fit the parameter estimates. But suppose that I took and added to one of the Xi observations the value 0.0000000001. James P. Geaghan - Copyright 2011 ...
View Full Document

This note was uploaded on 12/29/2011 for the course EXST 7015 taught by Professor Wang,j during the Fall '08 term at LSU.

Ask a homework question - tutors are online