This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Statistical Techniques II Page 35 In summary, SSX1 = 2381.31494
SSX2 = 1446.14793
SSX1X2 =3077.286793
SSX2X1 = 2142.119793 In this case the variables actually enhanced each other, performing better together than alone.
Although not the rule, this is not unusual.
3 factor model The calculation of extra SS is exactly the same for larger models. The following example is a 3
factor multiple regression.
In SAS this model would be
PROC REG; MODEL Y = X1 X2 X3;
The raw data is given below.
Obs
1
2
3
4
5
6
7
8
9
10
11
12 Y
1
3
5
3
6
4
2
8
9
3
5
6 X1
2
4
7
3
5
3
2
6
7
8
7
9 X2
9
6
7
5
8
4
3
2
5
2
3
1 X3
2
5
9
5
9
2
6
1
3
4
7
4 The results of the regressions are;
SSTotal is 62.91667 for all models
For the 1 factor models the results are;
Regression of Y on X1: SSError = 38.939, SSModel = 23.978
Regression of Y on X2: SSError = 58.801, SSModel = 4.115
Regression of Y on X3: SSError = 62.680, SSModel = 0.237
The extra SS are equal to the model SS for 1 factor models. SSX1 = 23.978
SSX2 = 4.115
SSX3 = 0.237 (or SSX1X0)
(or SSX2X0)
(or SSX3X0) These SS are adjusted for the intercept (correction factor). This will always be the case for our
examples, so the X0 is often omitted.
Fitting X1 and X2 and X3 together TWO AT A TIME we get the following results.
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 36 Regression of Y on X1 and X2 : SSError = 38.842, SSModel = 24.074
Regression of Y on X1 and X3 : SSError = 37.546, SSModel = 25.371
Regression of Y on X2 and X3 : SSError = 58.779, SSModel = 4.137
Now lets get the Extra SS for variables fitted together.
SSX1 alone = 23.978
SSX2 alone = 4.115
SSX1 and X2 together = 24.074
Again, look for the improvement in the model due to the second variable. Calculate how much
each variable adds to a model with the other variable already in the model.
For the model with X1 and X2, subtract the amount accounted for by each variable alone from
the amount together.
Start with X1 (SSX1=23.978), add X2 (SSX1,X2 = 24.074).
The improvement with X2 is 24.074–23.978 = 0.096 = SSX2X1
Start with X2 (SSX2=4.115), Add X1 (SSX1,X2=24.074).
The improvement is 24.074–4.115 = 19.959 = SSX1X2
So, SSX1X2=19.959, and SSX2X1=0.096
Likewise for the model with X1 and X3.
Start with X1 (SSX1=23.978), Add X3 (SSX1,X3 = 25.371).
The improvement is 25.371–23.978 = 1.393 = SSX3X1
Start with X3 (SSX3=0.237), Add X1 (SSX1,X3 = 25.371).
The improvement is 25.371–0.237 = 25.134 = SSX1X3
So, SSX1X3=25.134, and SSX3X1=1.393
Likewise for the model with X2 and X3.
Start with X2 (SSX2=4.115), Add X3 (SSX2,X3 = 4.137).
The improvement is 4.137–4.115 = 0.022 = SSX3X2
Start with X3 (SSX3=0.237), Add X2 (SSX2,X3 = 4.137).
The improvement is 4.137–0.237 = 3.900 = SSX2X3
So, SSX2X3=3.900, and SSX3X2=0.022
Finally, and most important (?), for all 3 variables in the model. How much does each variable
improve the model over a model with the other two variables present in the model?
Start with the full model, SSX1,X2,X3=26.190
SSX1,X2 = 24.074; SSX3X1,X2=2.116
SSX1,X3 = 25.371; SSX2X1,X3=0.819
SSX2,X3 = 4.137; SSX1X2,X3=22.053 James P. Geaghan  Copyright 2011 Statistical Techniques II Page 37 Summarizing the Extra SS calculations.
Extra SS SSX1 SSX2 SSX3 SSX1X2 SSX2X1 SSX1X3 SSX3X1 SSX2X3 SSX3X2 SSX1X2,X3 SSX2X1,X3 SSX3X1,X2 SS
23.978
4.115
0.237
19.959
0.096
25.134
1.393
3.900
0.022
22.053
0.819
2.116 d.f. Error
10
10
10
9
9
9
9
9
9
8
8
8 Error SS
38.939
58.802
62.680
38.843
38.843
37.546
37.546
58.780
58.780
36.727
36.727
36.727 Note that all Extra SS are also corrected for the intercept.
More extra SS A final note on extra SS. It is also useful to be able to express SS for two or more variables with
two or more degrees of freedom as extra SS. For example, the SS due to X1 and X2 together
(adjusted only for the intercept) is SSX1, X2. These extraSS can be obtained directly from the
two factor models fitted in SAS.
Another possibility is the two variable SS fitted after one or more other variables. For example,
the SS for X1 and X2 adjusted for X3 (and of course the intercept) is SSX1, X2  X3.
To calculate this we start with the full model (SSX1,X2,X3=26.190). We know X3 alone fits
SSX3 = 0.237.
So SSX1,X2  X3 = 26.190 – 0.237 = 25.953
Extra SS SSX1, X2 SSX1, X3 SSX2, X3 SSX1,X2X3 SSX1,X3X2 SSX2,X3X1 SSX1,X2,X3 SS 24.074 25.371 4.137 25.953 22.075 2.212 26.190 d.f. Error
9
9
9
8
8
8
8 Error SS
38.843
37.546
58.780
36.727
36.727
36.727
36.727 Type I SS So, what is important here? Or why do we need extra SS?
SAS will provide us with two types of sum of squares. We need to understand both, and extra
SS is one key to this understanding.
The first one is the SAS type 1 SS, the second is SAS type 2 or 3 or 4 (which are the same for
regression analysis).
The SAS Type 1 SS are called the sequentially adjusted SS. They have a number of potential
problems.
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 38 The Type I SS are adjusted in a sequential or serial fashion. Each SS is adjusted for the
variables previously entered in the model, but not for variables entered later. For the model [Y
= X1 X2 X3], X1 would be first and adjusted for nothing else (except X0). X2 would enter
second, be adjusted for X1, but not for X3. X3 enters last and is adjusted for both X1 and X2.
The result,
SSX1
SSX2X1
SSX3X1,X2.
The SAS Type 1 SS are called the sequentially adjusted SS. They have a number of potential
problems.
Unfortunately, these SS are different depending on the order of the variables, so different
researchers could get different results for the same data. Use of this SS type is rare, it is only
used where there is a mathematical reason to place the variables in a particular order. Use is
restricted mostly to polynomial regressions (which we will later) and a few other applications
we will discuss.
For the model Y = X1 X2 X3 in SAS the Type I SS are: SSX1, SSX2X1, SSX3X1,X2
A different order would give different SS and different results.
So, we will not usually use Type I. However, they are provided by default by SAS PROC
GLM (not PROC REG or PROC MIXED).
Type II SS The SS Type II SS (or type III or IV for regression) are called PARTIAL SS, or fully adjusted SS,
or uniquely attributable SS. These are the ones most often used.
From the “fully adjusted” terminology you might guess that we are talking about each variable
fitted after the other variables.
This is correct.
Note that in SAS, for regression, Type II and TYPE III and TYPE IV are the same. SAS provides
TYPE II in PROC REG and it provides TYPE I and TYPE III by default in PROC GLM.
Testing and evaluation of variables is usually done with the TYPE II or TYPE III SS.
ANOVA table for our example, using the TYPE III SS (Partial SS). Note: tabular F0.05,1,8=5.32.
Source SSX1X2,X3 SSX2X1,X3 SSX3X1,X2 ERROR d.f. 1 1 1 8 SS
22.053
0.819
2.116
36.727 MS
22.053
0.819
2.116
4.591 F value
4.804
0.178
0.461 The best variable appears to be X1, though SSX1X2,X3 is not quite significant. It might achieve
significance if we removed the variables that account for less variation SSX2X1,X3 and
SSX3X1,X2. However, since they are fully adjusted for each other we don't know how the SS
might change when we remove one variable. So when we remove variables we remove ONE AT
A TIME and check the remaining variables. James P. Geaghan  Copyright 2011 Statistical Techniques II Page 39 ANOVA table for analysis of the variables X1 and X3 alone. (F0.05,1,9=5.117).
Note that X1 is now significant, but X3 is not and may be removed.
Source SSX1X3 SSX3X1 ERROR d.f. 1 1 9 SS
25.134
1.393
37.546 MS
25.134
1.393
4.172 F value
6.024
0.334 The variable X1 is still significant. (F0.05,1,10=4.965)
This one at a time variable removal process is called “backward stepwise regression”.
Source SSX1 ERROR d.f. 1 10 SS
23.977
38.939 MS
23.977
3.894 F value
6.158 General Linear Hypothesis Test This test is relatively easy and intuitive given what we know about extra SS. In this test we can
examine the addition of any variable or group of variables to a model. The model without the
variables is called the Reduced model. The model with the additional variables we want to test is
called the Full model.
For example, suppose we had previously seen a study such as the SENIC hospital study that
had two variables only, Length of stay and average age of the patient (X1 and X2). We want to
jointly test the other 6 variables to see if they add anything JOINTLY to the model.
To do this we would calculate the extra SS= SSX3, X4, X5, X6, X7, X8  X1, X2
We fit the Reduced model and the Full model.
Full model results: dfError = 104, SSE=95.63982
Reduced model results: dfError=110, SSE=141.99965
Then we set up the table below to test the difference.
Source Reduced model Full model Difference Full model d.f. 110 104 6 104 SSE
141.9997
95.6398
46.3598
95.6398 MSE F 7.7266
0.9196 8.4020 P>F 0.00000020 In this case we see that the difference is highly significant, indicating that the amount of variation
described by the omitted variables is significantly different from zero. At least one of these
variables would be a useful addition to the model.
All extra SS tests can be viewed as simple cases of the GLHT, having a model with a term (full)
and a model with that term removed (reduced) Statistics quote: Statistics means never having to say you're certain. (Anon.)
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 40 Summary
The primary new aspect of multiple regression (compared to SLR) is the need to evaluate and
interpret several independent variables. A major tool for this are the SS produced for each
variable.
There are two types of SS for regression, Sequential and Partial. Extra SS are needed to
understand the difference between these two types of SS.
Extra SS are simply the SS that each variable accounts for, and causes to be removed from the
SSError and placed in the SSModel or SSReg.
It is necessary, however, to know which variables, if any, have been entered in the model in
advance of the variable being examined.
Also note a curious behavior of the variables when they occur together.
When one Xi is adjusted for another independent (X) variable, sometimes it's SS are larger, and
sometimes smaller. This is unpredictable and can go either way.
For example. SSX1 was 23.978, but dropped to 19.959 when adjusted for X2 and increased to
25.134 when adjusted for X3. It dropped to 22.053 when adjusted for both (see X1 next page).
Extra SS SSX1 SSX2 SSX3 SSX1X2 SSX2X1 SSX1X3 SSX3X1 SSX2X3 SSX3X2 SSX1X2,X3 SSX2X1,X3 SSX3X1,X2 SS
23.978
4.115
0.237
19.959
0.096
25.134
1.393
3.900
0.022
22.053
0.819
2.116 Not only will the SS of one variable increase or decrease as other variables are added to the model,
but the regression coefficient values will also change. They may even change sign, and hence
interpretation. So variables in combination do not necessarily have the same interpretation as they
might have alone, though the interpretation does not usually change. Final notes
Multiple regression shares a lot in interpretation and diagnostics with SLR.
The coefficients should be adjusted for each other. This is the Type III SS in SAS. This is the big
and important difference from SLR.
See Extra SS for the phosphorus example in SAS. James P. Geaghan  Copyright 2011 Statistical Techniques II Page 41 Multiple regression with SAS output
The population equation is Yi 0 1 X1i 2 X 2i 3 X 3i i
The sample equation is Yi b0 b1 X 1i b2 X 2i b3 X 3i ei
Always remember that our estimates of the bi are sample estimates of the true population values.
The objectives in multiple regression are generally the same as SLR. Testing hypotheses (about bi values, predicted values, correlations),
quantifying relationships (but NOT proving that there is a relationship),
estimating parameters with confidence intervals. The assumptions for the regression are the same as for Simple Linear Regression Normality
Independence
Homogeneity of variance
Xi measured without error
in short: ei ~ NIDr.v.(0,2). Do not use this expression in an exam unless you can explain
how it relates to the assumptions. The interpretation of the parameter estimates are the same as simple linear regression
For the slope, the units are Y units per X units, and measure the change in Y for a 1 unit change
in X).
For the intercept the units are Y units.
The diagnostics used in simple linear regression are mostly the same for multiple regression.
Residuals can still be examined for outliers, homogeneity, normality, curvature, influence, etc.,
as with SLR.
The only difference is that, since we have several X's, we would usually plot the residuals on
Yhat instead of a single X variable.
Interpretation From our discussion of Extra SS you may recall that SAS will provide several types of SS.
The first is called SS Type I, or the Sequential SS.
SSX1
SSX2X1
SSX3X1,X2
There will be some specific instances where these are desirable. However, we will usually want
the variables adjusted for each other. All variables adjusted for all other variables in the model.
This is desirable because when we adjust for other variables we account for the effect of the other variables, or we
remove the effect of the other variables, or we
hold the other variables constant. James P. Geaghan  Copyright 2011 Statistical Techniques II Page 42 So we know that in multiple regression each variable may have an effect on the dependent
variable, and we want to isolate the effect of each variable while adjusting for the effect of other
variables on the dependent variable (Yi).
The Type III SS do this, while the Type I SS adjust in a particular order (WHICH IS NOT
UNIQUE!!)
Note that the Type II or Type III SS (these are the same for regression) are also called the
PARTIAL SS. The may also be referred to as the fully adjusted SS or the uniquely attributable
SS.
SSX1 X2, X3
SSX2 X1, X3
SSX3 X1, X2
So we will generally use the Partial SS. Remember the word PARTIAL.
What about other things in regression, are they sequentially adjusted or fully adjusted? The
regressions coefficients for example. Or correlations that may be calculated between the Yi and
the various Xi.
The regression coefficients in a multiple regression are called the partial regression
coefficients, and we will see partial correlation coefficients. The word partial suggests that
these too are FULLY ADJUSTED,
Everything “partial” is fully adjusted.
Numerical examples
We will look at two examples. The primary example is a database called “SENIC” (Study of the
Efficacy of Nosocomial Infection Control) from Neter, Kutner, Nachtsheim and Wasserman.
1996. Applied Linear Statistical Models. Irwin Publishing Co..
The three factor example (Plant available phosphorus) used to for the matix example and extra SS
example will also be used. examples.
SAS Example 1: Snedecor and Cochran (1967) – see Appendix 6 Three types of soil phosphorus levels were determined, and the amount of phosphorus available to
plants was determined. We want to do a regression that determines which of the soil measurement
relate (correlate) to the plant available phosphorus.
The 4 variables in the data set are;
Plant available phosphorus, the dependent variable.
Inorganic phosphorus, the first independent variable (order is not important if Type II SS are
used).
Organic phosphorus hydrolyzed in hypobromite, another independent variable.
Organic phosphorus NOT hydrolyzed in hypobromite, also a independent variable.
See the SAS program.
PROC REG is used for this problem. There were a number of new options used.
We will use this example primarily to examine the fitting of regression with matrices. James P. Geaghan  Copyright 2011 Statistical Techniques II Page 43 Three matrices are contained in a small array with X'X a 4 by 4 matrix in the upper left, X'Y a 4
by 1 matrix on the upper right, and Y'Y is the scalar value (a 1 by 1 matrix) in the lower right
corner. The remaining 1 by 3 matrix in the lower left is the (X'Y)'.
X'X (4x4) (X'Y)' (1x4) X'Y (4x1) Y'Y (1x1) Calculated values in the resulting 5 by 5 matrix are given below. Intercept X1 X2 X3 Y Intercept n ∑X1 ∑X2 ∑X3 ∑Y X1 ∑X1 ∑X12 ∑X1X2 ∑X1X3 ∑X1Y X2 ∑X2 ∑X1X2 ∑X22 ∑X2X3 ∑X2Y X3 ∑X3 ∑X1X3 ∑X2X3 ∑X32 ∑X3Y Y ∑Y ∑X1Y ∑X2Y ∑X3Y ∑Y2 We will occasionally refer to the X matrix and the X'X matrix as the semester progresses. The
X matrix is the matrix of the various independent values (Xi). The X'X contains all of the SS
and cross products of those variables.
The X matrix has p columns and the X'X is a pxp square matrix.
Note the symmetry of the off diagonal elements.
The position of the elements of the “X'X Inverse, Parameter Estimates, and SSE” are as follows.
See computer output.
(X'X)–1 (4x4) B (1x4) B (4x1) SSE (1x1) Degrees of freedom (d.f.) in multiple regression.
The model will have p–1 d. f., where p is the number of parameters including the intercept.
The corrected total has n–1 d.f., where n is the number of observations.
The error has n–p d.f.
As a reminder, Type I SS are SSX1
SSX2  X1
SSX3  X1, X2 and type II or III SS (same for reg) are SSX1  X2, X3
SSX2  X1, X3
SSX3  X1, X2 Note that the last variable is the same for both types. This is always true.
Do an F test of these, first calculate the Mean Square (all have one d.f.), and then divide the MS by
the MSError, which in this example has 14 d.f.
The result would then be compared to tabular values from the F table with 1, 14 d.f.
This can also be used to test each parameter estimate against zero.
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 44 Regression in GLM
PROC GLM and PROC MIXED do regression, but do not have all of the regression diagnostics
available that we find in PROC REG.
However, they do have a few advantages.
They facilitate the inclusion of class variable (something we will be interest in later), and
They provide tests of both Type I and TYPE II SS (as well as Types III and IV).
The formatting is different, but most of the same information is available.
Tests of both SS 1 and SS 3 are given by default.
Note that the Type II and Type III are the same as in PROC REG (recall extra SS), but tests are
provided. These F test values are calculated by dividing each SS (Sequential or Partial) by the
MSE.
Also note that the ttests of the parameter estimates are the same as the tests of the Partial SS.
More material and Summarization of Multiple regression will be done with the second example. Multicollinearity
An important consideration in multiple regression is the effect of correlation among independent
variables. There is a problem that exists when two independent variables are very highly
correlated. The problem is called multicollinearity.
At one extreme of this phenomenon is the case where two independent variables are perfectly
correlated. This results in “singularity”, and the X'X matrix that cannot be inverted.
To illustrate the problem, take the following data set.
Y 1 2 3 X1 1 2 3 X2 2 3 4 If entered in PROC REG, SAS will report problems and will fit only the first variable, since the
second one is perfectly correlated. Suppose we did want to fit both parameters for X1 and X2,
what bi values could we get. The table below shows some possible values for b1 and b2.
Acceptable values of b0, b1 and b2 in the model Yi b0 b1 X 1i b2 X 2i .
b0 b1 0 1 –1 0 99 100 999 1000 –101 –100 –1001 –1000 –1000001 –1000000 b2 0 1 –99 –999 101 1001 1000001 There are an infinite number of solutions when singularity exists, and that is why no program
can, or should, fit the parameter estimates.
But suppose that I took and added to one of the Xi observations the value 0.0000000001.
James P. Geaghan  Copyright 2011 ...
View
Full
Document
This note was uploaded on 12/29/2011 for the course EXST 7015 taught by Professor Wang,j during the Fall '08 term at LSU.
 Fall '08
 Wang,J

Click to edit the document details