Unformatted text preview: Statistical Techniques II Page 47 On the other hand, you may not be interested in all of the variables. You may not know which
variables are important, and your objective may be to determine which ones are important.
In this case you may want to keep and discuss all of the variables, or you may want to select
the important variables and present them in a smaller (reduced) model with only significant
Our second model is more likely to be of this type. There are 8 variables, and most likely
not all are important. Our objective here is probably to determine which ones are correlated
to the dependent variable. We may start with a full model, but will then probably reduce the
model to some subset of significant variables.
A larger multiple variable example (Appendix 8): from Neter, Kutner, Nachtsheim and
Wasserman. 1996. Applied Linear Statistical Models. Irwin Publishing Co. It is a sample various hospitals. Data was taken for the “Study of the Efficacy of Nosocomial
Infection Control” (SENIC Project).
Each observation is for one hospital. There were 113 observations.
The variables are as follows; Identification number
Average length of stay of all patients (in days)
Average age of all patients (years)
Average Infection risk for the hospital (in percent)
This was taken as dependent variable
Routine culture ratio to patients without symptoms to patients with symptoms
Routine chest x-ray ratio for patients with and without symptoms of pneumonia
Average number of beds (during study)
Med School Affiliation (1=Yes, 2=No)
Region NE=1, NC=2, S=3, W=4
Average no. of patients during study
Average no. of nurses
Percent of 35 potential service facilities that were actually provided by the hosp. Infection risk was used as the dependent variable.
Eight other variables were considered independent variables.
The two class (categorical, group) variables were omitted from the analysis (Med school
affiliation and Region).
Identification number was not used.
See the computer output. Selected output was omitted as trivial or not of interest.
Interpretation and evaluation The PROC REG statement is given on the second page.
Only one new option is included here, the COLLIN option on the model statement.
We have seen all of the other output down through the PROC UNIVARIATE,
PROC GLM is excluded.
The Stepwise regressions at the end of the output are new.
James P. Geaghan - Copyright 2011 Statistical Techniques II Page 48 So, what are we looking for in this regression?
Which variables are important, if any?
Are there problems with Multicollinearity, and what do you do about it anyway?
Are there any problems with the observations?
Yi variable outliers?
Xi variable “outliers”? Influential observations?
Are the assumptions for linear regression met? Normality? Independence? Homogeneity of
variance? Are the Xi variables measured without error?
And what are our objectives? If we want find the best model for predicting infection risk, do
we need all 8 variables? How do we go about removing the ones we don't want?
See the computer output. Some output has been deleted to save space. This is information that I considered very simple, or
information that we will not cover this semester, Descriptive Statistics
Uncorrected SS and Crossproducts
Model Crossproducts X'X X'Y Y'Y
X'X Inverse, Parameter Estimates, SSE Matrix solution We did not cover matrix algebra in detail, but I expect you to know something about these.
As with the SLR, we need all SS and CP from the variables to do a multiple regression. I
expect you to know where these are.
The Variance-Covariance matrix is key to most variance calculations. Know where this is and
where it comes from.
Standardized regression coefficients (used extensively in some disciplines).
In your discipline, take note of the statistics presented in the literature.
Standardization is sometimes used to put variables on the same scale.
For example, if our slope (Y units per X unit) is meaningful in terms of the original scale
(e.g. mg phosphorus available per mg in the soil) we may want to keep the original scale for
However, in other cases our scales may be arbitrary. For example, if we are trying to
predict a Freshman's first semester college performance (scale 0 to 4) from SAT (scale 0 to
36), ACT verbal (scale 200 to 800) and High School GPA (scale 0 to 4), then the arbitrary
scales may confuse and complicate the study. We could “standardize” the 4 variables so
that all have a mean = 0 and variance = 1.
Since the original scales are arbitrary we lose little by doing this. The resulting regression
would have regression coefficients that are without scale, and whose “relative size” would
give an indication of the “relative importance” or “relative impact” of the variable in
James P. Geaghan - Copyright 2011 Statistical Techniques II Page 49 determining the value of the predicted value. So, the standardized regression coefficients
are relative measurements and have no units.
We might be interested in the standardized regression coefficients. We saw earlier that the
Ltofstay variable had the largest regression coefficient, but not the largest t value. Note that
the Std. Reg. Coeff. match the t-value pattern for the 3 most significant variables, not the
raw reg. coeff.
However, the standardized regression coefficients do not match the t value pattern for all
XRay was the third significant variable, with the third largest t value, but three variables
that were not significant (smaller t-values) had larger absolute values for the std. reg. coeff.
(NOBEDS, NURSES and SERVICES). What's up? More later.
Partial R2 Recall the R2 is the SSModel / SSTotal (both corrected). However, since we are not very
interested in the overall model could we get an R2 type statistics for each variable? Of course
we could, in fact there are four.
What would you imagine the individual R2 values to be? You might guess the Type I or Type
II SS divided by the total.
Congratulations, you just invented the “Squared semi-partial correlation Type I” and the
“Squared semi-partial correlation Type II”.
Squared semi-partial correlation: TYPE I = SCORR1 = SeqSSXj / SSTotal
Squared semi-partial correlation: TYPE II = SCORR2 = PartialSSXj / SSTotal But think about the Extra SS we talked about earlier. When variables go into a model they
account for some of the SSTotal, and that fraction of the SS is not available to later variables,
or it may even enhance the SS accounted for by later variables. Doesn't it seem that we should
look at the SS available to a variable when we consider how much SS it accounts for?
This is more in keeping with the concept of “Partial” SS we talked about earlier.
So, when a variable (say X1) enters the model after the other variables, what SS are available to
it? Obviously, the SS it accounted for was available (SSX1|X2,X3). And the SSError was
also available, though not accounted for.
So, the SS available to each variable is the part it accounts for (SSXj|all other variables), plus
the part no variable accounts for (SSError). If we use this as the available SS to be accounted
for, instead of the SSTotal, we have the “Squared Partial correlations”.
These are calculated as
Squared partial correlation TYPE I = PCORR1 = SeqSSXj / (SeqSSXj + SSError*)
* Note that for sequential SS the error changes as each variable enters. This must be
taken into account.
Squared partial correlation TYPE II = PCORR2 = PartialSSXj / (PartialSSXj +
So how are these used?
The interpretation is similar to that of the R2, except that it is a fraction for each
James P. Geaghan - Copyright 2011 Statistical Techniques II Page 50 The ones that make the most sense to me are
For models using the Type I SS the Squared semi-partial correlation TYPE I
For models using the Type II SS, the Squared partial correlation TYPE II
Since the Type I SS sum to the SSReg, the Semi-partial R2 Type I will sum to the overall R2.
Since the Partial SS may sum to more or less than the SSReg, and the denominator is not the
SSTotal, the sum of these partial R2 values is unpredictable.
Since we use the Type II SS for this problem we would probably be interested in the Squared
partial correlation Type II.
Note that these values follow pattern more similar to the t values than the standardized
regression coefficients. The largest tend to match the significant variables, and in the same
order. However, the match is not perfect across all variables.
So we have 3 statistics we can use to interpret the contribution or importance of the variables. I
consider the t-test to be the best test, and the standard regression coefficients and partial R2 type
values are ancillary statistics.
Variance Inflation Factor (VIF) Now lets consider multicollinearity. Recall variance inflation. Is it possible that some other
variable is important, but does not show up because it is in competition with another variable
for the SS? Or that two multicollinear variables have inflated variance and do not have
significant t-tests because of this?
2.92151333 Two VIF values (NOBEDS and CENSUS) are greater than 10, a clear indication of problems.
A second variable (NURSES) is greater than 7. I would consider this a possible problem as
We have talked about this problem, but have not discussed what to do about it.
So, what do we do? Statistics quote: Figures don't lie, but liars figure. - Samuel Clemens (alias Mark Twain) James P. Geaghan - Copyright 2011 Statistical Techniques II Page 51 Addressing Multicollinearity issues First we know that with multicollinearity problems we have potentially wide fluctuations in the
regression coefficients, and that we have inflated variances.
Since the problems are caused by two or more correlated variables, one obvious solution is to
leave off some of the variables that are correlated. This is often the easiest and best solution.
Fit a reduced model that has some subset of the original variables and where correlated
variables are reduced.
A second solution is to get more data. Sometimes additional data will bring out the differences
in the variables and reduce the correlation.
A third solution is “Ridge Regression”. This is an interesting regression, an example of where
a statistician may seek biased estimates in order to reduce variance. However, it can be hard to
interpret the results or to decide exactly how much bias needed. I consider it a last resort, but
one I have used. We will not discuss this option in detail.
We will apply the first and best option to this example (reducing the number of variables).
When we reduce a model like this, we do it one variable at a time. We will use “Stepwise
regression” to do this. This section will come when we finish with the rest of Example 2 in
Variance-Covariance Matrix This section was deleted, but it is simply the (X'X)–1 matrix multiplied by the MSE. It is
available in the SAS program output. You are responsible for knowing what is in this matrix
and how it relates to the parameter estimate variances.
Correlation of Estimates (deleted) You are not responsible for these
Sequential Parameter Estimates Recall that when multicollinearity exists, reg. coeff can fluctuate greatly. In this section we
look to see if there are large fluctuations, or if the parameter estimates are stable.
The Sequential Parameter Estimates provide an additional indication of problems, and some
indication of which variables are affected by which others.
Collinearity Diagnostics: These are produced by the “collin” option on the model statement. If the ratio is greater than
about 30, multicollinearity is indicated.
Multicolinearity summary Collinearity Diagnostics are a very good tool to evaluate multicollinearity, perhaps the best.
VIF is also a very good tool, and is the most popular.
Simple correlations are good, but may miss higher order correlations (“multi”).
Sequential bi values are useful, but somewhat subjective.
You are not responsible for the sections titled, Consistent Covariance of Estimates (deleted)
Test of First and Second Moment Specification (deleted)
James P. Geaghan - Copyright 2011 Statistical Techniques II Page 52 Observation Diagnostics The first columns are the value of Yi and the predicted value of Yi. You are responsible for
understanding these, along with the residual (the difference between these two values). These
have not changed from SLR.
You are not responsible for the Std Err Predict or the Std Err Residual. These are estimates of
standard deviations and have been adjusted by hii values.
You are responsible for the confidence intervals, Upper and Lower 95% MEAN and Upper and
Lower 95% PREDICT. These are confidence intervals for the regression line (Yhati) and for
individual points (Yi) respectively.
Recall that for simple linear regression Yi = b0 + b1Xi + ei
i i 2
The variance for Y is SYY|X
ˆ ( Xi X.
1 MSE n Xi X.2
2 ) The the variance for an individual observation, Yi, is ˆ
Yi Y ei
2 SY Y|X 2
1 X i - X . MSE MSE 1 1 X i - X . MSE n X i - X .2 n X i - X .2 You are responsible for: The Studentized residual, and perhaps more important the deleted studentized residual
For the hat diag values (hii)
The remaining 3 diagnostics of interest are the influence diagnostics (DFFITS, DFFBetas
and Cook's D). You are NOT responsible for the column titled Cov Ratio.
Partial Residual Plots These are “scatter plots” of the Y variable adjusted for all Xi except one plotted on that Xi
adjusted for all other Xi.
I used these to get across the concept that not only are the Yi adjusted for each Xi, but the Xi are
also adjusted for each other.
Beyond this these are used more like “scatter plots” than “residual plots”.
We can look for curvature, nonhomogeneous variance, etc.
If they appear to represent random scatter about zero it is because the variable does not
contribute anything to the model, not because it is a “residual plot”. James P. Geaghan - Copyright 2011 ...
View Full Document
- Fall '08