This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Statistical Techniques II Page 44 Regression in GLM
PROC GLM and PROC MIXED do regression, but do not have all of the regression diagnostics
available that we find in PROC REG.
However, they do have a few advantages.
They facilitate the inclusion of class variable (something we will be interest in later), and
They provide tests of both Type I and TYPE II SS (as well as Types III and IV).
The formatting is different, but most of the same information is available.
Tests of both SS 1 and SS 3 are given by default.
Note that the Type II and Type III are the same as in PROC REG (recall extra SS), but tests are
provided. These F test values are calculated by dividing each SS (Sequential or Partial) by the
MSE.
Also note that the ttests of the parameter estimates are the same as the tests of the Partial SS.
More material and Summarization of Multiple regression will be done with the second example. Multicollinearity
An important consideration in multiple regression is the effect of correlation among independent
variables. There is a problem that exists when two independent variables are very highly
correlated. The problem is called multicollinearity.
At one extreme of this phenomenon is the case where two independent variables are perfectly
correlated. This results in “singularity”, and the X'X matrix that cannot be inverted.
To illustrate the problem, take the following data set.
Y 1 2 3 X1 1 2 3 X2 2 3 4 If entered in PROC REG, SAS will report problems and will fit only the first variable, since the
second one is perfectly correlated. Suppose we did want to fit both parameters for X1 and X2,
what bi values could we get. The table below shows some possible values for b1 and b2.
Acceptable values of b0, b1 and b2 in the model Yi b0 b1 X 1i b2 X 2i .
b0 b1 0 1 –1 0 99 100 999 1000 –101 –100 –1001 –1000 –1000001 –1000000 b2 0 1 –99 –999 101 1001 1000001 There are an infinite number of solutions when singularity exists, and that is why no program
can, or should, fit the parameter estimates.
But suppose that I took and added to one of the Xi observations the value 0.0000000001.
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 45 Now the two independent variables are not PERFECTLY correlated!!! SAS will report no
error and will give a solution.
How good is that solution. Remember how the bi values could go way up or way down as long
as they were balanced by the other?
b0 b1 0 1 –1 0 99 100 999 1000 –101 –100 –1001 –1000 b2 0 1 –99 –999 101 1001 Typically when very high correlations exist (but NOT perfect correlations) small changes in the
data result in large fluctuations of the regression coefficients.
Basically, under these conditions, the regression coefficient estimates are useless.
Also, the variance estimates are inflated.
So how do we detect these problems?
First, look at the correlations, the simple correlations among the Xi variables produced by the
PROC REG in the summary statistics section.
For the Phosphorus example
Correlation
CORR
X1
X1
1.0000
X2
0.4616
X3
0.1520 X2
0.4616
1.0000
0.3175 X3
0.1520
0.3175
1.0000 Y
0.6934
0.3545
0.3617 Large correlations (usually > 0.9) can indicate potential multicollinearity problems.
However, to detect Multicollinearity these statistics alone are not enough. It is possible that
there is no pairwise correlation, but that some combination of Xi variables correlates with some
other combination. So we need another statistic to address this.
The Variance Inflation Factor (VIF) is the statistic most commonly used to detect this problem.
For the Phosphorus example
Variable
INTERCEP
X1
X2
X3 Tolerance
.
0.78692352
0.72432171
0.89915421 Variance
Inflation
0.00000000
1.27077152
1.38060199
1.11215627 VIF values over 5 or 10, or a mean of the VIF values much over 2 indicate potential problems
with multicollinearity.
Tolerance is just the inverse of the VIF, so as VIF go up, Tolerance goes down. Both can be
used to detect multicollinearity. We will ignore Tolerance.
Another criteria for the evaluation of multicolinearity are called the “Collinearity Diagnostics”.
The value examined is the “condition number” or “condition index”.
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 46 This criteria does not provide a Pvalue.
The last condition number is examined and values of 30 to 40 are considered to indicate
probable multicolinearity.
Another evaluation criteria for multicolinearity is the “condition number” or “condition index”
Collinearity Diagnostics
Condition
Number Eigenvalue
Index
1
7.92221
1.00000
2
0.70667
3.34824
3
0.23449
5.81248
4
0.04727
12.94588
5
0.03554
14.93049
6
0.02878
16.59008
7
0.01657
21.86586
8
0.00546
38.09407
9
0.00301
51.29072 Proportion of VariationIntercept
LtofStay
Age
CulRatio
XRay
0.00008203
0.00026608
0.00008384
0.00241
0.00055647
0.00062369
0.00113
0.00066774
0.02105
0.00504
0.00150
0.00158
0.00222
0.67694
0.00076682
0.00082735
0.04556
0.00039099
0.00635
0.00030739
0.00123
0.00104
0.00130
0.09639
0.43090
0.01898
0.01720
0.02926
0.06172
0.48518
0.03592
0.60385
0.02075
0.05279
0.05646
0.03300
0.27836
0.00195
0.01730
0.00245
0.90784
0.05102
0.94338
0.06505
0.01834 Multiple Regression Variable Diagnostics (See Appendix 1) So, the multiple regression differs from the SLR in that it has several variables.
We need new statistics to examine parameter estimates from these variables, and to determine if
there are problems among the variables.
I will collectively refer to these as the “variable diagnostics” and these will be covered in the next
section.
Recall that multicollinearity can cause the regression coefficients to fluctuate greatly. Examining
the Sequential Parameter Estimates for large fluctuations as variables enter is another indicator of
multicollinearity.
Sequential Parameter Estimates
INTERCEP
X1
X2
81.277777778
0
0
59.258958792
1.8434360081
0
56.251024085
1.7897741162
0.08664925
43.652197791
1.7847796802
0.083397057 X3
0
0
0
0.161132691 Before looking at output, let’s consider our objectives.
Objectives can vary.
You are probably interested in testing for relationships (actually “partial” correlations) between
the various independent variables and the dependent variable.
On the one hand, you may be interested in the effect of each and every variable included in
the analysis. If this is the case, you are probably particularly interested in the parameter
estimates.
And, you will probably be interested in confidence intervals on these parameter estimates, or
on testing them against certain hypothesized values.
Since you are interested in all of the variables, you are probably not interested in removing
any variables from the model.
Our first example may have been this type of analysis. We wanted to examine the effect of
3 soil phosphorus components, and would present the results for all 3, even though only one
was significant, because all 3 are of interest.
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 47 On the other hand, you may not be interested in all of the variables. You may not know which
variables are important, and your objective may be to determine which ones are important.
In this case you may want to keep and discuss all of the variables, or you may want to select
the important variables and present them in a smaller (reduced) model with only significant
variables.
Our second model is more likely to be of this type. There are 8 variables, and most likely
not all are important. Our objective here is probably to determine which ones are correlated
to the dependent variable. We may start with a full model, but will then probably reduce the
model to some subset of significant variables.
A larger multiple variable example (Appendix 8): from Neter, Kutner, Nachtsheim and
Wasserman. 1996. Applied Linear Statistical Models. Irwin Publishing Co. It is a sample various hospitals. Data was taken for the “Study of the Efficacy of Nosocomial
Infection Control” (SENIC Project).
Each observation is for one hospital. There were 113 observations.
The variables are as follows; Identification number
Average length of stay of all patients (in days)
Average age of all patients (years)
Average Infection risk for the hospital (in percent)
This was taken as dependent variable
Routine culture ratio to patients without symptoms to patients with symptoms
Routine chest xray ratio for patients with and without symptoms of pneumonia
Average number of beds (during study)
Med School Affiliation (1=Yes, 2=No)
Region NE=1, NC=2, S=3, W=4
Average no. of patients during study
Average no. of nurses
Percent of 35 potential service facilities that were actually provided by the hosp. Infection risk was used as the dependent variable.
Eight other variables were considered independent variables.
The two class (categorical, group) variables were omitted from the analysis (Med school
affiliation and Region).
Identification number was not used.
See the computer output. Selected output was omitted as trivial or not of interest.
Interpretation and evaluation The PROC REG statement is given on the second page.
Only one new option is included here, the COLLIN option on the model statement.
We have seen all of the other output down through the PROC UNIVARIATE,
PROC GLM is excluded.
The Stepwise regressions at the end of the output are new.
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 48 So, what are we looking for in this regression?
Which variables are important, if any?
Are there problems with Multicollinearity, and what do you do about it anyway?
Are there any problems with the observations?
Yi variable outliers?
Xi variable “outliers”? Influential observations?
Are the assumptions for linear regression met? Normality? Independence? Homogeneity of
variance? Are the Xi variables measured without error?
And what are our objectives? If we want find the best model for predicting infection risk, do
we need all 8 variables? How do we go about removing the ones we don't want?
See the computer output. Some output has been deleted to save space. This is information that I considered very simple, or
information that we will not cover this semester, Descriptive Statistics
Uncorrected SS and Crossproducts
Correlation section
Model Crossproducts X'X X'Y Y'Y
X'X Inverse, Parameter Estimates, SSE Matrix solution We did not cover matrix algebra in detail, but I expect you to know something about these.
As with the SLR, we need all SS and CP from the variables to do a multiple regression. I
expect you to know where these are.
The VarianceCovariance matrix is key to most variance calculations. Know where this is and
where it comes from.
Standardized regression coefficients (used extensively in some disciplines).
In your discipline, take note of the statistics presented in the literature.
Standardization is sometimes used to put variables on the same scale.
For example, if our slope (Y units per X unit) is meaningful in terms of the original scale
(e.g. mg phosphorus available per mg in the soil) we may want to keep the original scale for
interpretative purposes.
However, in other cases our scales may be arbitrary. For example, if we are trying to
predict a Freshman's first semester college performance (scale 0 to 4) from SAT (scale 0 to
36), ACT verbal (scale 200 to 800) and High School GPA (scale 0 to 4), then the arbitrary
scales may confuse and complicate the study. We could “standardize” the 4 variables so
that all have a mean = 0 and variance = 1.
Since the original scales are arbitrary we lose little by doing this. The resulting regression
would have regression coefficients that are without scale, and whose “relative size” would
give an indication of the “relative importance” or “relative impact” of the variable in
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 49 determining the value of the predicted value. So, the standardized regression coefficients
are relative measurements and have no units.
We might be interested in the standardized regression coefficients. We saw earlier that the
Ltofstay variable had the largest regression coefficient, but not the largest t value. Note that
the Std. Reg. Coeff. match the tvalue pattern for the 3 most significant variables, not the
raw reg. coeff.
However, the standardized regression coefficients do not match the t value pattern for all
variables!
XRay was the third significant variable, with the third largest t value, but three variables
that were not significant (smaller tvalues) had larger absolute values for the std. reg. coeff.
(NOBEDS, NURSES and SERVICES). What's up? More later.
Partial R2 Recall the R2 is the SSModel / SSTotal (both corrected). However, since we are not very
interested in the overall model could we get an R2 type statistics for each variable? Of course
we could, in fact there are four.
What would you imagine the individual R2 values to be? You might guess the Type I or Type
II SS divided by the total.
Congratulations, you just invented the “Squared semipartial correlation Type I” and the
“Squared semipartial correlation Type II”.
Squared semipartial correlation: TYPE I = SCORR1 = SeqSSXj / SSTotal
Squared semipartial correlation: TYPE II = SCORR2 = PartialSSXj / SSTotal But think about the Extra SS we talked about earlier. When variables go into a model they
account for some of the SSTotal, and that fraction of the SS is not available to later variables,
or it may even enhance the SS accounted for by later variables. Doesn't it seem that we should
look at the SS available to a variable when we consider how much SS it accounts for?
This is more in keeping with the concept of “Partial” SS we talked about earlier.
So, when a variable (say X1) enters the model after the other variables, what SS are available to
it? Obviously, the SS it accounted for was available (SSX1X2,X3). And the SSError was
also available, though not accounted for.
So, the SS available to each variable is the part it accounts for (SSXjall other variables), plus
the part no variable accounts for (SSError). If we use this as the available SS to be accounted
for, instead of the SSTotal, we have the “Squared Partial correlations”.
These are calculated as
Squared partial correlation TYPE I = PCORR1 = SeqSSXj / (SeqSSXj + SSError*)
* Note that for sequential SS the error changes as each variable enters. This must be
taken into account.
Squared partial correlation TYPE II = PCORR2 = PartialSSXj / (PartialSSXj +
SSError)
So how are these used?
The interpretation is similar to that of the R2, except that it is a fraction for each
variable.
James P. Geaghan  Copyright 2011 ...
View
Full
Document
This note was uploaded on 12/29/2011 for the course EXST 7015 taught by Professor Wang,j during the Fall '08 term at LSU.
 Fall '08
 Wang,J

Click to edit the document details