This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Statistical Techniques II Page 52 Observation Diagnostics The first columns are the value of Yi and the predicted value of Yi. You are responsible for
understanding these, along with the residual (the difference between these two values). These
have not changed from SLR.
You are not responsible for the Std Err Predict or the Std Err Residual. These are estimates of
standard deviations and have been adjusted by hii values.
You are responsible for the confidence intervals, Upper and Lower 95% MEAN and Upper and
Lower 95% PREDICT. These are confidence intervals for the regression line (Yhati) and for
individual points (Yi) respectively.
Recall that for simple linear regression Yi = b0 + b1Xi + ei
ˆ
Y =Y+e
i i 2
ˆ
The variance for Y is SYYX
ˆ ( Xi X.
1 MSE n Xi X.2
2 ) The the variance for an individual observation, Yi, is ˆ
Yi Y ei
2 SY YX 2
2
1 X i  X . MSE MSE 1 1 X i  X . MSE n X i  X .2 n X i  X .2 You are responsible for: The Studentized residual, and perhaps more important the deleted studentized residual
(RSTUDENT).
For the hat diag values (hii)
The remaining 3 diagnostics of interest are the influence diagnostics (DFFITS, DFFBetas
and Cook's D). You are NOT responsible for the column titled Cov Ratio.
Partial Residual Plots These are “scatter plots” of the Y variable adjusted for all Xi except one plotted on that Xi
adjusted for all other Xi.
I used these to get across the concept that not only are the Yi adjusted for each Xi, but the Xi are
also adjusted for each other.
Beyond this these are used more like “scatter plots” than “residual plots”.
We can look for curvature, nonhomogeneous variance, etc.
If they appear to represent random scatter about zero it is because the variable does not
contribute anything to the model, not because it is a “residual plot”. James P. Geaghan  Copyright 2011 Statistical Techniques II Page 53 Ordinary Residual Plots (see SAS output)
Full Model with diagnostics
Plot of e*YHat. Legend: A = 1 obs, B = 2 obs, etc.
‚
A
‚
‚
2 ˆ
A
A
‚
A
‚
A
A
A
A
‚
A
A
‚
A
‚
A A
A
A
1 ˆ
A
A
‚
A
A
A A
R
‚
A
A A
A A
e
‚
C
AA
A
A
s
‚
A
AA
A A A
A
A
A
i
‚
A
AA AA A
AA
A
A
A
d 0 ˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒAAƒƒƒƒƒAƒƒƒƒƒAƒƒƒƒƒƒƒƒƒƒƒƒƒAƒƒƒƒƒƒƒƒƒƒƒƒAƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
u
‚
A
A
AA
A
A A A
B
A
a
‚
A
A
A
A
l
‚
A
A
AA
A
A
A
A
A
A
‚
A
A
AA
‚
A
A
A AA
A
A
A
1 ˆ
A
A
A
‚
A
A
A
‚
A
A
A
A
A
‚
A
A
‚
A
A
‚
A
2 ˆ
‚
A
‚
Šƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
Predicted Value of InfRisk FORMCHAR "‚ƒ„…†‡ˆ‰Š‹Œ+=/\<>*" versus FORMCHAR "++=/\<>*"
Use OPTIONS FORMCHAR="++=/\<>*"; Residual Analysis with PROC UNIVARIATE This is an important procedure for evaluating residuals, especially for the assumption of normality.
The ShapiroWilk test values are: W:Normal 0.985827 Pr 0.8069 These results would lead you to FAIL to reject the hypothesis of normality. We conclude the
observed results are consistent with a normal distribution.
The plots lead to the same conclusion.
Stem Leaf
#
24 6
1
22
20 2
1
18 7
1
16 836
3
14 6989
4
12 2
1
10 32458
5
8 797
3
6 0899279
7
4 5623557
7
2 024614567899
12
0 238803455679
12
0 5541943
7
2 88322110
8
4 9864310762210
13
6 99327
5
8 55864421
8
10 291
3
12 7424
4
14 3720
4
16 23
2
18 5
1
20
22 2
1
++++
Multiply Stem.Leaf by 10**1 Boxplot
0








++


*+*




++








 Normal Probability Plot
2.5+
*

++

*+

*+

**+*

***+

*++

***

+**

+**

+***

***
0.1+
***

***

***

****

**

****

**

***

**

**+

*++
 ++
2.3+*+
+++++++++++
2
1
0
+1
+2 We would also check for outliers, and again see no great problems. Obs # 53 is too large, but only
one out of 113, so not entirely unexpected. This is consistent with our observations from the
RStudent values. James P. Geaghan  Copyright 2011 Statistical Techniques II Page 54 PROC GLM and PROC MIXED For regression, there is not much that is new with these procedures, PROC GLM and PROC
MIXED can do the same analysis as PROC REG.
These procedures can provide both Type I SS and TYPE III SS with tests. PROC GLM provides
Type I sums of squares by default, PROC MIXED provides them on request.
The tests of the Type III SS (or TYPE II SS) are identical to the ttests of the regression
coefficients. Observation diagnostics (See computer output handout)
First we will discuss “observation diagnostics” and tests of the assumptions.
The “ALL” option produces a host of output, but not everything. The INFLUENCE, COLLIN
and PARTIAL are also needed for some additional output.
One difference from SLR is that where previously we used Xi we now use Yhat. For example,
residual plots are usually plotted on Yhat.
The variance calculations for multiple regression use matrix algebraand include all variances and
covariances for the regression coefficients.
Using Studentized residuals Bonferroni adjustment
Doing more tests increases your chance of error. It is possible to do 20, 100, even 1000 tests
and have no Type I errors (at = 0.05), but the chance of an error goes up. The rate of increase
is not linear, so twice as many tests does not double your chance of error.
However, as an approximation Bonferroni noted that the probability of error would be NO
GREATER than the sum of the a values of the individual tests.
For example, do one test at and have probability of error Two tests and have no more than 2 chance of error
10 tests and error rate is < 10 This Bonferroni concept suggests a simple fix. If we were to do 2 tests at /2, then the two tests
together would have no more than a 2*/2 error rate, giving us a overall.
If we were to do 10 tests at /10, then the two tests together would have no more than a 10*
/10 error rate (=).
Two tailed tests are already /2, so we actually want /4 for two tests and /20 for 10 tests.
To make this correction simply choose the t value to reflect the smaller a value. For studentized
residuals use t/2n, n–p d.f.
For deleted residuals use t/2n,n–p–1 d.f. where there is an extra “–1” because of the deleted value.
For our numerical analysis the Bonferroni adjusted critical value would be,
t/2, n–p d.f. = 2.144788596 (unadjusted)
t/2n,n–p–1 = 3.621389624
The RSTUDENT value for one observation (#17) exceeds this value and is a probable outlier.
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 55 Variable Selection
We have previously discussed the concept of partial sums of squares and partial regression
coefficients.
As you know, the addition or removal of any variable will change all other variables in the model.
Therefore, if you decide to add or remove variables from a model this should be done one variable
at a time.
Stepwise Variable Selection
The procedure has been formalized in several options. We will discuss a few of these, Forward
Selection, Backward Selection and Stepwise Selection.
One additional reason for reducing the model in Example 2 is that we had multicollinearity.
Stepwise regression is not specifically designed to avoid multicollinearity, but it will tend to not
pick up two variables that are collinear.
Backward selection is the simplest. It starts with the full model, a model with all variables of interest already present in the model.
A selection criteria is established. Perhaps we want no nonsignificant variables in the model
(=0.05).
See the SAS output.
Forward selection Forward stepwise selection works by calculating all possible simple linear regressions, and
picking the best one to start with.
Again, the F test of the Type II SS, or ttest of the slopes, are used as criteria for selection.
The “best” variable is the most significant one, as long as it meets a minimum criteria.
Once chosen, this best variable will remain in the model for the whole analysis.
After picking the one best variable, the analysis checks all possible 2 factor models, trying each of
the remaining variables together with the first one chosen.
If there are additional variables that meet the criteria, the analysis chooses the best of these.
The step is repeated until no remaining variables meet the criteria.
Stepwise selection There is a variation of FORWARD selection called the “Stepwise” option requested by
“selection=stepwise”.
This is like forward stepwise selection, except that at each step the analysis checks to make
sure that each variable still meets the criteria. If a variable falls below the criteria it will be
removed.
Think of it a forward selection with a backward glance.
There is one additional option that can be useful among the selection options. You can specify
INCLUDE=#. This will force SAS to keep the first # variables in the model. They will be in to
start with and will not be removed. This is good if you have a base model you want to keep intact
and want to check for additional variables.
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 56 I ran the following program to force a larger model.
proc reg data=SENIC;
model InfRisk = LtofStay Age CulRatio XRay NoBeds Census Nurses
Services / selection=stepwise sle=.5 sls=0.5; and got these results
Step
1
2
3
4
5
6
7 Variable
Entered
CulRatio
LtofStay
Services
XRay
Nurses
NoBeds
Age Variable
Removed Number
Vars In
1
2
3
4
5
6
7 Partial
RSquare
0.3127
0.1377
0.0430
0.0227
0.0036
0.0029
0.0023 Model
RSquare
0.3127
0.4504
0.4934
0.5161
0.5197
0.5226
0.5249 C(p)
41.5161
13.3525
5.9368
2.9592
4.1729
5.5431
7.0440 F Value
50.49
27.57
9.25
5.07
0.80
0.64
0.50 Pr>F
<.0001
<.0001
0.0029
0.0263
0.3731
0.4260
0.4795 Other criteria used to determine the number of variables to retain. As you know, when a variable is added to a model the usual R2 always gets larger. The adjusted
R2 is “adjusted” such that the value will not get larger unless the variation accounted for by the
variable is equal to at least one MSE. Otherwise this value can actually decrease.
The ordinary R2 always increases while the adjusted R2
value starts decreasing after the 4th variable is added. 0.55 R2 0.5 Adjusted R2 0.45
0.4
0.35
0.3
0.25
0.2
0 1 2 3 4 5 6 7 8 9 8 9 Variables Plotting the MSE suggests a similar result. 1.3 MSE
1.2 1.1 1.0 0.9 0.8
0 1 2 3 4 5 6 7 Variables Mallow's C(p) statistic is supposed to indicate the
“best” model when C(p) = p.
Mallow's C(p) statistic depends on the Full model
being a pretty good model with no
multicollinearity. This is not always true, and as
we know it is probably not true for this example. 45
40
35
30
25
20
15
10
5
0
1 2 3 4 5 6 7 James P. Geaghan  Copyright 2011 Statistical Techniques II Page 57 Other selection criteria. 50 The information indices (AIC, BIC and SBC)
will be discussed later. 40
30 SBC 20 Cp 10 BIC 0
10
0 AIC
1 2 3 4 5 6 7 8 9 Variables Multicollinearity of the Reduced Model Question! Did the stepwise selection and the resulting reduced model cure our “multicollinearity”
problems? I reran the reduced model with options to get the collinearity diagnostics.
Variable
Intercept
LtofStay
CulRatio
XRay
Services Variance Inflation
0
1.35777
1.28045
1.33266
1.15566 The mean VIF is less than two, and no values even reach the mean, much less the criteria of 10.
Also, the highest condition number was only 16, well below the criteria of 30. We conclude the
model selected with stepwise regression clearly has no multicollinearity problems.
Stepwise selection does not ALWAYS fix problems with multicolinearity.
The reduced model has 4 significant variables. The full model had only 3 significant variables, and
they were not the same variables that were significant. The reduced model also has an R2 =
51.61% while the full model had R2 = 52.51.
The simpler model with nearly the same R2 value is most likely a superior model.
The R2 selection option There is one other “variable selection” option that is very interesting. It is quite different from the
stepwise selection model.
Suppose that you are going to fit a model with a number of variables, lets call them a, b, c, d, e,
and f. What happens if stepwise selection chooses one set of variables, but for some reason you
prefer a different set?
For example, if you feel that variables a, b & d should be the best variables, and stepwise selects a,
b and e. How much better is this model than the one that you feel is best? Or suppose that
variable d is inexpensive and easy to measure while c is expensive and difficult. If you use d
instead of c, how much do you lose?
We will examine the RSquare selection option. This procedure will show you the best models,
not just one, but several. It will also show you how good larger (more variables) and smaller
(fewer variables) models might be. The major criteria here is the value of R2, which is something
of a limitation.
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 58 To request the procedure ask for model options “selection=rsquare”. I also included the options
“start=3 stop=6 best=8”; This instructs SAS to start with 3 variable models, go up to 6 variables
and show me the best 8 models for each number of variables.
As requested, the RSQUARE selection option first produces the best 8 3factor models (plus
intercept).
Number
3
3
3
3
3
3
3
3 Rsquare
0.49340010
0.48523075
0.47356336
0.47347050
0.46955655
0.46300398
0.46191250
0.45384242 Variables in Model
LTOFSTAY CULRATIO SERVICES
LTOFSTAY CULRATIO NURSES
LTOFSTAY CULRATIO NOBEDS
LTOFSTAY CULRATIO CENSUS
LTOFSTAY CULRATIO XRAY
CULRATIO XRAY SERVICES
CULRATIO XRAY CENSUS
CULRATIO XRAY NURSES And then the best 4factor, 5factor, etc.
Number in
4
4
4
4
4
4
4
4 Rsquare
0.51613081
0.51023237
0.50002851
0.49971593
0.49556642
0.49556459
0.49348607
0.49341314 Variables in Model
LTOFSTAY CULRATIO XRAY SERVICES
LTOFSTAY CULRATIO XRAY NURSES
LTOFSTAY CULRATIO XRAY CENSUS
LTOFSTAY CULRATIO XRAY NOBEDS
LTOFSTAY CULRATIO NURSES SERVICES
LTOFSTAY AGE CULRATIO SERVICES
LTOFSTAY CULRATIO NOBEDS SERVICES
LTOFSTAY CULRATIO CENSUS SERVICES The best model we found was a 4factor model. Here we can check for alternative 4factor
models. Note that frequently very little is lost by replacing one or two variables with different
variables, often less than a few percentage points on the R2 value. The other variables may be
more interpretable, more reliably measured, cheaper and easier to measure, or have some other
advantage.
Other Regression Topics As mentioned earlier, the intercept for our last problem was not very meaningful (when all Xi
equal zero we have no beds, no nurses, a length of stay of zero days, etc.) This is not an
uncommon problem. In evaluating the abundance of marine organism with salinity, temperature
and depth, for example, a salinity of zero is not a marine environment, a temperature of zero is not
liquid and a depth of zero is not wet, so the intercept is meaningless.
So, if you want to plot your data on one of the Xi values, what can you do. If you just extract the
intercept and slope of interest, you are essentially setting all other Xi equal to zero. This can lead
to unreasonable values of Yhat even if you do not show the intercept. ˆ
Yi b0 b1 X1i b2 X 2i b3 X 3i b4 X 4i
ˆ
Yi b0 b1 X1i b2 (0) b3 (0) b4 (0)
ˆ
Yi b0 b1 X1i James P. Geaghan  Copyright 2011 ...
View
Full
Document
 Fall '08
 Wang,J

Click to edit the document details