This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Statistical Techniques II Page 58 To request the procedure ask for model options “selection=rsquare”. I also included the options
“start=3 stop=6 best=8”; This instructs SAS to start with 3 variable models, go up to 6 variables
and show me the best 8 models for each number of variables.
As requested, the RSQUARE selection option first produces the best 8 3factor models (plus
intercept).
Number
3
3
3
3
3
3
3
3 Rsquare
0.49340010
0.48523075
0.47356336
0.47347050
0.46955655
0.46300398
0.46191250
0.45384242 Variables in Model
LTOFSTAY CULRATIO SERVICES
LTOFSTAY CULRATIO NURSES
LTOFSTAY CULRATIO NOBEDS
LTOFSTAY CULRATIO CENSUS
LTOFSTAY CULRATIO XRAY
CULRATIO XRAY SERVICES
CULRATIO XRAY CENSUS
CULRATIO XRAY NURSES And then the best 4factor, 5factor, etc.
Number in
4
4
4
4
4
4
4
4 Rsquare
0.51613081
0.51023237
0.50002851
0.49971593
0.49556642
0.49556459
0.49348607
0.49341314 Variables in Model
LTOFSTAY CULRATIO XRAY SERVICES
LTOFSTAY CULRATIO XRAY NURSES
LTOFSTAY CULRATIO XRAY CENSUS
LTOFSTAY CULRATIO XRAY NOBEDS
LTOFSTAY CULRATIO NURSES SERVICES
LTOFSTAY AGE CULRATIO SERVICES
LTOFSTAY CULRATIO NOBEDS SERVICES
LTOFSTAY CULRATIO CENSUS SERVICES The best model we found was a 4factor model. Here we can check for alternative 4factor
models. Note that frequently very little is lost by replacing one or two variables with different
variables, often less than a few percentage points on the R2 value. The other variables may be
more interpretable, more reliably measured, cheaper and easier to measure, or have some other
advantage.
Other Regression Topics As mentioned earlier, the intercept for our last problem was not very meaningful (when all Xi
equal zero we have no beds, no nurses, a length of stay of zero days, etc.) This is not an
uncommon problem. In evaluating the abundance of marine organism with salinity, temperature
and depth, for example, a salinity of zero is not a marine environment, a temperature of zero is not
liquid and a depth of zero is not wet, so the intercept is meaningless.
So, if you want to plot your data on one of the Xi values, what can you do. If you just extract the
intercept and slope of interest, you are essentially setting all other Xi equal to zero. This can lead
to unreasonable values of Yhat even if you do not show the intercept. ˆ
Yi b0 b1 X1i b2 X 2i b3 X 3i b4 X 4i
ˆ
Yi b0 b1 X1i b2 (0) b3 (0) b4 (0)
ˆ
Yi b0 b1 X1i James P. Geaghan  Copyright 2011 Statistical Techniques II Page 59 Infection Risk Plot of observed and Predicted Infection Risk
9
8
7
6
5
4
3
2
1
0
5 7 9 11 13 15 17 19 Length of Stay
In order to do a plot of Yj and Yhatj on a single Xij value, it is best to set the other Xij values to their
mean value. ˆ
Yi b0 b1 X1i b2 X 2i b3 X 3i b4 X 4i
ˆ
Yi b0 b1 X1i b2 X 2 b3 X 3 b4 X 4 ˆ Yi b0 b2 X 2 b3 X 3 b4 X 4 b1 X1i b0 b1 X1i Since all bj X j are “constant”, the part in brackets is a new “intercept”, b'0.
For the final 4factor model, If I wanted to plot our observed and predicted SENIC values on
Length of Stay (with a meaningful range of values) I would get the following results.
Variable Parameter Estimate INTERCEP 0.04644573
0.01205242
0.02046537 SLR
2.53702 0.18841053 CULRATIO XRAY SERVICES Constants –0.06358059 –0.06358059 LTOFSTAY Means 0.18841
15.79
81.63
43.16
Sum 0.733513715 0.983818779 0.88327088 2.537022785 Notice the change in intercept, it is no longer negative, suggesting that even for a very short stay in
the hospital (near zero time) there is still a positive risk of infection. This seem more reasonable. Infection Risk Now lets look at the plot of the adjusted model for observed and predicted infection risk
9
8
7
6
5
4
3
2
1
0
5 7 9 11 13 15 17 19 Length of Stay
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 60 The predicted values line up nicely, as we would expect for a simple linear regression, and could
be connected to fit a line.
Also, though the origin is not shown, the intercept of 2.5 would be reasonable reasonable.
Compare this to the graph that used the full model slope and intercept to get the predicted values
with all other Xi values essentially set to zero. Obviously the unadjusted line does not fit well, and
it's negative intercept is too low. There appears to be a great deal of scatter, but remember we are
looking at the Y variable on only one X variable. There are 3 other significant independent
variables doing their share to explain the variation.
Cause & Effect I must reiterate, you cannot prove cause and effect with correlation or regression. Cause and
effect are “proved” with a controlled experiment. However, once proved relationships can be
quantified with regression, and a good correlation may prove to be a useful predictive tool even
where there is no cause and effect.
Linear combinations Regression is a linear combination. It is linear because the terms are additive.
There are some properties of linear combinations that are useful not only for regression, but for
other applications as well. Take the linear combination Ai aX i bYi cZi .
The variance is given by,
Var Ai a 2 * Var X i b 2 * Var Yi c 2 * Var Z i 2 * Covariances 2
2
2
Var Ai a 2 X i b 2 Yi c 2 Zi 2 ab X i ,Yi ac X i , Zi bc Yi , Zi Unless the variables are independent, in which case the covariances may be assumed to be zero.
For our variance calculation purposes in Multiple regression we need not consider the covariance among observations because they are independent,
We need not consider the covariance among Yhati and ei because they are independent
We DO NOT consider the parameter estimates of a multiple regression independent, and
we use the covariance estimates from the analysis. Other applications,
In a twosample ttest, and later on in Analysis of variance, if you want to test an hypothesis
between two or more independent estimates like, H 0 : 1 0 .5 2 or H 0 : 1 0 .5 2 0
We note that since these are independent, the variance for this ttest will be
Variance Var ( 1 ) 0.52 Var ( 2 ) Var ( 1 ) 0.25Var ( 2 ) Linear combinations also are used in sampling.
If random sampling is done on a heterogeneous population, the heterogeneity will cause a large
variance. If the population is broken into smaller, more homogeneous, units the variance of
each of the units will be smaller.
The overall variance is then calculated by summing the individual variances (multiplied by the
square of the coefficients). Since the units are sampled independently no covariance is needed.
For an example, with calculations, see “Linear combinations” under the EXST7005 notes.
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 61 Multiple Regression Summary Although the observation diagnostics are similar between SLR and MLR, there a number of new
diagnostics for variables. There is also a new problem (multicollinearity) that needs to be
addressed. Don't forget, or underestimate, this problem.
The assumptions for MLR are basically the same as for SLR.
Most diagnostics on assumptions and model adequacy are similar (normality, curvature, etc.).
We have partial residual plots (which could have been done for SLR) as a new diagnostic tool.
Extra SS are important to understanding the various types of SS, and the General Linear Test.
You should now be able to interpret the parameter estimates provided by SLR or MLR, and use
most of the diagnostics produced by SAS to determine variable “importance”, evaluate
observations and determine of the model is adequate and if the assumptions are met! Curvilinear Regression
As the name implies, these are regressions that fit curves. However, the regressions we will
discuss are also linear models, so most of the techniques and SAS procedures we have discussed
will still be relevant.
We will discuss two basic types of curvilinear model.
Models that are not linear, but that can be “linearized” by transformation are called
intrinsically linear because after transformation they are linear, often SLR. These have
already been discussed.
The other category are polynomial regressions. These are an extraordinarily flexible
family of curves that will fit almost anything. Unfortunately, they rarely have a good,
interpretation of the parameter estimates.
Polynomial Regression Polynomial regressions are multiple regressions that use power terms of the Xi variable to fit
curves. As long as the value of the power is known, the model is linear.
Only a single Xi is needed (though more can be used).
The assumptions are the same as for any other multiple regression.
Polynomial regressions are of the form
Yi b0 b1 X i b2 X i2 b3 X i3 ... bk X ik ei The simplest in this family of models is the “linear”, which is just a simple linear regression.
Polynomials proceed,
Quadratic Yi b0 b1 X i b2 X i2 ei Cubic Yi b0 b1 X i b2 X i2 b3 X i3 ei Quartic Yi b0 b1 X i b2 X i2 b3 X i3 b4 X i4 ei Quintic, etc. James P. Geaghan  Copyright 2011 Statistical Techniques II Page 62 The quadratic fits a simple parabolic curve. Either concave or convex, depending on the sign on
the regression coefficient.
Y Y X X The cubic fits parabolic curves with an inflection. The inflection does not always occur within the
range of the data. Y Y Inflection
X X The quartic polynomial adds another inflection, and another peak or valley (maximum or
minimum point). These are not usually symmetric. Y Y X X The same pattern continues for larger models. Y X
What good are polynomials? They will fit anything. In fact, if no two X values are repeated, then
a large enough polynomial will go through every observation.
A SLR exactly fits 2 points
A quadratic polynomial will exactly fit 3 points
A cubic will pass through each of 4 points
For n points, a polynomial with n–1 terms will pass through every point. James P. Geaghan  Copyright 2011 Statistical Techniques II Page 63 Sounds like a good thing? Only if you want to fit random scatter! How would you interpret
the graph below? Y X
About polynomial regressions Polynomial regressions are fitted successively starting with the linear term (a first order
polynomial). These are tested in order, so Sequential SS are appropriate.
When the highest order term is determined, then all lower order terms are also included.
For example, we fit a fifth order polynomial, and only the CUBIC term is significant, then we
would OMIT THE HIGHER ORDER NONSIGNIFICANT TERMS, BUT RETAIN THOSE
TERMS OF SMALLER ORDER THAN THE CUBIC.
This does not mean that Yi b0 b3 X i3 ei is not a potentially useful model, only that this is
not a “polynomial” model.
If there are “s” different values of Xi, then s–1 polynomial terms (plus the intercept) will pass
through every point (or the mean of every point if there are more than one observation per Xi
value.
It is often recommended that not more than 1/3 of the total number of points (different Xi
values) be tied up in polynomial terms. For example, if we are fitting a polynomial to the 12
months of the year, don't use more than 4 polynomial terms (quartic).
All of the assumptions for regression apply to polynomials.
Polynomials are WORTHLESS outside the range of observed data!!! Do NOT try to extend
predictions beyond the range of data.
Polynomials generally do not have “ biologically interpretable” regression coefficients.
Since the successive variables are all powers of Xi they are correlated, multicollinearity could be
an issue, but for two facts.
Using sequential SS gives exactly the needed tests, collinearity is not an issue.
Regression coefficients may be affected and variances inflated, but we are unlikely to be
interested in the regression coefficients for polynomials anyway.
Recall that transformations of Xi will not influence variance. This is true for polynomials.
Y Yi i Xi Xi James P. Geaghan  Copyright 2011 Statistical Techniques II Page 64 Polynomial Regression Example (10 K Race Results – Vermont) – Appenxix 9
There are separate race results for 527 Women & 963 Men. We will hypothesize that fastest
runners will be neither the oldest nor the youngest. This can be fitted with a polynomial.
See the output in Appendix 9 Examine the scatter plots, done separately for the two sexes.
Examine the regression models, also done separately for the two sexes.
High resolution graphics were prepared in SAS and processed in Freelance. Graphics for the two
model were done separately. Time to run marathon (min) 325
300
275
250
225
200 Marathon race
sex=F 175
150 Polynomial Regression Example 125
10 20 30 40 50 60 70 Age (years)
Time to run marathon (min) 325
300 Polynomial Regression Example 275
250
225
200
175 Marathon race
150 sex=M 125
10 20 30 40 50 60 70 Age (years)
Test of separate parameters for the two genders Remember the General Linear Hypotheisis test? Once again we have a full model (3 parameters
fitted to each gender = 6 parameters fitted) versus a single fit to both genders combined (ony 3
parameters). The full model is the 6 parameter fit and the reduced model is the 3 paremater fit.
The sums of squares from the separate fits to gender can be added to give the following result.
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 65 We fit the Reduced model and the Full model (as two separate models).
Full model results: dfError = 524 + 960 = 1484, SSE = 304949 + 799234 = 1104183
Reduced model results: dfError=1487, SSE=1206036.845
Then we set up the table below to test the difference.
Source Reduced model Full model Difference Full model d.f. 1487 1484 3 1484 SSE
MSE
1206036.85 1104183.01 101853.84 33951.2791
1104183.01
744.0586 F 45.6298 P>F 3.320764E‐28 In this case we would decide that there was clearly a difference between genders. We don’t know
which one or more of the 3 parameters is different (different curvature or different intercepts) but
some difference exists. Later, with analysis of covariance, we could determing which parameters
differ. It actually turns out to be the intercept, the curvatures are the same for both genders.
So, given the curvature, there is an intermediate age that runs the 10 K race fastest, and younger and
older individuals take longer. What is that age?
The fitted model for females is Time = 270.94 – 1.7668Age + 0.02906Age2
If we take the first derivative and set this equal to zero, and solve for Age we get:
Age at minimum time = 1.7668 / 2(0.02906) = 30.4
Using the equation to solve for the average time at age = 30.4 we get a mean of 244 minutes for
women, the best average time for any age.
The fitted model for males is Time = 265.60 – 2.3003Age + 0.03392Age2
Men had a minimum at 33.9 and had a mean time of 226.6 minutes at that age.
Do the results seem worthwhile? Are they meaningful? Are they Interpretable? Do they have
value? Note that the R2 for females was only 0.026822 and for males only 0.038138. However,
linear and quadratic terms were significant indicating that there is a significant fit to the means.
Polynomial Regression Summary Polynomial regressions are treated like any other multiple regression, except that we use Type I
SS for testing hypotheses.
Note that the FULLY ADJUSTED regression coefficients are still used to fit the model.
The ability to determine a minimum or maximum point is a useful application of polynomials
(optimum performance @ age, optimum yield @ fertilizer level, etc).
We have some new capabilities as far as what we can do with regression.
Test for a curvilinear relationship between the Y and X.
Test if the curvature is Quadratic? Cubic? Quartic? ...
We can now obtain a curvilinear predictive equation for Y on X. James P. Geaghan  Copyright 2011 ...
View
Full
Document
 Fall '08
 Wang,J

Click to edit the document details