This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Statistical Methods I (EXST 7005) Page 149 ∧ Standard error of the regression line (i.e. Yi ) : S ˆ
μY  X = ⎛
⎞
2
⎜ 1 ( X i − X. ) ⎟
MSE⎜ +
n n
2⎟
⎜ ∑ ( X i − X. ) ⎟
⎝ i =1
⎠
∧ Standard error of the individual points (i.e. Yi): This is a linear combination of Yi and ei, so the
variances are the sum of the variance of these two, where the variance of ei is MSE. The
2
= S
+ MSE =
standard error is then S μ
ˆ
μY  X YX ⎛
⎞
2
⎜ 1 ( X i − X. ) ⎟
MSE ⎜ +
+ MSE
n n
2⎟
∑ ( X i − X. ) ⎟
⎜
⎝ i =1
⎠ = ⎛
⎞
2
⎜ 1 ( X i − X. ) ⎟
MSE ⎜ 1+ +
n n
2⎟
∑ ( X i − X. ) ⎟
⎜
⎝
⎠
i =1 Standard error of b0 is the same as the standard error of the regression line where Xi = 0
Square Root of [5.503603515 (0.0625 + 26.91015625/90.4375)] = 1.407693696
Confidence interval on b0, where b0 = 4.771250864 and t(0.05/2, 14df) = 2.145
P(4.771250864 – 2.145*1.407693696 ≤ β0 ≤ 4.771250864+2.145*1.407693696) = 0.95
P(1.751747886 ≤ β0 ≤ 7.790753842) = 0.95
Estimate the standard error of an individual observation for number of parasites for a tenyear∧ old fish: Y = b0 + b1 X i =4.77125 + 1.82723*X=4.77125 + 1.82723*10 = 23.04354
Square Root of [ 5.503603515*(1+0.0625+(10 – 5.1875)2/90.4375)] =
Square Root of [ 5.503603515*(1+0.0625+(23.16015625)/90.4375)] = 2.693881509
Confidence interval on μYX=10
P(23.04353836 – 2.145*2.693881509 ≤ μYX=10 ≤ 23.04353836+2.145*2.693881509) = 0.95
P(17.26516252 ≤ μYX=10 ≤ 28.82191419) = 0.95
Calculate the coefficient of Determination and correlation
R2 =
r= 0.796700662
0.892580899 or 79.67006617 % See SAS output
Overview of results and findings from the SAS program James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 150 I. Objective 1 : Determine if older fish have more parasites. (SAS can provide this)
A. This determination would be made by examining the slope. The slope is the mean change in
parasite number for each unit increase in age. The hypothesis tested is H0: β1=0 versus H1:
β1≠0
1. If this number does not differ from zero, then there is no apparent relationship between age
and number of parasites. If it differs from zero and is positive, then parasites increase with
age. If it differs from zero and is negative, then parasites decrease with age.
2. For a simple linear regression we can examine the F test of the model, the F test of the
Type I, the F test of the Type II, the F test of the Type III or the ttest of the slope. For a
simple linear regression these all provide the same result. For multiple regressions (more
than 1 independent variable) we would examine the Type II or Type III F test (these are the
same in regression) or the ttest of regression coefficients. [Alternatively, a confidence
interval can be placed on the coefficient, and if the interval does not include 0, the estimate
of the coefficient is significantly different from zero].
B. In this case, the F tests mentioned had values of 54.86, and the probability of this F value with
1 and 14 d.f. is less than 0.0001. Likewise, the t test of the slope was 7.41, which was also
significant at the same level. Note that t2=F, these are the same test. We can therefore
conclude that the slope does differ from zero. Since it is positive we further conclude that
older fish have more parasites.
II. Objective 2 : Estimate the rate of accumulation of parasites. (SAS can provide this)
A. The slope for this example is 1.827228749 parasites per year (note the units). It is positive, so
we expect parasite numbers to increase by 1.8 per year.
B. The standard error for the slope was 0.24668872. This value is provided by SAS and can be
used for hypothesis testing or confidence intervals. SAS provides a ttest of H0: β1=0, but
hypotheses about values other than zero must be requested (SAS TEST statement) or
calculated by hand. The confidence interval in this case is: This calculation was done
previously and is partly repeated below.
P[b1 – tα/2,14 d.f. Sb1 ≤ β1 ≤ b1 + tα/2,14 d.f. Sb1]=0.95
P[1.827228749 – 2.144789(0.246689) ≤ β1 ≤ 1.827228749 +
2.144789(0.246689)]=0.95
P[1.298134 ≤ β1 ≤ 2.356324]=0.95
Note that this confidence interval does not include zero, so it differs significantly from zero.
III. Estimate the intercept with confidence interval.
A. The intercept may also require a confidence interval. This was calculated previously and was;
P(1.751747886 ≤ β0 ≤ 7.790753842) = 0.95 James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 151 IV. Determine how many parasites a 10 year old fish would have. (SAS can provide this)
A. Estimating a Yi value for a particular Xi simply requires solving the equation for the line with
∧ the Y = b0 + b1 X i which for coefficients of 4.771 and 1.827 and for a 10yearold fish (Xi=10)
∧ is Y =4.771+1.827(10) = 4.771+18.27 = 23.041.
V. Place a confidence interval on the 10 year old fish estimate. (SAS can provide this)
A. The confidence interval for this was estimated previously:
P(17.26516252≤μx=10≤28.82191419)=0.95.
B. There are many reasons why this type of calculation may be of interest. We can place a
confidence interval on any value of Xi, including the intercept where Xi=0 (this was done
previously). The intercept is often the most interesting point on the regression line, but not
always.
C. There is one very special characteristic of the confidence intervals (of either individual points
or means). The confidence interval is narrowest at the mean of Xi, and gets wider to either
side of the mean. The graph below for out example demonstrates this property.
Regression with confidence bands
25
20
P
a
r 15
a
s
i
t 10
e
s
5
0
0 1 2 3 4 5
6
Age (years) 7 8 9 10 D.
VI. Determine if a linear model is adequate and assumptions met. (SAS can provide most of this)
A. Independence : This is a difficult assumption to evaluate. There are some techniques in
advanced statistical methods, but these will not be covered here. The best guarantee for
independence is to randomize wherever and whenever possible.
B. Normality : The normality of the “residuals” or deviations from regression can be evaluated
with the PROC UNIVARIATE ShapiroWilks test. The W value was 0.96 and the P<W was
0.6831. We would not reject the null hypothesis of “data is normality distributed” with these
results. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 152 Homogeneity and other considerations : Residual plots are an important tool in evaluating
possible problems in regression, some of which we have not seen before. The normal residual
plot, when all is well, should reflect just random scatter about the regression line. An
Residual Plot ei
+
0
 Xi example is given below. The three residual plots below all show possible problems. From left to right the problems
indicated are (1) the data is curved and cannot be adequately described by a straight line, (2)
the variance is not homogeneous and (3) there is an outlier.
Residual Plot ei Residual Plot ei + + 0 0   Xi
ei Xi Residual Plot +
0
Xi An outlier is an observation which appears to be too large or too small in comparison to the other
values. Data should be checked carefully to insure that the point is correct. If it is correct, but
is way out of line relative to other values. it may be necessary to omit the point.
The residual plot for our example is given below. Can you detect any potential problems?
Age Residual Plot
4 Residuals 2 0 2 4 6
0 2 4 6 8 10 Age James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 153 VII. An old published article states that the rate of accumulation should be about 5 per year. Test
our estimate against 5. . (SAS can provide this if you ask nicely)
A. SAS automagically test the hypothesis that H0: β1=0. However, any value can be tested. The
b1 − bH0
MSE
MSE
=
,where Sb = n
as previously
test is the usual onesample ttest, t =
S xx
1
2
Sb1
∑ ( X −X )
i =1 mentioned. For this example, t = . i 1.827 − 5
0.2467 VIII. Final notes on regression and correlation. (SAS can provide most of this)
A. The much overrated R2. The regression accounts for a certain fraction of the total SS. The
fraction of the total SS that is accounted for by the regression is called coefficient of
determination and is denoted “R2”. It is calculated as R2=SSReg/SSTotal. This value is usually
multiplied by 100 and expressed as a percent. For our example the value was 79.7% of the
total variation accounted for by the model. This is pretty good, I guess. However, for some
analyses we expect much higher (length  weight relationships for example) and for others
much lower (try to predict how many fish you will get in a net at a particular depth or for a
particular size stream). This statistic does not provide any test, but may be useful for
comparing between similar studies on similar material.
B. The square root of the R2 value is equal to the “Pearson product moment correlation”
coefficient, usually denoted a “r”. This value is calculated as ∑(X
n Sb1 = i =1 ∑(X
n i =1 i i )( − X . Yi − Y. − X. ) ) ∑ (Y − Y )
2 n i =1 i =
2 S xy and is equal to 0.8926 for our example. S xx S yy . C. The correlation coefficient is “unitless” and ranges from 1 to +1.
D. A perfect inverse correlation gives a value of 1. This corresponds to a negative slope in
regression, but the R2 value will not reflect the negative because it is squared. A perfect
correlation gives a value of +1 (positive slope in regression). A correlation of zero can be
represented as random scatter about a horizontal line (slope = 0 in regression). Y Y Perfect inverse correlation
X Perfect correlation
X James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 154 E. The perfect correlation value of 1 (+ or – ) also corresponds to a “perfect” regression, where
the R2 value would indicate that 100% of the variation in the total was accounted for by the
Y Correlation = 0
X model. The error in this case would be zero. About Cross products
Cross products, X iYi ,are used in a number of related calculations. Note from the calculations
below that when any of the calculations equals zero, all of the others will also go to zero. As
a result when the covariance is zero the slope, correlation coefficient, R2 and SSRegression
are also zero. As a result of this, the common test of hypothesis of interest in regression,
H 0 : β1 = 0 , can be tested by testing any of the statistics below. A ttest of the slope or an F
test of the MSRegression are both testing the same hypothesis. Recall that we saw that from
the interrelationships of probability distributions that a t2 with γ d.f. = F with 1, γ d.f.
n Sum of cross products = S XY = ∑ (Yi − Y )( X i − X )
i =1 n Covariance = S XY ( n − 1) = ∑ (Y − Y )( X
i =1 i n Slope = S XY S XX = ∑ (Y − Y )( X
i i =1 i i − X) − X)
n ∑(X
i =1 SSRegression = ( S XY ) ( n − 1) 2 S XX i − X )2 ⎛ n
⎞
⎜ ∑ (Yi − Y )( X i − X ) ⎟
⎠
= ⎝ i =1 2 n ∑(X
i =1 n Correlation coefficient = r = S XY S XX SYY = ∑ (Y − Y )( X
i i =1 i i − X )2 − X)
n ∑ (X
i =1 R2 = r2 = ( S XY ) 2 S XX SYY ⎛ n
⎞
⎜ ∑ (Yi − Y )( X i − X ) ⎟
⎠
= ⎝ i =1 n i − X ) 2 ∑ (Yi − Y ) 2
i =1 2 SS Regression
⎛
SSTotal
2
2⎞
⎜ ∑ ( X i − X ) ∑ (Yi − Y ) ⎟
n ⎝ i =1 i =1 = n ⎠ James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 155 Summary
Regression is used to describing a relationship between two variables using paired observations
from the variables.
The intercept is the point where the line crosses the Y axis and the slope is the change in Y per unit
X.
Variance is derived from the sum of squared deviations from the regression line.
The regression model is given by
The population regression model is given by Yi = β 0 + β1 X i + ε i for observations and
μ y. x = β 0 + β1 X i for the regression line itself. ˆ
Estimated from a sample the regression line is Yi = b0 + b1 Xi
There are four assumptions usually made for a regression,
1) Normality (at each value of Xi),
2) Independence (1) of the observations (Yi, Yj) from each other and (2) of the deviations (eij)
from the rest of the model).
3) Homogeneity of variance at each value of Xi.
4) The Xi values are measured without error (i.e. all variation and deviations is vertical). James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 156 Multiple Regression
The objectives are the same as for simple linear regression, the testing of hypotheses about
potential relationships (correlation), fitting and documenting relationships, and estimating
parameters with confidence intervals.
The big difference is that a multiple regression will correlate a dependent variable (Yi) with
several independent variables (Xi's).
The regressions equation is similar. The sample equation is Yi = β 0 + β1 X 1i + β 2 X 2 i + β 3 X 3i + ε i
The assumptions for the regression are the same as for Simple Linear Regression
The degrees of freedom for the error in a simple linear regression were n – 2, where the two
degrees of freedom lost from the error represented one for the interecept and one for the slope.
In multiple regression the degrees of freedom are n – p, where “p” is the total number of
regression parameters fitted including one for the intercept.
The interpretation of the parameter estimates are the same (units are Y units per X units, and
measure the change in Y for a 1 unit change in X).
Diagnostics are mostly the same for simple linear regression and multiple regression.
Residuals can still be examined for outliers, homogeneity, curvature, etc. as with SLR. The
only difference is that, since we have several X's, we would usually plot the residuals on ˆ
Yhat ( Yi ) instead of a single X variable.
Normality would be evaluated with the PROC UNIVARIATE test of normality.
There is only really one new issue here, and this is in the way we estimate the parameters.
If the independent (X) variables were totally and absolutely independent (covariance or
correlation = 0), then it wouldn't make any difference if we fitted them one at a time or all
together, they would have the same value. However, in practice there will always be some
correlation between the X variables.
If two X variables were PERFECTLY correlated, they would both account for the SAME
variation in Y, so which would get the variation?
If two X variables are only partially correlated they would share part of the variation in Y,
so how is it partitioned?
To demonstrate this we will look at a simple example and develop a new notation called the
Extra SS.
For multiple regression there will be, as with simple linear regression, a SS for the “MODEL”.
This SS lumps together all SS for all variables. This is not usually very informative. We
will want to look at the variables individually. To do this there are several types of SS
available in SAS, two of which are of particular interest, TYPE 1 and TYPE 3 SS.
In PROC REG these are not provided by default. To see them you must request them.
This can be done by adding the options SS1 and/or SS2 to the model statement. For
regression the SS Type II and SS Type III are the same.
In PROC GLM, which will do regressions nicely, but has fewer regression diagnostics
than PROC REG, the TYPE 1 and TYPE 3 SS are provided by default.
To do multiple regression in SAS we simply specify a model with the variables of interest. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 157 For example, a regression on Y with 3 variables X1, X2 and X3 would be specified as
PROC REG; MODEL Y = X1 X2 X3; To get the SS1 and SS2 we add
PROC REG; MODEL Y = X1 X2 X3 /ss1 ss2; Example with Extra SS
The simple example is done with created data set.
Y
1
3
5
3
6
4
2
8
9
3
5
6 X1
2
4
7
3
5
3
2
6
7
8
7
9 X2
9
6
7
5
8
4
3
2
5
2
3
1 X3
2
5
9
5
9
2
6
1
3
4
7
4 Now let’s look at simple linear regressions for each variable independently, first for variable
X1. If we do a simple linear regression on X1 we get the following result. The SSTotal is
62.91667, and this will not change regardless of the model since it is adjusted only for the
intercept and all models will include an intercept.
If we fit a regression of Y on X1 the result is SSModel = 23.978, so the sum of squared
accounted for by X1 when it enters alone is 23.978. If we fit X2 alone, the result is
SSModel = 4.115.
If we then fit both X1 and X2 together, would the resulting model SS be 23.978 + 4.115 =
28.093? No, the model actually comes out to be 24.074 because of some covariance
between the two variables.
So how much would X1 add to the model if X2 was fitted first and how much would X2 add
if X1 was fitted first? We can calculate the extra SS for X1, fitted after X2, and for X2
fitted after X1. The variable X2 alone accounted for a sum of squares equal to 4.115
and when X1 was added the SS accounted for was 24.074, so X1 entering after X2
accounted for an additional 24.074 – 4.115 = 19.959. Therefore, we can state that the
SS accounted for by X1, entering the model after X2, is 19.959.
Likewise, we can calculate the SS that X2 accounted for entering after X1. Together they
account for SS = 24.074 and X1 alone accounted for 23.978, so X2 accounted for an
additional SS = 24.074 – 23.978 = 0.096 when it entered after X1.
We need a simpler notation to indicate the sum of square for each variable and which other
variables have been adjusted for before it enters the model. The sum of squares for X1
and X2 entering alone will be SSX1 and SSX2, respectively. When X1 is adjusted for
X2 and vice versa the notation will be SSX1X2 and SSX2X1, respectively. For the
James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 158 calculations above the results were: SSX1 = 23.978, SSX2 = 4.115, SSX1X2 =
19.959 and SSX2X1 = 0.096.
Finally, consider a model fitted on all three variables. A model fitted to X2 and X3, without
X1, yields SSModel = 4.137. When X1 is added to the model, so that all 3 variables are
now in the model the, the SS accounted for is 26.190. How much of this is due to X1
entering after X2 and X3 are already in the model? Calculate 26.190 – 4.137 = 22.053.
This sum of squares is denoted SSX1X2, X3. In summary, X1 accounts for 23.978
when it enters alone, 19.959 when it enters after X2 and 22.053 when it enters after
both X2 and X3 together. Clearly, how much variation X1 accounts for depends on
what variables are already in the model, so we cannot just talk about the sum of
squares for X1.
We can use the new notation to describe the sum of squares for X1 that indicates which
other variable are in the model. This is the notation of the extra sum of squares. The
notation is (SSX1) for X1 alone in the model (adjusted for only the intercept),
(SSX1X2) indicating X1 adjusted for X2 only, and (SSX1S2, X3) indicating that X1 is
entered after, or adjusted for, both X2 and X3. For our example;
SSX1 = 23.978
SSX1X2=19.959
SSX1X2, X3 = 22.053
The same procedure would be done for each of the other two variables. We would calculate the
same series of values for the variable X2; SSX2, SSX2X1 or SSX2X3 and SSX2X1, X3.
The series for variable X3 would be; SSX3, SSX3X1 or SSX3X2 and SSX3X1, X3.
These values are given in the table below.
Extra SS
SSX1
SSX2
SSX3
SSX1X2
SSX2X1
SSX1X3
SSX3X1
SSX2X3
SSX3X2
SSX1X2,X3
SSX2X1,X3
SSX3X1,X2 SS
23.978
4.115
0.237
19.959
0.096
25.134
1.393
3.900
0.022
22.053
0.819
2.116 d.f. Error
10
10
10
9
9
9
9
9
9
8
8
8 Error SS
38.939
58.802
62.680
38.843
38.843
37.546
37.546
58.780
58.780
36.727
36.727
36.727 All of these SS are previously adjusted only for the intercept (X0, the correction factor),
and this will always be the case for our examples. We could include a notation for
the intercept in the extra SS (e.g. SSX1X0; SSX1X0, X2; SSX1X0, X2, X3; etc.),
but since X0 would always present we will omit this from our notation. Partial sums of squares or Type II SS
With so many possible sums of squares which ones are will be useful to us? The sums of squares
normally used for a multiple regression is called the partial sum of squares, the sum of squares
James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 159 where each variable is adjusted for all other variables in the model. These are SSX1X2,X3;
SSX2X1,X3; and SSX3X1,X2. This type of sum of squares is sometimes called the fully
adjusted SS, or uniquely attributable SS. In SAS they are called the TYPE II or TYPE III
sum of squares since these two types are the same for regression analysis. SAS provides
TYPE II in PROC REG and TYPE III in PROC GLM by default. Testing and evaluation of
variables in multiple regression is usually done with the TYPE II or TYPE III SS.
ANOVA table for this analysis (F0.05,1,8=5.32), using the TYPE III SS (Partial SS).
Source
SSX1X2,X3
SSX2X1,X3
SSX3X1,X2
ERROR d.f.
1
1
1
8 SS
22.053
0.819
2.116
36.727 MS
22.053
0.819
2.116
4.591 F value
4.804
0.178
0.461 Sequential sums of squares or Type I SS
When we fit regression, we are interested in one of two types of SS, normally the partials sum of
squares. There is another type of sum of squares called the sequentially adjusted SS. These
sum of squares are adjusted in a sequential or serial fashion. Each SS is adjusted for the
variables previously entered in the model, but not for variables entered after, so it is important
to note the order in which the variables are entered in the model. For the model [Y = X1 X2
X3], X1 would be first and adjusted for nothing else (except the intercept X0). X2 would enter
second, be adjusted for X1, but not for X3. X3 enters last and is adjusted for both X1 and X2.
Using our extra SS notation these are SSX1; SSX2X1 and SSX3X1,X2.
These sums of squares have a number of potential problems. Unfortunately, the SS are different
depending on the order the variables are entered, so different researchers would get different
results. As a result the use of this SS type is rare and is only used where there is a
mathematical reason to place the variables in a particular order. Its use is restricted pretty
much to polynomial regressions which use a series of power terms (e.g. Yi = β0 + β1 Xi + β2 Xi2 + β3 Xi3 + εi ) and some other odd applications (e.g. in some cases
Analysis of Covariance). Investigators sometimes feel that they know which variables are
more important but this is not justification for using sequential sums of squares. So, we will
not use sequential SS at all, but they are provided by default by SAS PROC GLM. Multiple Regression with SAS
This same data set was run with SAS. The program was
**********************************************;
*** EXST7005 Multiple Regression Example 1 ***;
**********************************************;
OPTIONS LS=78 PS=78 NODATE nocenter nonumber;
DATA ONE; INFILE CARDS MISSOVER;
TITLE1 'EXST7005 MULTIPLE REGRESSION EXAMPLE #1';
INPUT Y X1 X2 X3;
CARDS;
PROC PRINT DATA=ONE;
TITLE2 'Data Listing'; RUN; James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 160 See SAS output in Appendix 8
Note:
The PROC REGRESSION section
PROC REG DATA=ONE LINEPRINTER;
TITLE2 'Analysis with PROC REG';
MODEL Y = X1 X2 X3;
OUTPUT OUT=NEXT P=P R=E STUDENT=student
rstudent=rstudent
lcl=lcl lclm=lclm ucl=ucl uclm=uclm;
RUN; OPTIONS PS=35; TITLE2 'Residual plot';
PLOT RESIDUAL.*PREDICTED.='E';
RUN; QUIT; The overall model
Statistics for the individual variables
The residual plot Residuals, confidence intervals and univariate analysis
proc print data=next;
var Y X1 X2 X3 P E student rstudent lcl ucl lclm uclm;
run;
OPTIONS PS=61;
PROC UNIVARIATE DATA=NEXT NORMAL PLOT;
VAR E; RUN;
Output from proc print, in particular the interpretation of the variables: student,
rstudent, lcl, ucl, lclm and uclm
Output from proc univariate, especially the test of normality This same analysis was done with GLM
PROC GLM DATA=ONE;
TITLE2 'Analysis with PROC GLM';
MODEL Y = X1 X2 X3;
RUN; QUIT;
The results are the same, we only want to look at the Type I and Type III SS. Evaluation of Multiple Regression
If your objective is to test the 3 variables jointly ( H0: β1 = 0, β2 = 0 and β3 = 0 ) or individually ( H0: βi = 0), you are done at this point. None of the variables is significantly different from
zero.
If, however, your objective is to develop the simplest possible, most parsimonious model, you
may delete the variables one at time. Why one at a time? Because when you remove a
variable everything changes since they are adjusted for each other. We would remove the James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 161 least significant variable (the one with the smallest F value). In this case that first step would
be to remove X2.
ANOVA table for analysis of the variables X1 and X3 alone. (F0.05,1,9 = 5.117). Note that X1 is now
significant, but X3 is not and may be removed as step 2.
Source
SSX1X3
SSX3X1
ERROR d.f.
1
1
9 SS
25.134
1.393
37.546 MS
25.134
1.393
4.172 F value
6.024
0.334 The variable X1 is still significant. (F0.05,1,10 = 4.965)
Source
SSX1
ERROR d.f.
1
10 SS
23.977
38.939 MS
23.977
3.894 F value
6.158 This one at a time variable removal process is called “stepwise regression”. More specifically,
it would be called backward selection stepwise regression. It is called backward because it
starts with a full model and removes one variable at at time. There also exist a forward
stepwise regression where the best single variable is found to start with and additional
variables are added to the model if they meet the significance requirements. Multiple Regression with SAS (see SAS output in Appendix 9)
SAS has a program for stepwise model development. This is accomplished with PROC REG,
with the specification of a selection option.
PROC REG DATA=ONE LINEPRINTER;
TITLE2 'Stepwise analysis with PROC REG';
MODEL Y = X1 X2 X3 / selection=backward;
RUN;
In the initial step (STEP 0) the full, 3parameter model is fitted, and the parameter
estimates are evaluated.
Backward Elimination: Step 0
All Variables Entered: RSquare = 0.4163 and C(p) = 4.0000
Analysis of Variance
Source
Model
Error
Corrected Total DF
3
8
11 Sum of
Squares
26.18995
36.72672
62.91667 Mean
Square
8.72998
4.59084 F Value
1.90 Pr > F
0.2078 Step 1 is the first removal, in this case of the variable X2. The results for the remaining variables
are then given. .
Backward Elimination: Step 1
Variable X2 Removed: RSquare = 0.4032 and C(p) = 2.1784
Sum of Mean James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005)
Source
Model
Error
Corrected Total Variable
Intercept
X1
X3 Page 162
DF
2
9
11 Squares
25.37078
37.54588
62.91667 Parameter
Estimate
1.91576
0.63161
0.13650 Standard
Error
1.73473
0.25732
0.23621 Square
12.68539
4.17176 Type II SS
5.08794
25.13390
1.39316 F Value
3.04 F Value
1.22
6.02
0.33 Pr > F
0.0980 Pr > F
0.2981
0.0365
0.5775 Step 2 is the next removal (if needed), in this case of the variable X3. The result for the remaining
variable is then given.
Backward Elimination: Step 2
Variable X3 Removed: RSquare = 0.3811 and C(p) = 0.4819 DF
1
10
11 Sum of
Squares
23.97763
38.93904
62.91667 Parameter
Estimate
1.37613
0.61089 Standard
Error
1.41242
0.24618 Source
Model
Error
Corrected Total Variable
Intercept
X1 Mean
Square
23.97763
3.89390 Type II SS
3.69640
23.97763 F Value
6.16 F Value
0.95
6.16 Pr > F
0.0325 Pr > F
0.3529
0.0325 Finally SAS prints a summary of variable removals.
All variables left in the model are significant at the 0.1000 level. Step
1
2 Variable
Removed
X2
X3 Summary of Backward Elimination
Number
Partial
Model
Vars In
RSquare
RSquare
C(p)
2
0.0130
0.4032
2.1784
1
0.0221
0.3811
0.4819 F Value
0.18
0.33 Pr > F
0.6838
0.5775 Interpretation of regression
Objectives can vary in regression. You may be interested in testing the correlations (actually
“partial” correlations due to the adjustment of one variable for another), or you may be
interested in the parameter estimates and the resulting model (the full model or the reduced
model from stepwise). Most aspects of the evaluation are similar to what we observed with
simple linear regression.
The parameter estimates are interpreted as before, the change in Y per unit X. Of course, now
they are adjusted for other effects.
Standard errors are provided for confidence intervals, as well as a test of each regression
coefficient against 0 (zero).
Confidence intervals are placed on the parameters the same as with SLR although the
calculations differ.
The d.f. for the t value is based on the MSE (for the final model) as with simple linear
regression. The parameter and standard errors can be estimated in SAS.
Residual evaluation is very similar to SLR, but residuals are usually plotted on Yhat instead of
X, since there are several independent variables (i.e. X's).
James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 163 Evaluation of the residuals using PROC UNIVARIATE for testing normality and outlier
detection is the same as for SLR.
Fully adjusted SS also mean fully adjusted regression coefficient (also partial reg. coeff.). SAS
REG does not give tests of SS like GLM, but the tests of the βi values are the same as the
tests of the Type III SS.
There are a few things that are different.
The R2 value is now called the coefficient of multiple determination (instead of the coefficient
of determination).
As discussed, we now evaluate SS for the individual variables. Note that the tests of TYPE III
SS are identical to the tests of the regression coefficients (see GLM handout). PROC REG
does only the latter, and will not do the former.
There is a suite of new diagnostics for evaluating the multiple independent variables and their
interrelations. We will not discuss these, except to say that if the independent variables are
highly correlated with each other (a correlation coefficient, r, of around 0.9), then the
parameter estimates can fluctuate wildly and
Extra SS
SS
unpredictably and may not be useful.
SSX1
23.978
Also note a curious behavior of the variables when they
SSX2
4.115
occur together. When one independent variable Xi is
SSX3
0.237
adjusted for another, sometimes it's SS are larger
SSX1X2
19.959
than what it would be for that variable alone and
SSX2X1
0.096
sometimes athe SS are smaller. This is
SSX1X3
25.134
unpredictable and can go either way. For example.
SSX3X1
1.393
The SSX1 was 23.978 when the variable was alone,
SSX2X3
3.900
but dropped to 19.959 when adjusted for X2, and
SSX3X2
0.022
increased to 25.134 when adjusted for X3. It
SSX1X2,X3
22.053
dropped to 22.053 when adjusted for both. In
SSX2X1,X3
0.819
essence the variables sometimes compete with each
SSX3X1,X2
2.116
other for sums of squares and at other times
enhance each others ability to account for sums of
squares. Adjusted SS
Not only will the SS of one variable increase or decrease as other variables are added, the
regression coefficient values will change. They may even change sign, and hence
interpretation. Although the interpretation does not usually change, sometimes variables in
combination do not necessarily have the same interpretation as they might have had when
alone. Summary
Multiple regression shares a lot in interpretation and diagnostics with SLR.
Most diagnostics are the same as with SLR.
The coefficients and sums of squares of the variables should be adjusted for each other. This is
the sequential sum of squares or the Type II SS or Type III SS in SAS. This is the big and
important difference from SLR. James P. Geaghan Copyright 2010 ...
View Full
Document
 Fall '08
 Geaghan,J

Click to edit the document details