This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Statistical Techniques II Page 71 Analysis of Covariance
Linear regression is usually done as a least squares technique applied to QUANTITATIVE
VARIABLES. ANOVA is the analysis of categorical (class, indicator, group) variables, there
are no quantitative “X” variables as in regression, but this is still a least squares technique.
It stands to reason that if Regression uses the least squares technique to fit quantitative variables,
and ANOVA uses the same approach to fit qualitative variables, that we should be able to put both
together into a single analysis.
We will call this Analysis of Covariance. There are actually two conceptual approaches, Multisource regression – adding class variables to a regression
Analysis of Covariance – adding quantitative variables to an ANOVA For now, we will be primarily concerned with Multisource regression.
With multisource regression we start with a regression (one slope and one intercept) and ask,
would the addition of an indicator or class variable improve the model?
Adding a class variable to a regression gives each group its own intercept, fitting a separate
intercept to each group. Y X
Adding an interaction fits a separate slope to each group. Y X
How do they do that?
For a simple linear regression we start with, Yi b 0 b1 X 1 i e i
Now add an indicator variable. An indicator variable, or dummy variable, is a variable that
uses values of “0” and “1” to distinguish between the members of a group. For example, if the
category, or groups, were MALE and FEMALE, we could give the females a “1” and the males
a “0” and distinguish between the groups. If the groups were FRESMAN, SOPHOMORE,
JUNIOR and SENIOR we would need 3 variables to distinguish between them, the first would
have a “1” for freshmen and a “0” otherwise, the second and third would have “1” for
SOPHOMORE and JUNIOR, respectively, and a zero otherwise. We don’t need a fourth
variable for SENIOR because if we have an observation with values of 0, 0, 0 for the three
variables we know it has to be a SENIOR. So, we always need one less dummy variable than
there are categories in the group. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 72 Fitting separate intercepts In our example we will add just one indicator variable, but it could be several. We will call our
indicator variable X2i, but it is actually just a variable with values of 0 or 1 that distinguishes
between the two categories of the indicator variable.
Yi b 0 b1 X 1 i b 2 X 2 i e i When X2i = 0 we get: Yi b0 b1 X 1 i b2 (0 ) ei ,
which reduces to Yi b 0 b1 X 1 i e i , a simple
linear model for the “0” group.
And when X2i = 1 we have Yi b 0 b1 X 1 i b 2 (1) e i ,
which simplifies to Yi ( b 0 b 2 ) b1 X 1 i e i , Y
b0 X a simple linear model with an intercept equal to (b0+b2) for the “1” group.
So there are two lines with intercepts of b0 and (b0+b2). Note that b2 is the difference between
the two lines, or an adjustment to the first line that gives the second line.
This can be positive or negative, so the second line may be above or below the first.
This term can be tested against zero to determine if there is a difference in the intercepts.
Fitting separate slopes Adding an interaction (crossproduct term) between the quantitative variable (X1i) and the
indicator variable (X2i) will fit separate slopes. Still using just one indicator variable for two
classes the model is Y i b 0 b1 X 1 i b 2 X 2 i b3 X 1 i X 2 i e i .
When X2i = 0 we get: Y i b 0 b1 X 1 i b 2 (0 ) b 3 X 1 i (0 ) e i , which reduces to Yi b 0 b1 X 1 i e i , a simple
linear model for the “0” group.
When X2i =1 then Yi b 0 b1 X 1 i b 2 (1) b3 X 1 i (1) e i Y
X simplifying to Yi ( b0 b 2 ) ( b1 b3 ) X 1i e i ,
which is a simple linear model with an intercept equal to (b0+b2) and a slope equal to
(b1+b3) for the “1” group. Note that b0 and b1 are the intercept and slope for one of the lines (whichever was assigned the
0). The values of b2 and b3 are the intercept and slope adjustments (+ or – differences) for the
If these adjustments are not different from zero then the intercept and or slope are not different
from each other.
A third or fourth line could be fitted by adding additional indicator variables and interaction
with coefficients b4 and b5, etc. Statistics quote: There are two kinds of statistics, the kind you look up and the kind you make up. -- Rex Stout
James P. Geaghan - Copyright 2011 Statistical Techniques II Page 73 Conceptual application of Analysis of Covariance Suppose we have a regression, and we would like to know if perhaps the regression is different for
some groups, perhaps male and female or treated and untreated.
Of course, regression starts with a corrected sum of squares (a flat line through the mean) and we
fit a slope. Viewing this as extra sums of squares we are comparing a model with no slope to a
model with a slope. a A test of the difference between these two models tests the
hypothesis of adding a slope (H0:1=0). The extra SS are SSX1|X0, or just SSX1.
From the simple linear regression we want to test the hypotheses that the model is improved by
adding separate intercepts or slopes. The concepts are pretty simple. The additional variables can
be tested with extra SS or the General Linear Hypothesis Test (GLHT with full model and reduced
For a single indicator variable the extra SS are easy enough. For several indicator variables it may
be easier to do the GLHT where the single line is the reduced model and the two lines are the full
model. b A test of the difference between these two models tests the
hypothesis of adding an intercept adjustment (H0:2=0) or tests for the difference between the
intercepts. The extra SS are SSX2 | X1.
The next test compares a model with two intercepts and one slope with a model containing two
slopes and two intercepts. c Here we are testing the addition of a slope adjustment, or
second slope (H0:3=0). This test determines if a model with 2 intercepts and 2 slopes is better
than a model with 2 intercepts and one slope. The extra SS are SSX1X2 | X1 X2.
This series a–b–c is the usual “Multisource regression” series. Note that the extra SS needed are:
SSX1; SSX2 | X1; SSX1X2 | X1, X2. These are
the Type I SS for the SAS model Y=X1 X2
X1*X2;. So this is another of those rare cases
where we may wish to use TYPE I tests.
The hypotheses to be tested can be simplified to
some extent if we know that for our base model
we want a regression. This is multisource
Start with a simple linear regression.
Test to see if separate intercepts improve the
model (H0:2=0). Correction factor,
1 level = b0 (fit mean)
1 level (b0), 1 slope (b1)
2 levels (b0, b2), 1 slope (b1)
2 levels (b0, b2) and
2 slopes (b1 and b3) James P. Geaghan - Copyright 2011 Statistical Techniques II Page 74 Test to see if separate slopes improve the model (H0:3=0).
As extra SS we could test extra SS for b2 and then the extra SS for b3.
In practice we usually start with the fullest model (2 slopes and 2 intercepts)
and then reduce to 1 slope and 2 intercepts if the extra SS for b3 is not significant
and then to 1 slope and 1 intercept if extra SS for b2 is not significant.
So, interpretation usually proceeds from the bottom up; do we need 2 slopes and 2 intercepts, if
not do we need 2 intercepts, if not is the SLR significant?
If TYPE I sums of squares are used there would be little if any change from deleting the higher
order components of the model, so in this case we would not have to refit the model after each
The solution (i.e. regression coefficients) provided with ANCOVA is the correct model for
prediction, even though it is fully adjusted (Type III) and our tests of hypothesis are not (Type
The progression discussed above corresponds to the likely series in multisource regression (a-b-c
below). In Analysis of Variance (our next topic), the researchers interest is in differences among
the means of categorical variables. For this analysis the categorical variable can be put first and
the progression is the usual series for Analysis of Variance (d-e-c below). Which series of test you
need depends on if you are starting with a regression or an analysis of variance. a d
e b f c
Examine the possible full and reduced models on the previous page. What is tested in each case,
what extra SS is needed for each test and what are the initial and final models for each case.
One other test of occasional interest is the test denoted by the comparison “g”. If you fit a model
with two slopes and two intercepts, and cannot reduce to a model with two intercepts and one
slope, you may wonder if a model with two slopes and one intercept is appropriate.
Full model: Yi b 0 b1 X 1 i b 2 X 2 i b3 X 1 i X 2 i e i
Reduced model: Yi b0 b1 X 1i b3 X 1i X 2 i ei
The extra SS for this test is SSX2 | X1, X1*X2. This is actually a test of “intercepts fitted last”
and would be available as the Type III SS for X1*X2 with the model Y=X1 X2 X1*X2;.
g James P. Geaghan - Copyright 2011 Statistical Techniques II Page 75 A few notes on ANCOVA. The values of the slopes and intercepts fitted in the full model (2 slopes and 2 intercepts) are
exactly the same slopes and intercepts that would be fitted if the models were done separately.
However, the combined model has the advantage of having a more powerful pooled error term.
In SAS we can fit our own indicator variable in proc reg, building dummy variables in the data
step. However, if there are more than just two levels (one dummy variable) the dummy variables
should be treated as a aggregate.
The CLASSES statement in PROC MIXED or GLM automatically creates the indicator
variables, and both will test TYPE I and TYPE III sums of squares.
We will use PROC MIXED in our numerical SAS example.
When we include the CLASSES statement in SAS, the program assumes that we are doing an
ANOVA, and by default does not print the regression coefficients. If we want these we must
add the option “/solution” to the model statement. This is true for both PROC MIXED and
Forbes 500 Companies Sales (Appendix 11) There is a presumed relationship between a company’s sales and its assets. The dataset has sales
and asset data for various industry sectors (Communication, Energy, Finance, HiTech,
Manufacturing, Medical, Retail, Transportation and Other).
We want to determine if the sales / asset relationship is the same for each sector.
Examine the scatter plot. The untransformed data shows little pattern. The log transformed data
shows some increasing trend.
We want to know if the apparent trends are significant and we want to know if the same trend is
apparent for each sector.
This will be fitted in proc mixed because I want to use TYPE 1 SS more than TYPE II or TYPE
Summary Analysis of Covariance is the combination of quantitative variables and categorical variables.
Multisource regression is the expression of an ANCOVA that reduces to a SLR.
Analyses that culminate in ANOVAs will be
discussed at the end of the course. This type is where ANOVA
the term “Analysis of Covariance” was developed.
You should be able to examine the graph and
determine what is tested and what Extra SS are tested. SLR a d
e b f c
g James P. Geaghan - Copyright 2011 Statistical Techniques II Page 76 Analysis of Variance and Experimental Design
The simplest model or analysis for Analysis of Variance (ANOVA) is the CRD, the Completely
Randomized Design. This model is also called “One-way” Analysis of Variance.
Unlike regression, which fits slopes for regression lines and calculates a measure of random variation
about those lines, ANOVA fits means and variation about those means.
The hypotheses tested are hypotheses about the equality of means H0 : 1 2 3 4 ... t
the i represent means of the levels of some categorical variable
“t” is the number of levels in the categorical variable.
H1: some i is different
We will generically refer to the categorical variable as the “treatment” even though it may not actually
be an experimenter manipulated effect.
The number of treatments will be designated “t”.
The number of observations within treatments will be designated n for a balanced design (the
same number of observations in each treatment), or ni for an unbalanced design (for i = 1 to t).
The assumptions for basic ANOVA are very similar to those of regression.
The residuals, or deviations of observations within groups, should be normally distributed.
The treatments are independently sampled.
The variance of each treatment is the same (homogeneous variance).
ANOVA review I am borrowing some material from my EXST7005 notes on t-test and ANOVA. See those notes
for a more complete review of the introduction to Analysis of Variance (ANOVA).
Start with the logic behind ANOVA.
Prior to R. A. Fisher's development of ANOVA, investigators were likely to have used a series of t
tests to test among t treatment levels.
What is wrong with that? Recall the Bonferroni adjustment. Each time we do a test we increase
the chance of error. To test among 3 treatments we need to do 3 tests, among 4 treatments, 6 tests,
5 treatments are 10 tests, etc.
What is needed is ONE test for a difference among all tests with one overall value of a specified
by the investigator (usually 0.05).
Fisher's solution was simple, but elegant.
Suppose we have a treatment with 5 categories or levels. We can calculate a mean and
variance for each treatment level. In order to get one really good estimate of variance we can
pool the individual variances of the 5 categories (assuming homogeneity of variance).
This pooled variance can be calculated as a weighted mean of the variance (weighted by the
degrees of freedom). James P. Geaghan - Copyright 2011 Statistical Techniques II Page 77 Y
Y A B C D E Group And since SS1 12 n1 1 then n1 1 12 SS1 , the weighted mean is simply the sum of the
SS divided by the sum of the d.f.
(n1 1)S1 (n2 1)S2 (n3 1)S3 (n4 1)S2 (n5 1)S5
S (n1 1) (n2 1) (n3 1) (n4 1) (n5 1)
p S2 p SS1 SS2 SS3 SS4 SS5
(n1 1) (n 2 1) (n 3 1) (n 4 1) (n 5 1) So we have one very good estimate of the random variation, or sampling error, S2.
Then what? Now consider the treatments. Why don't they all fall on the overall mean? Actually, under the
null hypothesis, they should, except for some random variation. So if we estimate that random
variation, it should be equal to the same error we already estimated within groups?
Recall the variance of means is estimated as S2/n, the variance of the sample divided by the
sample size. The standard error is the square root of this.
If we actually use means to estimate a
variance, we are also estimating the variance Y
of means, S2/n. If we multiply this by “n” it
should actually be equal to S2, which we
estimated with S2 , the pooled variance
So if the null hypothesis is true, the mean
square of the deviations within groups
should be equal to the mean square of the
deviations of the means multiplied by
groups Means A B C D E Group James P. Geaghan - Copyright 2011 Statistical Techniques II Page 78 Now, if the null hypothesis is not true, and some i
is different, then what?
Then, when we calculate a mean square of
deviations of the means from the overall mean, it
should be larger than the previously estimated S2 .
Y So we have two estimates of variance, S2 and the
variance from the treatment means. If the null
hypothesis is true, they should not be significantly
different. Y A B C D E Group If the null hypothesis is FALSE, the treatment mean square should be larger. It will therefore
be a ONE TAILED TEST!
We usually present this in an “Analysis of Variance” table.
Source d.f. Sum of Squares Treatment t–1 SSTreatment Error t(n–1) SSError Total tn–1 SSTotal Mean Square MSTreatment MSError Degrees of freedom There are tn observations total (ni if unbalanced).
After the correction factor, there are tn–1 d.f. for the corrected total.
There are t–1 degrees of freedom for the t treatment levels.
Each group contributes n–1 d.f. to the pooled error term. There are t groups, so the
pooled error (MSE) has t(n–1) d.f.
The SSTreatments is the SS deviations of the treatment means from the overall mean.
Each deviation is denoted ti, and is called a treatment “effect”.
SSTreatments = (Y -Y) t 2 t i i=1 2
i i=1 The model for regression is Yi b0 b1 X i ei
The effects model for a CRD is Yi i ij
where the treatments are i=1, 2, ... t and
the observations are j=1, 2, ... n, or ni for unbalanced data
An alternative expression for the CRD, called the means model, is Y i i ij
Statistics quote: Statistics are like a bikini. What they reveal is suggestive, but what they conceal is vital. -Aaron Levenstein
James P. Geaghan - Copyright 2011 ...
View Full Document
- Fall '08