Unformatted text preview: Statistical Techniques II Page 83 Y1 Y2 So, we would test each pair of means using the two sample ttest as t S ANOVA, using the MSE as our variance estimate, we have t 2
p ( 1 1 n1 n2 (Y1 Y2 )
1
1 MSE n1 n2 ) . For . If the design is balanced this simplifies this to t (Y1 Y2 ) .
2 M SE n Notice that if the calculated value of t is greater than the tabular value of t, we would reject the
null hypothesis. To the contrary, if the calculated value of t is less than the tabular value we
would fail to reject.
Call the tabular value t*, and write the case for rejection of H0 as t (Y1 Y2 ) So we would reject H0 if t (Y1 Y2 )
(Y1 Y2 ) t 2 MSE n 2 MSE 2 MSE .
n or if t 2 MSE (Y1 Y2 ) or
n
n . So, for any difference (Y1 Y2 ) that is greater than t 2 MSE n we find the difference between
the means to be statistically significant (reject H0), and for any value less than this value we
find the difference to be consistent with the null hypothesis. Right?
This value of t 2 MSE n is what R. A. Fisher called the “Least Significant Difference”,
commonly called the LSD (not to be confused with the Latin Square Design = LSD). LSD tcritical MSE (n1 n1 )
1 2 or LSD tcritical SY1 Y2 This value is the exact width of an interval Y1 Y2 which would give a ttest equal to
tcritical. Any larger values would be “significant” and any smaller values would not. This
LSD tcritical SY Y
is called the “Least Significant Difference”.
1 2 We calculate this value for each pair of differences and if the observed difference is less, the
treatments are “not significantly different”. If greater they are “significantly different”.
One last detail. I have used the simpler version of the variance assuming that n1 = n2. If the
experiment is unbalanced (i.e. there are unequal numbers of observations in the treatment
levels) then the value is MSE 1 1 . n1 n2 James P. Geaghan  Copyright 2011 Statistical Techniques II Page 84 The property of balance is nice because all of the pairwise tests have the same sample
sizes and the same standard error. However, balance is not necessary. For an
unbalanced design we must calculate the standard error ( MSE 1 1 ) for each n1 n2 pairwise test because they will be different.
This is the first of our post ANOVA tests, it is called the “LSD”.
But hey, wait a minute! Didn't Fisher invent ANOVA in the first place to avoid doing a bunch
of separate ttests? So, now we are doing a bunch of separate ttests. What is wrong with this
picture?
So, this is Fisher’s solution.
When we do a bunch of separate ttests, we don't know if there are any real differences at the level. After we do the ANOVA test we know that there are some differences. So we only
do the LSD if the ANOVA says that there are actually differences, otherwise, don't do the
LSD.
This is called “Fisher's Protected LSD”: we use the LSD ONLY if the ANOVA shows
differences, otherwise we are NOT justified in using the LSD.
Makes sense. But there were still a lot of nervous statisticians looking for something a little
better. As a result there are MANY alternative calculations. We will discuss the “classic”
solutions.
This least significant difference calculation can be used to either do pairwise tests on observed
differences or to place a confidence interval on observed differences.
The LSD can be done in SAS in one of two ways. The MEANS statement produces a range
test (LINES option) or confidence intervals (CLDIFF option), while the LSMEANS statement
gives pairwise comparisons.
Other Post ANOVA tests Basically, we calculate the LSD with our chosen value of . We then do our mean comparisons.
Each test has a pairwise error rate of .
We have already seen one alternative, the Bonferroni adjustment. If we do 5 tests, or 10 tests, our
error rate is no more than 5(/2) or 10(/2). Generally, for “g” tests our error rate is no more than
g(/2). To keep an experiment wide error rate of , we simply do each comparison using a t value
for an a equal to /2g.
For two tailed tests (which are the most common) we do each test at /2 and the Bonferroni test
would use a t for an error rate of . One tailed tests are possible, but usually only done with
2g Dunnett’s test discussed below.
The Bonferroni adjustment is fine if we are only doing a few tests. However, it is an upper
boundary of the error, the highest that the error can be. The real probability of error is actually
less, perhaps much less. So if we are doing very many tests, Bonferroni gets very
conservative, giving us an actual error rate much lower than the we really want.
So we seek alternatives.
The major applications are Tukey's and Scheffé's. We will also consider Dunnett's and Duncan's
since they are fairly commonly. Each of the tests is discussed below.
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 85 The LSD has an probability of error on each and every test or for each comparison. It is
called to as a comparisonwise error rate. The whole idea of ANOVA is to give a probability of
error that is for the whole experiment, so, much work in statistics has been dedicated to this
problem. Some of the most common and popular alternatives are discussed below. Most of
these are also discussed in your textbook. The LSD is the LEAST conservative of those discussed, meaning it is the one most likely to
detect a difference and it is also the one most likely to make a Type I error when it finds a
difference. However, since it is unlikely to miss a difference that is real, it is also the most
powerful. The probability distribution used to produce the LSD is the t distribution.
Bonferroni's adjustment. Bonferroni pointed out that in doing k tests, each at a probability of
Type I error equal to , the overall experimentwise probability of Type I error will be NO
MORE than k*, where k is the number of tests. Therefore, if we do 7 tests, each at =0.05,
the overall rate of error will be NO MORE than =.35, or 35%. So, if we want to do 7 tests and
keep an error rate of 5% overall, we can do each individual test at a rate of /k = 0.055/7 =
0.007143. For the 7 tests we have an overall rate of 7*0.007143 = 0.05. The probability
distribution used to produce the LSD is the t distribution.
Duncan's multiple range test. This test is intended to give groupings of means that are not
significantly different among themselves. The error rate is for each group, and has sometimes
been called a familywise error rate. This is done in a manner similar to Bonferroni, except the
calculation used to calculate the error rate is [1–(1–)r–1] instead of the sum of . For
comparing two means that are r steps apart, where for adjacent means r=2. Two means
separated by 3 other means would have r = 5, and the error rate would be [1–(1–)r–1] = [1–(1–
0.05)4] = 0.1855. The value of a needed to keep an error rate of is the reverse of this
calculation, [1–(1–0.05)1/4] = 0.0127. The StudentNeumanKeuls test is a similar test to Duncan’s, controlling the familywise error
k /2
rate. The value of is calculated as 1 1 Tukey's adjustment. This test seems to be most appropriate in most cases since it keeps an error rate for all possible pairwise tests for the whole experiment, which is often what an
investigator wants to do. This test basically allows for all pairwise tests and keeps an
experimentwise error rate of for all pairwise tests. The Tukey adjustment allows for, Tukey developed his own tables (see Appendix table A.7 in
your book, “Percentage points of the studentized range”). For “t” treatments and a given error
degrees of freedom the table will provide 5% and 1% error rates that give an experimentwise
rate of Type I error.
Note SAS puts “HSD” by Tukey's. This stands for “Honest Significant Difference”.
Scheffé's adjustment This test is the most conservative. It allows the investigator to do all
possible tests, and still maintain an experimentwise error rate of . “All possible” tests
includes not only all pairwise tests, but comparisons of all possible combinations of treatments 2 4 5 ,
with other combinations of treatments (e.g. H 0 : 1 2
2
3
CONTRASTS will be covered later). The calculation is based on a square root of the F
distribution, and can be used for range type tests or confidence intervals. The test is more
general than the others mentioned, for the special case of pairwise comparisons. James P. Geaghan  Copyright 2011 Statistical Techniques II Page 86 The critical value for Scheffé’s test is based on the F distribution. The statistic is given by t – 1 * Ft 1, n ( t 1) for a balanced design with t treatments and n observations per treatment.
This test is appropriate for “data dredging”.
Place the posthoc tests above in order from the one most likely to detect a difference (and the one
most likely to be wrong) to the one least likely to detect a difference (and the one least likely to be
wrong). LSD is first, followed by Duncan's test, Tukey's and finally Scheffé's. Dunnett's is a
special test that is similar to Tukey's, but for a specific purpose, so it does not fit well in the
ranking. The Bonferroni approach produces an upper bound on the error rate, so it is conservative
for a given number of tests. It is a useful approach if you want to do a few tests, fewer than
allowed by one of the others (e.g. you may want to do just a few and not all possible pairwise). In
this case, the Bonferroni may be better.
Note that if you want to do a couple of pairwise tests you can calculate Bonferroni and compare the
critical value to Tukey's. Tukey's is for all pairwise tests and would be conservative for fewer than
all pairwise tests. Bonferroni may be overly conservative because it is a bound. For other sets of
tests including some that are not pairwise, compare Bonferroni to Scheffé.
Post ANOVA test comparison
Comparisonwise error rate: LSD
Experimentwise error rate: Tukey (all pairwise), Bonferroni (selected tests), Scheffé (all
possible contrasts). When doing pairwise tests, the LSD is the test most likely to find differences, and the one most
likely to be wrong when it finds a difference. However, power is the ability to find differences, so
although error prone in the type I error sense, the LSD is the most powerful of the tests.
Scheffé is the test least likely to find a difference, and least likely to be wrong with respect to type
I error.
Other tests that are used in particular circumstances. We will mention only Dunnett's, which is
used to compare one treatment (usually a control) to all other treatments. This is the only post hoc
test in SAS that has onetailed tests (e.g. DUNNETTL and DUNNETTU).
Applying Post ANOVA test comparisons All of these tests can be expressed in one of two ways.
If the analysis is BALANCED, then there is a popular expression of pairwise tests that starts with
ranked means.
Suppose we calculate a value of the LSD equal to 8, and we have sorted the means of treatment
levels and have 5, 14, 17, 23, 29, and 38.
Treatment Level 3 1 6 5 2 4 Mean 38 29 23 17 14 5 Groups If the critical value of the LSD = 8 then means below that differ by less than 8 do not differ
statistically. This is represented by giving them a common letter so they share a letter.
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 87 For an LSD critical value of 8 Treatment Level
3 1 6 5 2 4 Mean
38
29
23
17
14
5 Groups
A
B
B C
D C
D
E Same means compared with a Tukey adjusted critical value of 10 Treatment Level
3 1 6 5 2 4 Mean
38
29
23
17
14
5 Groups
A
A B
B C
C
C
D Same means compared with a Scheffé adjusted critical value of 15 Treatment Level
3 1 6 5 2 4 Mean
38
29
23
17
14
5 Groups
A
A B
A B
B C
B C
C SAS Example (Appendix 12) Note the test of homogeneity of variance (random or repeated statement) .
Test the effects of TREATMENTS.
Post hoc tests : They can be done from MIXED using the LSMeans statement. In GLM either
the MEANS or LSMeans statement can be used. SAS statements results to compare (post ANOVA or post hoc tests ) Results with the LSD
Results with Tukey's.
Results with Scheffé's.
Results with Dunnett's. NOTE that normally only one postANOVA examination would be done. We have done
several here in order to compare.
Note the use of a macro to get sorted and labeled means to indicate significant differences. James P. Geaghan  Copyright 2011 Statistical Techniques II Page 88 Comparison of ranked means works very well if the analysis is balanced. If the analysis is
not balanced there can be a problem. It is possible that means that are close together are significantly different, while means that
have a greater difference are not significantly different.
Where variance = MSE 1 1 and MSE = 25 n n tmt
1
2
3 mean
18
13
12 1 n
5
100
5 2 test
1v2
2 v3
1v3 diff
5
1
6 se
t value
2.2913 2.1822
2.2913 0.4364
3.1623 1.8974 d.f.
103
103
8 P value
0.02398
0.66343
0.09435 For unbalanced tests the best way to check for difference is to calculate a confidence interval
for each mean and see if the confidence intervals overlap. By default, SAS will use this
approach for unbalanced means. PostANOVA tests Having rejected the Null hypothesis in Analysis of Variance we would usually wish to determine
how the treatment levels differ from each other. This is the “postANOVA” part of the analysis.
These tests fall into two general categories. We have already discussed the post hoc tests (LSD,
Tukey, Scheffé, Duncan's, Dunnett's, etc.) . These tests are often (usually?) done with no a priori
hypotheses in mind. That means we do not have any particular comparisons in mind before doing
the experiment; we want to examine many, or all, levels of the treatments for differences from one
another, and each test is done with a probability of error. The use of an experimentwise error
rate is intended to permit these a posteriori comparisons without inflating the error rate for the
analysis.
We will now discuss a priori tests or preplanned comparisons (contrasts). These a priori tests are
better in many ways because the researcher plans on doing particular tests before the data is
gathered. If we dedicate 1 d.f. to each one we generally feel comfortable doing each test at some
specified level of alpha, usually 0.05. However, since multiple tests do entail risks of higher
experiment wide error rates, it would not be unreasonable to apply some technique, like
Bonferroni's adjustment, to insure an experimentwise error rate of the desired level of alpha ().
When we want some lesser number of comparisons, and they are determined a priori (without
looking at the data), then we can use a less stringent criteria. We generally feel comfortable with
one test per degree of freedom at some specified level of alpha (), just as we did in regression
(looking at each regression coefficient with an a level of error). James P. Geaghan  Copyright 2011 ...
View
Full
Document
This note was uploaded on 12/29/2011 for the course EXST 7015 taught by Professor Wang,j during the Fall '08 term at LSU.
 Fall '08
 Wang,J

Click to edit the document details