This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Statistical Methods I (EXST 7005) Page 135 Our “sampling unit” is a smaller unit than the experimental unit (a plot) so we have sampling
error.
Replicated within blocks as multiple samples in an experimental unit. Error is sampling error.
Another type of error comes from having several plots
with a given soybean variety in each plot. Here each
variety of soybean has several experimental units in
each field.
In this case the additional replication represents a
second experimental error, one for block by
treatment combinations and one for replicate
plots within a block. Field 1 Field 2 Field 3 t5 t3 t2 t3 t5 t1 t1 t4 t2 t1 t2 t6 t4 t3 t2 t6 t5 t4 t6 t5 t1 t6 t5 t1 t5 t1 t3 t4 t4 t3 t4 t2 t6 t3 t6 t2 In this case we have replicated experimental units in each block. Factorial EMS
I haven't mentioned factorials EMS.
Developing EMS can be pretty simple. Start with the lowest unit, and move up the source table
adding additional variance components for each new term.
Source
Treatment EMS with Reps Block 2
2
σε2 + nστβ + ntσβ Exptl Error 2
σ ε2 + nστβ Rep Error σε2 2
σε2 + nστβ + nbστ2 Interaction components occur on their own line, and on the source line for each higher effect
contained in the interaction.
Each main effect gets its own source.
Now consider whether the effects are fixed or random. Modify fixed effects to show SSEffects instead
of variance components.
Source
Treatment EMS with Reps
2
σ 2 + nσ 2 + nb Σ τ i
ε τβ Block
Exptl Error 2
σ ε2 + nστβ Rep Error t −1 σε + nστβ + ntσβ σε2 2 2 2 If the model is an RBD we're done, because the interaction is always a random variable.
For factorials that are random models and mixed models were done.
Consider what the F test should be for the treatment. Surprise, SAS always uses the residual error
term! James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 136 But for factorials there is one last detail. It is perfectly possible in factorial designs that both
effects are fixed, and if both effects are fixed the interaction is also fixed!
Source
Treatment A EMS with Reps
2
σ 2 + nb Σ τ Ai Treatment B σ ε2 + na Σ τ Bi Interaction A*B σε Error σε2 ( a − 1) ε 2 2 ( b − 1) ∑ (τ τ )
+n 2
A B ij ( a − 1)( b − 1) And a FIXED effect occurs only on its own line, no other!! The fixed interaction disappears from
the main effects!!!
Now what is the error term for testing treatments and interaction? Maybe SAS is right? Or maybe
SAS just doesn't know what is fixed and what is random. Testing ANOVAs in SAS
So tell SAS what is random and what is fixed.
Look for the following additions to SAS program.
How do we tell SAS which terms to test with what error term?
How do we get SAS to output EMS?
How do we get SAS to automagically test the right treatment terms with the right error terms? Summary
Randomized Block Designs modify the model by factoring a source of variation out of the error
term in order to reduce the error variance and increase power. If the basis for blocking is
good, this will be effective. If the basis for blocking is not good, we lose a few degrees of
freedom from the error term and may actually lose power.
The block by treatment combinations (interaction?) provide a measure of variation in the
experimental units and provide an adequate error term.
We have an additional assumption that this error term represents ONLY experimental error, and
not some real interaction between the treatments and blocks.
Expected mean squares for the RBD indicate that the experimental error term is the correct error
term, whether there is a sampling unit or not.
Factorial designs, where effects are random or mixed are similar to RBD EMS. THE
TREATMENT INTERACTION IS ACTUALLY USED AS AN ERROR TERM!
When the treatments are fixed, the main effects do not contain the interaction term, and the
residual error term is the appropriate error term. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 137 Sample size in ANOVA
Some textbooks use a slightly different expression for the equation, but it is the same as the
S2
n ≥ (tα 2 + tβ )2 2 . An
equation discussed previously. One minor change is the expression
d
alternative to the use of d is the expressing of the difference as a percentage of the mean. For
example, if we wanted to test for a difference that was 10% of the mean we could use the
expression n ≥ (tα 2 + tβ ) 2 ( S2
0.1Y ) 2 . This expression can further be altered to express the difference in terms of the coefficient of variation CV = S . Calculating the sample sized Y 2
2
needed to detect a 10% change in the mean then becomes n ≥ (tα 2 + tβ ) (10CV ) . In analysis of variance we may also want to be able to detect a certain difference between two
means (μ1 and μ2) out of the treatment means we are studying, so our difference will be μ1 –
μ2. A prior analysis, or a pilot study, may provide us with an estimate of the variance (MSE
in ANOVA). From here we can use a formula pretty much the same as for the ttest discussed
earlier. There is one other little detail, however.
2
We are basically testing H0 : μ12 − μ2 = δ , from the 2 sample ttest. Recall from our linear
combinations we have a variance for this linear combination that is the sum of the individual
S12 S 22
+
variances of the mean. Therefore, the variance will be.
. Since we are usually
n1 n2 pooling variances (ANOVA) then the formula simplifies to MSE ( 1 1
+ )
n1 n2 . Furthermore, since we usually attempt to have balanced experiments (equal sample size in each
group) for analysis of variance the formula further simplifies to an expression similar to one
seen previously, except for the addition of “2”, 2MSE . The additional “2” occurs when we
n
are testing for difference in two means ( H 0 : μ1 = μ 2 ) as opposed to testing a mean against an
hypothesized value ( H 0 : μ = μ 0 ).
Note one very important thing here. In this formula “n” represents each group or population being
studied, that is, each “treatment level” in an analysis of variance! So for ANOVA or twosample ttest with equal variance and equal n, the expression for sample size is
2(tα 2 + tβ ) 2 MSE
. Note that this “n” is for each treatment. In a two sample ttest, each
n≥
2
d
population would have a sample size of “n”, so the total number of observations would be 2n.
In ANOVA we have “t” treatments; each would have a sample size of “n”, so the total number of
observations would be tn. How often are we likely to have situations with equal variance and
equal n? Is this realistic? Actually, yes it is.
First, ANOVA traditionally required equal variances, though more modern analytical techniques
can address the lack of homogeneity. If necessary, equal variances may be achieved by a
transformation or some other fix. If variance is nonhomogeneous you could use the larger
estimates and get a conservative estimate of “n”.
James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 138 Second, the most common application for sample size calculation is in planning NEW studies, and
of course in planning new studies you usually do not PLAN on unbalanced designs and non
homogeneous variance.
So these situations are realistic. Summary
Finally we saw that this formula is applicable to twosample ttests and ANOVA, with some
modifications in the estimate of the variance. These modifications are the same ones needed
for the 2sample ttest as dictated by our study of linear combinations. However, the
calculations are simplified by the common ANOVA assumption of equal variance and the
prevalence of balanced experiments.
Review of Analysis of Variance procedures. 1) H0: μ1 = μ2 = μ3 = μ4 = . . . = μt = μ
2) H1: some μi is different
3a) Assume that the observations are normally distributed about each mean, or that the
residuals (i.e. deviations) are normally distributed.
b) Assume that the observations are independent
c) Assume that the variances are homogeneous
4) Set the level of type I error. Usually α = 0.05
5) Determine the critical value. For a balanced CRD with a single factor treatment the test is an
F test with t–1 and t(n–1) degrees of freedom (Fα=0.05, t–1, t(n–1) d.f.).
6) Obtain data and evaluate.
The treatment sum of squares, as developed by Fisher, are converted to a “variance” and
tested with an F test against the pooled error variance. In practice, the sum of squares are
usually calculated and presented with the degrees of freedom in a table called an ANOVA
table. For a balanced design (all ni equal) the calculations are, ( ) The uncorrected SS for treatments is USS Treatments = t
∑
i =1 n
∑ Yij
j =1
n 2
t
= n∑
i =1 n
∑ Yij
j =1 2 ( ).
n 2
The uncorrected SS for the total is SSTotal = ∑∑Yij
i (
The correction factor for both terms is CF = j ) t n
2
∑ ∑ Yij
i j
tn Our ANOVA analyses will be done with PROC MIXED and PROC GLM. There is a PROC
ANOVA, but it is a subset of PROC GLM. LSMeans calculation
The calculations of LSMeans are different. For a balanced design, the results will be the same.
However, for unbalanced designs the results will often differ. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 139 The MEANS statement in SAS calculates a simple mean of all available observations in the
treatment cells.
The LSMeans statement will calculate the mean of the treatment cell means. Example:
The MEAN of 4 treatments, where the observations are 3,4,8 for a1, 3,5,6,7,9 for a2, 7,8,6,7 for
a3 and 3,5,7 for a4 is 5.8667.
The individual cells means are 5, 6, 7 and 5 for a1, a2, a3 and a4 respectively. The mean of
these 4 values is 5.75. This would be the LSMean.
Raw means
Treatments a1
5
7 b1 a2
6
8
4
5 a3
9 6.5 7
9
7 Means
LSMeans means
Treatments
b1
b2
Means 5.75 5
7
7 a1
6
8
7 b2 Means a2
6
5
5.5 a3
9
6
7.5 6.6 Means
7
6.33 Confidence Intervals on Treatments
Like all confidence intervals on normally distributed estimates, this will employ a tvalue and will
be of the form Mean ± t a SY
2 The treatment mean can be obtained from a means (or LSMeans) statement, but the standard
deviation provided is not the correct standard error for the interval.
The standard error in a simple CRD with fixed effects is the square root of MSE/n, where n is the
number of observations used in calculating the mean.
The calculation requires other considerations when random components are involved. For
example, in PROC MIXED the use of the Satterthwaite and KenwardRoger
approximations, the use of various estimation methods (the default is REML) and
specifications of covariance structure are all things that can affect degrees of freedom.
The use of MSE in the numerator is the default in PROC GLM, and if a different error is
desired it must be specified by the user. PROC MIXED is capable of detecting and using
and error other than the MSE where appropriate.
If there are several error terms (e.g. experimental error and sampling error) use the one that is
appropriate for testing the treatments. When an error term other than the residual is
appropriate for testing the treatments, the degrees of freedom for the tabular t value are the
d.f. from the error term used for testing. This variance term would also be used to
calculate the standard error for treatment means.
James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 140 Simple Linear Regression
Simple regression applications are used to fit a model describing a linear relationship between two
variables. The aspects of least squares regression and correlation were developed by Sir
Francis Galton in the late 1800’s.
The application can be used to test for a statistically significant correlation between the variables.
Finding a relationship does not prove a “cause and effect” relationship, but the model can be
used to quantify a relationship where one is known to exist. The model provides a measure of
the rate of change of one variable relative to another variable..
There is a potential change in the value of variable Y as the value of variable X changes.
Variable values will always be paired, one termed an
independent variable (often referred to as the X
variable) and a dependent variable (termed a Y
variable). For each value of X there is assumed to
be a normally distributed population of values for
the variable Y. Y The linear model which describes the relationship
between two variables is given as X Yi = β 0 + β1 X i + ε i
The “Y” variable is called the dependent variable or response variable (vertical axis). μ y. x = β0 + β1 X i is the population equation for a straight line. No error is needed in this
equation because it describes the line itself. The term μ y. x is estimated with at each
ˆ
value of Xi with Y . μy.x = the true population mean of Y at each value of X
The “X” variable is called the independent variable or predictor variable (horizontal
axis). β0 = the true value of the intercept (the value of Y when X = 0)
β1 = the true value of the slope, the amount of change in Y for each unit change in X (i.e.
if X changes by 1 unit, Y changes by β1 units).
The two population parameters to abe estimated, β0 and β1 are also referred to as the
regression coefficients. All variability in the model is assumed to be due to Yi, so variance is measured vertically
The variability is assumed to be normally distributed at each value of Xi
The Xi variable is assumed to have no variance since all variability is in Yi (this is a new
assumption)
The values β0 and β1 (b0 and b1 for a sample) are called the regressions coefficients.
The β0 value is the value of Y at the point where the line crosses the Y axis. This value is
called the intercept. If this value is zero the line crosses at the origin of the X and Y
James P. Geaghan Copyright 2010 ...
View
Full
Document
 Fall '08
 Geaghan,J

Click to edit the document details