EXST7005 Fall2010 23a Factorial

EXST7005 Fall2010 23a Factorial - Statistical Methods I...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Statistical Methods I (EXST 7005) Page 135 Our “sampling unit” is a smaller unit than the experimental unit (a plot) so we have sampling error. Replicated within blocks as multiple samples in an experimental unit. Error is sampling error. Another type of error comes from having several plots with a given soybean variety in each plot. Here each variety of soybean has several experimental units in each field. In this case the additional replication represents a second experimental error, one for block by treatment combinations and one for replicate plots within a block. Field 1 Field 2 Field 3 t5 t3 t2 t3 t5 t1 t1 t4 t2 t1 t2 t6 t4 t3 t2 t6 t5 t4 t6 t5 t1 t6 t5 t1 t5 t1 t3 t4 t4 t3 t4 t2 t6 t3 t6 t2 In this case we have replicated experimental units in each block. Factorial EMS I haven't mentioned factorials EMS. Developing EMS can be pretty simple. Start with the lowest unit, and move up the source table adding additional variance components for each new term. Source Treatment EMS with Reps Block 2 2 σε2 + nστβ + ntσβ Exptl Error 2 σ ε2 + nστβ Rep Error σε2 2 σε2 + nστβ + nbστ2 Interaction components occur on their own line, and on the source line for each higher effect contained in the interaction. Each main effect gets its own source. Now consider whether the effects are fixed or random. Modify fixed effects to show SSEffects instead of variance components. Source Treatment EMS with Reps 2 σ 2 + nσ 2 + nb Σ τ i ε τβ Block Exptl Error 2 σ ε2 + nστβ Rep Error t −1 σε + nστβ + ntσβ σε2 2 2 2 If the model is an RBD we're done, because the interaction is always a random variable. For factorials that are random models and mixed models were done. Consider what the F test should be for the treatment. Surprise, SAS always uses the residual error term! James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 136 But for factorials there is one last detail. It is perfectly possible in factorial designs that both effects are fixed, and if both effects are fixed the interaction is also fixed! Source Treatment A EMS with Reps 2 σ 2 + nb Σ τ Ai Treatment B σ ε2 + na Σ τ Bi Interaction A*B σε Error σε2 ( a − 1) ε 2 2 ( b − 1) ∑ (τ τ ) +n 2 A B ij ( a − 1)( b − 1) And a FIXED effect occurs only on its own line, no other!! The fixed interaction disappears from the main effects!!! Now what is the error term for testing treatments and interaction? Maybe SAS is right? Or maybe SAS just doesn't know what is fixed and what is random. Testing ANOVAs in SAS So tell SAS what is random and what is fixed. Look for the following additions to SAS program. How do we tell SAS which terms to test with what error term? How do we get SAS to output EMS? How do we get SAS to automagically test the right treatment terms with the right error terms? Summary Randomized Block Designs modify the model by factoring a source of variation out of the error term in order to reduce the error variance and increase power. If the basis for blocking is good, this will be effective. If the basis for blocking is not good, we lose a few degrees of freedom from the error term and may actually lose power. The block by treatment combinations (interaction?) provide a measure of variation in the experimental units and provide an adequate error term. We have an additional assumption that this error term represents ONLY experimental error, and not some real interaction between the treatments and blocks. Expected mean squares for the RBD indicate that the experimental error term is the correct error term, whether there is a sampling unit or not. Factorial designs, where effects are random or mixed are similar to RBD EMS. THE TREATMENT INTERACTION IS ACTUALLY USED AS AN ERROR TERM! When the treatments are fixed, the main effects do not contain the interaction term, and the residual error term is the appropriate error term. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 137 Sample size in ANOVA Some textbooks use a slightly different expression for the equation, but it is the same as the S2 n ≥ (tα 2 + tβ )2 2 . An equation discussed previously. One minor change is the expression d alternative to the use of d is the expressing of the difference as a percentage of the mean. For example, if we wanted to test for a difference that was 10% of the mean we could use the expression n ≥ (tα 2 + tβ ) 2 ( S2 0.1Y ) 2 . This expression can further be altered to express the difference in terms of the coefficient of variation CV = S . Calculating the sample sized Y 2 2 needed to detect a 10% change in the mean then becomes n ≥ (tα 2 + tβ ) (10CV ) . In analysis of variance we may also want to be able to detect a certain difference between two means (μ1 and μ2) out of the treatment means we are studying, so our difference will be μ1 – μ2. A prior analysis, or a pilot study, may provide us with an estimate of the variance (MSE in ANOVA). From here we can use a formula pretty much the same as for the t-test discussed earlier. There is one other little detail, however. 2 We are basically testing H0 : μ12 − μ2 = δ , from the 2 sample t-test. Recall from our linear combinations we have a variance for this linear combination that is the sum of the individual S12 S 22 + variances of the mean. Therefore, the variance will be. . Since we are usually n1 n2 pooling variances (ANOVA) then the formula simplifies to MSE ( 1 1 + ) n1 n2 . Furthermore, since we usually attempt to have balanced experiments (equal sample size in each group) for analysis of variance the formula further simplifies to an expression similar to one seen previously, except for the addition of “2”, 2MSE . The additional “2” occurs when we n are testing for difference in two means ( H 0 : μ1 = μ 2 ) as opposed to testing a mean against an hypothesized value ( H 0 : μ = μ 0 ). Note one very important thing here. In this formula “n” represents each group or population being studied, that is, each “treatment level” in an analysis of variance! So for ANOVA or twosample t-test with equal variance and equal n, the expression for sample size is 2(tα 2 + tβ ) 2 MSE . Note that this “n” is for each treatment. In a two sample t-test, each n≥ 2 d population would have a sample size of “n”, so the total number of observations would be 2n. In ANOVA we have “t” treatments; each would have a sample size of “n”, so the total number of observations would be tn. How often are we likely to have situations with equal variance and equal n? Is this realistic? Actually, yes it is. First, ANOVA traditionally required equal variances, though more modern analytical techniques can address the lack of homogeneity. If necessary, equal variances may be achieved by a transformation or some other fix. If variance is nonhomogeneous you could use the larger estimates and get a conservative estimate of “n”. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 138 Second, the most common application for sample size calculation is in planning NEW studies, and of course in planning new studies you usually do not PLAN on unbalanced designs and non homogeneous variance. So these situations are realistic. Summary Finally we saw that this formula is applicable to two-sample t-tests and ANOVA, with some modifications in the estimate of the variance. These modifications are the same ones needed for the 2-sample t-test as dictated by our study of linear combinations. However, the calculations are simplified by the common ANOVA assumption of equal variance and the prevalence of balanced experiments. Review of Analysis of Variance procedures. 1) H0: μ1 = μ2 = μ3 = μ4 = . . . = μt = μ 2) H1: some μi is different 3a) Assume that the observations are normally distributed about each mean, or that the residuals (i.e. deviations) are normally distributed. b) Assume that the observations are independent c) Assume that the variances are homogeneous 4) Set the level of type I error. Usually α = 0.05 5) Determine the critical value. For a balanced CRD with a single factor treatment the test is an F test with t–1 and t(n–1) degrees of freedom (Fα=0.05, t–1, t(n–1) d.f.). 6) Obtain data and evaluate. The treatment sum of squares, as developed by Fisher, are converted to a “variance” and tested with an F test against the pooled error variance. In practice, the sum of squares are usually calculated and presented with the degrees of freedom in a table called an ANOVA table. For a balanced design (all ni equal) the calculations are, ( ) The uncorrected SS for treatments is USS Treatments = t ∑ i =1 n ∑ Yij j =1 n 2 t = n∑ i =1 n ∑ Yij j =1 2 ( ). n 2 The uncorrected SS for the total is SSTotal = ∑∑Yij i ( The correction factor for both terms is CF = j ) t n 2 ∑ ∑ Yij i j tn Our ANOVA analyses will be done with PROC MIXED and PROC GLM. There is a PROC ANOVA, but it is a subset of PROC GLM. LSMeans calculation The calculations of LSMeans are different. For a balanced design, the results will be the same. However, for unbalanced designs the results will often differ. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 139 The MEANS statement in SAS calculates a simple mean of all available observations in the treatment cells. The LSMeans statement will calculate the mean of the treatment cell means. Example: The MEAN of 4 treatments, where the observations are 3,4,8 for a1, 3,5,6,7,9 for a2, 7,8,6,7 for a3 and 3,5,7 for a4 is 5.8667. The individual cells means are 5, 6, 7 and 5 for a1, a2, a3 and a4 respectively. The mean of these 4 values is 5.75. This would be the LSMean. Raw means Treatments a1 5 7 b1 a2 6 8 4 5 a3 9 6.5 7 9 7 Means LSMeans means Treatments b1 b2 Means 5.75 5 7 7 a1 6 8 7 b2 Means a2 6 5 5.5 a3 9 6 7.5 6.6 Means 7 6.33 Confidence Intervals on Treatments Like all confidence intervals on normally distributed estimates, this will employ a t-value and will be of the form Mean ± t a SY 2 The treatment mean can be obtained from a means (or LSMeans) statement, but the standard deviation provided is not the correct standard error for the interval. The standard error in a simple CRD with fixed effects is the square root of MSE/n, where n is the number of observations used in calculating the mean. The calculation requires other considerations when random components are involved. For example, in PROC MIXED the use of the Satterthwaite and Kenward-Roger approximations, the use of various estimation methods (the default is REML) and specifications of covariance structure are all things that can affect degrees of freedom. The use of MSE in the numerator is the default in PROC GLM, and if a different error is desired it must be specified by the user. PROC MIXED is capable of detecting and using and error other than the MSE where appropriate. If there are several error terms (e.g. experimental error and sampling error) use the one that is appropriate for testing the treatments. When an error term other than the residual is appropriate for testing the treatments, the degrees of freedom for the tabular t value are the d.f. from the error term used for testing. This variance term would also be used to calculate the standard error for treatment means. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 140 Simple Linear Regression Simple regression applications are used to fit a model describing a linear relationship between two variables. The aspects of least squares regression and correlation were developed by Sir Francis Galton in the late 1800’s. The application can be used to test for a statistically significant correlation between the variables. Finding a relationship does not prove a “cause and effect” relationship, but the model can be used to quantify a relationship where one is known to exist. The model provides a measure of the rate of change of one variable relative to another variable.. There is a potential change in the value of variable Y as the value of variable X changes. Variable values will always be paired, one termed an independent variable (often referred to as the X variable) and a dependent variable (termed a Y variable). For each value of X there is assumed to be a normally distributed population of values for the variable Y. Y The linear model which describes the relationship between two variables is given as X Yi = β 0 + β1 X i + ε i The “Y” variable is called the dependent variable or response variable (vertical axis). μ y. x = β0 + β1 X i is the population equation for a straight line. No error is needed in this equation because it describes the line itself. The term μ y. x is estimated with at each ˆ value of Xi with Y . μy.x = the true population mean of Y at each value of X The “X” variable is called the independent variable or predictor variable (horizontal axis). β0 = the true value of the intercept (the value of Y when X = 0) β1 = the true value of the slope, the amount of change in Y for each unit change in X (i.e. if X changes by 1 unit, Y changes by β1 units). The two population parameters to abe estimated, β0 and β1 are also referred to as the regression coefficients. All variability in the model is assumed to be due to Yi, so variance is measured vertically The variability is assumed to be normally distributed at each value of Xi The Xi variable is assumed to have no variance since all variability is in Yi (this is a new assumption) The values β0 and β1 (b0 and b1 for a sample) are called the regressions coefficients. The β0 value is the value of Y at the point where the line crosses the Y axis. This value is called the intercept. If this value is zero the line crosses at the origin of the X and Y James P. Geaghan Copyright 2010 ...
View Full Document

Ask a homework question - tutors are online