EXST7015 Fall2011 Lect17

EXST7015 Fall2011 Lect17 - Statistical Techniques II Page...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Statistical Techniques II Page 71 Analysis of Covariance Linear regression is usually done as a least squares technique applied to QUANTITATIVE VARIABLES. ANOVA is the analysis of categorical (class, indicator, group) variables, there are no quantitative “X” variables as in regression, but this is still a least squares technique. It stands to reason that if Regression uses the least squares technique to fit quantitative variables, and ANOVA uses the same approach to fit qualitative variables, that we should be able to put both together into a single analysis. We will call this Analysis of Covariance. There are actually two conceptual approaches, Multisource regression – adding class variables to a regression Analysis of Covariance – adding quantitative variables to an ANOVA For now, we will be primarily concerned with Multisource regression. With multisource regression we start with a regression (one slope and one intercept) and ask, would the addition of an indicator or class variable improve the model? Adding a class variable to a regression gives each group its own intercept, fitting a separate intercept to each group. Y X Adding an interaction fits a separate slope to each group. Y X How do they do that? For a simple linear regression we start with, Yi b 0 b1 X 1 i e i Now add an indicator variable. An indicator variable, or dummy variable, is a variable that uses values of “0” and “1” to distinguish between the members of a group. For example, if the category, or groups, were MALE and FEMALE, we could give the females a “1” and the males a “0” and distinguish between the groups. If the groups were FRESMAN, SOPHOMORE, JUNIOR and SENIOR we would need 3 variables to distinguish between them, the first would have a “1” for freshmen and a “0” otherwise, the second and third would have “1” for SOPHOMORE and JUNIOR, respectively, and a zero otherwise. We don’t need a fourth variable for SENIOR because if we have an observation with values of 0, 0, 0 for the three variables we know it has to be a SENIOR. So, we always need one less dummy variable than there are categories in the group. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 72 Fitting separate intercepts In our example we will add just one indicator variable, but it could be several. We will call our indicator variable X2i, but it is actually just a variable with values of 0 or 1 that distinguishes between the two categories of the indicator variable. Yi b 0 b1 X 1 i b 2 X 2 i e i When X2i = 0 we get: Yi b0 b1 X 1 i b2 (0 ) ei , which reduces to Yi b 0 b1 X 1 i e i , a simple linear model for the “0” group. And when X2i = 1 we have Yi b 0 b1 X 1 i b 2 (1) e i , which simplifies to Yi ( b 0 b 2 ) b1 X 1 i e i , Y (b0+b2) b0 X a simple linear model with an intercept equal to (b0+b2) for the “1” group. So there are two lines with intercepts of b0 and (b0+b2). Note that b2 is the difference between the two lines, or an adjustment to the first line that gives the second line. This can be positive or negative, so the second line may be above or below the first. This term can be tested against zero to determine if there is a difference in the intercepts. Fitting separate slopes Adding an interaction (crossproduct term) between the quantitative variable (X1i) and the indicator variable (X2i) will fit separate slopes. Still using just one indicator variable for two classes the model is Y i b 0 b1 X 1 i b 2 X 2 i b3 X 1 i X 2 i e i . When X2i = 0 we get: Y i b 0 b1 X 1 i b 2 (0 ) b 3 X 1 i (0 ) e i , which reduces to Yi b 0 b1 X 1 i e i , a simple linear model for the “0” group. When X2i =1 then Yi b 0 b1 X 1 i b 2 (1) b3 X 1 i (1) e i Y b1+b3 b0+b2 b0 b1 X simplifying to Yi ( b0 b 2 ) ( b1 b3 ) X 1i e i , which is a simple linear model with an intercept equal to (b0+b2) and a slope equal to (b1+b3) for the “1” group. Note that b0 and b1 are the intercept and slope for one of the lines (whichever was assigned the 0). The values of b2 and b3 are the intercept and slope adjustments (+ or – differences) for the second line. If these adjustments are not different from zero then the intercept and or slope are not different from each other. A third or fourth line could be fitted by adding additional indicator variables and interaction with coefficients b4 and b5, etc. Statistics quote: There are two kinds of statistics, the kind you look up and the kind you make up. -- Rex Stout James P. Geaghan - Copyright 2011 Statistical Techniques II Page 73 Conceptual application of Analysis of Covariance Suppose we have a regression, and we would like to know if perhaps the regression is different for some groups, perhaps male and female or treated and untreated. Of course, regression starts with a corrected sum of squares (a flat line through the mean) and we fit a slope. Viewing this as extra sums of squares we are comparing a model with no slope to a model with a slope. a A test of the difference between these two models tests the hypothesis of adding a slope (H0:1=0). The extra SS are SSX1|X0, or just SSX1. From the simple linear regression we want to test the hypotheses that the model is improved by adding separate intercepts or slopes. The concepts are pretty simple. The additional variables can be tested with extra SS or the General Linear Hypothesis Test (GLHT with full model and reduced model). For a single indicator variable the extra SS are easy enough. For several indicator variables it may be easier to do the GLHT where the single line is the reduced model and the two lines are the full model. b A test of the difference between these two models tests the hypothesis of adding an intercept adjustment (H0:2=0) or tests for the difference between the intercepts. The extra SS are SSX2 | X1. The next test compares a model with two intercepts and one slope with a model containing two slopes and two intercepts. c Here we are testing the addition of a slope adjustment, or second slope (H0:3=0). This test determines if a model with 2 intercepts and 2 slopes is better than a model with 2 intercepts and one slope. The extra SS are SSX1X2 | X1 X2. This series a–b–c is the usual “Multisource regression” series. Note that the extra SS needed are: SSX1; SSX2 | X1; SSX1X2 | X1, X2. These are the Type I SS for the SAS model Y=X1 X2 X1*X2;. So this is another of those rare cases where we may wish to use TYPE I tests. The hypotheses to be tested can be simplified to some extent if we know that for our base model we want a regression. This is multisource regression. Start with a simple linear regression. Test to see if separate intercepts improve the model (H0:2=0). Correction factor, 1 level = b0 (fit mean) add b1, 1 level (b0), 1 slope (b1) add b2, 2 levels (b0, b2), 1 slope (b1) add b3, 2 levels (b0, b2) and 2 slopes (b1 and b3) James P. Geaghan - Copyright 2011 Statistical Techniques II Page 74 Test to see if separate slopes improve the model (H0:3=0). As extra SS we could test extra SS for b2 and then the extra SS for b3. In practice we usually start with the fullest model (2 slopes and 2 intercepts) and then reduce to 1 slope and 2 intercepts if the extra SS for b3 is not significant and then to 1 slope and 1 intercept if extra SS for b2 is not significant. So, interpretation usually proceeds from the bottom up; do we need 2 slopes and 2 intercepts, if not do we need 2 intercepts, if not is the SLR significant? If TYPE I sums of squares are used there would be little if any change from deleting the higher order components of the model, so in this case we would not have to refit the model after each decision. The solution (i.e. regression coefficients) provided with ANCOVA is the correct model for prediction, even though it is fully adjusted (Type III) and our tests of hypothesis are not (Type I). The progression discussed above corresponds to the likely series in multisource regression (a-b-c below). In Analysis of Variance (our next topic), the researchers interest is in differences among the means of categorical variables. For this analysis the categorical variable can be put first and the progression is the usual series for Analysis of Variance (d-e-c below). Which series of test you need depends on if you are starting with a regression or an analysis of variance. a d e b f c g Examine the possible full and reduced models on the previous page. What is tested in each case, what extra SS is needed for each test and what are the initial and final models for each case. One other test of occasional interest is the test denoted by the comparison “g”. If you fit a model with two slopes and two intercepts, and cannot reduce to a model with two intercepts and one slope, you may wonder if a model with two slopes and one intercept is appropriate. Full model: Yi b 0 b1 X 1 i b 2 X 2 i b3 X 1 i X 2 i e i Reduced model: Yi b0 b1 X 1i b3 X 1i X 2 i ei The extra SS for this test is SSX2 | X1, X1*X2. This is actually a test of “intercepts fitted last” and would be available as the Type III SS for X1*X2 with the model Y=X1 X2 X1*X2;. g James P. Geaghan - Copyright 2011 Statistical Techniques II Page 75 A few notes on ANCOVA. The values of the slopes and intercepts fitted in the full model (2 slopes and 2 intercepts) are exactly the same slopes and intercepts that would be fitted if the models were done separately. However, the combined model has the advantage of having a more powerful pooled error term. In SAS we can fit our own indicator variable in proc reg, building dummy variables in the data step. However, if there are more than just two levels (one dummy variable) the dummy variables should be treated as a aggregate. The CLASSES statement in PROC MIXED or GLM automatically creates the indicator variables, and both will test TYPE I and TYPE III sums of squares. We will use PROC MIXED in our numerical SAS example. When we include the CLASSES statement in SAS, the program assumes that we are doing an ANOVA, and by default does not print the regression coefficients. If we want these we must add the option “/solution” to the model statement. This is true for both PROC MIXED and PROC GLM. Forbes 500 Companies Sales (Appendix 11) There is a presumed relationship between a company’s sales and its assets. The dataset has sales and asset data for various industry sectors (Communication, Energy, Finance, HiTech, Manufacturing, Medical, Retail, Transportation and Other). We want to determine if the sales / asset relationship is the same for each sector. Examine the scatter plot. The untransformed data shows little pattern. The log transformed data shows some increasing trend. We want to know if the apparent trends are significant and we want to know if the same trend is apparent for each sector. This will be fitted in proc mixed because I want to use TYPE 1 SS more than TYPE II or TYPE III. See handout. Summary Analysis of Covariance is the combination of quantitative variables and categorical variables. Multisource regression is the expression of an ANCOVA that reduces to a SLR. Analyses that culminate in ANOVAs will be discussed at the end of the course. This type is where ANOVA the term “Analysis of Covariance” was developed. You should be able to examine the graph and determine what is tested and what Extra SS are tested. SLR a d e b f c g James P. Geaghan - Copyright 2011 Statistical Techniques II Page 76 Analysis of Variance and Experimental Design The simplest model or analysis for Analysis of Variance (ANOVA) is the CRD, the Completely Randomized Design. This model is also called “One-way” Analysis of Variance. Unlike regression, which fits slopes for regression lines and calculates a measure of random variation about those lines, ANOVA fits means and variation about those means. The hypotheses tested are hypotheses about the equality of means H0 : 1 2 3 4 ... t where the i represent means of the levels of some categorical variable “t” is the number of levels in the categorical variable. H1: some i is different We will generically refer to the categorical variable as the “treatment” even though it may not actually be an experimenter manipulated effect. The number of treatments will be designated “t”. The number of observations within treatments will be designated n for a balanced design (the same number of observations in each treatment), or ni for an unbalanced design (for i = 1 to t). The assumptions for basic ANOVA are very similar to those of regression. The residuals, or deviations of observations within groups, should be normally distributed. The treatments are independently sampled. The variance of each treatment is the same (homogeneous variance). ANOVA review I am borrowing some material from my EXST7005 notes on t-test and ANOVA. See those notes for a more complete review of the introduction to Analysis of Variance (ANOVA). Start with the logic behind ANOVA. Prior to R. A. Fisher's development of ANOVA, investigators were likely to have used a series of t tests to test among t treatment levels. What is wrong with that? Recall the Bonferroni adjustment. Each time we do a test we increase the chance of error. To test among 3 treatments we need to do 3 tests, among 4 treatments, 6 tests, 5 treatments are 10 tests, etc. What is needed is ONE test for a difference among all tests with one overall value of a specified by the investigator (usually 0.05). Fisher's solution was simple, but elegant. Suppose we have a treatment with 5 categories or levels. We can calculate a mean and variance for each treatment level. In order to get one really good estimate of variance we can pool the individual variances of the 5 categories (assuming homogeneity of variance). This pooled variance can be calculated as a weighted mean of the variance (weighted by the degrees of freedom). James P. Geaghan - Copyright 2011 Statistical Techniques II Page 77 Y Y A B C D E Group And since SS1 12 n1 1 then n1 1 12 SS1 , the weighted mean is simply the sum of the SS divided by the sum of the d.f. 2 2 2 (n1 1)S1 (n2 1)S2 (n3 1)S3 (n4 1)S2 (n5 1)S5 2 4 S (n1 1) (n2 1) (n3 1) (n4 1) (n5 1) 2 p S2 p SS1 SS2 SS3 SS4 SS5 (n1 1) (n 2 1) (n 3 1) (n 4 1) (n 5 1) So we have one very good estimate of the random variation, or sampling error, S2. Then what? Now consider the treatments. Why don't they all fall on the overall mean? Actually, under the null hypothesis, they should, except for some random variation. So if we estimate that random variation, it should be equal to the same error we already estimated within groups? Recall the variance of means is estimated as S2/n, the variance of the sample divided by the sample size. The standard error is the square root of this. If we actually use means to estimate a variance, we are also estimating the variance Y of means, S2/n. If we multiply this by “n” it should actually be equal to S2, which we estimated with S2 , the pooled variance Y p estimate. So if the null hypothesis is true, the mean square of the deviations within groups should be equal to the mean square of the deviations of the means multiplied by “n”!!!! Deviations within groups Deviations between groups Means A B C D E Group James P. Geaghan - Copyright 2011 Statistical Techniques II Page 78 Now, if the null hypothesis is not true, and some i is different, then what? Then, when we calculate a mean square of deviations of the means from the overall mean, it should be larger than the previously estimated S2 . p Y So we have two estimates of variance, S2 and the p variance from the treatment means. If the null hypothesis is true, they should not be significantly different. Y A B C D E Group If the null hypothesis is FALSE, the treatment mean square should be larger. It will therefore be a ONE TAILED TEST! We usually present this in an “Analysis of Variance” table. Source d.f. Sum of Squares Treatment t–1 SSTreatment Error t(n–1) SSError Total tn–1 SSTotal Mean Square MSTreatment MSError Degrees of freedom There are tn observations total (ni if unbalanced). After the correction factor, there are tn–1 d.f. for the corrected total. There are t–1 degrees of freedom for the t treatment levels. Each group contributes n–1 d.f. to the pooled error term. There are t groups, so the pooled error (MSE) has t(n–1) d.f. The SSTreatments is the SS deviations of the treatment means from the overall mean. Each deviation is denoted ti, and is called a treatment “effect”. SSTreatments = (Y -Y) t 2 t i i=1 2 i i=1 The model for regression is Yi b0 b1 X i ei The effects model for a CRD is Yi i ij where the treatments are i=1, 2, ... t and the observations are j=1, 2, ... n, or ni for unbalanced data An alternative expression for the CRD, called the means model, is Y i i ij Statistics quote: Statistics are like a bikini. What they reveal is suggestive, but what they conceal is vital. -Aaron Levenstein James P. Geaghan - Copyright 2011 ...
View Full Document

Ask a homework question - tutors are online