EXST7005 Fall2010 25a Regression

EXST7005 Fall2010 25a Regression - Statistical Methods I...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Statistical Methods I (EXST 7005) Page 149 ∧ Standard error of the regression line (i.e. Yi ) : S ˆ μY | X = ⎛ ⎞ 2 ⎜ 1 ( X i − X. ) ⎟ MSE⎜ + n n 2⎟ ⎜ ∑ ( X i − X. ) ⎟ ⎝ i =1 ⎠ ∧ Standard error of the individual points (i.e. Yi): This is a linear combination of Yi and ei, so the variances are the sum of the variance of these two, where the variance of ei is MSE. The 2 = S + MSE = standard error is then S μ ˆ μY | X Y|X ⎛ ⎞ 2 ⎜ 1 ( X i − X. ) ⎟ MSE ⎜ + + MSE n n 2⎟ ∑ ( X i − X. ) ⎟ ⎜ ⎝ i =1 ⎠ = ⎛ ⎞ 2 ⎜ 1 ( X i − X. ) ⎟ MSE ⎜ 1+ + n n 2⎟ ∑ ( X i − X. ) ⎟ ⎜ ⎝ ⎠ i =1 Standard error of b0 is the same as the standard error of the regression line where Xi = 0 Square Root of [5.503603515 (0.0625 + 26.91015625/90.4375)] = 1.407693696 Confidence interval on b0, where b0 = 4.771250864 and t(0.05/2, 14df) = 2.145 P(4.771250864 – 2.145*1.407693696 ≤ β0 ≤ 4.771250864+2.145*1.407693696) = 0.95 P(1.751747886 ≤ β0 ≤ 7.790753842) = 0.95 Estimate the standard error of an individual observation for number of parasites for a ten-year∧ old fish: Y = b0 + b1 X i =4.77125 + 1.82723*X=4.77125 + 1.82723*10 = 23.04354 Square Root of [ 5.503603515*(1+0.0625+(10 – 5.1875)2/90.4375)] = Square Root of [ 5.503603515*(1+0.0625+(23.16015625)/90.4375)] = 2.693881509 Confidence interval on μY|X=10 P(23.04353836 – 2.145*2.693881509 ≤ μY|X=10 ≤ 23.04353836+2.145*2.693881509) = 0.95 P(17.26516252 ≤ μY|X=10 ≤ 28.82191419) = 0.95 Calculate the coefficient of Determination and correlation R2 = r= 0.796700662 0.892580899 or 79.67006617 % See SAS output Overview of results and findings from the SAS program James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 150 I. Objective 1 : Determine if older fish have more parasites. (SAS can provide this) A. This determination would be made by examining the slope. The slope is the mean change in parasite number for each unit increase in age. The hypothesis tested is H0: β1=0 versus H1: β1≠0 1. If this number does not differ from zero, then there is no apparent relationship between age and number of parasites. If it differs from zero and is positive, then parasites increase with age. If it differs from zero and is negative, then parasites decrease with age. 2. For a simple linear regression we can examine the F test of the model, the F test of the Type I, the F test of the Type II, the F test of the Type III or the t-test of the slope. For a simple linear regression these all provide the same result. For multiple regressions (more than 1 independent variable) we would examine the Type II or Type III F test (these are the same in regression) or the t-test of regression coefficients. [Alternatively, a confidence interval can be placed on the coefficient, and if the interval does not include 0, the estimate of the coefficient is significantly different from zero]. B. In this case, the F tests mentioned had values of 54.86, and the probability of this F value with 1 and 14 d.f. is less than 0.0001. Likewise, the t test of the slope was 7.41, which was also significant at the same level. Note that t2=F, these are the same test. We can therefore conclude that the slope does differ from zero. Since it is positive we further conclude that older fish have more parasites. II. Objective 2 : Estimate the rate of accumulation of parasites. (SAS can provide this) A. The slope for this example is 1.827228749 parasites per year (note the units). It is positive, so we expect parasite numbers to increase by 1.8 per year. B. The standard error for the slope was 0.24668872. This value is provided by SAS and can be used for hypothesis testing or confidence intervals. SAS provides a t-test of H0: β1=0, but hypotheses about values other than zero must be requested (SAS TEST statement) or calculated by hand. The confidence interval in this case is: This calculation was done previously and is partly repeated below. P[b1 – tα/2,14 d.f. Sb1 ≤ β1 ≤ b1 + tα/2,14 d.f. Sb1]=0.95 P[1.827228749 – 2.144789(0.246689) ≤ β1 ≤ 1.827228749 + 2.144789(0.246689)]=0.95 P[1.298134 ≤ β1 ≤ 2.356324]=0.95 Note that this confidence interval does not include zero, so it differs significantly from zero. III. Estimate the intercept with confidence interval. A. The intercept may also require a confidence interval. This was calculated previously and was; P(1.751747886 ≤ β0 ≤ 7.790753842) = 0.95 James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 151 IV. Determine how many parasites a 10 year old fish would have. (SAS can provide this) A. Estimating a Yi value for a particular Xi simply requires solving the equation for the line with ∧ the Y = b0 + b1 X i which for coefficients of 4.771 and 1.827 and for a 10-year-old fish (Xi=10) ∧ is Y =4.771+1.827(10) = 4.771+18.27 = 23.041. V. Place a confidence interval on the 10 year old fish estimate. (SAS can provide this) A. The confidence interval for this was estimated previously: P(17.26516252≤μx=10≤28.82191419)=0.95. B. There are many reasons why this type of calculation may be of interest. We can place a confidence interval on any value of Xi, including the intercept where Xi=0 (this was done previously). The intercept is often the most interesting point on the regression line, but not always. C. There is one very special characteristic of the confidence intervals (of either individual points or means). The confidence interval is narrowest at the mean of Xi, and gets wider to either side of the mean. The graph below for out example demonstrates this property. Regression with confidence bands 25 20 P a r 15 a s i t 10 e s 5 0 0 1 2 3 4 5 6 Age (years) 7 8 9 10 D. VI. Determine if a linear model is adequate and assumptions met. (SAS can provide most of this) A. Independence : This is a difficult assumption to evaluate. There are some techniques in advanced statistical methods, but these will not be covered here. The best guarantee for independence is to randomize wherever and whenever possible. B. Normality : The normality of the “residuals” or deviations from regression can be evaluated with the PROC UNIVARIATE Shapiro-Wilks test. The W value was 0.96 and the P<W was 0.6831. We would not reject the null hypothesis of “data is normality distributed” with these results. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 152 Homogeneity and other considerations : Residual plots are an important tool in evaluating possible problems in regression, some of which we have not seen before. The normal residual plot, when all is well, should reflect just random scatter about the regression line. An Residual Plot ei + 0 - Xi example is given below. The three residual plots below all show possible problems. From left to right the problems indicated are (1) the data is curved and cannot be adequately described by a straight line, (2) the variance is not homogeneous and (3) there is an outlier. Residual Plot ei Residual Plot ei + + 0 0 - - Xi ei Xi Residual Plot + 0 Xi An outlier is an observation which appears to be too large or too small in comparison to the other values. Data should be checked carefully to insure that the point is correct. If it is correct, but is way out of line relative to other values. it may be necessary to omit the point. The residual plot for our example is given below. Can you detect any potential problems? Age Residual Plot 4 Residuals 2 0 -2 -4 -6 0 2 4 6 8 10 Age James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 153 VII. An old published article states that the rate of accumulation should be about 5 per year. Test our estimate against 5. . (SAS can provide this if you ask nicely) A. SAS automagically test the hypothesis that H0: β1=0. However, any value can be tested. The b1 − bH0 MSE MSE = ,where Sb = n as previously test is the usual one-sample t-test, t = S xx 1 2 Sb1 ∑ ( X −X ) i =1 mentioned. For this example, t = . i 1.827 − 5 0.2467 VIII. Final notes on regression and correlation. (SAS can provide most of this) A. The much over-rated R2. The regression accounts for a certain fraction of the total SS. The fraction of the total SS that is accounted for by the regression is called coefficient of determination and is denoted “R2”. It is calculated as R2=SSReg/SSTotal. This value is usually multiplied by 100 and expressed as a percent. For our example the value was 79.7% of the total variation accounted for by the model. This is pretty good, I guess. However, for some analyses we expect much higher (length - weight relationships for example) and for others much lower (try to predict how many fish you will get in a net at a particular depth or for a particular size stream). This statistic does not provide any test, but may be useful for comparing between similar studies on similar material. B. The square root of the R2 value is equal to the “Pearson product moment correlation” coefficient, usually denoted a “r”. This value is calculated as ∑(X n Sb1 = i =1 ∑(X n i =1 i i )( − X . Yi − Y. − X. ) ) ∑ (Y − Y ) 2 n i =1 i = 2 S xy and is equal to 0.8926 for our example. S xx S yy . C. The correlation coefficient is “unitless” and ranges from -1 to +1. D. A perfect inverse correlation gives a value of -1. This corresponds to a negative slope in regression, but the R2 value will not reflect the negative because it is squared. A perfect correlation gives a value of +1 (positive slope in regression). A correlation of zero can be represented as random scatter about a horizontal line (slope = 0 in regression). Y Y Perfect inverse correlation X Perfect correlation X James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 154 E. The perfect correlation value of 1 (+ or – ) also corresponds to a “perfect” regression, where the R2 value would indicate that 100% of the variation in the total was accounted for by the Y Correlation = 0 X model. The error in this case would be zero. About Cross products Cross products, X iYi ,are used in a number of related calculations. Note from the calculations below that when any of the calculations equals zero, all of the others will also go to zero. As a result when the covariance is zero the slope, correlation coefficient, R2 and SSRegression are also zero. As a result of this, the common test of hypothesis of interest in regression, H 0 : β1 = 0 , can be tested by testing any of the statistics below. A t-test of the slope or an F test of the MSRegression are both testing the same hypothesis. Recall that we saw that from the interrelationships of probability distributions that a t2 with γ d.f. = F with 1, γ d.f. n Sum of cross products = S XY = ∑ (Yi − Y )( X i − X ) i =1 n Covariance = S XY ( n − 1) = ∑ (Y − Y )( X i =1 i n Slope = S XY S XX = ∑ (Y − Y )( X i i =1 i i − X) − X) n ∑(X i =1 SSRegression = ( S XY ) ( n − 1) 2 S XX i − X )2 ⎛ n ⎞ ⎜ ∑ (Yi − Y )( X i − X ) ⎟ ⎠ = ⎝ i =1 2 n ∑(X i =1 n Correlation coefficient = r = S XY S XX SYY = ∑ (Y − Y )( X i i =1 i i − X )2 − X) n ∑ (X i =1 R2 = r2 = ( S XY ) 2 S XX SYY ⎛ n ⎞ ⎜ ∑ (Yi − Y )( X i − X ) ⎟ ⎠ = ⎝ i =1 n i − X ) 2 ∑ (Yi − Y ) 2 i =1 2 SS Regression ⎛ SSTotal 2 2⎞ ⎜ ∑ ( X i − X ) ∑ (Yi − Y ) ⎟ n ⎝ i =1 i =1 = n ⎠ James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 155 Summary Regression is used to describing a relationship between two variables using paired observations from the variables. The intercept is the point where the line crosses the Y axis and the slope is the change in Y per unit X. Variance is derived from the sum of squared deviations from the regression line. The regression model is given by The population regression model is given by Yi = β 0 + β1 X i + ε i for observations and μ y. x = β 0 + β1 X i for the regression line itself. ˆ Estimated from a sample the regression line is Yi = b0 + b1 Xi There are four assumptions usually made for a regression, 1) Normality (at each value of Xi), 2) Independence (1) of the observations (Yi, Yj) from each other and (2) of the deviations (eij) from the rest of the model). 3) Homogeneity of variance at each value of Xi. 4) The Xi values are measured without error (i.e. all variation and deviations is vertical). James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 156 Multiple Regression The objectives are the same as for simple linear regression, the testing of hypotheses about potential relationships (correlation), fitting and documenting relationships, and estimating parameters with confidence intervals. The big difference is that a multiple regression will correlate a dependent variable (Yi) with several independent variables (Xi's). The regressions equation is similar. The sample equation is Yi = β 0 + β1 X 1i + β 2 X 2 i + β 3 X 3i + ε i The assumptions for the regression are the same as for Simple Linear Regression The degrees of freedom for the error in a simple linear regression were n – 2, where the two degrees of freedom lost from the error represented one for the interecept and one for the slope. In multiple regression the degrees of freedom are n – p, where “p” is the total number of regression parameters fitted including one for the intercept. The interpretation of the parameter estimates are the same (units are Y units per X units, and measure the change in Y for a 1 unit change in X). Diagnostics are mostly the same for simple linear regression and multiple regression. Residuals can still be examined for outliers, homogeneity, curvature, etc. as with SLR. The only difference is that, since we have several X's, we would usually plot the residuals on ˆ Yhat ( Yi ) instead of a single X variable. Normality would be evaluated with the PROC UNIVARIATE test of normality. There is only really one new issue here, and this is in the way we estimate the parameters. If the independent (X) variables were totally and absolutely independent (covariance or correlation = 0), then it wouldn't make any difference if we fitted them one at a time or all together, they would have the same value. However, in practice there will always be some correlation between the X variables. If two X variables were PERFECTLY correlated, they would both account for the SAME variation in Y, so which would get the variation? If two X variables are only partially correlated they would share part of the variation in Y, so how is it partitioned? To demonstrate this we will look at a simple example and develop a new notation called the Extra SS. For multiple regression there will be, as with simple linear regression, a SS for the “MODEL”. This SS lumps together all SS for all variables. This is not usually very informative. We will want to look at the variables individually. To do this there are several types of SS available in SAS, two of which are of particular interest, TYPE 1 and TYPE 3 SS. In PROC REG these are not provided by default. To see them you must request them. This can be done by adding the options SS1 and/or SS2 to the model statement. For regression the SS Type II and SS Type III are the same. In PROC GLM, which will do regressions nicely, but has fewer regression diagnostics than PROC REG, the TYPE 1 and TYPE 3 SS are provided by default. To do multiple regression in SAS we simply specify a model with the variables of interest. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 157 For example, a regression on Y with 3 variables X1, X2 and X3 would be specified as PROC REG; MODEL Y = X1 X2 X3; To get the SS1 and SS2 we add PROC REG; MODEL Y = X1 X2 X3 /ss1 ss2; Example with Extra SS The simple example is done with created data set. Y 1 3 5 3 6 4 2 8 9 3 5 6 X1 2 4 7 3 5 3 2 6 7 8 7 9 X2 9 6 7 5 8 4 3 2 5 2 3 1 X3 2 5 9 5 9 2 6 1 3 4 7 4 Now let’s look at simple linear regressions for each variable independently, first for variable X1. If we do a simple linear regression on X1 we get the following result. The SSTotal is 62.91667, and this will not change regardless of the model since it is adjusted only for the intercept and all models will include an intercept. If we fit a regression of Y on X1 the result is SSModel = 23.978, so the sum of squared accounted for by X1 when it enters alone is 23.978. If we fit X2 alone, the result is SSModel = 4.115. If we then fit both X1 and X2 together, would the resulting model SS be 23.978 + 4.115 = 28.093? No, the model actually comes out to be 24.074 because of some covariance between the two variables. So how much would X1 add to the model if X2 was fitted first and how much would X2 add if X1 was fitted first? We can calculate the extra SS for X1, fitted after X2, and for X2 fitted after X1. The variable X2 alone accounted for a sum of squares equal to 4.115 and when X1 was added the SS accounted for was 24.074, so X1 entering after X2 accounted for an additional 24.074 – 4.115 = 19.959. Therefore, we can state that the SS accounted for by X1, entering the model after X2, is 19.959. Likewise, we can calculate the SS that X2 accounted for entering after X1. Together they account for SS = 24.074 and X1 alone accounted for 23.978, so X2 accounted for an additional SS = 24.074 – 23.978 = 0.096 when it entered after X1. We need a simpler notation to indicate the sum of square for each variable and which other variables have been adjusted for before it enters the model. The sum of squares for X1 and X2 entering alone will be SSX1 and SSX2, respectively. When X1 is adjusted for X2 and vice versa the notation will be SSX1|X2 and SSX2|X1, respectively. For the James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 158 calculations above the results were: SSX1 = 23.978, SSX2 = 4.115, SSX1|X2 = 19.959 and SSX2|X1 = 0.096. Finally, consider a model fitted on all three variables. A model fitted to X2 and X3, without X1, yields SSModel = 4.137. When X1 is added to the model, so that all 3 variables are now in the model the, the SS accounted for is 26.190. How much of this is due to X1 entering after X2 and X3 are already in the model? Calculate 26.190 – 4.137 = 22.053. This sum of squares is denoted SSX1|X2, X3. In summary, X1 accounts for 23.978 when it enters alone, 19.959 when it enters after X2 and 22.053 when it enters after both X2 and X3 together. Clearly, how much variation X1 accounts for depends on what variables are already in the model, so we cannot just talk about the sum of squares for X1. We can use the new notation to describe the sum of squares for X1 that indicates which other variable are in the model. This is the notation of the extra sum of squares. The notation is (SSX1) for X1 alone in the model (adjusted for only the intercept), (SSX1|X2) indicating X1 adjusted for X2 only, and (SSX1|S2, X3) indicating that X1 is entered after, or adjusted for, both X2 and X3. For our example; SSX1 = 23.978 SSX1|X2=19.959 SSX1|X2, X3 = 22.053 The same procedure would be done for each of the other two variables. We would calculate the same series of values for the variable X2; SSX2, SSX2|X1 or SSX2|X3 and SSX2|X1, X3. The series for variable X3 would be; SSX3, SSX3|X1 or SSX3|X2 and SSX3|X1, X3. These values are given in the table below. Extra SS SSX1 SSX2 SSX3 SSX1|X2 SSX2|X1 SSX1|X3 SSX3|X1 SSX2|X3 SSX3|X2 SSX1|X2,X3 SSX2|X1,X3 SSX3|X1,X2 SS 23.978 4.115 0.237 19.959 0.096 25.134 1.393 3.900 0.022 22.053 0.819 2.116 d.f. Error 10 10 10 9 9 9 9 9 9 8 8 8 Error SS 38.939 58.802 62.680 38.843 38.843 37.546 37.546 58.780 58.780 36.727 36.727 36.727 All of these SS are previously adjusted only for the intercept (X0, the correction factor), and this will always be the case for our examples. We could include a notation for the intercept in the extra SS (e.g. SSX1|X0; SSX1|X0, X2; SSX1|X0, X2, X3; etc.), but since X0 would always present we will omit this from our notation. Partial sums of squares or Type II SS With so many possible sums of squares which ones are will be useful to us? The sums of squares normally used for a multiple regression is called the partial sum of squares, the sum of squares James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 159 where each variable is adjusted for all other variables in the model. These are SSX1|X2,X3; SSX2|X1,X3; and SSX3|X1,X2. This type of sum of squares is sometimes called the fully adjusted SS, or uniquely attributable SS. In SAS they are called the TYPE II or TYPE III sum of squares since these two types are the same for regression analysis. SAS provides TYPE II in PROC REG and TYPE III in PROC GLM by default. Testing and evaluation of variables in multiple regression is usually done with the TYPE II or TYPE III SS. ANOVA table for this analysis (F0.05,1,8=5.32), using the TYPE III SS (Partial SS). Source SSX1|X2,X3 SSX2|X1,X3 SSX3|X1,X2 ERROR d.f. 1 1 1 8 SS 22.053 0.819 2.116 36.727 MS 22.053 0.819 2.116 4.591 F value 4.804 0.178 0.461 Sequential sums of squares or Type I SS When we fit regression, we are interested in one of two types of SS, normally the partials sum of squares. There is another type of sum of squares called the sequentially adjusted SS. These sum of squares are adjusted in a sequential or serial fashion. Each SS is adjusted for the variables previously entered in the model, but not for variables entered after, so it is important to note the order in which the variables are entered in the model. For the model [Y = X1 X2 X3], X1 would be first and adjusted for nothing else (except the intercept X0). X2 would enter second, be adjusted for X1, but not for X3. X3 enters last and is adjusted for both X1 and X2. Using our extra SS notation these are SSX1; SSX2|X1 and SSX3|X1,X2. These sums of squares have a number of potential problems. Unfortunately, the SS are different depending on the order the variables are entered, so different researchers would get different results. As a result the use of this SS type is rare and is only used where there is a mathematical reason to place the variables in a particular order. Its use is restricted pretty much to polynomial regressions which use a series of power terms (e.g. Yi = β0 + β1 Xi + β2 Xi2 + β3 Xi3 + εi ) and some other odd applications (e.g. in some cases Analysis of Covariance). Investigators sometimes feel that they know which variables are more important but this is not justification for using sequential sums of squares. So, we will not use sequential SS at all, but they are provided by default by SAS PROC GLM. Multiple Regression with SAS This same data set was run with SAS. The program was **********************************************; *** EXST7005 Multiple Regression Example 1 ***; **********************************************; OPTIONS LS=78 PS=78 NODATE nocenter nonumber; DATA ONE; INFILE CARDS MISSOVER; TITLE1 'EXST7005 MULTIPLE REGRESSION EXAMPLE #1'; INPUT Y X1 X2 X3; CARDS; PROC PRINT DATA=ONE; TITLE2 'Data Listing'; RUN; James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 160 See SAS output in Appendix 8 Note: The PROC REGRESSION section PROC REG DATA=ONE LINEPRINTER; TITLE2 'Analysis with PROC REG'; MODEL Y = X1 X2 X3; OUTPUT OUT=NEXT P=P R=E STUDENT=student rstudent=rstudent lcl=lcl lclm=lclm ucl=ucl uclm=uclm; RUN; OPTIONS PS=35; TITLE2 'Residual plot'; PLOT RESIDUAL.*PREDICTED.='E'; RUN; QUIT; The overall model Statistics for the individual variables The residual plot Residuals, confidence intervals and univariate analysis proc print data=next; var Y X1 X2 X3 P E student rstudent lcl ucl lclm uclm; run; OPTIONS PS=61; PROC UNIVARIATE DATA=NEXT NORMAL PLOT; VAR E; RUN; Output from proc print, in particular the interpretation of the variables: student, rstudent, lcl, ucl, lclm and uclm Output from proc univariate, especially the test of normality This same analysis was done with GLM PROC GLM DATA=ONE; TITLE2 'Analysis with PROC GLM'; MODEL Y = X1 X2 X3; RUN; QUIT; The results are the same, we only want to look at the Type I and Type III SS. Evaluation of Multiple Regression If your objective is to test the 3 variables jointly ( H0: β1 = 0, β2 = 0 and β3 = 0 ) or individually ( H0: βi = 0), you are done at this point. None of the variables is significantly different from zero. If, however, your objective is to develop the simplest possible, most parsimonious model, you may delete the variables one at time. Why one at a time? Because when you remove a variable everything changes since they are adjusted for each other. We would remove the James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 161 least significant variable (the one with the smallest F value). In this case that first step would be to remove X2. ANOVA table for analysis of the variables X1 and X3 alone. (F0.05,1,9 = 5.117). Note that X1 is now significant, but X3 is not and may be removed as step 2. Source SSX1|X3 SSX3|X1 ERROR d.f. 1 1 9 SS 25.134 1.393 37.546 MS 25.134 1.393 4.172 F value 6.024 0.334 The variable X1 is still significant. (F0.05,1,10 = 4.965) Source SSX1 ERROR d.f. 1 10 SS 23.977 38.939 MS 23.977 3.894 F value 6.158 This one at a time variable removal process is called “stepwise regression”. More specifically, it would be called backward selection stepwise regression. It is called backward because it starts with a full model and removes one variable at at time. There also exist a forward stepwise regression where the best single variable is found to start with and additional variables are added to the model if they meet the significance requirements. Multiple Regression with SAS (see SAS output in Appendix 9) SAS has a program for stepwise model development. This is accomplished with PROC REG, with the specification of a selection option. PROC REG DATA=ONE LINEPRINTER; TITLE2 'Stepwise analysis with PROC REG'; MODEL Y = X1 X2 X3 / selection=backward; RUN; In the initial step (STEP 0) the full, 3-parameter model is fitted, and the parameter estimates are evaluated. Backward Elimination: Step 0 All Variables Entered: R-Square = 0.4163 and C(p) = 4.0000 Analysis of Variance Source Model Error Corrected Total DF 3 8 11 Sum of Squares 26.18995 36.72672 62.91667 Mean Square 8.72998 4.59084 F Value 1.90 Pr > F 0.2078 Step 1 is the first removal, in this case of the variable X2. The results for the remaining variables are then given. . Backward Elimination: Step 1 Variable X2 Removed: R-Square = 0.4032 and C(p) = 2.1784 Sum of Mean James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Source Model Error Corrected Total Variable Intercept X1 X3 Page 162 DF 2 9 11 Squares 25.37078 37.54588 62.91667 Parameter Estimate 1.91576 0.63161 -0.13650 Standard Error 1.73473 0.25732 0.23621 Square 12.68539 4.17176 Type II SS 5.08794 25.13390 1.39316 F Value 3.04 F Value 1.22 6.02 0.33 Pr > F 0.0980 Pr > F 0.2981 0.0365 0.5775 Step 2 is the next removal (if needed), in this case of the variable X3. The result for the remaining variable is then given. Backward Elimination: Step 2 Variable X3 Removed: R-Square = 0.3811 and C(p) = 0.4819 DF 1 10 11 Sum of Squares 23.97763 38.93904 62.91667 Parameter Estimate 1.37613 0.61089 Standard Error 1.41242 0.24618 Source Model Error Corrected Total Variable Intercept X1 Mean Square 23.97763 3.89390 Type II SS 3.69640 23.97763 F Value 6.16 F Value 0.95 6.16 Pr > F 0.0325 Pr > F 0.3529 0.0325 Finally SAS prints a summary of variable removals. All variables left in the model are significant at the 0.1000 level. Step 1 2 Variable Removed X2 X3 Summary of Backward Elimination Number Partial Model Vars In R-Square R-Square C(p) 2 0.0130 0.4032 2.1784 1 0.0221 0.3811 0.4819 F Value 0.18 0.33 Pr > F 0.6838 0.5775 Interpretation of regression Objectives can vary in regression. You may be interested in testing the correlations (actually “partial” correlations due to the adjustment of one variable for another), or you may be interested in the parameter estimates and the resulting model (the full model or the reduced model from stepwise). Most aspects of the evaluation are similar to what we observed with simple linear regression. The parameter estimates are interpreted as before, the change in Y per unit X. Of course, now they are adjusted for other effects. Standard errors are provided for confidence intervals, as well as a test of each regression coefficient against 0 (zero). Confidence intervals are placed on the parameters the same as with SLR although the calculations differ. The d.f. for the t value is based on the MSE (for the final model) as with simple linear regression. The parameter and standard errors can be estimated in SAS. Residual evaluation is very similar to SLR, but residuals are usually plotted on Yhat instead of X, since there are several independent variables (i.e. X's). James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 163 Evaluation of the residuals using PROC UNIVARIATE for testing normality and outlier detection is the same as for SLR. Fully adjusted SS also mean fully adjusted regression coefficient (also partial reg. coeff.). SAS REG does not give tests of SS like GLM, but the tests of the βi values are the same as the tests of the Type III SS. There are a few things that are different. The R2 value is now called the coefficient of multiple determination (instead of the coefficient of determination). As discussed, we now evaluate SS for the individual variables. Note that the tests of TYPE III SS are identical to the tests of the regression coefficients (see GLM handout). PROC REG does only the latter, and will not do the former. There is a suite of new diagnostics for evaluating the multiple independent variables and their interrelations. We will not discuss these, except to say that if the independent variables are highly correlated with each other (a correlation coefficient, r, of around 0.9), then the parameter estimates can fluctuate wildly and Extra SS SS unpredictably and may not be useful. SSX1 23.978 Also note a curious behavior of the variables when they SSX2 4.115 occur together. When one independent variable Xi is SSX3 0.237 adjusted for another, sometimes it's SS are larger SSX1|X2 19.959 than what it would be for that variable alone and SSX2|X1 0.096 sometimes athe SS are smaller. This is SSX1|X3 25.134 unpredictable and can go either way. For example. SSX3|X1 1.393 The SSX1 was 23.978 when the variable was alone, SSX2|X3 3.900 but dropped to 19.959 when adjusted for X2, and SSX3|X2 0.022 increased to 25.134 when adjusted for X3. It SSX1|X2,X3 22.053 dropped to 22.053 when adjusted for both. In SSX2|X1,X3 0.819 essence the variables sometimes compete with each SSX3|X1,X2 2.116 other for sums of squares and at other times enhance each others ability to account for sums of squares. Adjusted SS Not only will the SS of one variable increase or decrease as other variables are added, the regression coefficient values will change. They may even change sign, and hence interpretation. Although the interpretation does not usually change, sometimes variables in combination do not necessarily have the same interpretation as they might have had when alone. Summary Multiple regression shares a lot in interpretation and diagnostics with SLR. Most diagnostics are the same as with SLR. The coefficients and sums of squares of the variables should be adjusted for each other. This is the sequential sum of squares or the Type II SS or Type III SS in SAS. This is the big and important difference from SLR. James P. Geaghan Copyright 2010 ...
View Full Document

Ask a homework question - tutors are online