ISU_Stat_108_Lecture_08_Printable_with_Solutions - STA 108 Regression Analysis Lecture 8 Irina Udaltsova Department of Statistics University of

ISU_Stat_108_Lecture_08_Printable_with_Solutions - STA 108...

This preview shows page 1 out of 26 pages.

You've reached the end of your free preview.

Want to read all 26 pages?

Unformatted text preview: STA 108 Regression Analysis Lecture 8 Irina Udaltsova Department of Statistics University of California, Davis January 23rd, 2015 Admin for the Day Homework 2 is due today, Friday, January 23rd in class Homework 3 is assigned (soon) and is due coming Wednesday, January 28th, in class Midterm: Friday, January 30th, in class References for Today: Ch.2.7-2.10, 3.1-3.2 (Kutner, 5th Ed.) Topics For Today Recap: Analysis of variance approach to regression Measures of Association Today: 1. ANOVA in R 2. General linear test 3. Diagnostics ANOVA table Recap: ANOVA table : A table that gives the summary of the variance decomposition inthe response variable, useful in testing H0 : β1 = 0 against H1 : β 1 = 0 Source Regression Error Total df df(SSR) = 1 df(SSE ) = n − 2 df(SSTO) = n − 1 SS SSR SSE SSTO MS MSR MSE F∗ F ∗ = MSR/MSE Toluca example: ANOVA table in R Let’s review all appropriate commands we need in order to produce the ANOVA table in R for the Toluca Company example. > workData = read.table("toluca.txt", header=TRUE) > fit = lm(WorkHrs ~ LotSize, data=workData) > anova(fit) Analysis of Variance Table Response: WorkHrs Df Sum Sq Mean Sq F value Pr(>F) LotSize 1 252378 252378 105.88 4.449e-10 *** Residuals 23 54825 2384 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 Interpretation: F statistic is 105.88, and P-value is 0.000. Hence at 0.05 level of significance, there is a linear relationship between work hours and lot size. 1 Toluca example: additional output in R Let’s review additional important output in R for the Toluca Company example. > fit = lm(WorkHrs ~ LotSize, data=workData) > summary(fit) Call: lm(formula = WorkHrs ~ LotSize, data = workData) Residuals: Min 1Q -83.876 -34.088 Median -5.982 3Q Max 38.826 103.528 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 62.366 26.177 2.382 0.0259 * LotSize 3.570 0.347 10.290 4.45e-10 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 48.82 on 23 degrees of freedom Multiple R-squared: 0.8215, Adjusted R-squared: 0.8138 F-statistic: 105.9 on 1 and 23 DF, p-value: 4.449e-10 Toluca example: additional output in R Residual standard error: 48.82 on 23 degrees of freedom Multiple R-squared: 0.8215, Adjusted R-squared: 0.8138 F-statistic: 105.9 on 1 and 23 DF, p-value: 4.449e-10 The residual standard error is the square-root of MSE , i.e., √ MSE = 48.82. Degrees of freedom associated with the SSE is n − 2 = 23. 2 The adjusted Radj = 0.8138. Hence, 81.38% of the variation in the work hours is explained by the regression relationship on the lot size, after adjusting for the number of parameters in the model. Coefficient of determination, revisited 2 Important things to remember about R 2 (and Radj ): R 2 does not necessarily imply that useful predictions could be made, because prediction intervals can be wide; estimated regression line is a good fit, because actual regression relationship could be curvilinear; X and Y are not related, because actual regression relationship could be curvilinear. General linear test Consider that we are interested in testing for the dependence on the predictor variable from a different point of view. The ideas of ANOVA approach to regression is applicable here. Define a Full model: Yi = β0 + β1 Xi + εi Define a Reduced model: Yi = β0 + εi Goal: We want to test H0 : Reduced model holds, against H1 : Full model holds. Example: suppose we want to test H0 : β1 = 0 against H1 : β1 = 0. Then, under H0 : β1 = 0, we have the reduced model. General linear test Example: suppose we want to test H0 : β1 = 0 against H1 : β1 = 0. Then, under H0 : β1 = 0, we have the reduced model. In this case, under the full model, SSEfull = i (Yi − Yi )2 = SSE . Under the reduced model SSEred = i (Yi − Y )2 = SSTO. Observe that: d.f.(SSEfull ) = n − 2, d.f.(SSEred ) = n − 1 and SSEred − SSEfull = SSR. Test Statistic: F ∗ = MSR/MSE . General linear test Goal: Test H0 : Reduced model holds, against H1 : Full model holds. General linear test : Test statistic: SSEred −SSEfull ∗ F = SSR d.f.(SSEred )−d.f.(SSEfull ) d.f.(SSR) SSEfull d.f.(SSEfull ) = SSE d.f.(SSE ) = MSR MSE Under normal error model, and under H0 : β1 = 0, F ∗ ∼ Fd.f.(SSEred )−d.f.(SSEfull ), d.f.(SSEfull ) the F distribution with numerator and denominator degrees of freedom (d.f.(SSEred )− d.f.(SSEfull ), d.f.(SSEfull )). P-value: p = P( Fd.f.(SSEred )−d.f.(SSEfull ), d.f.(SSEfull ) > F∗ ) Decision rule: reject H0 at level of significance α if F ∗ > F (1−α; d.f.(SSEred )−d.f.(SSEfull ), d.f.(SSEfull )), or, if p < α Toluca example: General linear test In the Toluca example, suppose we were interested to test H0 : β0 = 20 and β1 = 3, against H1 : either β0 = 20 or β1 = 3 or both. Full model: Yi = β0 + β1 Xi + εi , i = 1, · · · , n Estimated of β0 and β1 are b0 and b1 , respectively ˆ Fitted values: Yi = b0 + b1 Xi ˆ Residuals: ei = Yi − Yi Residual sum of squares: SSEfull = ei2 d.f.(SSEfull ) = n − (# of beta parameters estimated) = n − 2 Toluca example: General linear test In the Toluca example, suppose we were interested to test H0 : β0 = 20 and β1 = 3, against H1 : either β0 = 20 or β1 = 3 or both. Reduced model: Yi = 20 + 3Xi + εi , i = 1, · · · , n ˆ Fitted values: Yi = 20 + 3Xi ˆ Residuals: ei = Yi − Yi = Yi − (20 − 3Xi ) ˜ Residual sum of squares: SSEred = ei2 ˜ d.f.(SSEred ) = n − (# of beta parameters estimated) = n−0=n Numerator d.f. = d.f.(SSEred )− d.f.(SSEfull ) = n − (n − 2) = 2 Denominator d.f. = d.f.(SSEfull ) = n − 2 Toluca example: General linear test anova(fit) # full model Analysis of Variance Table Response: WorkHrs Df Sum Sq Mean Sq F value Pr(>F) LotSize 1 252378 252378 105.88 4.449e-10 *** Residuals 23 54825 2384 > yhat = 20 + 3*workData$LotSize # reduced model > SSEred = sum((workData$WorkHrs - yhat)^2) > SSEred [1] 230513 Toluca example: General linear test Test statistic: SSEred −SSEfull ∗ F = d.f.(SSEred )−d.f.(SSEfull ) SSEfull = d.f.(SSEfull ) 230513−54825 2 54825 23 = 36.85202 P-value: p = P( F2,23 > F ∗ ) = 6.718263 × 10−8 (In R : 1-pf(36.85202,2,23) ) Since p-value is nearly 0 (smaller than 0.05), we conclude that with 0.05 level of significance, either the intercept is not equal to 20 or the slope is not equal to 3 or both. Considerations in applying regression analysis Making inferences for the future: We assume the regression model will be consistent. However, it is often that with time, the relationship between X and Y may change. For example, the admissions officer using her model that predicts first year GPA from ACT scores is assuming that the relationship between GPA and ACT stays the same (consistent). If the ACT changes in anyway, or the academic grading policies for first year college students changes, then the model that she estimated may not be valid. When assumption of normality is inappropriate, then as long as the sample size is large Most inference methods we discussed (Ch.1+2) are valid With one exception, however: prediction intervals no longer valid. Considerations in applying regression analysis Predicting observations outside the range of values for X: We only assume the linear relationship between X and Y in the range of the values of X. Extrapolation outside the range of X is unadvised. A linear association does not imply a cause and effect relationship: If we conclude β1 = 0, that does not mean a causal relationship between X and Y. There could be another variable that influences both X and Y (confounding variable). This is a serious problem in observational studies, yet is still possible in controlled experiments. Model Assumptions Model : Yi = β0 + β1 Xi + εi , i = 1, · · · , n, where ε1 , · · · , εn are independent and normally distributed with mean 0 and variance σ 2 . Assumptions: Linearity Equal Variance Normality of errors Independence of errors Departures from the model Possible Departures from the model: 1. The regression function is not linear 2. The error terms (i.e., εi ’s) do not have constant variance 3. The error terms are not independent 4. The model fits all but one or few outlier observations 5. The error terms are not normally distributed 6. One or several important predictor variables have been omitted from the model Diagnostics Diagnostic plots using residuals or semistudentized residuals: [Residuals: ei = Yi − Yi , Semistudentized residuals: √ ei∗ = ei / MSE ] 1. Plot of residuals against predictor variable (useful for explaining departures 1,2,4) 2. Plot of absolute or squared residuals against predictor variable (useful for explaining departure 2) 3. Plot of residuals against fitted values (useful for explaining departures 1,2,4) 4. Plot of residuals against time or other sequence (useful for examining departure 3) 5. Plot of residuals against omitted predictor variables (useful for examining departure 6) 6. Box plot or stem-and-leaf plot or histogram or normal probability plot of residuals (useful for examining departure 5, and also 4). Residual plots Residual plots: When plotting Residuals vs X – Look for NO PATTERN! Interpretation: If there is no obvious pattern, it’s fine. Residual plots in R : Normal Probability Plot > > > > fit = lm(WorkHrs ~ fit$fitted.values fit$residuals residuals(fit) LotSize, data=workData) # Y-hat # e, residuals # e, residuals > # Plot Residual vs X > plot(workData$LotSize, fit$resid, xlab="Lot Size", ylab=" > abline(h=0, lty=2) 0 −50 Residuals 50 100 Work Hours example: Residual plot: ei vs. X 20 40 60 80 100 120 Lot Size All points appear without pattern, hence no obvious departures from assumptions of linearity and constant variance. Also, no obvious outliers. Note: in case of Simple linear regression, e vs. X plot gives the same information as e vs. Y plot. Residual plots in R # Plot Normal Probability plot > qqnorm(fit$resid) > qqline(fit$resid, col="red") Work Hours example: Normal Probability Residual plot: 50 0 −50 Sample Quantiles 100 Normal Q−Q Plot −2 −1 0 1 2 Theoretical Quantiles All points except for one small outlier appear to follow the reference line. Hence, normality assumption of error terms is adequate. Cartoon of the Day ...
View Full Document

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture