# solution2cb - Statistics 203 Introduction to Regression and...

This preview shows pages 1–4. Sign up to view the full content.

1 Statistics 203 Introduction to Regression and Analysis of Variance Assignment #2 Solutions 1 Question 1: a) Note that the following solution is only one of many possible ways to answer this question. > library(leaps) > library(car) > edu<-read.table(‘education.table', header=T) > attach(edu) > region<-as.factor(region) > lm1<-lm(education~(income+under18+urban)*region+region) > summary(lm1) ### Use all subsets regression > X <- model.matrix(lm1)[,-1] > Cp.leaps <- leaps(X, education, nbest=3, method='Cp') > plot(Cp.leaps\$size, Cp.leaps\$Cp, pch=21, bg=c('red'), cex=1.5) 2 4 6 8 10 12 14 16 0 10 20 30 40 50 Cp.leaps\$size Cp.leaps\$Cp > best.model.Cp <- Cp.leaps\$which[which((Cp.leaps\$Cp == min(Cp.leaps\$Cp))),] 1 Thanks to Laura Miller for her contribution to the sample solutions

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
2 > best.model.Cp <- which(best.model.Cp) > colnames(X)[best.model.Cp] [1] "income" "under18" "under18:region4" "urban:region4" The plot above shoes the Cp values for the best three models for each size. Based on Cp criterion (minimize), the optimal size is 4 (excluding the intercept) and the optimal set of regressors are : income under18 under18:region4 urban:region4 > region4 <- X[,6] > best.lm <- lm(education~income+under18+under18:region4+urban:region4) > summary(best.lm) Call: lm(formula = education ~ X[, best.model.Cp]) Residuals: Min 1Q Median 3Q Max -75.420 -24.302 -1.306 16.926 82.276 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.174e+02 1.366e+02 -2.323 0.02476 * X[, best.model.Cp]income 6.562e-02 8.974e-03 7.312 3.52e-09 *** X[, best.model.Cp]under18 8.793e-01 3.600e-01 2.443 0.01856 * X[, best.model.Cp]under18:region4 4.581e-01 1.679e-01 2.729 0.00904 ** X[, best.model.Cp]urban:region4 -1.712e-01 7.475e-02 -2.291 0.02673 * --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 37.14 on 45 degrees of freedom Multiple R-Squared: 0.6633, Adjusted R-squared: 0.6334 F-statistic: 22.16 on 4 and 45 DF, p-value: 3.677e-10 b) Now we verify if the model obtained in part a) gives an appropriate fit. > par(mfrow=c(2,2)) > plot(best.lm)
3 200 250 300 350 400 450 -50 0 50 100 Fitted values Residuals Residuals vs Fitted 15 10 7 -2 -1 0 1 2 -2 -1 1 2 Theoretical Quantiles Standardized residuals Normal Q-Q plot 49 15 10 200 250 300 350 400 450 0.0 0.5 1.0 1.5 Fitted values Scale-Location plot 49 15 10 0 10 20 30 40 50 Obs. number Cook's distance Cook's distance plot 49 39 48 The diagnostic plot above for the best model chosen in part a) shows that observation 49 (Alaska) has a moderate influence on the regression estimates as can be seen from the Cook’s Distance Plot. The plot in the upper right corner shows that the residuals a distributed fairly symmetric around zero. Note however that the plot gives strong evidence for heteroscedasticity, i.e. non-constant variant. Ideally

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

### Page1 / 14

solution2cb - Statistics 203 Introduction to Regression and...

This preview shows document pages 1 - 4. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online