solution2cb - Statistics 203 Introduction to Regression and...

Info iconThis preview shows pages 1–4. Sign up to view the full content.

View Full Document Right Arrow Icon
1 Statistics 203 Introduction to Regression and Analysis of Variance Assignment #2 Solutions 1 Question 1: a) Note that the following solution is only one of many possible ways to answer this question. > library(leaps) > library(car) > edu<-read.table(‘education.table', header=T) > attach(edu) > region<-as.factor(region) > lm1<-lm(education~(income+under18+urban)*region+region) > summary(lm1) ### Use all subsets regression > X <- model.matrix(lm1)[,-1] > Cp.leaps <- leaps(X, education, nbest=3, method='Cp') > plot(Cp.leaps$size, Cp.leaps$Cp, pch=21, bg=c('red'), cex=1.5) 2 4 6 8 10 12 14 16 0 10 20 30 40 50 Cp.leaps$size Cp.leaps$Cp > best.model.Cp <- Cp.leaps$which[which((Cp.leaps$Cp == min(Cp.leaps$Cp))),] 1 Thanks to Laura Miller for her contribution to the sample solutions
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 > best.model.Cp <- which(best.model.Cp) > colnames(X)[best.model.Cp] [1] "income" "under18" "under18:region4" "urban:region4" The plot above shoes the Cp values for the best three models for each size. Based on Cp criterion (minimize), the optimal size is 4 (excluding the intercept) and the optimal set of regressors are : income under18 under18:region4 urban:region4 > region4 <- X[,6] > best.lm <- lm(education~income+under18+under18:region4+urban:region4) > summary(best.lm) Call: lm(formula = education ~ X[, best.model.Cp]) Residuals: Min 1Q Median 3Q Max -75.420 -24.302 -1.306 16.926 82.276 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.174e+02 1.366e+02 -2.323 0.02476 * X[, best.model.Cp]income 6.562e-02 8.974e-03 7.312 3.52e-09 *** X[, best.model.Cp]under18 8.793e-01 3.600e-01 2.443 0.01856 * X[, best.model.Cp]under18:region4 4.581e-01 1.679e-01 2.729 0.00904 ** X[, best.model.Cp]urban:region4 -1.712e-01 7.475e-02 -2.291 0.02673 * --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 37.14 on 45 degrees of freedom Multiple R-Squared: 0.6633, Adjusted R-squared: 0.6334 F-statistic: 22.16 on 4 and 45 DF, p-value: 3.677e-10 b) Now we verify if the model obtained in part a) gives an appropriate fit. > par(mfrow=c(2,2)) > plot(best.lm)
Background image of page 2
3 200 250 300 350 400 450 -50 0 50 100 Fitted values Residuals Residuals vs Fitted 15 10 7 -2 -1 0 1 2 -2 -1 1 2 Theoretical Quantiles Standardized residuals Normal Q-Q plot 49 15 10 200 250 300 350 400 450 0.0 0.5 1.0 1.5 Fitted values Scale-Location plot 49 15 10 0 10 20 30 40 50 Obs. number Cook's distance Cook's distance plot 49 39 48 The diagnostic plot above for the best model chosen in part a) shows that observation 49 (Alaska) has a moderate influence on the regression estimates as can be seen from the Cook’s Distance Plot. The plot in the upper right corner shows that the residuals a distributed fairly symmetric around zero. Note however that the plot gives strong evidence for heteroscedasticity, i.e. non-constant variant. Ideally
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 4
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 14

solution2cb - Statistics 203 Introduction to Regression and...

This preview shows document pages 1 - 4. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online