# Both the histogram and the qq plot suggest that the

• 13
• 94% (18) 17 out of 18 people found this document helpful

This preview shows page 7 - 10 out of 13 pages.

Both the histogram and the qq plot suggest that the normality assumption may not hold. The data appear to have a right skew. The model is not a good fit because the normality assumption does not hold. (2c) 2pts - Using Cook’s distance, identify if any outliers in model3 . Response to question (2c) : # Your code here... cook = cooks.distance (model3) # Show the number of outliers cat ( "Maximum Cook s distance:" , max (cook)) ## Maximum Cook s distance: 0.04408242 The maximum Cook’s distance metric is 0.04408242, suggesting that there are no outliers in model3 . 7
(2d) 3pts - Build a multiple linear regression model called model4 . For this model use the train data set with trestbps as the response variable, and age , chol , fbs , restecg , thalach , exang , oldpeak , slope , ca , and sexM as the predictors. Display a summary of model4 . Response to question (2d) : # Your code here... model4 = lm (trestbps ~ age + chol + fbs + restecg + thalach + exang + oldpeak + slope + ca + sexM, data= train) summary (model4) ## ## Call: ## lm(formula = trestbps ~ age + chol + fbs + restecg + thalach + ## exang + oldpeak + slope + ca + sexM, data = train) ## ## Residuals: ## Min 1Q Median 3Q Max ## -36.221 -9.973 -1.536 9.026 56.641 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 79.40879 14.15661 5.609 6.18e-08 *** ## age 0.61340 0.13776 4.453 1.36e-05 *** ## chol -0.01028 0.02254 -0.456 0.6489 ## fbs 7.13089 3.06100 2.330 0.0208 * ## restecg -4.31342 2.10051 -2.054 0.0412 * ## thalach 0.12859 0.05873 2.189 0.0296 * ## exang 0.38038 2.54115 0.150 0.8811 ## oldpeak 1.70896 1.29577 1.319 0.1886 ## slope 0.72143 2.21864 0.325 0.7454 ## ca 0.55546 1.13136 0.491 0.6239 ## sexM -1.03626 2.46870 -0.420 0.6751 ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## Residual standard error: 16.13 on 216 degrees of freedom ## Multiple R-squared: 0.1631, Adjusted R-squared: 0.1244 ## F-statistic: 4.21 on 10 and 216 DF, p-value: 2.378e-05 (2e) 2pts - At an alpha level of 0.01, compare the explanatory power of model3 to model4 . Response to question (2e) : We will use an F-test to compare the explanatory power of model3 to model4 . # Your code here... anova (model3,model4) ## Analysis of Variance Table ## ## Model 1: trestbps ~ age ## Model 2: trestbps ~ age + chol + fbs + restecg + thalach + exang + oldpeak + ## slope + ca + sexM ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 225 60735 ## 2 216 56181 9 4554.3 1.9456 0.04708 * ## --- 8
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 The p-value of 0.04708 is greater than the alpha level of 0.01. Therefore we accept the null hypothesis that the coefficients of the additional predictors in model4 are plausibly zero. We can state that model4 does not have statistically significantly higher explanatory power relative to model3 at an alpha level of 0.01. (2f) 2pts - Calculate the variance inflation factors for model4 . Comment on the results and possible implications with respect to multicollinearity. Response to question (2f) : # Your code here... vif (model4) ## age chol fbs restecg thalach exang oldpeak slope ## 1.412361 1.119516 1.066431 1.063537 1.576659 1.229537 1.565487 1.540519 ## ca sexM ## 1.174302 1.076817 None of the VIF values in model4 are greater than 2, suggesting that multicollinearity should not be a problem with the model. (2g) 2pts - Calculate the mean absolute percentage error (MAPE) of model3 and model4 on the test data set. Using MAPE as the metic, which model would you suggest is optimal? Response to question (2g) : # Your code here mape3 = mean ( abs ( predict (model3, test) - test \$ trestbps) / test \$ trestbps) cat ( MAPE of model3: , mape3, end= \n ) ## MAPE of model3: 0.1042716 mape4 = mean ( abs ( predict (model4, test) - test \$ trestbps) / test \$ trestbps) cat ( MAPE of model4: , mape4, end= \n ) ## MAPE of model4: 0.1044115 model3 has the lowest MAPE, which suggests it is the optimal model.