{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

Lecture 19 slides(AIC &amp; BIC &amp; Residuals &amp; Studentization)

# Lecture 19 slides(AIC & BIC & Residuals & Studentization)

This preview shows pages 1–3. Sign up to view the full content.

Maximum likelihood vs. SSE Therefore, for linear models, ML is almost equivalent to minimizing SSE! Similarly strictly using maximum likelihood ( 2log p ( y | ˆ θ ) ) will overfit the data as we increase the number of parameters So we can look again at “significant” reductions in log-likelihood (or SSE), just viewing things through a different prism Lecture 22 – p. 1/44 AIC Akaike suggested the AIC (Akaike’s Information Criterion): AIC = 2log p ( y | ˆ θ ) + 2 d where d is the number of unconstrained parameters in the model, p Of course, for the normal model, that gives us: AIC = n log ˆ σ 2 + n summationdisplay i =1 ( y i ˆ y i ) 2 ˆ σ 2 + c n + 2( p + 1) where c n = n log2 π is a constant Lecture 22 – p. 2 AIC Like adjusted R 2 , this is a penalized criterion, only now it is a penalized deviance criterion Using the AIC is justified, because it asymptotically approximates the Kullback-Liebler information, so choosing minimum AIC should choose the model with smallest K-L distance to the true model In English, this is the model that asymptotically gives the best predictive fit to true model Lecture 22 – p. 3/44 Bayesian Information Criterion Another information criterion is the BIC (also known as the SIC), postulated by Schwarz (1978), justified further by Kass and Raftery (1995) and Kass and Wasserman (1995) The BIC approximation: BIC = D ( y, ˆ θ ) + d log n Note that the complexity penalty now depends on n Penalizes severely for large numbers of parameters Upshot: Will choose less complex models than AIC as it tries to choose the model asymptotically that is most likely to be the true model , not the one that is closest to the true model Lecture 22 – p. 4

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Example # Using the AIC function > AIC(treemodel.N) [1] 48.34248 > AIC(treemodel.age) [1] 37.46445 > AIC(treemodel.hd) [1] 31.44173 > AIC(treemodel.agehd) [1] 33.44107 > AIC(treemodel.NHD) [1] 20.28018 > AIC(treemodel.ageN) [1] 36.75299 > AIC(treemodel.agehdN) [1] 18.37409 Lecture 22 – p. 5/44 Example # Can calculate BIC with same function, only with # penalty k= log(n), where n is the sample size > n<-nrow(treedata) # NOTE THESE ARE BIC VALUES > AIC(treemodel.N,k=log(n)) [1] 51.32967 > AIC(treemodel.age,k=log(n)) [1] 40.45165 > AIC(treemodel.hd,k=log(n)) [1] 34.42892 > AIC(treemodel.agehd, k=log(n)) [1] 37.424 > AIC(treemodel.NHD, k=log(n)) [1] 24.26311 > AIC(treemodel.ageN, k=log(n)) [1] 40.73592 > AIC(treemodel.agehdN, k=log(n)) [1] 23.35275 Lecture 22 – p. 6 Comparison Model R 2 Adj PRESS CV C p AIC BIC AGE 0.43 7.12 0.38 34.8 37.5 40.5 N 0.01 12.7 0.71 71.4 48.3 51.3 HD 0.58 5.2 0.29 21.6 31.4 34.4 AGE + N 0.47 6.82 0.42 30.3 36.8 40.7 AGE + HD 0.55 6.26 0.39 23.6 33.4 37.4 HD + N 0.77 3.01 0.18 5.5 20.3 24.3 AGE + HD + N 0.80 3.00 0.19 4.0 18.4 23.4 Lecture 22 – p. 7/44 Stepwise selection When I have large numbers of predictors, I may not be able to fit all possible models to find the one with minimum AIC, BIC, adj- R 2 , etc. Often, then we resort to stepwise procedures (which are also useful when using criteria like F-tests which can only be used for nested models) Forwards stepwise regression: 1. Start with the model with just the intercept 2. Calculate which covariate when added to the model gives the “best” increase in model fit (i.e. smallest AIC/BIC/F p-value/adj- R 2 ) (if any) 3. Put that covariate in the model giving you a new
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}