This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Maximum likelihood vs. SSE Therefore, for linear models, ML is almost equivalent to minimizing SSE! Similarly strictly using maximum likelihood ( − 2 log p ( y  ˆ θ ) ) will overfit the data as we increase the number of parameters So we can look again at “significant” reductions in loglikelihood (or SSE), just viewing things through a different prism Lecture 22 – p. 1/44 AIC Akaike suggested the AIC (Akaike’s Information Criterion): AIC = − 2 log p ( y  ˆ θ ) + 2 d where d is the number of unconstrained parameters in the model, p Of course, for the normal model, that gives us: AIC = n log ˆ σ 2 + n summationdisplay i =1 ( y i − ˆ y i ) 2 ˆ σ 2 + c n + 2( p + 1) where c n = n log 2 π is a constant Lecture 22 – p. 2 / 4 AIC Like adjusted R 2 , this is a penalized criterion, only now it is a penalized deviance criterion Using the AIC is justified, because it asymptotically approximates the KullbackLiebler information, so choosing minimum AIC should choose the model with smallest KL distance to the true model In English, this is the model that asymptotically gives the best predictive fit to true model Lecture 22 – p. 3/44 Bayesian Information Criterion Another information criterion is the BIC (also known as the SIC), postulated by Schwarz (1978), justified further by Kass and Raftery (1995) and Kass and Wasserman (1995) The BIC approximation: BIC = D ( y, ˆ θ ) + d log n Note that the complexity penalty now depends on n Penalizes severely for large numbers of parameters Upshot: Will choose less complex models than AIC as it tries to choose the model asymptotically that is most likely to be the true model , not the one that is closest to the true model Lecture 22 – p. 4 / 4 Example # Using the AIC function > AIC(treemodel.N) [1] 48.34248 > AIC(treemodel.age) [1] 37.46445 > AIC(treemodel.hd) [1] 31.44173 > AIC(treemodel.agehd) [1] 33.44107 > AIC(treemodel.NHD) [1] 20.28018 > AIC(treemodel.ageN) [1] 36.75299 > AIC(treemodel.agehdN) [1] 18.37409 Lecture 22 – p. 5/44 Example # Can calculate BIC with same function, only with # penalty k= log(n), where n is the sample size > n<nrow(treedata) # NOTE THESE ARE BIC VALUES > AIC(treemodel.N,k=log(n)) [1] 51.32967 > AIC(treemodel.age,k=log(n)) [1] 40.45165 > AIC(treemodel.hd,k=log(n)) [1] 34.42892 > AIC(treemodel.agehd, k=log(n)) [1] 37.424 > AIC(treemodel.NHD, k=log(n)) [1] 24.26311 > AIC(treemodel.ageN, k=log(n)) [1] 40.73592 > AIC(treemodel.agehdN, k=log(n)) [1] 23.35275 Lecture 22 – p. 6 / 4 Comparison Model R 2 Adj PRESS CV C p AIC BIC AGE 0.43 7.12 0.38 34.8 37.5 40.5 N 0.01 12.7 0.71 71.4 48.3 51.3 HD 0.58 5.2 0.29 21.6 31.4 34.4 AGE + N 0.47 6.82 0.42 30.3 36.8 40.7 AGE + HD 0.55 6.26 0.39 23.6 33.4 37.4 HD + N 0.77 3.01 0.18 5.5 20.3 24.3 AGE + HD + N 0.80 3.00 0.19 4.0 18.4 23.4 Lecture 22 – p. 7/44 Stepwise selection When I have large numbers of predictors, I may not be able to fit all possible models to find the one with minimum AIC, BIC, adj R 2 , etc....
View
Full
Document
This note was uploaded on 01/15/2010 for the course MATH 423 taught by Professor Steele during the Spring '06 term at McGill.
 Spring '06
 STEELE

Click to edit the document details