Course: STAT 882, Fall 2008
School: Ohio State
Word Count: 873

Error Prediction and its estimates for three model selection methods under controlled and random covariates. Prasenjit Kapat Model Selection, Stat 882 AU 2006, Oct 19. Leo Breiman. Better Subset Regression Using the Nonnegative Garrote. Technometrics., 37, no. 4 (1995), 373384, (2). Talk Outline 1 2 Setup Prediction Error Controlled covariates Random covariates Review of model selection methods Subset selection Ridge regression Nonnegative Garrote Model selection strategy Estimation of Prediction Error Controlled Covariates: Subset, Ridge, NN-Garrote Random covariates: Subset, Ridge, NN-Garrote 3 4 5 Setup What do we know? Data: Tn = {(yi , xi ), i = 1, . . . , n}; xi = (xi1 , . . . , xip ). Controlled covariates: {xi }n are chosen and xed. {Yi }n are 1 1 random following some model. Random covariates: {(Yi , Xi )}n are iid from some joint 1 distribution. Prediction Error (PE): Expected (average) squared error in predicting y from x for future (new) cases. Prediction Error iid i [controlled X ] Model: Yi = (xi ) + i where N(0, 2 ). New data: {(yinew , xi ), i = 1, . . . , n}, Y new [Y ]. Model Predictor: () based on Tn . n PE () = E[Y ] i=1 (Yinew (xi ))2 n = n + i=1 2 ((xi ) (xi ))2 Linear model: (x) = x, (x) = x, PE () = n 2 + ( ) X X ( ). Prediction Error [random X ] (Yi , Xi ) D; New data: (Y , X ) D. Model: Y = (x) + where E { |X } = 0, E Model Predictor: () based on Tn . PE () = n ED (Y (X ))2 = nE 2 2 iid < . + n ED ((X ) (X ))2 Linear model: (x) = x, (x) = x, = (EX (k) X (l) )pp PE () = n E 2 + ( ) (n)( ). Review Subset selection k-subset: k : indices of a good k-subset model; |k | = k, k = 1 . . . , K ; K = p, Xk = (xij )nk i = 1, . . . , n, j k . Contending model: k (), k = 1 . . . , K . 1 Linear models: (OLS) (k ) = (Xk Xk ) Xk Y , k (xi ) = jk ( ) j k xij Review Ridge regression () = (X X + I )1 X Y . Ridge estimate: (xi ) = p () j=1 j xij . Contending models: 1 , . . . , K , for some K . p k (xi ) = j=1 ( ) j k xij Review Nonnegative garrote n y i p 2 cj jLS xij minimize {cj } i=1 j=1 subject to cj 0 and p j=1 cj s, for some s > 0. p LS j=1 cj(s) j xij . Garrote estimate: s (xi ) = Contending models: s1 , . . . , sK , for some K . p sk (xi ) = j=1 cj(sk ) jLS xij sk . subject to cj(sk ) 0 and p j=1 cj(sk ) Model selection strategy Choose k0 to minimize PE (k ). BUT, PE (k ) depends on the unknown k ! Need some estimate, PE (k ). The winning model: k0 () where, k0 = arg min PE (k ). 1kK PE (k ) will depend on the covariates type (controlled on random). PE : Subset selection [controlled ] Standard X estimate: Mallows Cp type, PE (k ) = 2k 2 + SSEk = 2 (Cp(k) + n) where, 1 = n 2 n (yi p (xi ))2 i=1 n SSEk = 1 n (yi k (xi ))2 . i=1 PE : Subset selection Little Bootstrap [controlled X ] Little Bootstrap: a better estimate. N(0, t 2 2 ), for some 0 < t 1. Let k () be predicted model based on (k ) using the bootstraped data {(yi , xi ), i = 1 . . . , n} but only the Xk variables. Let yi = yi + i , where Let 1 Bt (k) = 2 E t Y PE (k ) = SSEk + 2 Bt (k). n i=1 i k (xi ) iid i . PE : Subset selection Little Bootstrap in practice [controlled X ] Repeated bootstrap: yi k (r ) (r ) = yi + jk (r ) , i (r ) iid i N(0, t 2 2 ), for r = 1, . . . , R (xi ) = 1 t2 1 R ( )(r ) j k xij n i=1 R (r ) (r ) k (xi ) i Bt (k) = Bt (k) = (r ) Bt (k) r =1 (r ) PE : Ridge regression Repeated Little Bootstrap estimate [controlled X ] PE (k ) = SSE (k ) + 2 Bt (k ) n = i=1 (yi k (xi ))2 + 2 Rt 2 R n (r ) (r ) k (xi ) i r =1 i=1 where for r = 1, . . . , R, (r ) k (xi ) p = j=1 ( )(r ) j k xij (r ) (k )(r ) = [X X + k I ]1 X Y (r ) Y =Y + (r ) PE : Non-negative Garrote Repeated Little Bootstrap estimate [controlled X ] PE (sk ) = SSE (sk ) + 2 Bt (sk ) n = i=1 (yi sk (xi ))2 + 2 Rt 2 R n (r ) (r ) sk (xi ) i r =1 i=1 where for r = 1, . . . , R, (r ) sk (xi ) p = j=1 cj(sk ) j (r ) LS(r ) xij LS(r ) = (X X )1 X Y (r ) (r ) Y =Y + (r ) Points to note All methods: small t = Bias of Bt is small. Subset selection: As t 0, Var(Bt ) (no limit). R = 25 and t [.6, .8] is recomended. Ridge: As t 0, Var(Bt ) , has a limit. PE (k ) = SSE (k ) + 2 2 tr (Hk ) min PE (k ) min PE (k ). k k Garrote: Performance somewhere in between. Var (Bt ) increases faster than ridge, but has a nite limit. Yard stick: Behavior of Bt for small t is a reection of the stability of the regression procedures used. PE : Subset selection [random X ] V -fold C.V. estiamtes: Tn = L1 L2 . . . LV (disjoint partition). Lc = Tn L . k () based on {(yi , xi ) Lc }. Prediction on L : yj k (xj ) such that (yj , xj ) L , V () () PE (k ) = =1 (yj ,xj )L (yj k (xj ))2 . () n-fold (leave-one-out) C.V gives poor results. V 5 or 10 is good enough. PE : Ridge, Nonnegative Garrote Nonnegative garrote: similar to subset selections, using V -fold C.V. 10-fold works better than n-fold. Ridge regerssion: Leave-one-out C.V. does a decent job. n [random X ] PE (k ) = i=1 n (yi k (xi ))2 yi k (xi ) 1 hii (k ) 2 [i] = i=1 1 usually hii () n tr (H ) G.C.V: PE (k ) = n 2 i=1 (yi k (xi )) (1 n1 tr (Hk ))2
