chap3 - ESL Chapter 3 Linear Methods for Regression Trevor...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Linear Methods for Regression Outline The simple linear regression model Multiple linear regression Model selection and shrinkage--the state of the art 1 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Preliminaries Data (x1 , y1 ), . . . (xN , yN ). xi is the predictor (regressor, covariate, feature, independent variable) yi is the response (dependent variable, outcome) We denote the regression function by (x) = E (Y |x) This is the conditional expectation of Y given x. The linear regression model assumes a specific linear form for (x) = + x which is usually thought of as an approximation to the truth. 2 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Linearity assumption? (x) = + x Almost always thought of as an approximation to the truth. Functions in nature are rarely linear. 7 f(X) 3 4 5 6 2 4 X 6 8 3 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Fitting by least squares Minimize: N ^ ^ 0 , = argmin0 , i=1 (yi - 0 - xi )2 Solutions are N j=1 (xi - x)yi N 2 j=1 (xi - x) ^ = ^ 0 = y - x ^ ^ ^ yi = 0 + xi are called the fitted or predicted values ^ ^ ^ ri = yi - 0 - xi are called the residuals 4 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Y 1 2 3 4 5 6 -0.5 0.0 X 0.5 Least squares estimation for linear regression model. 5 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Standard errors & confidence intervals We often assume further that yi = 0 + xi + where E ( i ) = 0 and Var ( i ) = 2 . Then ^ se () = Estimate 2 by 2 = ^ (xi - x)2 2 1 2 i (yi - yi )2 /(N - 2). ^ Under additional assumption of normality for the i s, a 95% confidence ^ interval for is: 1.96se() ^ ^ se () = ^ ^ ^ (xi - x)2 2 1 2 6 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Fitted Line and Standard Errors (x) ^ = = se[^(x)] = = ^ ^ 0 + x y + (x - x) ^ 1 2 ^ var() + var()(x - x)2 y (x - x) + (xi - x)2 n 2 2 2 1 2 7 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani 6 4 0 2 -1.0 -0.5 0.0 X Y 0.5 1.0 ^ Fitted regression line with pointwise standard errors: (x) 2se[^(x)]. ^ 8 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Multiple linear regression Model is f (xi ) = 0 + j=1 p xij j Equivalently in matrix notation: f = X f is N -vector of predicted values X is N (p + 1) matrix of regressors, with ones in the first column is a p + 1-vector of parameters 9 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Estimation by least squares p ^ = = argmin i (yi - 0 - j=1 xij j )2 argmin(y - X)T (y - X) Solution satisfies normal equations: XT (y - X) = 0. If X full column rank, ^ = ^ y = (XT X)-1 XT y ^ X ^ Also Var () = (XT X)-1 2 10 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Y X2 X1 The (p + 1)-dimensional geometry of least squares estimation. 11 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani y x2 y ^ x1 The N -dimensional geometry of least squares estimation ^ (y - y) xj , j = 1, . . . , p 12 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Properties of OLS If X1 and X2 are mutually orthogonal matrices (XT X2 = 0), then 1 the joint regression coefficients for X = (X1 , X2 ) on y can be found from the separate regressions. ^ ^ ^ Proof: XT (y - X) = XT (y - X1 1 ) = 0. Same for 2 . 1 1 OLS is equivariant under non-singular linear transformations of X. ^ ^ i.e. if is OLS solution for X, then = A-1 is OLS solution for X = XA for App nonsingular. Proof: OLS is defined by orthogonal projection onto column space of ^ ^ X. So y = X = X . Let X(p) be the submatrix of X excluding the last column xp . Let zp = xp - X(p) (for any ). Then OLS coefficient of xp is the same as OLS coefficent of zp if we replace xp by zp . Proof: previous point. 13 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Let be the OLS coefficient of xp on X(p) . Hence zp is the residual obtained by adjusting xp for all the other variables in the model. XT zp = 0 so the regression of y on (X(p) , zp ) decouples. (p) The multiple regression coefficient of xp is the same as the univariate coefficient in the regression of y on zp i.e. xp adjusted for the rest! ^p = zp , y ||zp ||2 2 ^ Var (p ) = ||zp ||2 Last statements true for all j, not just the last term p. 14 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani The course website has some additional more technical notes (linearR.pdf) on multiple linear regression, with an emphasis on computations. 15 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Example: Prostate cancer 97 observations on 9 variables (Stamey et al, 1989) Goal to predict log(PSA) from 8 clinical measurements/ demographics on men who were about to have their prostate removed. Next page shows a scatterplot matrix of all the date. This is created using the R expression pairs(lpsa ., data=lprostate). Notice that several variables are correlated, and that svi is binary. 16 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani -1 1 2 3 4 o o o o oo o o o o o o o oooooo o ooo o o oo o ooo o oooooooo oo o oo oooo ooo o o o oo oo o o oo o oo o oo ooo o o ooo o o o oo o o o o o o o oo ooo o o o oo o o oooo o o oo o oo o o o o oo o oo o o o oo ooo o oo o oooo o ooo oo o o o o o ooo o o o o o o oo o o oo o o ooo o o o o o o o o 40 o 50 o 60 70 o 80 o o o o o oo o o o o o o o o o o oo o o o o o oo oooo o oo o o oo o o o o oo o oo o o o oo oooo o o o oo o o o o o o o o o o o o o o o o o o oo o o oo o o o o o o oo oo o o o o o o o o o o o o o o oo o o o oo o o o o o o o o oo oo o o o o o o o o o o o o o o o o o o oo o o oo o o o ooo o o o oo oo o o o oo o ooo o ooo o o ooo o o o oo oo o o o o o o ooo o o oo o o o o o o o o o o o o o o o o o o o oo oo o o oo o oo o o o o ooo o o o o o o o oo o o oo oooooo o o oo o o o o oo o o o o o o o o o o o o o o 0.0 0.4 0.8 o o o o o o o o o o o o o o o o o o ooo oo o o oo o o o o o o oo o ooo o o oo oo o o o o o o oo o o o o o o 6.0 7.0 o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o 8.0 9.0 o o o o o o o o o o o o o o o oo o oo oo o o ooo o o o o ooo o o oooo oo oo oo o o oo ooo o oo o oo oo o o o o o o o oo o oo o o oo o o o o oo o o o ooo o o o o oo o ooooo o o o o o ooo o oo o o o o o oo oo o o o o o o o o o o o o o o o o o o o oo oo o o o o o oo o o oo o o o o o o o o o ooo o o o o oo oo o o oo ooooooo o oo o o o o o oo oo o o o o o o oo o o oo o ooo o ooo o o o o o ooo o o o oo o o o o o o o o oo oo o o o o o oo oo o o o o o o oo oo o o oo o oo oo o o o o o o oooo oo o o o ooo o o oo oo o o oo o o oo o o o o o oo oo oo o 2 4.5 0 1 2 3 4 5 lpsa 4 o o o ooo o o ooo o oo oo o oo oo oo o o oo o o oo oo o o o ooo oooo o o o oo o o o o oo oo o o ooo o o o o oo o o o o oo o oo o o o o o o o oo o o o o o oo oooo o o o o o oo o o ooo oo o oo o oo ooo oo ooo o oo o o o o o oo o oo o o oooo o o ooo o o oo o o o ooo o oo o oo o 80 o oo oo o o o oo oo oo o oo ooo o o o o oo oooo oo o o o o o oooo o ooooooo o o o oo o o ooo o o o o o oo o o o o oo o oo o o o o o o o o o o o o o o oo o ooo o oo o o oo o o ooooo o o o o o ooo o o o oo o o o o o oo o o o o o o o oo oo ooooooooooooooo o oo oooo o oo ooo oo ooo oo o o o ooooo oo o 0.8 o o o o o o o oo o o oo o ooo oooo o oo o o oo oo o o o o ooo o oooo oo o o ooo o o oo oo o o oo o oooo o o o o oo o o oo o o o o o oo o o o oo o oo oo o ooooo o o o o oooo o o o o oo o ooo oooo o o ooo oo o ooo oooo oo o oo o o o oo ooo o oo o o o o o o o o oo o o o oo o o oo o oooo o o ooo o oooooo o ooo o o o oo oo o oo o o o o oo o o o oo o o o o oo o ooo o o o oo o oo o o o o o o oo o o o o o o o o oo oooo o ooo o o o o o oo ooooo o ooooooo o o o oo oo ooo o oooo oo oooo o o o o o o ooo ooo oo oo o o o oo o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o lcavol o o o o o oo oo oo o o o oo o oo o o o o o oo o o o oo oo oo o o o ooo oo o o o o o ooo o oo oo o oo oo o o oo o ooo o oo o oo o oooo oo o o o o o oo o o o oo o o o o o o o oo o o o o oo oo o oooo oo o oo oo oo o o oooo o o ooooooo o o o o o o oo o o o o o o o o ooo o o o oo o o o o oo o o o o o o o o oo o o o oo o o o oo o o o oo o o o o o o o o oo o o o o o o o o o o o oo o o o o o o o oo oooooooo oooo o o o o ooooo ooo o oo ooo ooooo oo oo o oo o 1 o o o o o o o o o o o o o -1 o o o oo oo ooo oo o oo o o oooo o o oo o o o o oo o o oo o o o o oo o o o 3 o o o o 2 o o o o o o o o o o lweight o o oo o o o o oo o oo o o oo o o o oo o oooo ooooo oo o o oooooo oo o o o oooooo oo oo o ooo oo oooo oo oo oo oo o oo o oo o o o o oo o o o oo oo o o o oo o oo oo o o o o o o ooo o o oo o o o oo o o oo o o o o oo o 70 o o o o o o o o o o o o o o o o o o o o o o o o o age o oo o o oo o o oo oo o ooooooo o o oo o oo oo o o o o o oo o o o ooo o o o oo o oo o ooo ooooooo oo oooooo oo o o o ooooo o o ooo o o 40 o o o o o o o oo o o o o o o o o oo o o oooo o o oo o o oo o o o o o oo oo o oo o o o oo o o o o o o o o o oo o o o o o o o o o o o o o oo o o o o o o o oo o o o o o o oo o oo o o o o o o o o o o o o o o o o o o o o oo o oo oo o oo ooo o o o oo o ooooo o o o oooo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o 50 60 ooo o oo o oo o o ooo oo o oo o o oo o oo o o oo o oo o o o ooo o o o o o o o o o ooo oo o oooo o o oooooooo oo o o o ooooooo ooo oo o o o o o o o o o o o o oo o oo ooo ooo o o ooo o o 0.0 0.4 svi oo ooooooooo ooo o o oo oooo o o oooo o o o o o o o oo o o oo o o oo oo o o o oo o o o o o o oo o o oooo o o oo o o oo o o ooo oo o o oo ooooooooo o o o o oooo oo o o o oo o o o oooo ooooooo oooo o o o oooooo ooo o o oo ooo o ooo o o ooo o o o oo o o o ooo o o o o o o o o o ooo o o ooo oo o o o ooo oo o o oooo ooo o ooo o o o o ooo o o o oo o o o o o o ooooooo oooo o o o o ooo oooo o o o oo oooo ooo ooooo o o o ooo oo o o o o o o oo o o oo o oo oo o o o oo o o oo o o o o o o o o o oo o o ooo o o o o o o ooooooo oo oooo o o o o oooo ooo oo o o ooo oo o o o o o ooooooooooo o o o o oooooooooo o o oooooo o o o ooo o o oo o oo o o o o ooo o o oo o oooo o o oo o oo o oo o o oo o o o oo o o o o o oo oo o o o o o o oo ooo o o oo o o oo o o ooooooo oooo o o o oo o o o o ooooooooooo o o o ooooooo oo o o o oo oo o o o o oooooo oo oo oooooo ooo o o o o o o o o oo ooo oo o o o o o o oo oo o o oo o oo oo o o o o o o ooo oo o o o o oo oooo o ooooooooo oo oo o o o o oooo o ooo 2.5 3.5 4.5 oo o o ooooooooooo o oo ooooooooo o oo oo o o o o oo oo o oo oo o o oo o oo o o o o o o o o oo o ooo o oo o o oo oooo oo oo oo oo o ooooooo o o o ooooooo o o oo o o o o o o o o o oooooo oooo o oooooo o oo oo oo o o ooooo o o o oo oo oo o o o o o o o oo oo o o oo o o oo o o oo oo o o oo oo o o oooo o o o oo o o o o o o oo o o ooooooooo oo oo ooo oooooo o oo o o o o oo ooo ooooo ooo o o oooo o ooo oo oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o oo oooo o oo oooo o o o o oo ooo o o o oo o o o o o o o o o oooooooo o o ooo o o 100 o oo oo o o oo o o o oo oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo oooo o o o o o o oo o o o o oo ooo o o oo o o oo o o o o oo o o o o o o ooooo o oo o ooo o o o oo o o oooooo o o o o ooo 3 o o o o o o o o oo o ooo o o oo o o o o o o o o o o o oo o o o o o o oo o o o o o o o o o oo oo o oo o ooo o o o o o ooo o ooo oo o ooo oo ooo oo oo o o o ooo o oo o oo o o o o lcp o o o o 9.0 ooo o o o o oo ooooooo o oooo ooo o o o o ooooo o o 8.0 6.0 7.0 gleason oo ooooooooo oo o o ooo oo o o o oo o oo o o o oo oo o o o oo o oo oo o o o o o o o o o oo o o oo o ooo o o o o oo ooo oooooo o o o oooo o o oo ooooooo o o o oo o o 0 1 2 3 4 5 o o o o o o o o o oo o o o o o o o o oo o oo o o o o o o o oo o o o o o o o o o o oo oo o o o o oooo o o oo o -1 0 1 2 o o o o o o o o o o oo o o o o o oo o o o oo o o o o o o o oo o ooo o o oo o o o o o o oo o o o oo o 0 1 2 3 o o -1 0 20 60 100 0 20 o o o o o o o o o o o o o pgg45 60 -1 0 o 1 2 -1 0 lbph o o o o o oo oo o o o o o o o oo 1 2.5 3.5 o o o o o o Scatterplot matrix of the Prostate Cancer data. The top variable is the response lpsa, and the top row shows its relationship with each of the others. 17 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani > l m f i t =lm ( l p s a. , data = l p r o s t a t e ) > summary ( l m f i t ) Call : lm ( formula = l p s a . , data = l p r o s t a t e ) Coefficients : E s t i m a t e S t d . E r r o r t v a l u e Pr (>| t | ) ( I n t e r c e p t ) 0.180899 1.320601 0.137 0.89136 lcavol 0.564355 0.087831 6 . 4 2 5 6 . 5 4 e-09 lweight 0.622081 0.200892 3.097 0.00263 a ge -0.021248 0 . 0 1 1 0 8 4 -1.917 0 . 0 5 8 4 8 lbph 0.096676 0.057915 1.669 0.09862 svi 0.761652 0.241173 3.158 0.00218 lcp -0.106055 0 . 0 8 9 8 6 6 -1.180 0 . 2 4 1 1 2 gleason 0.049287 0.155340 0.317 0.75178 pgg45 0.004458 0.004365 1.021 0.30999 . . R e s i d u a l s t a n d a r d e r r o r : 0 . 6 9 9 5 on 88 d e g r e e s o f f r e e d o m Multiple R quared : 0.6634 , -s Adjusted R quared : 0.6328 -s F-s t a t i s t i c : 2 1 . 6 8 on 8 and 88 DF , p-v a l u e : < 2 . 2 e-16 18 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani The Bias-variance tradeoff A good measure of the quality of an estimator ^(x) is the mean squared f error. Let f0 (x) be the true value of f (x) at the point x. Then Mse [^(x)] = E [^(x) - f0 (x)]2 f f This can be written as Mse [^(x)] = Var [^(x)] + [E ^(x) - f0 (x)]2 f f f This is variance plus squared bias. Typically, when bias is low, variance is high and vice-versa. Choosing estimators often involves a tradeoff between bias and variance. 19 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani If the linear model is correct for a given problem, then the least squares prediction ^ is unbiased, and has the lowest variance among f all unbiased estimators that are linear functions of y But there can be (and often exist) biased estimators with smaller Mse . Generally, by regularizing (shrinking, dampening, controlling) the estimator in some way, its variance will be reduced; if the corresponding increase in bias is small, this will be worthwhile. Examples of regularization: subset selection (forward, backward, all subsets); ridge regression, the lasso. In reality models are almost never correct, so there is an additional model bias between the closest member of the linear model class and the truth. 20 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Model Selection Often we prefer a restricted estimate because of its reduced estimation variance. Closest fit in population Realization Closest fit Truth Model bias Estimation Bias MODEL SPACE Shrunken fit Estimation Variance RESTRICTED MODEL SPACE 21 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Analysis of time series data Two approaches: frequency domain (fourier)--see discussion of wavelet smoothing. Time domain. Main tool is auto-regressive (AR) model of order k: yt = 1 yt-1 + 2 yt-2 + k yt-k + Fit by linear least squares regression on lagged data yt = = 1 yt-1 + 2 yt-2 k yt-k 1 yt-2 + 2 yt-3 k yt-k-1 . . . 1 yk + 2 yk-1 k y1 t yt-1 . . = . yk+1 = 22 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Example: NYSE data Time series of 6200 daily measurements, 1962-1987 volume -- log(trading volume) -- outcome volume.Lj -- log(trading volume)day-j , j = 1, 2, 3 ret.Lj -- log(Dow Jones)day-j , j = 1, 2, 3 aret.Lj -- |log(Dow Jones)|day-j , j = 1, 2, 3 vola.Lj -- volatilityday-j , j = 1, 2, 3 Source--Weigend and LeBaron (1994) We randomly selected a training set of size 50 and a test set of size 500, from the first 600 observations. 23 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani 1 -2 1 volume 2 0 volume.L1 -2 volume.L2 0 -2 1 -1 1 -1 3 1 -1 2 0 -2 2 -0.4 -2 -1 2 -1 1 -1 1 3 -2 0 2 0.4 0.0 -0.4 volume.L3 retd.L1 retd.L2 retd.L3 1 -2 1 -2 1 3 aretd.L1 aretd.L2 -1 1 aretd.L3 vola.L1 -2 2 0 -2 2 0 -2 2 0 -2 1 -1 2 0 -2 vola.L2 vola.L3 0 -2 -2 0 2 -2 0 24 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani OLS Fit Results of ordinary least squares analysis of NYSE data Term Intercept volume.L1 volume.L2 volume.L3 retd.L1 retd.L2 retd.L3 aretd.L1 aretd.L2 aretd.L3 vola.L1 vola.L2 vola.L3 Coefficient -0.02 0.09 0.06 0.04 0.00 -0.02 -0.03 0.08 -0.02 0.03 0.20 -0.50 0.27 Std. Error 0.04 0.05 0.05 0.05 0.04 0.05 0.04 0.07 0.05 0.04 0.30 0.40 0.34 t-Statistic -0.64 1.80 1.19 0.81 0.11 -0.46 -0.65 1.12 -0.45 0.77 0.66 -1.25 0.78 25 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Variable subset selection We retain only a subset of the coefficients and set to zero the coefficients of the rest. There are different strategies: All subsets regression finds for each s 0, 1, 2, . . . p the subset of size s that gives smallest residual sum of squares. The question of how to choose s involves the tradeoff between bias and variance: can use cross-validation (see below) Rather than search through all possible subsets, we can seek a good path through them. Forward stepwise selection starts with the intercept and then sequentially adds into the model the variable that most improves the fit. The improvement in fit is usually based on the 26 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani F ratio ^ ^ RSS( old ) - RSS( new ) F = ^ RSS( new )/(N - s) Backward stepwise selection starts with the full OLS model, and sequentially deletes variables. There are also hybrid stepwise selection strategies which add in the best variable and delete the least important variable, in a sequential manner. Each procedure has one or more tuning parameters: subset size P-values for adding or dropping terms 27 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Model Assessment Objectives: 1. Choose a value of a tuning parameter for a technique 2. Estimate the prediction performance of a given model For both of these purposes, the best approach is to run the procedure on an independent test set, if one is available If possible one should use different test data for (1) and (2) above: a validation set for (1) and a test set for (2) Often there is insufficient data to create a separate validation or test set. In this instance Cross-Validation is useful. 28 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani K-Fold Cross-Validation Primary method for estimating a tuning parameter (such as subset size) Divide the data into K roughly equal parts (typically K=5 or 10) 1 Train 2 Train 3 Test 4 Train 5 Train for each k = 1, 2, . . . K, fit the model with parameter to the other K - 1 ^ parts, giving -k () and compute its error in predicting the kth part: ^ Ek () = (yi - xi -k ())2 . ikth part This gives the cross-validation error 1 CV () = K K Ek () k=1 do this for many values of and choose the value of that makes CV () smallest. 29 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani In our variable subsets example, is the subset size ^ -k () are the coefficients for the best subset of size , found from the training set that leaves out the kth part of the data Ek () is the estimated test error for this best subset. from the K cross-validation training sets, the K test error estimates are averaged to give K CV () = (1/K) k=1 Ek (). Note that different subsets of size will (probably) be found from each of the K cross-validation training sets. Doesn't matter: focus is on subset size, not the actual subset. 30 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani all subsets 0.075 0.070 0.065 CV error 0.060 2 4 6 8 10 12 subset size CV curve for NYSE data The focus is on subset size--not which variables are in the model. Variance increases slowly--typically 2 /N per variable. 31 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani 100 Residual Sum-of-Squares 40 80 60 0 0 20 1 2 3 4 Subset Size k 5 6 7 8 All possible subset models for the prostate cancer example. At each subset size is shown the residual sum-of-squares for each model of that size. 32 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani The Bootstrap approach Bootstrap works by sampling N times with replacement from training set to form a "bootstrap" data set. Then model is estimated on bootstrap data set, and predictions are made for original training set. This process is repeated many times and the results are averaged. Bootstrap most useful for estimating standard errors of predictions. Can also use modified versions of the bootstrap to estimate prediction error. Sometimes produces better estimates than cross-validation Eg -- for each bootstrap sample, estimate errors using only observations excluded from bootstrap sample. 33 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Cross-validation- revisited Consider a simple classifier for wide data: 1. Starting with 5000 predictors, find the 200 predictors having the largest correlation with the class labels 2. Carry about nearest-centroid classification using only these 200 genes How do we estimate the test set performance of this classifier? " Wrong: Apply cross-validation in step 2. Right: Apply cross-validation to steps 1 and 2. It is easy to simulate realistic data with the class labels independent of the outcome -- so that true test error =50% -- but Wrong CV error estimate is zero! We have seen this error made in 4 high profile microarray papers in the last couple of years. See Ambroise and McLachlan PNAS 2002. A little cheating goes a long way! 34 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Validation and test set issues Important to have both cross-validation and test sets, since we often run CV many times, fiddling with different parameters. This can bias the CV results A separate test set provides a convincing, independent assessment of a model's performance Test-set results might still overestimate actual performance, as a real future test set may differ in many ways from today's data 35 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Does cross-validation really work? Consider a scenario with N = 20 samples in two equal-sized classes, and p = 500 quantitative features that are independent of the class labels. The true error rate of any classifier is 50%. Consider a simple univariate classifier -- a single split that minimizes the misclassification error (a "stump"). Fitting to the entire training set, we will find a feature that splits the data very well If we do 5-fold CV, this same feature should split any 4/5ths and 1/5th of the data well too, and hence its CV error will be small (much less than 50%) Thus CV does not give an accurate estimate of error. Is this argument correct? (Details in Section 7.10.3). 36 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani 9 Error on Full Training Set 8 7 Error on 1/5 0 100 200 300 400 500 6 4 5 3 2 0 1 1 2 3 4 2 3 4 5 6 7 8 Predictor Error on 4/5 Class Label full 4/5 0.4 -1 0 Predictor 436 (blue) 1 2 CV Errors 0.0 0 0.2 0.6 0.8 1.0 1 37 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani NYSE example continued Table shows the coefficients from a number of different selection and shrinkage methods, applied to the NYSE data. Term Intercept volume.L1 volume.L2 volume.L3 retd.L1 retd.L2 retd.L3 aretd.L1 aretd.L2 aretd.L3 vola.L1 vola.L2 vola.L3 Test err SE OLS -0.02 0.09 0.06 0.04 0.00 -0.02 -0.03 0.08 -0.02 0.03 0.20 -0.50 0.27 0.050 0.007 VSS 0.00 0.16 0.00 0.00 0.00 0.00 0.00 0.00 -0.05 0.00 0.00 0.00 0.00 0.041 0.005 Ridge -0.01 0.06 0.04 0.04 0.01 -0.01 -0.01 0.03 -0.03 0.01 0.00 -0.01 -0.01 0.042 0.005 Lasso -0.02 0.09 0.02 0.03 0.01 0.00 0.00 0.02 -0.03 0.00 0.00 0.00 0.00 0.039 0.005 PCR -0.02 0.05 0.06 0.04 0.02 -0.01 -0.02 -0.02 -0.01 0.02 -0.01 -0.01 -0.01 0.045 0.006 PLS -0.04 0.06 0.06 0.05 0.01 -0.02 0.00 0.00 -0.01 0.01 -0.01 -0.01 -0.01 0.044 0.006 CV was used on the 50 training observations (except for OLS). Test error for constant: 0.061. 38 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani all subsets CV error 0.07 0.07 CV error ridge regression 0.05 0.05 2 4 6 8 10 subset size 12 2 4 6 8 10 12 degrees of freedom CV error 0.07 CV error Crossvalidation results for NYSE data Estimated prediction error curves for the various selection and shrinkage methods. The arrow indicates the estimated minimizing value of the complexity parameter. Training sample size = 50. lasso 0.07 0.0 0.2 0.4 s 0.6 0.8 1.0 0 PC regression 2 4 6 8 0.05 0.05 10 12 # directions partial least squares CV error 0.07 0 2 4 6 8 # directions 10 12 0.05 39 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani All Subsets 1.8 1.8 Ridge Regression 1.6 1.4 1.2 CV Error CV Error 1.2 1.4 1.6 4 Degrees of Freedom 1.0 1.8 1.6 1.6 1.8 Crossvalidation results for Prostate Cancer data Estimated prediction error curves for the various selection and shrinkage methods. The arrow indicates the estimated minimizing value of the complexity parameter. Training sample size = 67. 0.8 0.6 4 Subset Size 6 0.6 0 2 0.8 1.0 8 6 8 0 2 Lasso Principal Components Regression 1.4 1.2 CV Error 1.0 1.0 0.0 0.2 0.4 CV Error 1.2 1.4 0.8 0.8 0.6 0.8 1.0 0.6 0.6 4 6 8 0 2 Shrinkage Factor s Number of Directions Partial Least Squares 1.8 1.6 CV Error 0.8 1.0 1.2 1.4 0.6 0 2 4 6 8 Number of Directions 40 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Ridge Regression The ridge estimator is defined by ^ ridge = argmin(y - X)T (y - X) + T Equivalently, ^ ridge = argmin (y - X)T (y - X) 2 subject to j s. 2 The parameter > 0 penalizes j proportional to its size j . Solution is ^ = (XT X + I)-1 XT y where I is the identity matrix. This is a biased estimator that for some value of > 0 may have smaller mean squared error than the least squares estimator. ^ Note = 0 gives the least squares estimator; if , then 0. 41 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani lcavol 0.6 0.4 Coefficients svi lweight pgg45 lbph 0.2 0.0 gleason age -0.2 lcp 0 2 4 6 8 df() 42 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani The Lasso The lasso is a shrinkage method like ridge, but acts in a nonlinear manner on the outcome y. The lasso is defined by ^ lasso = argmin (y - X)T (y - X) subject to |j | t 2 j is replaced by Notice that ridge penalty |j |. this makes the solutions nonlinear in y, and a quadratic programming algorithm is used to compute them. because of the nature of the constraint, if t is chosen small enough then the lasso will set some coefficients exactly to zero. Thus the lasso does a kind of continuous model selection. 43 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani The parameter t should be adaptively chosen to minimize an estimate of expected, using say cross-validation Ridge vs Lasso: if inputs are orthogonal, ridge multiplies least squares coefficients by a constant < 1, lasso translates them towards zero by a constant, truncating at zero. OLS Transformed Coefficient Lasso Ridge Coefficient 44 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani 12 0.2 10 Profiles of coefficients for NYSE data as lasso shrinkage is varied. s = t/t0 [0, 1], where t0 = ^ |OLS |. Coefficients Lasso in Action 7 2 3 9 4 8 5 6 -0.4 -0.2 0.0 11 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Shrinkage Factor s 45 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani lcavol 0.4 0.6 Coefficients svi lweight pgg45 0.2 lbph 0.0 gleason age -0.2 lcp 0.0 0.2 0.4 0.6 0.8 1.0 Shrinkage Factor s 46 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani 2 ^ . 2 ^ . 1 1 47 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani A family of shrinkage estimators Consider the criterion N ~ = argmin i=1 (yi - xT )2 i |j |q s |j |q are shown for the j q = 0.5 q = 0.1 subject to for q 0. The contours of constant value of case of two inputs. q=4 q=2 q=1 Contours of constant value of |j |q for given values of q. j Thinking of |j |q as the log-prior density for j , these are also the equi-contours of the prior. 48 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Use of derived input directions -- Principal Component Regression We choose a set of linear combinations of the Xj s, and then regress the outcome on these linear combinations. The largest principal component Z1 is the standardized linear combination of the Xj s with largest variance. Subsequent principal components Z2 , Z3 , . . . maximize variance subject top being uncorrelated with the preceding components. If S is the sample covariance matrix of x1 , . . . , xp , then the eigenvector equations Svj = d2 vj , j j = 1, . . . , p define the principal components of S. 49 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani 4 Largest Principal Component o o o o o o o o o o o o o o oooo o oo o o o o oo o o o o oo o o o oo o o o o o o o o o oo o o o o oo o o o o o o o o o o oo o o o oo o oo o o o oo o oo o o o o o o oo o oo o o oo o o o o o ooo o o o o o o o o o o o o ooo o o oo oo o o o ooo oooo oo o oo o o o o o oo o o oo o o o o o o o oo o o o o o oo o o o o o o o o o o oo o o o o o Smallest Principal o Component o -4 o 2 o X2 -2 -4 0 -2 0 2 4 X1 50 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani PCA regression continued The vj are the (ordered) principal component directions; the derived principal component variables are given by zj = Xvj . The variance of zj is d2 , and determines the ordering. j Then principal components regression regresses y on z1 , z2 , . . . zJ for some J p. Since the zj s are orthogonal, this regression is just a sum of univariate regressions: J ^ ypcr = y + j=1 j zj ^ where j is the univariate regression coefficient of y on zj . ^ 51 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Principal components and the SVD ~ ~ Let X = UDVT be the SVD of the centered version X of the model matrix X. This SVD provides the principal components of X. Proof: 1 ~T ~ 1 S = X X = VD2 VT N N Digression: there are some notes on the coursework materials webpage on the SVD and PCA: SVD.pdf 52 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Principal components regression is very similar to ridge regression: both operate on the principal components of the input matrix. Ridge regression shrinks the coefficients of the principal components, with relatively more shrinkage applied to the smaller components than the larger; principal components regression discards the p - J + 1 smallest eigenvalue components. 1.0 ridge pcr Shrinkage Factor 0.6 0.8 0.4 0.2 8 0.0 2 4 Index 6 53 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani ^ Ridge fit vector: X = X(XT X + I)-1 XT y SVD of X = UDVT ^ X = = = UDVT (VD2 VT + I)-1 VDUT y UD(D2 + I)-1 DUT y d2 j uj , y uj 2 dj + j=1 p uj is the jth (standardized) principal component (zj = Xvj = uj dj ), so uj , y is the regression coefficient of y on uj . If = 0, this is the OLS fit--a projection onto U; with > 0, the fit is shrunk, increasingly for smaller principal components. 54 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Partial least squares This technique also constructs a set of linear combinations of the xj s for regression, but unlike principal components regression, it uses y (in addition to X) for this construction. We assume that y is centered and begin by computing the univariate regression coefficients of y on each x ^ From these we construct the derived input z1 = first partial-least-squares direction. x , which is the ^ ^ The outcome y is regressed on z1 , giving coefficient 1 , and then we ^ orthogonalize y, x1 , . . . xp with respect to z1 : r1 = y - 1 z1 , and ^ x = x - z1 We repeat this process, until J directions have been obtained. 55 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani In this manner, partial least squares produces a sequence of derived inputs or directions z1 , z2 , . . . zJ . As with principal components regression, if we continue on to construct J = p new directions we get back the ordinary least squares estimates; use of J < p directions produces a reduced regression Notice that in the construction of each zj , the inputs are weighted by the strength of their univariate effect on y. It can also be shown that the sequence z1 , z2 , . . . zp represents the conjugate gradient sequence for computing the ordinary least squares solutions. 56 ESL Chapter 3 -- Linear Methods for Regression Trevor Hastie and Rob Tibshirani Ridge vs PCR vs PLS vs Lasso Frank and Friedman (1993) show that ridge and PCR outperform PLS in prediction, and that they are simpler to understand. Lasso outperforms ridge when there are a moderate number of sizable effects, rather than many small effects. It also produces more interpretable models. These are all topics for ongoing research, and have become extremely relevant with massively wide datasets. 57 ...
View Full Document

This note was uploaded on 02/27/2012 for the course STATS 315A taught by Professor Tibshirani,r during the Spring '10 term at Stanford.

Ask a homework question - tutors are online