This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: A Solution Manual and Notes for the Text: The Elements of Statistical Learning by Jerome Friedman, Trevor Hastie, and Robert Tibshirani John L. Weatherwax ∗ December 15, 2009 * [email protected] 1 Chapter 2 (Overview of Supervised Learning) Notes on the Text Statistical Decision Theory Our expected predicted error (EPE) under the squared error loss and assuming a linear model for y i.e. y = f ( x ) ≈ x T β is given by EPE( β ) = integraldisplay ( y − x T β ) 2 Pr( dx,dy ) . (1) Considering this a function of the components of β i.e. β i to minimize this expression with respect to β i we take the β i derivative, set the resulting expression equal to zero and solve for β i . Taking the vector derivative with respect to the vector β we obtain ∂ EPE ∂β = integraldisplay 2 ( y − x T β ) ( − 1) x Pr( dx,dy ) = − 2 integraldisplay ( y − x T β ) x Pr( dx,dy ) . (2) Now this expression will contain two parts. The first will have the integrand yx and the second will have the integrand x T βx . This latter expression in terms of its components is given by x T βx = ( x β + x 1 β 1 + x 2 β 2 + ··· + x p β p ) x x 1 x 2 . . . x p = x x β + x x 1 β 1 + x x 2 β 2 + ... + x x p β p x 1 x β + x 1 x 1 β 1 + x 1 x 2 β 2 + ... + x 1 x p β p . . . x p x β + x p x 1 β 1 + x p x 2 β 2 + ... + x p x p β p = xx T β . So with this recognition, that we can write x T βx as xx T β , we see that the expression ∂ EPE ∂β = 0 gives E [ yx ] − E [ xx T β ] = 0 . (3) Since β is a constant, it can be taken out of the expectation to give β = E [ xx T ] 1 E [ yx ] , (4) which gives a very simple derivation of equation 2.16 in the book. Note since y ∈ R and x ∈ R p we see that x and y commute i.e. xy = yx . Exercise Solutions Ex. 2.1 (target coding) If each of our samples from K classes is coded as a target vector t k which has a one in the k th spot. Then one way of developing a classifier is by regressing the independent variables onto the target vectors t k . Then our classification procedure would then become the following. Given the measurement vector X , predict a target vector ˆ y via linear regression and to select the class k corresponding to the component of ˆ y which has the largest value. That is k = argmax i (ˆ y i ). Now consider the expression argmin k  ˆ y − t k  , which finds the index of the target vector that is closest to the produced regression output ˆ y . By expanding the quadratic we find that argmin k  ˆ y − t k  = argmin k  ˆ y − t k  2 = argmin k K summationdisplay i =1 (ˆ y i − ( t k ) i ) 2 = argmin k K summationdisplay i =1 ( (ˆ y i ) 2 − 2ˆ y i ( t k ) i + ( t k ) i 2 ) = argmin k K summationdisplay i =1 ( − 2ˆ y i ( t k ) i + ( t k ) i 2 ) , since the sum ∑ K i =1 ˆ y 2 i is the same for all classes k and we have denoted ( t k ) i to be the i th component of the...
View
Full
Document
This note was uploaded on 02/29/2012 for the course STATS 315A taught by Professor Tibshirani,r during the Winter '10 term at Stanford.
 Winter '10
 TIBSHIRANI,R
 Statistics

Click to edit the document details