This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: CS229 Problem Set #2 Solutions 1 CS 229, Public Course Problem Set #2 Solutions: Kernels, SVMs, and Theory 1. Kernel ridge regression In contrast to ordinary least squares which has a cost function J ( ) = 1 2 m summationdisplay i =1 ( T x ( i ) y ( i ) ) 2 , we can also add a term that penalizes large weights in . In ridge regression , our least squares cost is regularized by adding a term bardbl bardbl 2 , where > 0 is a fixed (known) constant (regularization will be discussed at greater length in an upcoming course lecutre). The ridge regression cost function is then J ( ) = 1 2 m summationdisplay i =1 ( T x ( i ) y ( i ) ) 2 + 2 bardbl bardbl 2 . (a) Use the vector notation described in class to find a closed-form expreesion for the value of which minimizes the ridge regression cost function. Answer: Using the design matrix notation, we can rewrite J ( ) as J ( ) = 1 2 ( X vector y ) T ( X vector y ) + 2 T . Then the gradient is J ( ) = X T X X T vector y + . Setting the gradient to gives us = X T X X T vector y + = ( X T X + I ) 1 X T vector y . (b) Suppose that we want to use kernels to implicitly represent our feature vectors in a high-dimensional (possibly infinite dimensional) space. Using a feature mapping , the ridge regression cost function becomes J ( ) = 1 2 m summationdisplay i =1 ( T ( x ( i ) ) y ( i ) ) 2 + 2 bardbl bardbl 2 . Making a prediction on a new input x new would now be done by computing T ( x new ). Show how we can use the kernel trick to obtain a closed form for the prediction on the new input without ever explicitly computing ( x new ). You may assume that the parameter vector can be expressed as a linear combination of the input feature vectors; i.e., = m i =1 i ( x ( i ) ) for some set of parameters i . CS229 Problem Set #2 Solutions 2 [Hint: You may find the following identity useful: ( I + BA ) 1 B = B ( I + AB ) 1 . If you want, you can try to prove this as well, though this is not required for the problem.] Answer: Let be the design matrix associated with the feature vectors ( x ( i ) ) . Then from parts (a) and (b), = ( T + I ) 1 T vector y = T ( T + I ) 1 vector y = T ( K + I ) 1 vector y. where K is the kernel matrix for the training set (since i,j = ( x ( i ) ) T ( x ( j ) ) = K ij .) To predict a new value y new , we can compute vector y new = T ( x new ) = vector y T ( K + I ) 1 ( x new ) = m summationdisplay i =1 i K ( x ( i ) ,x new ) . where = ( K + I ) 1 vector y . All these terms can be efficiently computing using the kernel function....
View Full Document