ps2_solution

ps2_solution - CS229 Problem Set #2 Solutions 1 CS 229,...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS229 Problem Set #2 Solutions 1 CS 229, Public Course Problem Set #2 Solutions: Kernels, SVMs, and Theory 1. Kernel ridge regression In contrast to ordinary least squares which has a cost function J ( ) = 1 2 m summationdisplay i =1 ( T x ( i ) y ( i ) ) 2 , we can also add a term that penalizes large weights in . In ridge regression , our least squares cost is regularized by adding a term bardbl bardbl 2 , where > 0 is a fixed (known) constant (regularization will be discussed at greater length in an upcoming course lecutre). The ridge regression cost function is then J ( ) = 1 2 m summationdisplay i =1 ( T x ( i ) y ( i ) ) 2 + 2 bardbl bardbl 2 . (a) Use the vector notation described in class to find a closed-form expreesion for the value of which minimizes the ridge regression cost function. Answer: Using the design matrix notation, we can rewrite J ( ) as J ( ) = 1 2 ( X vector y ) T ( X vector y ) + 2 T . Then the gradient is J ( ) = X T X X T vector y + . Setting the gradient to gives us = X T X X T vector y + = ( X T X + I ) 1 X T vector y . (b) Suppose that we want to use kernels to implicitly represent our feature vectors in a high-dimensional (possibly infinite dimensional) space. Using a feature mapping , the ridge regression cost function becomes J ( ) = 1 2 m summationdisplay i =1 ( T ( x ( i ) ) y ( i ) ) 2 + 2 bardbl bardbl 2 . Making a prediction on a new input x new would now be done by computing T ( x new ). Show how we can use the kernel trick to obtain a closed form for the prediction on the new input without ever explicitly computing ( x new ). You may assume that the parameter vector can be expressed as a linear combination of the input feature vectors; i.e., = m i =1 i ( x ( i ) ) for some set of parameters i . CS229 Problem Set #2 Solutions 2 [Hint: You may find the following identity useful: ( I + BA ) 1 B = B ( I + AB ) 1 . If you want, you can try to prove this as well, though this is not required for the problem.] Answer: Let be the design matrix associated with the feature vectors ( x ( i ) ) . Then from parts (a) and (b), = ( T + I ) 1 T vector y = T ( T + I ) 1 vector y = T ( K + I ) 1 vector y. where K is the kernel matrix for the training set (since i,j = ( x ( i ) ) T ( x ( j ) ) = K ij .) To predict a new value y new , we can compute vector y new = T ( x new ) = vector y T ( K + I ) 1 ( x new ) = m summationdisplay i =1 i K ( x ( i ) ,x new ) . where = ( K + I ) 1 vector y . All these terms can be efficiently computing using the kernel function....
View Full Document

Page1 / 8

ps2_solution - CS229 Problem Set #2 Solutions 1 CS 229,...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online