This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: CS229 Problem Set #2 Solutions 1 CS 229, Public Course Problem Set #2 Solutions: Kernels, SVMs, and Theory 1. Kernel ridge regression In contrast to ordinary least squares which has a cost function J ( θ ) = 1 2 m summationdisplay i =1 ( θ T x ( i ) − y ( i ) ) 2 , we can also add a term that penalizes large weights in θ . In ridge regression , our least squares cost is regularized by adding a term λ bardbl θ bardbl 2 , where λ > 0 is a fixed (known) constant (regularization will be discussed at greater length in an upcoming course lecutre). The ridge regression cost function is then J ( θ ) = 1 2 m summationdisplay i =1 ( θ T x ( i ) − y ( i ) ) 2 + λ 2 bardbl θ bardbl 2 . (a) Use the vector notation described in class to find a closed-form expreesion for the value of θ which minimizes the ridge regression cost function. Answer: Using the design matrix notation, we can rewrite J ( θ ) as J ( θ ) = 1 2 ( Xθ − vector y ) T ( Xθ − vector y ) + λ 2 θ T θ . Then the gradient is ∇ θ J ( θ ) = X T Xθ − X T vector y + λθ . Setting the gradient to gives us = X T Xθ − X T vector y + λθ θ = ( X T X + λI ) − 1 X T vector y . (b) Suppose that we want to use kernels to implicitly represent our feature vectors in a high-dimensional (possibly infinite dimensional) space. Using a feature mapping φ , the ridge regression cost function becomes J ( θ ) = 1 2 m summationdisplay i =1 ( θ T φ ( x ( i ) ) − y ( i ) ) 2 + λ 2 bardbl θ bardbl 2 . Making a prediction on a new input x new would now be done by computing θ T φ ( x new ). Show how we can use the “kernel trick” to obtain a closed form for the prediction on the new input without ever explicitly computing φ ( x new ). You may assume that the parameter vector θ can be expressed as a linear combination of the input feature vectors; i.e., θ = ∑ m i =1 α i φ ( x ( i ) ) for some set of parameters α i . CS229 Problem Set #2 Solutions 2 [Hint: You may find the following identity useful: ( λI + BA ) − 1 B = B ( λI + AB ) − 1 . If you want, you can try to prove this as well, though this is not required for the problem.] Answer: Let Φ be the design matrix associated with the feature vectors φ ( x ( i ) ) . Then from parts (a) and (b), θ = ( Φ T Φ + λI ) − 1 Φ T vector y = Φ T ( ΦΦ T + λI ) − 1 vector y = Φ T ( K + λI ) − 1 vector y. where K is the kernel matrix for the training set (since Φ i,j = φ ( x ( i ) ) T φ ( x ( j ) ) = K ij .) To predict a new value y new , we can compute vector y new = θ T φ ( x new ) = vector y T ( K + λI ) − 1 Φ φ ( x new ) = m summationdisplay i =1 α i K ( x ( i ) ,x new ) . where α = ( K + λI ) − 1 vector y . All these terms can be efficiently computing using the kernel function....
View Full Document
This note was uploaded on 01/24/2010 for the course CS 229 at Stanford.