CS 598 and STAT 598A: Homework 5 Due: 20th April 2010 1. Attempt as many problems as possible 2. No points for random guessing. You have to explain your answers. 3. Mail your source code to [email protected] before the class on 20th of April 2010. You may email a PDF of your reports or hand them to me in the class. No late submissions will be accepted! 4. Program files should be named after the problem (e.g. solution to problem 1 should be problem1.c etc). Include detailed instructions for how to run your code on a Linux machine (e.g. include makefiles, or instructions to run scripts as appropriate) Problem 1 (2 pt) Show that the distance of a point x i to a hyperplane H = { x |h w,x i + b = 0 } is given by |h w,x i i + b | / k w k . Problem 2 (4 pt) A homogeneous hyperplane is one which passes through the origin, that is, H = { x |h w,x i = 0 } . (1) If we devise a soft margin classifier which uses the homogeneous hyper- plane as a decision boundary, then the corresponding primal optimization
problem can be written as min w,ξ 1 2 k w k 2 + C m X i =1 ξ i (2a) s.t. y i h w,x i i ≥ 1-ξ i for all i (2b) ξ i ≥ , (2c) Derive the dual of (2) and contrast it with ( ?? ). What changes to the SMO algorithm would you make to solve this dual? Problem 3 (4 pt) LIBSVM is a widely used solver for training a SVM. The aim of this problem is to familiarize yourselves with LibSVM and use it for our familiar CML/PAKDD spam detection problem. • Download and read the documentation about LibSVM. • Use LibSVM with three different kernels to classify the ECML/-PAKDD discovery challenge dataset that you used in the previous HW. 1 • Compare the performance of different kernels for your problem. • Write a report (5 pages max) to document your findings. In partic-ular, mention how you choose the kernel parameters. 2
