This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: 6 Linear Models A hyperplane in a space H endowed with a dot product Â· Â· is described by the set { x âˆˆ H  w x + b = 0 } (6.1) where w âˆˆ H and b âˆˆ R . Such a hyperplane naturally divides H into two halfspaces: { x âˆˆ H  w x + b â‰¥ } and { x âˆˆ H  w x + b < } , and hence can be used as the decision boundary of a binary classifier. In this chapter we will study a number of algorithms which employ such linear decision boundaries. Although such models look restrictive at first glance, when combined with kernels (Chapter 5 ) they yield a large class of useful algorithms. All the algorithms we will study in this chapter maximize the margin. Given a set X = { x 1 ... x m } , the margin is the distance of the closest point in X to the hyperplane ( 6.1 ). Elementary geometric arguments (Problem 6.1 ) show that the distance of a point x i to a hyperplane is given by  w x i + b  / w , and hence the margin is simply min i =1 ... m  w x i + b  w . (6.2) Note that the parameterization of the hyperplane ( 6.1 ) is not unique; if we multiply both w and b by the same nonzero constant, then we obtain the same hyperplane. One way to resolve this ambiguity is to set min i =1 ...m  w x i + b  = 1 . In this case, the margin simply becomes 1 / w . We postpone justification of margin maximization for later and jump straight ahead to the description of various algorithms. 6.1 Support Vector Classification Consider a binary classification task, where we are given a training set { ( x 1 y 1 ) ... ( x m y m ) } with x i âˆˆ H and y i âˆˆ {Â± 1 } . Our aim is to find a linear decision boundary parameterized by ( w b ) such that w x i + b â‰¥ 159 160 6 Linear Models x 1 w x 2 y i = âˆ’ 1 y i = +1 { x  w x + b = âˆ’ 1 } { x  w x + b = 1 } { x  w x + b = 0 } w x 1 + b = +1 w x 2 + b = âˆ’ 1 w x 1 âˆ’ x 2 = 2 w w x 1 âˆ’ x 2 = 2 w Fig. 6.1. A linearly separable toy binary classification problem of separating the diamonds from the circles. We normalize ( w b ) to ensure that min i =1 ...m  w x i + b  = 1. In this case, the margin is given by 1 w as the calculation in the inset shows. whenever y i = +1 and w x i + b < 0 whenever y i = âˆ’ 1. Furthermore, as dis cussed above, we fix the scaling of w by requiring min i =1 ...m  w x i + b  = 1. A compact way to write our desiderata is to require y i ( w x i + b ) â‰¥ 1 for all i (also see Figure 6.1 ). The problem of maximizing the margin therefore reduces to max w b 1 w (6.3a) s.t. y i ( w x i + b ) â‰¥ 1 for all i (6.3b) or equivalently min w b 1 2 w 2 (6.4a) s.t. y i ( w x i + b ) â‰¥ 1 for all i. (6.4b) This is a constrained convex optimization problem with a quadratic objec tive function and linear constraints (see Section 3.3 ). In deriving ( 6.4 ) we implicitly assumed that the data is linearly separable, that is, there is a hyperplane which correctly classifies the training data. Such a classifier is called a hard margin classifier . If the data is not linearly separable, then ( 6.4 ) does not have a solution. To deal with this situation we introduce 6.1 Support Vector Classification6....
View
Full
Document
This note was uploaded on 02/23/2012 for the course STAT 598 taught by Professor Staff during the Spring '08 term at Purdue.
 Spring '08
 Staff

Click to edit the document details