CS229 Lecture notes
Andrew Ng
Part V
Support Vector Machines
This set of notes presents the Support Vector Machine (SVM) learning al
gorithm. SVMs are among the best (and many believe is indeed the best)
“oFtheshelf” supervised learning algorithm. To tell the SVM story, we’ll
need to ±rst talk about margins and the idea of separating data with a large
“gap.” Next, we’ll talk about the optimal margin classi±er, which will lead
us into a digression on Lagrange duality. We’ll also see kernels, which give
a way to apply SVMs e²ciently in very high dimensional (such as in±nite
dimensional) feature spaces, and ±nally, we’ll close oF the story with the
SMO algorithm, which gives an e²cient implementation of SVMs.
1
Margins: Intuition
We’ll start our story on SVMs by talking about margins. This sectionw
i
l
l
give the intuitions about margins and about the “con±dence” of our predic
tions; these ideas will be made formal in Section 3.
Consider logistic regression, where the probability
p
(
y
=1

x
;
θ
)ismod

eled by
h
θ
(
x
)=
g
(
θ
T
x
). We would then predict “1” on an input
x
if and
only if
h
θ
(
x
)
≥
0
.
5, or equivalently, if and only if
θ
T
x
≥
0. Consider a
positive training example (
y
= 1). The larger
θ
T
x
is, the larger also is
h
θ
(
x
p
(
y

x
;
w,b
), and thus also the higher our degree of “con±dence”
that the label is 1. Thus, informally we can think of our prediction as being
a very con±dent one that
y
=1i
f
θ
T
x
±
0. Similarly, we think of logistic
regression as making a very con±dent prediction of
y
=0
,if
θ
T
x
²
0. Given
a training set, again informally it seems that we’d have found a good ±t to
the training data if we can ±nd
θ
so that
θ
T
x
(
i
)
±
0 whenever
y
(
i
)
,and
1
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document2
θ
T
x
(
i
)
±
0 whenever
y
(
i
)
= 0, since this would refect a very conFdent (and
correct) set o± classiFcations ±or all the training examples. This seems to be
a nice goal to aim ±or, and we’ll soon ±ormalize this idea using the notion o±
±unctional margins.
²or a di³erent type o± intuition, consider the ±ollowing Fgure, in which x’s
represent positive training examples, o’s denote negative training examples,
a decision boundary (this is the line given by the equation
θ
T
x
=0
,and
is also called the
separating hyperplane
) is also shown, and three points
have also been labeled A, B and C.
B
A
C
Notice that the point A is very ±ar ±rom the decision boundary. I
±weare
asked to make a prediction ±or the value o±
y
at at A, it seems we should be
quite conFdent that
y
= 1 there. Conversely, the point C is very close to
the decision boundary, and while it’s on the side o± the decision boundary
on which we would predict
y
= 1, it seems likely that just a small change to
the decision boundary could easily have caused out prediction tobe
y
.
This is the end of the preview.
Sign up
to
access the rest of the document.
 Fall '09
 Optimization, optimization problem, SVMs

Click to edit the document details