CS229 Lecture notes
Andrew Ng
Part V
Support Vector Machines
This set of notes presents the Support Vector Machine (SVM) learning al
gorithm. SVMs are among the best (and many believe is indeed the best)
“offtheshelf” supervised learning algorithm.
To tell the SVM story, we’ll
need to first talk about margins and the idea of separating data with a large
“gap.” Next, we’ll talk about the optimal margin classifier, which will lead
us into a digression on Lagrange duality. We’ll also see kernels, which give
a way to apply SVMs efficiently in very high dimensional (such as infinite
dimensional) feature spaces, and finally, we’ll close off the story with the
SMO algorithm, which gives an efficient implementation of SVMs.
1
Margins: Intuition
We’ll start our story on SVMs by talking about margins. This section will
give the intuitions about margins and about the “confidence” of our predic
tions; these ideas will be made formal in Section 3.
Consider logistic regression, where the probability
p
(
y
= 1

x
;
θ
) is mod
eled by
h
θ
(
x
) =
g
(
θ
T
x
).
We would then predict “1” on an input
x
if and
only if
h
θ
(
x
)
≥
0
.
5, or equivalently, if and only if
θ
T
x
≥
0.
Consider a
positive training example (
y
= 1).
The larger
θ
T
x
is, the larger also is
h
θ
(
x
) =
p
(
y
= 1

x
;
w, b
), and thus also the higher our degree of “confidence”
that the label is 1. Thus, informally we can think of our prediction as being
a very confident one that
y
= 1 if
θ
T
x
≫
0. Similarly, we think of logistic
regression as making a very confident prediction of
y
= 0, if
θ
T
x
≪
0. Given
a training set, again informally it seems that we’d have found a good fit to
the training data if we can find
θ
so that
θ
T
x
(
i
)
≫
0 whenever
y
(
i
)
= 1, and
1
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
2
θ
T
x
(
i
)
≪
0 whenever
y
(
i
)
= 0, since this would reflect a very confident (and
correct) set of classifications for all the training examples. This seems to be
a nice goal to aim for, and we’ll soon formalize this idea using the notion of
functional margins.
For a different type of intuition, consider the following figure, in which x’s
represent positive training examples, o’s denote negative training examples,
a decision boundary (this is the line given by the equation
θ
T
x
= 0, and
is also called the
separating hyperplane
) is also shown, and three points
have also been labeled A, B and C.
B
A
C
Notice that the point A is very far from the decision boundary. If we are
asked to make a prediction for the value of
y
at at A, it seems we should be
quite confident that
y
= 1 there.
Conversely, the point C is very close to
the decision boundary, and while it’s on the side of the decision boundary
on which we would predict
y
= 1, it seems likely that just a small change to
the decision boundary could easily have caused out prediction to be
y
= 0.
This is the end of the preview.
Sign up
to
access the rest of the document.
 Fall '09
 Machine Learning, Optimization, optimization problem, SVMs, functional margin

Click to edit the document details