SVMs—a practical consequence of
learning theory
Bernhard Schölkopf, GMD First
Is there anything worthwhile to learn
about the new SVM algorithm, or does it
fall into the category of “yetanotheralgo
rithm,” in which case readers should stop
here and save their time for something
more useful? In this short overview, I will
try to argue that studying supportvector
learning is very useful in two respects.
First, it is quite satisfying from a theoreti
cal point of view: SV learning is based on
some beautifully simple ideas and provides
a clear intuition of what learning from ex
amples is about. Second, it can lead to high
performances in practical applications.
In the following sense can the SV algo
rithm be considered as lying at the intersec
tion of learning theory and practice: for
certain simple types of algorithms, statisti
cal learning theory can identify rather pre
cisely the factors that need to be taken into
account to learn successfully. Realworld
applications, however, often mandate the
use of more complex models and algori
thms—such as neural networks—that are
much harder to analyze theoretically. The
SV algorithm achieves both. It constructs
models that are complex enough: it con
tains a large class of neural nets, radial
basis function (RBF) nets, and polynomial
classifiers as special cases. Yet it is simple
enough to be analyzed mathematically,
because it can be shown to correspond to a
linear
method in a highdimensional
fea
ture space
nonlinearly related to input
space. Moreover, even though we can think
of it as a linear algorithm in a highdimen
sional space, in practice, it does not involve
any computations in that highdimensional
space. By the use of
kernels
, all necessary
computations are performed directly in
input space. This is the characteristic twist
of SV methods—we are dealing with com
plex algorithms for nonlinear pattern
recognition,
1
regression,
2
or feature extrac
tion,
3
but for the sake of analysis and algo
rithmics, we can pretend that we are work
ing with a simple linear algorithm.
I will explain the gist of SV methods by
describing their roots in learning theory,
the optimal hyperplane algorithm, the ker
nel trick, and SV function estimation. For
details and further references, see Vladimir
Vapnik’s authoritative treatment,
2
the col
lection my colleagues and I have put to
gether,
4
and the SV Web page at
http://svm.
first.gmd.de
.
Learning pattern recognition from
examples
For pattern recognition, we try to esti
mate a function
f
:
R
N
→
{±1} using training
data—that is,
N
dimensional patterns
x
i
and class labels
y
i
,
(
x
1
,
y
1
),
…
,(
x
‘
,
y
‘
)
∈
R
N
×
{
±
1},
(1)
such that
f
will correctly classify new exam
ples (
x
,
y
)—that is,
f
(
x
) =
y
for examples (
x
,
y
),
which were generated from the same under
lying probability distribution
P
(
x
,
y
) as the
training data. If we put no restriction on the
class of functions that we choose our estimate
f
from, however, even a function that does
well on the training data—for example by
satisfying
f
(
x
i
) =
y
i
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
This is the end of the preview.
Sign up
to
access the rest of the document.
 Fall '11
 TonyMartinez

Click to edit the document details