SVMs—a practical consequence of
learning theory
Bernhard Schölkopf, GMD First
Is there anything worthwhile to learn
about the new SVM algorithm, or does it
fall into the category of “yet-another-algo-
rithm,” in which case readers should stop
here and save their time for something
more useful? In this short overview, I will
try to argue that studying support-vector
learning is very useful in two respects.
First, it is quite satisfying from a theoreti-
cal point of view: SV learning is based on
some beautifully simple ideas and provides
a clear intuition of what learning from ex-
amples is about. Second, it can lead to high
performances in practical applications.
In the following sense can the SV algo-
rithm be considered as lying at the intersec-
tion of learning theory and practice: for
certain simple types of algorithms, statisti-
cal learning theory can identify rather pre-
cisely the factors that need to be taken into
account to learn successfully. Real-world
applications, however, often mandate the
use of more complex models and algori-
thms—such as neural networks—that are
much harder to analyze theoretically. The
SV algorithm achieves both. It constructs
models that are complex enough: it con-
tains a large class of neural nets, radial
basis function (RBF) nets, and polynomial
classifiers as special cases. Yet it is simple
enough to be analyzed mathematically,
because it can be shown to correspond to a
linear
method in a high-dimensional
fea-
ture space
nonlinearly related to input
space. Moreover, even though we can think
of it as a linear algorithm in a high-dimen-
sional space, in practice, it does not involve
any computations in that high-dimensional
space. By the use of
kernels
, all necessary
computations are performed directly in
input space. This is the characteristic twist
of SV methods—we are dealing with com-
plex algorithms for nonlinear pattern
recognition,
1
regression,
2
or feature extrac-
tion,
3
but for the sake of analysis and algo-
rithmics, we can pretend that we are work-
ing with a simple linear algorithm.
I will explain the gist of SV methods by
describing their roots in learning theory,
the optimal hyperplane algorithm, the ker-
nel trick, and SV function estimation. For
details and further references, see Vladimir
Vapnik’s authoritative treatment,
2
the col-
lection my colleagues and I have put to-
gether,
4
and the SV Web page at
http://svm.
first.gmd.de
.
Learning pattern recognition from
examples
For pattern recognition, we try to esti-
mate a function
f
:
R
N
→
{±1} using training
data—that is,
N
-dimensional patterns
x
i
and class labels
y
i
,
(
x
1
,
y
1
),
…
,(
x
‘
,
y
‘
)
∈
R
N
×
{
±
1},
(1)
such that
f
will correctly classify new exam-
ples (
x
,
y
)—that is,
f
(
x
) =
y
for examples (
x
,
y
),
which were generated from the same under-
lying probability distribution
P
(
x
,
y
) as the
training data. If we put no restriction on the
class of functions that we choose our estimate
f
from, however, even a function that does
well on the training data—for example by
satisfying
f
(
x
i
) =
y
i