Data Mining and Knowledge Discovery, 2, 121–167 (1998)
c
±
1998 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
A Tutorial on Support Vector Machines for Pattern
Recognition
CHRISTOPHER J.C. BURGES
burges@lucent.com
Bell Laboratories, Lucent Technologies
Editor:
Usama Fayyad
Abstract.
The tutorial starts with an overview of the concepts of VC dimension and structural risk minimization.
We then describe linear Support Vector Machines (SVMs) for separable and non-separable data, working through
a non-trivial example in detail. We describe a mechanical analogy, and discuss when SVM solutions are unique
and when they are global. We describe how support vector training can be practically implemented, and discuss
in detail the kernel mapping technique which is used to construct SVM solutions which are nonlinear in the
data. We show how Support Vector machines can have very large (even inﬁnite) VC dimension by computing
the VC dimension for homogeneous polynomial and Gaussian radial basis function kernels. While very high VC
dimension would normally bode ill for generalization performance, and while at present there exists no theory
which shows that good generalization performance is
guaranteed
for SVMs, there are several arguments which
support the observed high accuracy of SVMs, which we review. Results of some experiments which were inspired
by these arguments are also presented. We give numerous examples and proofs of most of the key theorems.
There is new material, and I hope that the reader will ﬁnd that even old material is cast in a fresh light.
Keywords:
support vector machines, statistical learning theory, VC dimension, pattern recognition
1.
Introduction
The purpose of this paper is to provide an introductory yet extensive tutorial on the basic
ideas behind Support Vector Machines (SVMs). The books (Vapnik, 1995; Vapnik, 1998)
contain excellent descriptions of SVMs, but they leave room for an account whose purpose
from the start is to teach. Although the subject can be said to have started in the late
seventies (Vapnik, 1979), it is only now receiving increasing attention, and so the time
appears suitable for an introductory review. The tutorial dwells entirely on the pattern
recognition problem. Many of the ideas there carry directly over to the cases of regression
estimation and linear operator inversion, but space constraints precluded the exploration of
these topics here.
The tutorial contains some new material. All of the proofs are my own versions, where
I have placed a strong emphasis on their being both clear and self-contained, to make the
material as accessible as possible. This was done at the expense of some elegance and
generality: however generality is usually easily added once the basic ideas are clear. The
longer proofs are collected in the Appendix.