16
Ensemble Learning
16.1
Introduction
The idea of ensemble learning is to build a prediction model by combining
the strengths of a collection of simpler base models. We have already seen
a number of examples that fall into this category.
Bagging in Section 8.7 and random forests in Chapter 15 are ensemble
methods for classiﬁcation, where a
committee
of trees each cast a vote for
the predicted class. Boosting in Chapter 10 was initially proposed as a
committee method as well, although unlike random forests, the committee
of
weak learners
evolves over time, and the members cast a weighted vote.
Stacking (Section 8.8) is a novel approach to combining the strengths of
a number of ﬁtted models. In fact one could characterize any dictionary
method, such as regression splines, as an ensemble method, with the basis
functions serving the role of weak learners.
Bayesian methods for nonparametric regression can also be viewed as
ensemble methods: a large number of candidate models are averaged with
respect to the posterior distribution of their parameter settings (e.g. (Neal
and Zhang, 2006)).
Ensemble learning can be broken down into two tasks: developing a pop
ulation of base learners from the training data, and then combining them
to form the composite predictor. In this chapter we discuss boosting tech
nology that goes a step further; it builds an ensemble model by conducting
a regularized and supervised search in a highdimensional space of weak
learners.
© Springer Science+Business Media, LLC 2009
T. Hastie et al.,
The Elements of Statistical Learning, Second Edition,
605
DOI: 10.1007/b94608_16,
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document606
16. Ensemble Learning
An early example of a learning ensemble is a method designed for multi
class classiﬁcation using
errorcorrecting output codes
(Dietterich and Bakiri,
1995, ECOC). Consider the 10class digit classiﬁcation problem, and the
coding matrix
C
given in Table 16.1.
TABLE 16.1.
Part of a
15
bit errorcorrecting coding matrix
C
for the
10
class
digit classiﬁcation problem. Each column deﬁnes a twoclass classiﬁcation prob
lem.
Digit
C
1
C
2
C
3
C
4
C
5
C
6
···
C
15
0
110000
1
1
001111
0
2
100100
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
110101
1
9
011100
0
Note that the
±
th column of the coding matrix
C
±
deﬁnes a twoclass
variable that merges all the original classes into two groups. The method
works as follows:
1. Learn a separate classiﬁer for each of the
L
= 15 two class problems
deﬁned by the columns of the coding matrix.
2
. Atatestpo
int
x
, let ˆ
p
±
(
x
) be the predicted probability of a one for
the
±
th response.
3. Deﬁne
δ
k
(
x
)=
∑
L
±
=1

C
k±
−
ˆ
p
±
(
x
)

, the discriminant function for the
k
th class, where
C
is the entry for row
k
and column
±
in Table 16.1.
Each row of
C
is a binary code for representing that class. The rows have
more bits than is necessary, and the idea is that the redundant “error
correcting” bits allow for some inaccuracies, and can improve performance.
This is the end of the preview.
Sign up
to
access the rest of the document.
 Spring '10
 Haulk
 Linear Regression, Regression Analysis, basis functions, Ensemble learning, oo oo oo

Click to edit the document details