Machine Learning, 45, 5–32, 2001
2001 Kluwer Academic Publishers. Manufactured in The Netherlands.
Statistics Department, University of California, Berkeley, CA 94720
Robert E. Schapire
Random forests are a combination of tree predictors such that each tree depends on the values of a
random vector sampled independently and with the same distribution for all trees in the forest. The generalization
error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization
error of a forest of tree classiFers depends on the strength of the individual trees in the forest and the corre-
lation between them. Using a random selection of features to split each node yields error rates that compare
favorably to Adaboost (Y. ±reund & R. Schapire,
Proceedings of the Thirteenth Interna-
, 148–156), but are more robust with respect to noise. Internal estimates monitor error,
strength, and correlation and these are used to show the response to increasing the number of features used in
the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to
classiFcation, regression, ensemble
SigniFcant improvements in classiFcation accuracy have resulted from growing an ensemble
of trees and letting them vote for the most popular class. In order to grow these ensembles,
often random vectors are generated that govern the growth of each tree in the ensemble.
An early example is bagging (Breiman, 1996), where to grow each tree a random selection
(without replacement) is made from the examples in the training set.
Another example is random split selection (Dietterich, 1998) where at each node the split
is selected at random from among the
best splits. Breiman (1999) generates new training
sets by randomizing the outputs in the original training set. Another approach is to select
the training set from a random set of weights on the examples in the training set. Ho (1998)
has written a number of papers on “the random subspace” method which does a random
selection of a subset of features to use to grow each tree.
In an important paper on written character recognition, Amit and Geman (1997) deFne
a large number of geometric features and search over a random selection of these for the
best split at each node. This latter paper has been in²uential in my thinking.
The common element in all of these procedures is that for the
th tree, a random vector
is generated, independent of the past random vectors
but with the same
distribution; and a tree is grown using the training set and
, resulting in a classiFer
is an input vector. ±or instance, in bagging the random vector