This preview shows page 1. Sign up to view the full content.
Unformatted text preview: s a
nal node. ni; nA Number of observations in the sample that are class i,
number of obs in node A. niA Number of observations in the sample that are class i and node A. P A Probability of A for future observations.
P
= PC i P fx 2 A j x = ig
i=1
C=1 iniA=ni
i pijA P f x = i j x 2 Ag for future observations
= i P fx 2 A jPx = ig=P fx 2 Ag
iniA=ni= iniA=ni RA Risk of A
P
= C pijALi; A
i=1
where A is chosen to minimize this risk. RT Risk of a model or tree T
P
= k=1 P Aj RAj
j
where Aj are the terminal nodes of the tree. If Li; j = 1 for all i 6= j , and we set the prior probabilities equal to the
observed class frequencies in the sample then pijA = niA=nA and RT is the
proportion misclassi ed. 3 Building the tree
3.1 Splitting criteria If we split a node A into two sons AL and AR left and right sons, we will have P AL RAL + P AR RAR P ARA
5 this is proven in 1 . Using this, one obvious way to build a tree is to choose
that split which maximizes R, the decrease in risk. There are defects with this,
however, as the following example shows.
Suppose losses are equal and that the data is 80 class 1's, and that
some trial split results in AL being 54 class 1's and AR being 100
class 1's. Class 1's versus class 0's are the outcome variable in this
example. Since the minimum risk prediction for both the left and right
son is AL = AR = 1, this split will have R = 0, yet scienti cally
this is a very informative division of the sample. In real data with such
a majority, the rst few splits very often can do no better than this.
A more serious defect with maximizing R is that the risk reduction is essentially
linear. If there were two competing splits, one separating the data into groups of
85 and 50 purity respectively, and the other into 7070, we would usually
prefer the former, if for no other reason than because it better sets things up for the
next splits.
One way around both of these problems is to use lookahead rules; but these
are computationally very expensive. Instead rpart uses one of several measures of
impurity, or diversity, of a node. Let f be some impurity function and de ne the
impurity of a node A as
C
X
I A = f piA
i=1 where piA is the proportion of those in A that belong to class i for future samples.
Since we would like I A =0 when A is pure, f must be concave with f 0 = f 1 =
0.
Two candidates for f are the information index f p = ,p logp and the Gini
index f p = p1 , p. We then use that split with maximal impurity reduction
I = pAI A , pAL I AL , pAR I AR
The two impurity functions are plotted in gure 2, with the second plot scaled
so that the maximum for both measures is at 1. For the two class problem the
measures di er only slightly, and will nearly always choose the same split point.
Another convex criteria not quite of the above class is twoing for which I A = C C f pC1 + f pC2
min
12 where C1 ,C2...
View
Full
Document
This document was uploaded on 09/26/2013.
 Fall '13

Click to edit the document details