Rpart_TechReport61

# Of predictor variables a the class assigned to a if a

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: s a nal node. ni; nA Number of observations in the sample that are class i, number of obs in node A. niA Number of observations in the sample that are class i and node A. P A Probability of A for future observations. P = PC i P fx 2 A j x = ig i=1  C=1 iniA=ni i pijA P f x = i j x 2 Ag for future observations = i P fx 2 A jPx = ig=P fx 2 Ag  iniA=ni= iniA=ni RA Risk of A P = C pijALi; A i=1 where A is chosen to minimize this risk. RT  Risk of a model or tree T P = k=1 P Aj RAj  j where Aj are the terminal nodes of the tree. If Li; j  = 1 for all i 6= j , and we set the prior probabilities  equal to the observed class frequencies in the sample then pijA = niA=nA and RT  is the proportion misclassi ed. 3 Building the tree 3.1 Splitting criteria If we split a node A into two sons AL and AR left and right sons, we will have P AL RAL  + P AR RAR   P ARA 5 this is proven in 1 . Using this, one obvious way to build a tree is to choose that split which maximizes R, the decrease in risk. There are defects with this, however, as the following example shows. Suppose losses are equal and that the data is 80 class 1's, and that some trial split results in AL being 54 class 1's and AR being 100 class 1's. Class 1's versus class 0's are the outcome variable in this example. Since the minimum risk prediction for both the left and right son is AL  = AR  = 1, this split will have R = 0, yet scienti cally this is a very informative division of the sample. In real data with such a majority, the rst few splits very often can do no better than this. A more serious defect with maximizing R is that the risk reduction is essentially linear. If there were two competing splits, one separating the data into groups of 85 and 50 purity respectively, and the other into 70-70, we would usually prefer the former, if for no other reason than because it better sets things up for the next splits. One way around both of these problems is to use lookahead rules; but these are computationally very expensive. Instead rpart uses one of several measures of impurity, or diversity, of a node. Let f be some impurity function and de ne the impurity of a node A as C X I A = f piA i=1 where piA is the proportion of those in A that belong to class i for future samples. Since we would like I A =0 when A is pure, f must be concave with f 0 = f 1 = 0. Two candidates for f are the information index f p = ,p logp and the Gini index f p = p1 , p. We then use that split with maximal impurity reduction I = pAI A , pAL I AL  , pAR I AR  The two impurity functions are plotted in gure 2, with the second plot scaled so that the maximum for both measures is at 1. For the two class problem the measures di er only slightly, and will nearly always choose the same split point. Another convex criteria not quite of the above class is twoing for which I A = C C f pC1  + f pC2  min 12 where C1 ,C2...
View Full Document

## This document was uploaded on 09/26/2013.

Ask a homework question - tutors are online