Rpart_TechReport61

# Quite of the above class is twoing for which i a c c f

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: is some partition of the C classes into two disjoint sets. If C = 2 twoing is equivalent to the usual impurity index for f . Surprisingly, twoing can be calculated almost as e ciently as the usual impurity index. One potential advantage of twoing 6 1.0 0.6 0.4 Scaled Impurity 0.8 0.6 0.4 0.2 Impurity Gini criteria Information 0.0 0.0 0.2 Gini criteria Information 0.0 0.2 0.4 0.6 0.8 1.0 0.0 P 0.2 0.4 0.6 0.8 1.0 P Figure 2: Comparison of Gini and Information indices is that the output may give the user additional insight concerning the structure of the data. It can be viewed as the partition of C into two superclasses which are in some sense the most dissimilar for those observations in A. For certain problems there may be a natural ordering of the response categories e.g. level of education, in which case ordered twoing can be naturally de ned, by restricting C1 to be an interval 1; 2; : : : ; k of classes. Twoing is not part of rpart. 3.2 Incorporating losses One salutatory aspect of the risk reduction criteria not found in the impurity measures is inclusion of the loss function. Two di erent ways of extending the impurity criteria to also include losses are implemented in CART, the generalized Gini index and altered priors. The rpart software implements only the altered priors method. 3.2.1 Generalized Gini index The Gini index has the following interesting interpretation. Suppose an object is selected at random from one of C classes according to the probabilities p1 ; p2 ; :::; pC  and is randomly assigned to a class using the same distribution. The probability of 7 misclassi cation is XX i j 6=i pipj = XX i j pipj , X i p2 = i X i 1 , p2 = Gini index for p i Let Li; j  be the loss of assigning class j to an object which actually belongs to class i. The expected cost of misclassi cation is P P Li; j pi pj . This suggests de ning a generalized Gini index of impurity by Gp = XX i j Li; j pi pj The corresponding splitting criterion appears to be promising for applications involving variable misclassi cation costs. But there are several reasonable objections to it. First, Gp is not necessarily a concave function of p, which was the motivating factor behind impurity measures. More seriously, G symmetrizes the loss matrix before using it. To see this note that Gp = 1=2 XX Li; j  + Lj; i pi pj In particular, for two-class problems, G in e ect ignores the loss matrix. 3.2.2 Altered priors Remember the de nition of RA RA = C X i=1 C X i=1 piALi; A i Li; AniA =ni n=nA  ~ Assume there exists  and L be such that ~ iLi; j  = i Li; j  ~~ 8i; j 2 C ~ Then RA is unchanged under the new losses and priors. If L is proportional to the zero-one loss matrix then the priors  should be used in the splitting criteria. ~ This is possible only if L is of the form Li; j  =  8 Li i 6= j 0 i=j in which case L i = Pi iL ~ jjj This is always possible when C = 2, and hence altered priors are exact for the two class prob...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online