This preview shows page 1. Sign up to view the full content.
Unformatted text preview: ees (I) If Xj is categorical, taking values, say in{1, 2, ..., M}, then Q contains all questions of the form {Is Xj A?} . A ranges over all subsets of {1, 2, ..., M}. The splits for all p variables constitute the standard set of questions. Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (I) Goodness of Split The goodness of split is measured by an impurity function defined for each node. Intuitively, we want each leaf node to be "pure", that is, one class dominates. Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (I) The Impurity Function
Definition: An impurity function is a function defined on the set of all K tuples of numbers (p1 , ..., pK ) satisfying pj 0, j = 1, ..., K , j pj = 1 with the properties:
1 1 1 1. is a maximum only at the point ( K , K , ..., K ). 2. achieves its minimum only at the points (1, 0, ..., 0), (0, 1, 0, ..., 0), ..., (0, 0, ..., 0, 1). 3. is a symmetric function of p1 , ..., pK , i.e., if you permute pj , remains constant. Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (I) Definition: Given an impurity function , define the impurity measure i(t) of a node t as i(t) = (p(1  t), p(2  t), ..., p(K  t)) , where p(j  t) is the estimated probability of class j within node t. Goodness of a split s for node t, denoted by (s, t), is defined by (s, t) = i(s, t) = i(t)  pR i(tR )  pL i(tL ) , where pR and pL are the proportions of the samples in node t that go to the right node tR and the left node tL respectively. Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (I) Define I (t) = i(t)p(t)...
View
Full
Document
 Fall '09
 JIALI
 Statistics

Click to edit the document details