trees2 - Classification/Decision Trees (II)...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Classification/Decision Trees (II) Classification/Decision Trees (II) Jia Li Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu http://www.stat.psu.edu/jiali Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) Right Sized Trees Let the expected misclassification rate of a tree T be R (T ). Recall the resubstitution estimate for R (T ) is R(T ) = r (t)p(t) = R(t) . ~ tT ~ tT R(T ) is biased downward. R(t) R(tL ) + R(tR ) . Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) Digit recognition example No. Terminal Nodes 71 63 58 40 34 19 10 9 7 6 5 2 1 Jia Li R(T ) .00 .00 .03 .10 .12 .29 .29 .32 .41 .46 .53 .75 .86 R ts (T ) .42 .40 .39 .32 .32 .31 .30 .34 .47 .54 .61 .82 .91 The estimate R(T ) becomes increasingly less accurate as the trees grow larger. The estimate R ts decreases first when the tree becomes larger, hits minimum at the tree with 10 terminal nodes, and begins to increase when the tree further grows. http://www.stat.psu.edu/jiali Classification/Decision Trees (II) Preliminaries for Pruning Grow a very large tree Tmax . 1. Until all terminal nodes are pure (contain only one class) or contain only identical measurement vectors. 2. When the number of data in each terminal node is no greater than a certain threshold, say 5, or even 1. 3. As long as the tree is sufficiently large, the size of the initial tree is not critical. Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) 1. Descendant: a node t is a descendant of node t if there is a connected path down the tree leading from t to t . 2. Ancestor: t is an ancestor of t if t is its descendant. 3. A branch Tt of T with root node t T consists of the node t and all descendants of t in T . 4. Pruning a branch Tt from a tree T consists of deleting from T all descendants of t, that is, cutting off all of Tt except its root node. The tree pruned this way will be denoted by T - Tt . 5. If T is gotten from T by successively pruning off branches, then T is called a pruned subtree of T and denoted by T T . Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) Subtrees Even for a moderate sized Tmax , there is an enormously large number of subtrees and an even larger number ways to prune the initial tree to them. A "selective" pruning procedure is needed. The pruning is optimal in a certain sense. The search for different ways of pruning should be of manageable computational load. Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) Minimal Cost-Complexity Pruning Definition for the cost-complexity measure: ~ For any subtree T Tmax , define its complexity as |T |, the number of terminal nodes in T . Let 0 be a real number called the complexity parameter and define the cost-complexity measure R (T ) as ~ R (T ) = R(T ) + |T | . Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) For each value of , find the subtree T () that minimizes R (T ), i.e., R (T ()) = min R (T ) . T Tmax If is small, the penalty for having a large number of terminal nodes is small and T () tends to be large. For sufficiently large, the minimizing subtree T () will consist of the root node only. Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) Since there are at most a finite number of subtrees of Tmax , R (T ()) yields different values for only finitely many 's. T () continues to be the minimizing tree when increases until a jump point is reached. Two questions: Is there a unique subtree T Tmax which minimizes R (T )? In the minimizing sequence of trees T1 , T2 , ..., is each subtree obtained by pruning upward from the previous subtree, i.e., does the nesting T1 T2 {t1 } hold? Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) Definition: The smallest minimizing subtree T () for complexity parameter is defined by the conditions: 1. R (T ()) = minT Tmax R (T ) 2. If R (T ) = R (T ()), then T () T . If subtree T () exists, it must be unique. It can be proved that for every value of , there exists a smallest minimizing subtree. Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) The starting point for the pruning is not Tmax , but rather T1 = T (0), which is the smallest subtree of Tmax satisfying R(T1 ) = R(Tmax ) . Let tL and tR be any two terminal nodes in Tmax descended from the same parent node t. If R(t) = R(tL ) + R(tR ), prune off tL and tR . Continue the process until no more pruning is possible. The resulting tree is T1 . Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) For Tt any branch of T1 , define R(Tt ) by R(Tt ) = R(t ) , ~ t Tt ~ where Tt is the set of terminal nodes of Tt . For t any nonterminal node of T1 , R(t) > R(Tt ). Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) Weakest-Link Cutting For any node t T1 , set R ({t}) = R(t) + . ~ For any branch Tt , define R (Tt ) = R(Tt ) + |Tt |. When = 0, R0 (Tt ) < R0 ({t}). The inequality holds for sufficiently small . But at some critical value of , the two cost-complexities become equal. For exceeding this threshold, the inequality is reversed. Solve the inequality R (Tt ) < R ({t}) and get < R(t) - R(Tt ) . ~ |Tt | - 1 The right hand side is always positive. Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) Define a function g1 (t), t T1 by g1 (t) = +, R(t)-R(Tt ) , ~ |Tt |-1 t T1 / ~ ~ t T1 Define the weakest link 1 in T1 as the node such that t g1 (1 ) = min g1 (t) . t tT1 and put 2 = g1 (1 ). t Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) When increases, 1 is the first node that becomes more t preferable than the branch T1 descended from it. t 2 is the first value after 1 = 0 that yields a strict subtree of T1 with a smaller cost-complexity at this complexity parameter. That is, for all 1 < 2 , the tree with smallest cost-complexity is T1 . Let T2 = T1 - T1 . t Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) Repeat the previous steps. Use T2 instead of T1 , find the weakest link in T2 and prune off at the weakest link node. R(t)-R(T2t ) , t T2 , t T2 / ~ ~ |T2t |-1 g2 (t) = ~ +, t T2 g2 (2 ) = min g2 (t) t tT2 3 = g2 (2 ) t T3 = T2 - T2 t If at any stage, there are multiple weakest links, for instance, if gk (k ) = gk (k ), then define Tk+1 = Tk - Tk - Tk . t t t t Two branches are either nested or share no node. Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) A decreasing sequence of nested subtrees are obtained: T1 T2 T3 {t1 } . Theorem: The {k } are an increasing sequence, that is, k < k+1 , k 1, where 1 = 0. For k 1, k < k+1 , T () = T (k ) = Tk . Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) At the initial steps of pruning, the algorithm tends to cut off large subbranches with many leaf nodes. With the tree becoming smaller, it tends to cut off fewer. Digit recognition example: T1 71 T2 63 T3 58 T4 40 T5 34 T6 19 T7 10 T8 9 T9 7 T10 6 T11 5 T12 2 T13 1 Tree ~ |Tk | Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) Best Pruned Subtree Two approaches to choose the best pruned subtree: Use a test sample set. Cross-validation Use a test set to compute the classification error rate of each minimum cost-complexity subtree. Choose the subtree with the minimum test error rate. Cross validation: tree structures are not stable. When the training data set changes slightly, there may be large structural change in the tree. It is difficult to correspond a subtree trained from the entire data set to a subtree trained from a majority part of it. Focus on choosing the right complexity parameter . Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) Pruning by Cross-Validation Consider V -fold cross-validation. The original learning sample L is divided by random selection into V subsets, Lv , v = 1, ..., V . Let the training sample set in each fold be L(v ) = L - Lv . The tree grown on the original set is Tmax . V accessory trees (v ) Tmax are grown on L(v ) . Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) For each value of the complexity parameter , let T (), T (v ) (), v = 1, ..., V , be the corresponding minimal (v ) cost-complexity subtrees of Tmax , Tmax . For each maximum tree, we obtain a sequence of jump points of : 1 < 2 < 3 < k < . To find the corresponding minimal cost-complexity subtree at , find k from the list such that k < k+1 . Then the subtree corresponding to k is the subtree for . Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) The cross-validation error rate of T () is computed by R CV (T ()) = V (v ) 1 Nmiss , V N (v ) v =1 where N (v ) is the number of samples in the test set Lv in fold (v ) v ; and Nmiss is the number of misclassified samples in Lv (v ) using T (v ) (), a pruned tree of Tmax trained from L(v ) . Although is continuous, there are only finite minimum cost-complexity trees grown on L. Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) Let Tk = T (k ). To compute the cross-validation error rate of Tk , let k = k k+1 . Let R CV (Tk ) = R CV (T (k )) . For the root node tree {t1 }, R CV ({t1 }) is set to the resubstitution cost R({t1 }). Choose the subtree Tk with minimum cross-validation error rate R CV (Tk ). Jia Li http://www.stat.psu.edu/jiali Classification/Decision Trees (II) Computation Involved 1. Grow V + 1 maximum trees. 2. For each of the V + 1 trees, find the sequence of subtrees with minimum cost-complexity. 3. Suppose the maximum tree grown on the original data set Tmax has K subtrees. 4. For each of the (K - 1) k , compute the misclassification rate of each of the V test sample set, average the error rates and set the mean to the cross-validation error rate. 5. Find the subtree of Tmax with minimum R CV (Tk ). Jia Li http://www.stat.psu.edu/jiali ...
View Full Document

This note was uploaded on 02/04/2012 for the course STAT 557 taught by Professor Jiali during the Fall '09 term at Pennsylvania State University, University Park.

Ask a homework question - tutors are online