Classification_and_Regression_Trees

Classification_and_Regression_Trees - n Recognition esented...

Info iconThis preview shows pages 1–15. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 6
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 8
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 10
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 12
Background image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 14
Background image of page 15
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: n Recognition esented as the (4.100) the likelihood ional expecta— ;erved N inde- ,xN}; the Q- (4.101) g equation can (4.102) (4.103) (4.104) (4.105) plete data (x, y) and setting it to (4.106) Classification and Regression Trees 175 in a. = (4107) 2%: 2 ckpk (Y; I‘D/g) i=1 i=1 P (Y.- | ‘1’) fly; (yi _ “Xyi _ "k ). i 6th (yi | G’gm -u,.)(y,- - m)’ i z z (y. up) (4108) k if fickpk (Y.- |¢k) i=1 k 1:] P (Y: I (p) The quantity y; defined in Eq. (4.103) can be interpreted as the posterior probability that the observed data yi belong to Gaussian component k (N k (y | upzk) ). This informa- tion as to whether the observed data y,. should belong to Gaussian component k is hidden and can only be observed through the hidden variable x (ck). The EM algorithm described above is used to uncover how likely the observed data yi are expected to be in each Gaus- sian component. The re—estimation formulas are consistent with our intuition. These MLE formulas calculate the weighted contribution of each data sample according to the mixture posterior probability 7/; . In fact, VQ is an approximate version of EM algorithms. A lIaditional VQ with the Mahalanobis distance measure is equivalent to a mixture Gaussian VQ with the following conditions: ck =1/K (4.109) yr" __ 1’ yi E Ck k 0, otherwise ' The difference between VQ and the EM algorithm is that VQ performs a hard assignment of the data sample y,. to clusters (cells) while the EM algorithm performs a soft assignment of the data sample y,. to clusters. As discussed in Chapter 8, this difference carries over to the case of the Viterbi algorithm vs. the Baum-Welch algorithm in hidden Markov models. 4.5. CLASSIFICATION AND REGRESSION TREES Classification and regression trees (CART) [15, 82] have been used in a variety of pattern recogmtion applications. Binary decision trees, with splitting questions attached to each Pattern Recognition node, provide an easy representation that interprets and predicts the structure of a set of data. The application of binary decision trees is much like playing the number-guessing game, where the examinee tries to deduce the chosen number by asking a series of binary number- comparing questions. Consider a simple binary decision tree for height classification. Every person’s data in the study may consist of several measurements, including race, gender, weight, age, occupa- tion, and so on. The goal of the study is to develop a classification method to assign a person one of the following five height classes: tall (T), medium—tall (t), medium (M), medium- short(s) and short (S). Figure 4.14 shows an example of such a binary tree structure. With this binary decision tree, one can easily predict the height class for any new person (with all the measured data, but no height information) by traversing the binary trees. Traversing the binary tree is done through answering a series of yes/no questions in the traversed nodes with the measured data. When the answer is no, the right branch is traversed next; otherwise, the left branch will be traversed instead. When the path ends at a leaf node, you can use its attached label as the height class for the new person. If you have the average height for each leaf node (computed by averaging the heights from those people who fall in the same leaf node during training), you can actually use the average height in the leaf node to predict the height for the new person. This carried out based systc based on tt other hand. ess based ( for data sar instead to 1 data structL 0 C1 of oTl an o It nit o It be To cc as S ), we 1 > 12?”, “Is Once the ql trees. All tr then chose] measureme The algorit right-sized measure ear 4.5.1. Assume tha x = (1 where each set of questi 1. Ea- tY1: n Recognition fa set of data. tessing game, inary number- :rson’ s data in ., age, occup a- .ssign a person (M), medium- .tructure. With erson (with all Traversing the 'aversed nodes ext; otherwise, you can use its reight for each 1 the same leaf e to predict the on > more leek? m ? C32, - -'.: '..'_.ao'-—-'.-—E-'-3d'I—' Fl I a Classification and Regression Trees 177 This classification process is similar to a rule—based system where the classification is carried out by a sequence of decision rules. The choice and order of rules applied in a rule- based system is typically designed subjectively by hand through an introspective analysis based on the impressions and intuitions of a limited number of data samples. CART, on the other hand, provides an automatic and data-driven framework to construct the decision proc- ess based on objective criteria. Most statistical pattern recognition techniques are designed for data samples having a standard structure with homogeneous variables. CART is designed instead to handle data samples with high dimensionality, mixed data types, and nonstandard data structure. It has the following advantages over other pattern recognition techniques: 0 CART can be applied to any data structure through appropriate formulation of the set of potential questions. 0 The binary tree structure allows for compact storage, efficient classification, and easily understood interpretation of the predictive sfiucture of the data. 0 It often provides, without additional effort, not only classification and recog- nition, but also an estimate of the misclassification rate for each class. 0 It not only handles missing data, but also is very robust to outliers and misla- beled data samples. To construct a CART from the training samples with their classes (let’s denote the set as S ), we first need to find a set of questions regarding the measured variables; e.g., “Is age > 12?”, “Is occupation = professional basketball player?”, “Is gender = male?” and so on. Once the question set is determined, CART uses a greedy algorithm to generate the decision trees. All training samples 8 are placed in the root of the initial tree. The best question is then chosen from the question set to split the root into two nodes. Of course, we need a measurement of how well each question splits the data samples to pick the best question. The algorithm recursively splits the most promising node with the best question until the right-sized tree is obtained. We describe next how to construct the question set, how to measure each split, how to grow the tree, and how to choose the right-sized tree. 4.5.1. Choice of Question Set Assume that the training data has the following format: x=(x1,x2,...xd) (4.111) where each variable x, is a discrete or continuous data type. We can construct a standard set of questions Q as follows: 1. Each question is about the value of only a single variable. Questions of this type are called simple or singleton questions. \ 178 Pattern Recognition 2. If x, is a discrete variable from the set {c,,cz,...,cK}, Q includes all ques- tions of the following form: {st,. e S?} (4.112) where S is any subset of {c1 ,c2, . . .,cK} 3. If xi is a continuous variable, Q includes all questions of the following form: {st,. Sc?} force (—oo,oo) (4.113) The question subset generated from discrete variables (in condition 2 above) is clearly a finite set (2‘H —1). On the other hand, the question subset generated from continuous variables (in condition 3 above) seems to be an infinite set based on the definition. Fortu- nately, since the training data samples are finite, there are only finite number of distinct splits for the training data. For a continuous variable x,, the data points in 8 contain at most M distinct values v,,vz,...,vM. There are only at most M different splits generated by the set of questions in the form: {st,.Scn} n=1,2,...,M (4.114) where c" = v“; v" and v0 = 0. Therefore, questions related to a continuous variable also form a finite subset. The fact that Q is a finite set allows the enumerating of all possible questions in each node during tree growing. The construction of a question set is similar to that of rules in a rule-based system. In- stead of using the all-possible question set Q , some people use knowledge selectively to pick a subset of Q, which is sensitive to pattern classification. For example, the vowel sub- set and consonant subset are a natural choice for these sensitive questions for phoneme clas- sification. However, the beauty of CART is the ability to use all possible questions related to the measured variables, because CART has a statistical data-driven framework to determine the decision process (as described in subsequent sections). Instead of setting some con- straints on the questions (splits), most CART systems use all the possible questions for Q . 4.5.2. Splitting Criteria A question in CART framework represents a split (partition) of data samples. All the leaf nodes (L in total) represent L disjoint subsets Al , A2,. . ., AL . Now we have the entire poten- tial question set Q , the task is how to find the best question for a node split. The selection of the best question is equivalent to finding the best split for the data samples of the node. Since each node t in the tree contains some training samples, we can compute the cor- responding class probability density function P(w, l t) . The classification process for the node can then be interpreted as a random process based on P(co I I) . Since our goal is classi- where P((0l probabilityl the total nu: ting criterio tion, where defined as: Am. The rt The task be (split), and] O q=2 If we define nodes, we h Hm It can tropy of the pdf, likeliht urement [41 which can t log-likelihoI mx. 17(X 1! Recognition ll ques- (4.112) )llowing (4.113) ove) is clearly 1m continuous inition. Fortu— ber of distinct 8 contain at s generated by (4.114) 1s variable also of all possible sed system. In- : selectively to the vowel sub- ' phoneme clas- stions related to rk to determine ting some con— stions for Q. les. All the leaf he entire poten- The selection of :‘the node. :ompute the cor- process for the ur goal is classi- Classification and Regression Trees 179 fication, the objective of a decision tree is to reduce the uncertainty of the event being de- cided upon. We want the leaf nodes to be as pure as possible in terms of the class distribu- tion. Let Y be the random variable of classification decision for data sample X. We could define the weighted entropy for any node t as follows: FI.(Y)=H.(Y)P(r) (4.115) H,(Y)=-2P(wi It)logP(w,- It) (4.116) where P(a),. It) is the percentage of data samples for class i in node I; and P(t) is the prior probability of visiting node t (equivalent to the ratio of number of data samples in node t and the total number of training data samples). With this weighted entropy definition, the split- ting criterion is equivalent to finding the question which gives the greatest entropy reduc- tion, where the entropy reduction for a question q to split a node t into nodes I and r can be defined as: AH. (q) = H, (Y) — (H, (Y)+ F1, (Y)) = 1?, (Y) — F1. (Y I 4) (4.117) The reduction in entropy is also the mutual information_ between Y and question q. The task becomes that of evaluating the entropy reduction AH q for each potential question (split), and picking the question with the greatest entropy reduction, that is, 11' = argmax (AP—1,01» (4.118) If we define the entropy for a tree, T , as the sum of weighted entropies for all the terminal nodes, we have: 17(T) = 2 1?, (Y) (4.119) I is terminal It can be shown that the tree-growing (splitting) process repeatedly reduces the en- tropy of the tree. The resulting tree thus has a better classification power. For continuous pdf, likelihood gain is often used instead, since there is no straightforward entropy meas- urement [43]. Suppose one specific split divides the data into two groups, X1 and X2, which can then be used to train two Gaussian distributions N,(u,,2,) and N2 (112,222). The log-likelihoods for generating these two data groups are: L,(x,|N1)=1og1'[N(x,,p,,>:,)=—(d1og2n+1og|>:,|+d)a/2 (4.120) 1,,(x,1N,)=1ogHN(x,,%,,z,)=—(diog2n+iog|z,|+d)b/2 (4.121) Pattern Recognition where d is the dimensionality of the data; and a and b are the sample counts for the data groups XI and X2 respectively. Now if the data X1 and X2 are merged into one group and modeled by one Gaussian N (11,2) , according to MLE, we have a = 4.1 2 ’1 a+bul+a+bu2 ( 2) a I b I 2: a+b[zr +0“ "1001: ‘11) ]+a+b[22 +(uz -u)(u2-u)] (4.123) Thus, the likelihood gain of splitting the data X into two groups X1 and X2 is: AI:(q)=A(X, IN)+L.(X. IN)—L,(XIN) (4.124) =(a+b)logl2l—alog|2,l—blogl22| For regression purposes, the most popular splitting criterion is the mean squared error measure, which is consistent with the common least squared regression methods. For in- stance, suppose we need to investigate the real height as a regression function of the meas- ured variables in the height study. Instead of finding height classification, we could simply use the average height in each node to predict the height for any data sample. Suppose Y is the actual height for training data X , then overall regression (prediction) error for a node t can be defined as: E0) = 21 Y—d(X) F ‘ (4.125) Xe! where d(X) is the regression (predictive) value of Y. Now, instead of finding the question with greatest entropy reduction, we want to find the question with largest squared error reduction. That is, we want to pick the question q' that maximizes: AE,(q) = E(t)-(E(l)+E(r)) (4.126) where l and r are the leaves of node t. We define the expected square error V(t) for a node t as the overall regression error divided by the total number of data samples in the node. V(t) =E[2| Y—d(X) I: J=fi2| Y—d(X) I” (4.127) Xe! Xe! Note that V(t) is actually the variance estimate of the height, if d(X) is made to be the average height of data samples in the node. With V(t) , we define the weighted squared er- ror V(t) for a node t as follows. V(t) = V(t)P(t) {$2} Y—d(X) [2 Jpa) (4.128) Xe! Based on E the splitting the assumpl nonuniform might be us rule. Those For a cally insens much more 4.5.3. Given the c from the ini ables one b. best questic questions. '1 ditions is m 1. Nc the 2. Th thr ma <16! 3. Th 0: on When a no: split) nodes The a that seems I constructs a globally go application: ity is descri few variabl rn Recognition is for the data one group and (4.122) (4.123) (4.124) n squared error ethods. For in- m of the meas- e could simply . Suppose Y is tor for a node t (4.125) we want to find he question q (4.126) '(t) for a node t the node. (4.127) made to be the hted squared er- (4.128) Classification and Regression Trees 181 Finally, the splitting criterion can be mwritten as: N7,(q) = I7(t) - (7(1) + I7(r)) (4.129) Based on Eqs. (4.117) and (4.129), one can see the analogy between entropy and variance in the splitting criteria for CART. The use of entropy or variance as splitting criteria is under the assumption of uniform misclassification costs and uniform prior distributions. When nonuniform misclassification costs and prior distributions are used, some other splitting might be used for splitting criteria. Noteworthy ones are Gini index of diversity and twoing rule. Those interested in alternative splitting criteria can refer to [11, 15]. For a wide range of splitting criteria, the properties of the resulting CARTs are empiri- cally insensitive to these choices. Instead, the criterion used to get the right-sized tree is much more important. We discuss this issue in Section 4.5.6. 4.5.3. Growing the Tree Given the question set Q and splitting criteria A171, (q), the tree-growing algorithm starts from the initial root-only tree. At each node of tree, the algorithm searches through the vari- ables one by one, from xl to x" . For each variable, it uses the splitting criteria to find the best question (split). Then it can pick the best question out of the N best single-variable questions. The procedure can continue splitting each node until either of the following con- ditions is met for a node: 1. No more splits are possible; that is, all the data samples in the node belong to the same class; 2. The greatest entropy reduction of best question (split) falls below a pre-set threshold [1 , i.e.: max M7, (q) < [3 (4.130) #9 3. The number of data samples falling in the leaf node t is below some threshold at . This is to assure that there are enough training tokens for each leaf node if one needs to estimate some parameters associated with the node. When a node cannot be further split, it is declared a terminal node. When all active (non- split) nodes are terminal, the tree—growing algorithm stops. The algorithm is greedy because the question selected for any given node is the one that seems to be the best, without regard to subsequent splits and nodes. Thus, the algorithm constructs a tree that is locally optimal, but not necessarily globally optimal (but hopefully globally good enough). This tree-growing algorithm has been successfully applied in many applications [5, 39, 60]. A dynamic programming algorithm for determining global optimal- ity is described in [78]; however, it is suitable only in restricted applications with relatively few variables. 182 4.5.4. Missing Values and Conflict Resolution Sometimes, the available data sample it = (x1,x2,...xd) has some value xJ- missing. This missing-value case can be handled by the use of surrogate questions (splits). The idea is intuitive. We define a similarity measurement between any two questions (splits) q and q of a node t. If the best question of node 2‘ is the question q on the variable x, , we can find the question a that is most similar to q on a variable other than xi. q is our best surrogate question. Similarly, we find the second-best surrogate question, third-best and so on. The surrogate questions are considered as the backup questions in the case of missing x, values in the data samples. The surrogate question is used in descending order to continue tree trav- ersing for those data samples. The surrogate question gives CART unique ability to handle the case of missing data. The similarity measurement is basically a measurement reflecting the similarity of the class probability density function [15]. When choosing the best question for splitting a node, several questions on the same variable x,- may achieve the same entropy reduction and generate the same partition. As in rule-based problem solving systems, a conflict resolution procedure [99] is needed to decide which question to use. For example, discrete questions ql and q2 have the following for- mat: ql : {st,eSl?} (4.131) ch = {Isa-652?} (4.132) Suppose S1 is a subset of S2, and one particular node contains only data samples whose x, value contains only values in S1, but no other. Now question ql or q2 performs the same splitting pattern and therefore achieves exactly the same amount of entropy reduc- tion. In this case, we call ql a sub-question of question q2 , because ql is a more specific version. A specificity ordering conflict resolution strategy is used to favor the discrete question with fewer elements because it is more specific to the current node. In other words, if the elements of a question are a subset of the elements of another question with the same en- tropy reduction, the question with the subset of elements is preferred. Prefem'ng more spe- cific questions will prevent decision trees from over-generalizing. The specificity ordering conflict resolution can be implemented easily by presorting the set of discrete questions by the number of elements they contain in descending order, before applying them to decision trees. A similar specificity ordering conflict resolution can also be implemented for continu- ous-variable questions. 4.5.5. Complex Questions One problem with allowing only simple questions is that the data may be over-fragmented, resulting in similar leaves in different locations of the tree. For example, when the best ques— Pattern Recognition tion (11116) 1 2F.- 6 S2 ?”’ : split the dai the answer inefi'ective training exz necessarily reduced tra We d volves con; composite I clustering t tion. Afier tion, multi; one of the ticular rout: sion of the i To sp of leaves 01 tics is the questions a find the bes tion is met improves e composite 1 timum achi The c obtained cc 1: Recognition missing. This ). The idea is lits) q and ti -,. , we can find best surrogate nd so on. The sing x, values tinue tree trav- )i1ity to handle nent reflecting is on the same partition. As in :eded to decide following for- (4.131) (4.132) .y data samples or q2 performs f entropy reduc- a more specific liscrete question ter words, if the ith the same en- :rring more spe- cificity ordering etc questions by them to decision nted for continu- over-fragmented, 1en the best ques- .'.4L'_..J—'_‘. 'i' ‘ - - Classification and Regression Trees 183 tion (rule) to split a node is actually a composite question of the form “Is x, 6 S1?” or “Is x, 6 S2 ’2”, a system allowing only simple questions will generate two separate questions to split the data into three clusters rather than two as shown in Figure 4.15. Also data for which the answer is yes are inevitably fragmented across two shaded nodes. This is inefficient and ineffective since these two very similar data clusters may now both contain insufficient training examples, which could potentially handicap future tree growing. Splitting data un- necessarily across different nodes leads to unnecessary computation, redundant clusters, reduced trainability, and less accurate entropy reduction. Is x, e 3,? Figure 4.15 An over-split tree for the question “Is xi e S‘ ?” or “Is x, 6 S2 ?” We deal with this problem by using a composite-question construction [38, 40]. It in- volves conjunctive and disjunctive combinations of all questions (and their negations). A composite question is formed by first growing a tree with simple questions only and then clustering the leaves into two sets. Figure 4.16 shows the formation of one composite ques- tion. After merging, the structure is still a binary question. To construct the composite ques- tion, multiple OR operators are used to describe the composite condition leading to either one of the final clusters, and AND operators are used to describe the relation within a par- ticular route. Finally, a Boolean reduction algorithm is used to simplify the Boolean expres- sion of the composite question. To speed up the process of constructing composite questions, we constrain the number of leaves or the depth of the binary tree through heuristics. The most frequently used heuris- tics is the limitation of the depth when searching a composite question. Since composite questions are essentially binary questions, we use the same greedy tree-growing algorithm to find the best composite question for each node and keep growing the tree until the stop crite- rion is met. The use of composite questions not only enables flexible clustering, but also improves entropy reduction. Growing the sub-tree a little deeper before constructing the 0.0mposite question may achieve longer-range optimum, which is preferable to the local op- tlmum achieved in the original greedy algorithm that used simple questions only. The construction of composite questions can also be applied to continuous variables to obtained complex rectangular partitions. However, some other techniques are used to obtain Pattern Recognition general partitions generated by hyperplanes not perpendicular to the coordinate axes. Ques— tions typically have a linear combination of continuous variables in the following form [15]: {Is 2am. s 6?} (4.133) Figure 4.16 The formation of a composite question from simple questions. 4.5.6. The Right-Sized Tree One of the most critical problems for CART is that the tree may be strictly tailored to the training data and has no generalization capability. When you split a leaf node in the tree to get entropy reduction until each leaf node contains data from one single class, that tree pos- sesses a zero percent classification error on the training set. This is an over-optimistic esti- mate of the test-set misclassification rate. Independent test sample estimation or cross- validation is often used to prevent decision trees from over-modeling idiosyncrasies of the training data. To get a right-sized tree, you can minimize the misclassification rate for future independent test data. Before we describe the solution for finding the right sized tree, let’s define a couple of useful terms. Naturally we will use the plurality rule 6(t) to choose the class for a node I: 5(t) = argmax P(col. l t) (4.134) / classificafio Similar to t rate RU) f‘ R(t) = Where rm 1 node t. The MD where i re the cost of r. W) = As we the training can minimiz Almost no t complicated too soon at inventing sc tree over-gr back the tre« rithm to pru 4.5.6.1. To prune a t a cost measn subtree. To DEFINITI( minal nodes DEFINITI( complexity RAT: DEFINITI( minimizes 1 T(a) - -n Recognition .e axes. Ques— ng form [15]: (4.133) ~.~ tailored to the le in the tree to s, that tree pos- -optimistic esti- ation or cross— mcrasies of the 11 rate for future fine a couple of for a node t: (4.134) a Classification and Regression Trees 185 Similar to the notation used in Bayes’ decision theory, we can define the misclassification rate R(t) for a node I as: R(t) =r(t)P(t) (4.135) where r(t) = l—max P(a),. It) and P(t) is the frequency (probability) of the data falling in node t. The overall misclassification rate for the whole tree Tis defined as: R(T) = 212(1) (4.136) teT where T represents the set of terminal nodes. If a nonuniform misclassification cost c(i | j), the cost of misclassifying class j data as class i data, is used, r(t) is redefined as: r(t) = min 2 c(i | j)P( j I t) (4.137) As we mentioned, R(T) can be made arbitrarily small (eventually reduced to zero) for the training data if we keep growing the tree. The key now is how we choose the tree that can minimize R'(T) , which is denoted as the misclassification rate of independent test data. Almost no tree initially grown can perform well on independent test data. In fact, using more complicated stopping rules to limit the tree growing seldom works, and it is either stopped too soon at some terminal nodes, or continued too far in other parts of the tree. Instead of inventing some clever stopping criteria to stop the tree growing at the right size, we let the tree over-grow (based on rules in Section 4.5.3). We use a pruning strategy to gradually cut back the tree until the minimum R‘(T) is achieved. In the next section we describe an algo- rithm to prune an over-grown tree, minimum cost-complexity pruning. 4.5.6.1. Minimum Cost-Complexity Pruning To prune a tree, we need to find a subtree (or branch) that makes the least impact in terms of a cost measure, whether it is pruned or not. This candidate to be pruned is called the weakest subtree. To define such a weakest subtree, we first need to define the cost measure. DEFINITION 1: For any sub-tree T of Tm (T -< Tm ), let IT I denote the number of ter- minal nodes in tree T. DEFINITION 2: Let a 2 0 be a real number called the complexity parameter. The cost- complexity measure can be defined as: Ra(T) = R(T)+a |1”' I (4.138) DEFINITION 3: For each a , define the minimal cost-complexity subtree T (a) -< Tm.x that minimizes Ra (T) , that is, T(a) =argminRa(T) (4.139) T<TM R 186 Based on Definitions 2 and 3, if a is small, the penalty for having a large tree is small and T (a) will be large. In fact, T (0) = Tmx because Tm has a zero misclassification rate, so it will minimize Re (T). On the other hand, when 0: increases, T (06) becomes smaller and smaller. For a sufficient large a , T (a) may collapse into a tree with only the root. The increase of 0: produces a sequence of pruned trees and it is the basis of the pruning process. The pruning algorithm rests on two theorems. The first is given as follows. THEOREM 1: For every value of a, there exists a unique minimal cost-complexity sub- tree T (a) as defined in Definition 3.12 To progressively prune the tree, we need to find the weakest subtree (node). The idea of a weakest subtree is the following: if we collapse the weakest subtree into a single termi- nal node, the cost-complexity measure would increase least. For any node t in the tree T , let {t} denote the subtree containing only the node 1, and 7; denote the branch starting at node I. Then we have R.(T,)=R<1:)+a12':1 (4.140) R..({t}) =R(t)+a (4.141) When a is small, I} has a smaller cost-complexity than the single—node tree {t}. However, when (1 increases to a point where the cost-complexity measures for I; and {t} are the same, it makes sense to collapse 7; into a single terminal node {t}. Therefore, we decide the critical value for a by solving the following inequality: RAT.) S Ra({t}) (4.142) We obtain: a s M (4.143) III I—1 Based on Eq. (4.143), we define a measurement 11(t) for each node t in tree T: R(t)_—R(T.) ,ET n(t)= IT.I—1 ’ (4.144) +oo, tef Based on measurement 11(t) , we then define the weakest subtree 171 as the tree branch start- ing at the node tl such that x " You can find the proof to this in [15]. Pattern Recognition a,=n As a Rat (T1) . At node subtree Now tl T.=T We then usr point 01,. A same proces we get a seq T>Tl where {r} i points: (10 <( where a0 = With ' complexity THEOREL For l 4.5.6.2. The minim' tree to forrr a0 = 0 and sized tree. t :rn Recognition ge tree is small ssification rate, :comes smaller ly the root. The runing process. omplexity sub- 1ode). The idea a single tenni- in the tree T , anch starting at (4.140) (4.141) :-node tree {I}. ;for T and {t} . Therefore, we (4.142) (4.143) (4.144) tree branch start- Classification and Regression Trees 187 t] = arg min 11(t) (4.145) IET 0‘] = 71(4) (4.146) As 0: increases, the node t] is the first node such that Ra({t}) becomes equal to Ram) . At this point, it would make sense to prune subtree I}. (collapse 1;} into a single- node subtree {1,} ), and 061 is the value of a where the pruning occurs. Now the tree T after pruning is referred to as I; , i.e., T] = T _T'l (4.147) We then use the same process to find the weakest subtree T2 in I; and the new pruning point (12. After pruning away I; from I] to form the new pruned tree T2 , we repeat the same process to find the next weakest subtree and pruning point. If we continue the process, we get a sequence of decreasing pruned trees: T>I§>T2>T2m>{r} (4.148) where {r} is the single-node tree containing the root of tree T with corresponding pruning points: ao<al<a2<a3<--- (4.149) where do = 0. With the process above, the following theorem (which is basic for the minimum cost- complexity pruning) can be proved. THEOREM 2 : Let T, be the original tree T. For k 20, a, Soc <05,+1 , T(a) =T(a,,)=1; (4.150) 4.5.6.2. Independent Test Sample Estimation The minimum cost-complexity pruning algorithm can progressively prune the over-grown tree to form a decreasing sequence of subtrees T > I; >- T2 > T2 > {r} , where I}, = T (06*) , 06° = 0 and I}, = T . The task now is simply to choose one of those subtrees as the optimal- sized tree. Our goal is to find the optimal-sized tree that minimizes the misclassification for Pattern Recognition ALGORI Step 1: C tions ab0| Step 2: 8 training set S. We use the remaining two thirds of the training set 8 —9t (still abundant) to train the initial tree T and apply the minimum cost-complexity pruning algorithm to attain the decreasing sequence of subtrees T >- I; > T2 >T2 > {r}. Next, the test set 9? is run through the sequence of subtrees to get corresponding estimates of test-set misclassification in any n( R'(T),R'(T;),R°(I; ),~~-,R'({r}). The optimal-sized tree is then picked as the one with square er minimum test-set misclassification measure, i.e.: Step 3: Ir Step 4: S k' = argminR°(I},) (4.151) a- t b. C. The independent test sample estimation approach has the drawback that it reduces the Step 5; 3 effective training sample size. This is why it is used only when there is abundant training Step 6; 5 data. Under most circumstances where training data is limited, cross-validation is often the poten used. erwise go Step 7: L 4.5.6.3. Cross-Validation “6 "“°‘ CART can be pruned via v-fold cross-validation. It follows the same principle of cross vali- .The E dation described in Section 4.2.3. First it randomly divides the training set S into v disjoint That is, T (4 subsets 81,32,-~~,Sv , each containing roughly the same data samples. It then defines the 1"" duecfly es? training set Proxmate 8 occurs 11 31:8—3i i=l,2,...,v (4.152) canbecom , . . . . . . . R"( so that 3 contains the fraction (v—l)/v of the original training set. v is usually chosen to be large, like 10. . In v-fold cross-validation, v auxiliary trees are grown together with the main tree T Similar to ' grown on 8. The f" tree is grown on the 1‘“ training set 3'. By applying minimum cost- tree Kay 03 complexity pruning, for any given value of the cost-complexity parameter a, we can obtain kcv = the corresponding minimum cost-complexity subtrees T(a) and T ’(a) , i =1, 2,...,v. Ac- cording to Theorem 2 in Section 4.5.6.1, those minimum cost-complexity subtrees will form v +1 sequences of subtrees: Cross sample esti: information (4153) on a smalle higher misc . . . . over-estima T' >-I}' >Tz' >-I;'~-~ >-{r'} i=l,2,...,v (4.154) iuustratedi T>I;>-T2>-T2~~~>{r} and m Recognition rd to set aside re third of the l abundant) to rithrn to attain set ER is run asclassification s the one with (4.151) t it reduces the indant training dation is often le of cross vali- i into v disjoint n defines the 1‘" (4.152) sually chosen to he main tree T minimum cost- 5, we can obtain =l,2,...,v . AC' lbtrees will form (4.153) (4.154) h Classification and Regression Trees 189 ALGORITHM 4.5: THE CARTALGORITHM Step 1: Question Set: Create a standard set of questions Q that consists of all possible ques- tions about the measured variables. Step 2: Splitting Criterion: Pick a splitting criterion that can evaluate all the possible questions in any node. Usually it is either entropy-like measurement for classification trees or mean square errors for regression trees. Step 3: initialization: Create a tree with one (root) node, consisting of all training samples. Step 4: Split Candidates: Find the best composite question for each terminal node: a. Generate a tree with several simple-question splits as described in Section 4.5.3. b. Cluster leaf nodes into two classes according to the same splitting criterion. 0. Based on the clustering done in (b), construct a corresponding composite question. Step 5: Split: Out of all the split candidates in Step 4, split the one with best criterion. Step 6: Stop Criterion: if all the leaf nodes containing data samples from the same class or all the potential splits generating improvement fall below a pre-set threshold [3 , go to Step 7; oth- erwise go to Step 4. Step 7: Use independent test sample estimate or cross-validation estimate to prune the original tree into the optimal size. The basic assumption of cross-validation is that the procedure is stable if v is large. That is, T (a) should have the same classification accuracy as T’(a) . Although we cannot directly estimate the test-set misclassification for the main tree R.(T ((1)) , we could ap- proximate it via the test-set misclassification measure R'(T'(a)) , since each data sample in 8 occurs in one and only one test set 8,. The v-fold cross-validation estimate R" (T ((1)) can be computed as: R" (T(a)) = $2R'(T’(a)) (4.155) i=l Similar to Eq. (4.151), once R" (T ((1)) is computed, the optimal v-fold cross-validation tree 7%, can be found through k" = argmin R6701) (4.156) k Cross-validation is computationally expensive in comparison with independent test sample estimation, though it makes more effective use of all training data and reveals useful information regarding the stability of the tree structure. Since the auxiliary trees are grown on a smaller training set (a fraction v —1/v of the original training data), they tend to have a higher misclassification rate. Therefore, the cross-validation estimates R" (T) tend to be an Wet-estimation of the misclassification rate. The algorithm for generating a CART tree is Illustrated in Algorithm 4.5. ...
View Full Document

This note was uploaded on 05/08/2010 for the course CS 6.345 taught by Professor Glass during the Spring '10 term at MIT.

Page1 / 15

Classification_and_Regression_Trees - n Recognition esented...

This preview shows document pages 1 - 15. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online