This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: n Recognition
esented as the (4.100) the likelihood
ional expecta—
;erved N inde
,xN}; the Q (4.101) g equation can (4.102) (4.103) (4.104) (4.105) plete data (x, y)
and setting it to (4.106) Classiﬁcation and Regression Trees 175
in a. = (4107)
2%: 2 ckpk (Y; I‘D/g)
i=1 i=1 P (Y.  ‘1’)
ﬂy; (yi _ “Xyi _ "k ). i 6th (yi  G’gm u,.)(y,  m)’
i z z (y. up) (4108)
k if ﬁckpk (Y. ¢k)
i=1 k 1:] P (Y: I (p) The quantity y; deﬁned in Eq. (4.103) can be interpreted as the posterior probability
that the observed data yi belong to Gaussian component k (N k (y  upzk) ). This informa
tion as to whether the observed data y,. should belong to Gaussian component k is hidden
and can only be observed through the hidden variable x (ck). The EM algorithm described
above is used to uncover how likely the observed data yi are expected to be in each Gaus
sian component. The re—estimation formulas are consistent with our intuition. These MLE
formulas calculate the weighted contribution of each data sample according to the mixture
posterior probability 7/; . In fact, VQ is an approximate version of EM algorithms. A lIaditional VQ with the
Mahalanobis distance measure is equivalent to a mixture Gaussian VQ with the following
conditions: ck =1/K (4.109) yr" __ 1’ yi E Ck k 0, otherwise ' The difference between VQ and the EM algorithm is that VQ performs a hard assignment of the data sample y,. to clusters (cells) while the EM algorithm performs a soft assignment of the data sample y,. to clusters. As discussed in Chapter 8, this difference carries over to the
case of the Viterbi algorithm vs. the BaumWelch algorithm in hidden Markov models. 4.5. CLASSIFICATION AND REGRESSION TREES Classification and regression trees (CART) [15, 82] have been used in a variety of pattern
recogmtion applications. Binary decision trees, with splitting questions attached to each Pattern Recognition node, provide an easy representation that interprets and predicts the structure of a set of data.
The application of binary decision trees is much like playing the numberguessing game,
where the examinee tries to deduce the chosen number by asking a series of binary number
comparing questions. Consider a simple binary decision tree for height classification. Every person’s data in
the study may consist of several measurements, including race, gender, weight, age, occupa
tion, and so on. The goal of the study is to develop a classification method to assign a person
one of the following ﬁve height classes: tall (T), medium—tall (t), medium (M), medium
short(s) and short (S). Figure 4.14 shows an example of such a binary tree structure. With
this binary decision tree, one can easily predict the height class for any new person (with all
the measured data, but no height information) by traversing the binary trees. Traversing the
binary tree is done through answering a series of yes/no questions in the traversed nodes
with the measured data. When the answer is no, the right branch is traversed next; otherwise,
the left branch will be traversed instead. When the path ends at a leaf node, you can use its
attached label as the height class for the new person. If you have the average height for each
leaf node (computed by averaging the heights from those people who fall in the same leaf node during training), you can actually use the average height in the leaf node to predict the
height for the new person. This
carried out
based systc
based on tt
other hand.
ess based (
for data sar
instead to 1
data structL 0 C1
of oTl
an o It
nit o It
be To cc
as S ), we 1
> 12?”, “Is
Once the ql
trees. All tr
then chose]
measureme
The algorit
rightsized
measure ear 4.5.1. Assume tha
x = (1 where each
set of questi 1. Ea
tY1: n Recognition fa set of data. tessing game,
inary number :rson’ s data in
., age, occup a
.ssign a person
(M), medium
.tructure. With
erson (with all
Traversing the
'aversed nodes
ext; otherwise,
you can use its
reight for each
1 the same leaf
e to predict the on > more
leek? m ? C32,  '.: '..'_.ao'—'.—E'3d'I—' Fl
I a Classiﬁcation and Regression Trees 177 This classiﬁcation process is similar to a rule—based system where the classiﬁcation is
carried out by a sequence of decision rules. The choice and order of rules applied in a rule
based system is typically designed subjectively by hand through an introspective analysis
based on the impressions and intuitions of a limited number of data samples. CART, on the
other hand, provides an automatic and datadriven framework to construct the decision proc
ess based on objective criteria. Most statistical pattern recognition techniques are designed
for data samples having a standard structure with homogeneous variables. CART is designed
instead to handle data samples with high dimensionality, mixed data types, and nonstandard
data structure. It has the following advantages over other pattern recognition techniques: 0 CART can be applied to any data structure through appropriate formulation
of the set of potential questions. 0 The binary tree structure allows for compact storage, efﬁcient classiﬁcation,
and easily understood interpretation of the predictive sﬁucture of the data. 0 It often provides, without additional effort, not only classiﬁcation and recog
nition, but also an estimate of the misclassiﬁcation rate for each class. 0 It not only handles missing data, but also is very robust to outliers and misla
beled data samples. To construct a CART from the training samples with their classes (let’s denote the set
as S ), we ﬁrst need to ﬁnd a set of questions regarding the measured variables; e.g., “Is age
> 12?”, “Is occupation = professional basketball player?”, “Is gender = male?” and so on.
Once the question set is determined, CART uses a greedy algorithm to generate the decision
trees. All training samples 8 are placed in the root of the initial tree. The best question is
then chosen from the question set to split the root into two nodes. Of course, we need a
measurement of how well each question splits the data samples to pick the best question.
The algorithm recursively splits the most promising node with the best question until the
rightsized tree is obtained. We describe next how to construct the question set, how to
measure each split, how to grow the tree, and how to choose the rightsized tree. 4.5.1. Choice of Question Set Assume that the training data has the following format: x=(x1,x2,...xd) (4.111) where each variable x, is a discrete or continuous data type. We can construct a standard
set of questions Q as follows: 1. Each question is about the value of only a single variable. Questions of this
type are called simple or singleton questions. \
178 Pattern Recognition 2. If x, is a discrete variable from the set {c,,cz,...,cK}, Q includes all ques
tions of the following form: {st,. e S?} (4.112) where S is any subset of {c1 ,c2, . . .,cK} 3. If xi is a continuous variable, Q includes all questions of the following
form: {st,. Sc?} force (—oo,oo) (4.113) The question subset generated from discrete variables (in condition 2 above) is clearly
a ﬁnite set (2‘H —1). On the other hand, the question subset generated from continuous
variables (in condition 3 above) seems to be an inﬁnite set based on the deﬁnition. Fortu
nately, since the training data samples are ﬁnite, there are only ﬁnite number of distinct
splits for the training data. For a continuous variable x,, the data points in 8 contain at most M distinct values v,,vz,...,vM. There are only at most M different splits generated by
the set of questions in the form: {st,.Scn} n=1,2,...,M (4.114)
where c" = v“; v" and v0 = 0. Therefore, questions related to a continuous variable also form a ﬁnite subset. The fact that Q is a ﬁnite set allows the enumerating of all possible
questions in each node during tree growing. The construction of a question set is similar to that of rules in a rulebased system. In
stead of using the allpossible question set Q , some people use knowledge selectively to
pick a subset of Q, which is sensitive to pattern classiﬁcation. For example, the vowel sub
set and consonant subset are a natural choice for these sensitive questions for phoneme clas
siﬁcation. However, the beauty of CART is the ability to use all possible questions related to
the measured variables, because CART has a statistical datadriven framework to determine
the decision process (as described in subsequent sections). Instead of setting some con
straints on the questions (splits), most CART systems use all the possible questions for Q . 4.5.2. Splitting Criteria A question in CART framework represents a split (partition) of data samples. All the leaf
nodes (L in total) represent L disjoint subsets Al , A2,. . ., AL . Now we have the entire poten
tial question set Q , the task is how to ﬁnd the best question for a node split. The selection of
the best question is equivalent to ﬁnding the best split for the data samples of the node. Since each node t in the tree contains some training samples, we can compute the cor
responding class probability density function P(w, l t) . The classiﬁcation process for the
node can then be interpreted as a random process based on P(co I I) . Since our goal is classi where P((0l
probabilityl
the total nu:
ting criterio
tion, where
deﬁned as: Am. The rt
The task be
(split), and] O q=2 If we deﬁne
nodes, we h Hm It can
tropy of the
pdf, likeliht
urement [41
which can t
loglikelihoI mx. 17(X 1! Recognition ll ques (4.112) )llowing (4.113) ove) is clearly
1m continuous
inition. Fortu—
ber of distinct 8 contain at
s generated by (4.114) 1s variable also of all possible sed system. In
: selectively to
the vowel sub
' phoneme clas
stions related to
rk to determine
ting some con—
stions for Q. les. All the leaf
he entire poten
The selection of
:‘the node. :ompute the cor
process for the
ur goal is classi Classiﬁcation and Regression Trees 179 ﬁcation, the objective of a decision tree is to reduce the uncertainty of the event being de
cided upon. We want the leaf nodes to be as pure as possible in terms of the class distribu
tion. Let Y be the random variable of classiﬁcation decision for data sample X. We could
deﬁne the weighted entropy for any node t as follows: FI.(Y)=H.(Y)P(r) (4.115)
H,(Y)=2P(wi It)logP(w, It) (4.116) where P(a),. It) is the percentage of data samples for class i in node I; and P(t) is the prior
probability of visiting node t (equivalent to the ratio of number of data samples in node t and
the total number of training data samples). With this weighted entropy deﬁnition, the split
ting criterion is equivalent to ﬁnding the question which gives the greatest entropy reduc
tion, where the entropy reduction for a question q to split a node t into nodes I and r can be
deﬁned as: AH. (q) = H, (Y) — (H, (Y)+ F1, (Y)) = 1?, (Y) — F1. (Y I 4) (4.117) The reduction in entropy is also the mutual information_ between Y and question q.
The task becomes that of evaluating the entropy reduction AH q for each potential question
(split), and picking the question with the greatest entropy reduction, that is, 11' = argmax (AP—1,01» (4.118) If we deﬁne the entropy for a tree, T , as the sum of weighted entropies for all the terminal
nodes, we have: 17(T) = 2 1?, (Y) (4.119) I is terminal It can be shown that the treegrowing (splitting) process repeatedly reduces the en
tropy of the tree. The resulting tree thus has a better classiﬁcation power. For continuous
pdf, likelihood gain is often used instead, since there is no straightforward entropy meas
urement [43]. Suppose one speciﬁc split divides the data into two groups, X1 and X2,
which can then be used to train two Gaussian distributions N,(u,,2,) and N2 (112,222). The
loglikelihoods for generating these two data groups are: L,(x,N1)=1og1'[N(x,,p,,>:,)=—(d1og2n+1og>:,+d)a/2 (4.120)
1,,(x,1N,)=1ogHN(x,,%,,z,)=—(diog2n+iogz,+d)b/2 (4.121) Pattern Recognition where d is the dimensionality of the data; and a and b are the sample counts for the data
groups XI and X2 respectively. Now if the data X1 and X2 are merged into one group and
modeled by one Gaussian N (11,2) , according to MLE, we have a
= 4.1 2
’1 a+bul+a+bu2 ( 2)
a I b I
2: a+b[zr +0“ "1001: ‘11) ]+a+b[22 +(uz u)(u2u)] (4.123) Thus, the likelihood gain of splitting the data X into two groups X1 and X2 is: AI:(q)=A(X, IN)+L.(X. IN)—L,(XIN) (4.124)
=(a+b)logl2l—alog2,l—blogl22 For regression purposes, the most popular splitting criterion is the mean squared error
measure, which is consistent with the common least squared regression methods. For in
stance, suppose we need to investigate the real height as a regression function of the meas
ured variables in the height study. Instead of ﬁnding height classiﬁcation, we could simply
use the average height in each node to predict the height for any data sample. Suppose Y is the actual height for training data X , then overall regression (prediction) error for a node t
can be deﬁned as: E0) = 21 Y—d(X) F ‘ (4.125)
Xe!
where d(X) is the regression (predictive) value of Y. Now, instead of ﬁnding the question with greatest entropy reduction, we want to ﬁnd the question with largest squared error reduction. That is, we want to pick the question q'
that maximizes: AE,(q) = E(t)(E(l)+E(r)) (4.126) where l and r are the leaves of node t. We deﬁne the expected square error V(t) for a node t
as the overall regression error divided by the total number of data samples in the node. V(t) =E[2 Y—d(X) I: J=ﬁ2 Y—d(X) I” (4.127)
Xe! Xe! Note that V(t) is actually the variance estimate of the height, if d(X) is made to be the average height of data samples in the node. With V(t) , we define the weighted squared er
ror V(t) for a node t as follows. V(t) = V(t)P(t) {$2} Y—d(X) [2 Jpa) (4.128) Xe! Based on E
the splitting
the assumpl
nonuniform
might be us
rule. Those
For a
cally insens
much more 4.5.3. Given the c
from the ini
ables one b.
best questic
questions. '1
ditions is m
1. Nc the 2. Th
thr ma
<16!
3. Th 0:
on When a no:
split) nodes The a
that seems I constructs a
globally go
application:
ity is descri
few variabl rn Recognition is for the data
one group and (4.122) (4.123) (4.124) n squared error
ethods. For in
m of the meas
e could simply
. Suppose Y is
tor for a node t (4.125) we want to ﬁnd
he question q (4.126) '(t) for a node t
the node. (4.127) made to be the
hted squared er (4.128) Classiﬁcation and Regression Trees 181 Finally, the splitting criterion can be mwritten as:
N7,(q) = I7(t)  (7(1) + I7(r)) (4.129) Based on Eqs. (4.117) and (4.129), one can see the analogy between entropy and variance in
the splitting criteria for CART. The use of entropy or variance as splitting criteria is under
the assumption of uniform misclassiﬁcation costs and uniform prior distributions. When
nonuniform misclassiﬁcation costs and prior distributions are used, some other splitting
might be used for splitting criteria. Noteworthy ones are Gini index of diversity and twoing
rule. Those interested in alternative splitting criteria can refer to [11, 15]. For a wide range of splitting criteria, the properties of the resulting CARTs are empiri
cally insensitive to these choices. Instead, the criterion used to get the rightsized tree is
much more important. We discuss this issue in Section 4.5.6. 4.5.3. Growing the Tree Given the question set Q and splitting criteria A171, (q), the treegrowing algorithm starts
from the initial rootonly tree. At each node of tree, the algorithm searches through the vari
ables one by one, from xl to x" . For each variable, it uses the splitting criteria to ﬁnd the
best question (split). Then it can pick the best question out of the N best singlevariable
questions. The procedure can continue splitting each node until either of the following con
ditions is met for a node: 1. No more splits are possible; that is, all the data samples in the node belong to
the same class; 2. The greatest entropy reduction of best question (split) falls below a preset
threshold [1 , i.e.: max M7, (q) < [3 (4.130)
#9 3. The number of data samples falling in the leaf node t is below some threshold
at . This is to assure that there are enough training tokens for each leaf node if
one needs to estimate some parameters associated with the node. When a node cannot be further split, it is declared a terminal node. When all active (non
split) nodes are terminal, the tree—growing algorithm stops. The algorithm is greedy because the question selected for any given node is the one
that seems to be the best, without regard to subsequent splits and nodes. Thus, the algorithm
constructs a tree that is locally optimal, but not necessarily globally optimal (but hopefully
globally good enough). This treegrowing algorithm has been successfully applied in many
applications [5, 39, 60]. A dynamic programming algorithm for determining global optimal
ity is described in [78]; however, it is suitable only in restricted applications with relatively
few variables. 182 4.5.4. Missing Values and Conﬂict Resolution Sometimes, the available data sample it = (x1,x2,...xd) has some value xJ missing. This
missingvalue case can be handled by the use of surrogate questions (splits). The idea is
intuitive. We deﬁne a similarity measurement between any two questions (splits) q and q
of a node t. If the best question of node 2‘ is the question q on the variable x, , we can ﬁnd
the question a that is most similar to q on a variable other than xi. q is our best surrogate
question. Similarly, we ﬁnd the secondbest surrogate question, thirdbest and so on. The
surrogate questions are considered as the backup questions in the case of missing x, values
in the data samples. The surrogate question is used in descending order to continue tree trav
ersing for those data samples. The surrogate question gives CART unique ability to handle
the case of missing data. The similarity measurement is basically a measurement reﬂecting
the similarity of the class probability density function [15]. When choosing the best question for splitting a node, several questions on the same
variable x, may achieve the same entropy reduction and generate the same partition. As in
rulebased problem solving systems, a conﬂict resolution procedure [99] is needed to decide which question to use. For example, discrete questions ql and q2 have the following for
mat: ql : {st,eSl?} (4.131) ch = {Isa652?} (4.132) Suppose S1 is a subset of S2, and one particular node contains only data samples
whose x, value contains only values in S1, but no other. Now question ql or q2 performs
the same splitting pattern and therefore achieves exactly the same amount of entropy reduc
tion. In this case, we call ql a subquestion of question q2 , because ql is a more speciﬁc
version. A specificity ordering conﬂict resolution strategy is used to favor the discrete question
with fewer elements because it is more speciﬁc to the current node. In other words, if the
elements of a question are a subset of the elements of another question with the same en
tropy reduction, the question with the subset of elements is preferred. Prefem'ng more spe
ciﬁc questions will prevent decision trees from overgeneralizing. The speciﬁcity ordering
conﬂict resolution can be implemented easily by presorting the set of discrete questions by
the number of elements they contain in descending order, before applying them to decision trees. A similar speciﬁcity ordering conﬂict resolution can also be implemented for continu
ousvariable questions. 4.5.5. Complex Questions One problem with allowing only simple questions is that the data may be overfragmented,
resulting in similar leaves in different locations of the tree. For example, when the best ques— Pattern Recognition tion (11116) 1
2F. 6 S2 ?”’ :
split the dai
the answer ineﬁ'ective training exz
necessarily
reduced tra We d
volves con;
composite I
clustering t
tion. Aﬁer
tion, multi;
one of the
ticular rout:
sion of the i To sp
of leaves 01
tics is the
questions a
ﬁnd the bes
tion is met
improves e
composite 1 timum achi
The c obtained cc 1: Recognition missing. This
). The idea is
lits) q and ti ,. , we can ﬁnd
best surrogate
nd so on. The
sing x, values
tinue tree trav
)i1ity to handle
nent reﬂecting is on the same
partition. As in
:eded to decide
following for (4.131) (4.132) .y data samples
or q2 performs
f entropy reduc
a more speciﬁc liscrete question
ter words, if the
ith the same en
:rring more spe
ciﬁcity ordering
etc questions by
them to decision
nted for continu overfragmented,
1en the best ques .'.4L'_..J—'_‘. 'i' ‘   Classiﬁcation and Regression Trees 183 tion (rule) to split a node is actually a composite question of the form “Is x, 6 S1?” or “Is
x, 6 S2 ’2”, a system allowing only simple questions will generate two separate questions to
split the data into three clusters rather than two as shown in Figure 4.15. Also data for which
the answer is yes are inevitably fragmented across two shaded nodes. This is inefﬁcient and
ineffective since these two very similar data clusters may now both contain insufﬁcient
training examples, which could potentially handicap future tree growing. Splitting data un
necessarily across different nodes leads to unnecessary computation, redundant clusters,
reduced trainability, and less accurate entropy reduction. Is x, e 3,? Figure 4.15 An oversplit tree for the question “Is xi e S‘ ?” or “Is x, 6 S2 ?” We deal with this problem by using a compositequestion construction [38, 40]. It in
volves conjunctive and disjunctive combinations of all questions (and their negations). A
composite question is formed by ﬁrst growing a tree with simple questions only and then
clustering the leaves into two sets. Figure 4.16 shows the formation of one composite ques
tion. After merging, the structure is still a binary question. To construct the composite ques
tion, multiple OR operators are used to describe the composite condition leading to either
one of the ﬁnal clusters, and AND operators are used to describe the relation within a par
ticular route. Finally, a Boolean reduction algorithm is used to simplify the Boolean expres
sion of the composite question. To speed up the process of constructing composite questions, we constrain the number
of leaves or the depth of the binary tree through heuristics. The most frequently used heuris
tics is the limitation of the depth when searching a composite question. Since composite
questions are essentially binary questions, we use the same greedy treegrowing algorithm to
ﬁnd the best composite question for each node and keep growing the tree until the stop crite
rion is met. The use of composite questions not only enables ﬂexible clustering, but also
improves entropy reduction. Growing the subtree a little deeper before constructing the
0.0mposite question may achieve longerrange optimum, which is preferable to the local op
tlmum achieved in the original greedy algorithm that used simple questions only. The construction of composite questions can also be applied to continuous variables to
obtained complex rectangular partitions. However, some other techniques are used to obtain Pattern Recognition general partitions generated by hyperplanes not perpendicular to the coordinate axes. Ques—
tions typically have a linear combination of continuous variables in the following form [15]: {Is 2am. s 6?} (4.133) Figure 4.16 The formation of a composite question from simple questions. 4.5.6. The RightSized Tree One of the most critical problems for CART is that the tree may be strictly tailored to the
training data and has no generalization capability. When you split a leaf node in the tree to
get entropy reduction until each leaf node contains data from one single class, that tree pos
sesses a zero percent classiﬁcation error on the training set. This is an overoptimistic esti
mate of the testset misclassiﬁcation rate. Independent test sample estimation or cross
validation is often used to prevent decision trees from overmodeling idiosyncrasies of the
training data. To get a rightsized tree, you can minimize the misclassiﬁcation rate for future
independent test data. Before we describe the solution for ﬁnding the right sized tree, let’s deﬁne a couple of
useful terms. Naturally we will use the plurality rule 6(t) to choose the class for a node I: 5(t) = argmax P(col. l t) (4.134) / classiﬁcaﬁo Similar to t
rate RU) f‘ R(t) = Where rm 1 node t. The
MD where i re
the cost of r. W) = As we
the training
can minimiz
Almost no t
complicated
too soon at
inventing sc
tree overgr
back the tre«
rithm to pru 4.5.6.1. To prune a t
a cost measn
subtree. To DEFINITI(
minal nodes DEFINITI(
complexity RAT: DEFINITI(
minimizes 1 T(a)  n Recognition .e axes. Ques—
ng form [15]: (4.133) ~.~ tailored to the
le in the tree to
s, that tree pos
optimistic esti
ation or cross—
mcrasies of the
11 rate for future ﬁne a couple of
for a node t: (4.134) a
Classiﬁcation and Regression Trees 185 Similar to the notation used in Bayes’ decision theory, we can deﬁne the misclassiﬁcation
rate R(t) for a node I as: R(t) =r(t)P(t) (4.135) where r(t) = l—max P(a),. It) and P(t) is the frequency (probability) of the data falling in node t. The overall misclassiﬁcation rate for the whole tree Tis deﬁned as: R(T) = 212(1) (4.136) teT where T represents the set of terminal nodes. If a nonuniform misclassiﬁcation cost c(i  j),
the cost of misclassifying class j data as class i data, is used, r(t) is redeﬁned as: r(t) = min 2 c(i  j)P( j I t) (4.137) As we mentioned, R(T) can be made arbitrarily small (eventually reduced to zero) for
the training data if we keep growing the tree. The key now is how we choose the tree that
can minimize R'(T) , which is denoted as the misclassiﬁcation rate of independent test data.
Almost no tree initially grown can perform well on independent test data. In fact, using more
complicated stopping rules to limit the tree growing seldom works, and it is either stopped
too soon at some terminal nodes, or continued too far in other parts of the tree. Instead of
inventing some clever stopping criteria to stop the tree growing at the right size, we let the
tree overgrow (based on rules in Section 4.5.3). We use a pruning strategy to gradually cut
back the tree until the minimum R‘(T) is achieved. In the next section we describe an algo
rithm to prune an overgrown tree, minimum costcomplexity pruning. 4.5.6.1. Minimum CostComplexity Pruning To prune a tree, we need to ﬁnd a subtree (or branch) that makes the least impact in terms of
a cost measure, whether it is pruned or not. This candidate to be pruned is called the weakest
subtree. To deﬁne such a weakest subtree, we ﬁrst need to deﬁne the cost measure. DEFINITION 1: For any subtree T of Tm (T < Tm ), let IT I denote the number of ter
minal nodes in tree T. DEFINITION 2: Let a 2 0 be a real number called the complexity parameter. The cost
complexity measure can be deﬁned as: Ra(T) = R(T)+a 1”' I (4.138) DEFINITION 3: For each a , deﬁne the minimal costcomplexity subtree T (a) < Tm.x that
minimizes Ra (T) , that is, T(a) =argminRa(T) (4.139) T<TM R 186 Based on Deﬁnitions 2 and 3, if a is small, the penalty for having a large tree is small
and T (a) will be large. In fact, T (0) = Tmx because Tm has a zero misclassiﬁcation rate,
so it will minimize Re (T). On the other hand, when 0: increases, T (06) becomes smaller
and smaller. For a sufﬁcient large a , T (a) may collapse into a tree with only the root. The increase of 0: produces a sequence of pruned trees and it is the basis of the pruning process.
The pruning algorithm rests on two theorems. The ﬁrst is given as follows. THEOREM 1: For every value of a, there exists a unique minimal costcomplexity sub
tree T (a) as deﬁned in Deﬁnition 3.12 To progressively prune the tree, we need to find the weakest subtree (node). The idea
of a weakest subtree is the following: if we collapse the weakest subtree into a single termi
nal node, the costcomplexity measure would increase least. For any node t in the tree T , let {t} denote the subtree containing only the node 1, and 7; denote the branch starting at
node I. Then we have R.(T,)=R<1:)+a12':1 (4.140) R..({t}) =R(t)+a (4.141) When a is small, I} has a smaller costcomplexity than the single—node tree {t}.
However, when (1 increases to a point where the costcomplexity measures for I; and {t} are the same, it makes sense to collapse 7; into a single terminal node {t}. Therefore, we
decide the critical value for a by solving the following inequality: RAT.) S Ra({t}) (4.142)
We obtain:
a s M (4.143)
III I—1 Based on Eq. (4.143), we deﬁne a measurement 11(t) for each node t in tree T: R(t)_—R(T.) ,ET
n(t)= IT.I—1 ’ (4.144)
+oo, tef Based on measurement 11(t) , we then deﬁne the weakest subtree 171 as the tree branch start
ing at the node tl such that x
" You can ﬁnd the proof to this in [15]. Pattern Recognition a,=n As a
Rat (T1) . At
node subtree Now tl T.=T We then usr
point 01,. A
same proces
we get a seq T>Tl where {r} i
points: (10 <(
where a0 = With '
complexity THEOREL For l 4.5.6.2. The minim'
tree to forrr
a0 = 0 and
sized tree. t :rn Recognition ge tree is small
ssiﬁcation rate,
:comes smaller
ly the root. The
runing process. omplexity sub 1ode). The idea
a single tenni in the tree T ,
anch starting at (4.140) (4.141) :node tree {I}.
;for T and {t}
. Therefore, we (4.142) (4.143) (4.144) tree branch start Classification and Regression Trees 187
t] = arg min 11(t) (4.145)
IET
0‘] = 71(4) (4.146) As 0: increases, the node t] is the ﬁrst node such that Ra({t}) becomes equal to
Ram) . At this point, it would make sense to prune subtree I}. (collapse 1;} into a single
node subtree {1,} ), and 061 is the value of a where the pruning occurs. Now the tree T after pruning is referred to as I; , i.e., T] = T _T'l (4.147) We then use the same process to ﬁnd the weakest subtree T2 in I; and the new pruning
point (12. After pruning away I; from I] to form the new pruned tree T2 , we repeat the
same process to ﬁnd the next weakest subtree and pruning point. If we continue the process,
we get a sequence of decreasing pruned trees: T>I§>T2>T2m>{r} (4.148) where {r} is the singlenode tree containing the root of tree T with corresponding pruning
points: ao<al<a2<a3< (4.149) where do = 0. With the process above, the following theorem (which is basic for the minimum cost
complexity pruning) can be proved. THEOREM 2 : Let T, be the original tree T. For k 20, a, Soc <05,+1 , T(a) =T(a,,)=1; (4.150) 4.5.6.2. Independent Test Sample Estimation The minimum costcomplexity pruning algorithm can progressively prune the overgrown
tree to form a decreasing sequence of subtrees T > I; > T2 > T2 > {r} , where I}, = T (06*) ,
06° = 0 and I}, = T . The task now is simply to choose one of those subtrees as the optimal
sized tree. Our goal is to ﬁnd the optimalsized tree that minimizes the misclassiﬁcation for Pattern Recognition ALGORI Step 1: C
tions ab0
Step 2: 8 training set S. We use the remaining two thirds of the training set 8 —9t (still abundant) to
train the initial tree T and apply the minimum costcomplexity pruning algorithm to attain
the decreasing sequence of subtrees T > I; > T2 >T2 > {r}. Next, the test set 9? is run through the sequence of subtrees to get corresponding estimates of testset misclassiﬁcation in any n(
R'(T),R'(T;),R°(I; ),~~,R'({r}). The optimalsized tree is then picked as the one with square er
minimum testset misclassiﬁcation measure, i.e.: Step 3: Ir
Step 4: S
k' = argminR°(I},) (4.151) a
t b.
C.
The independent test sample estimation approach has the drawback that it reduces the Step 5; 3
effective training sample size. This is why it is used only when there is abundant training Step 6; 5
data. Under most circumstances where training data is limited, crossvalidation is often the poten
used. erwise go
Step 7: L
4.5.6.3. CrossValidation “6 "“°‘
CART can be pruned via vfold crossvalidation. It follows the same principle of cross vali .The E
dation described in Section 4.2.3. First it randomly divides the training set S into v disjoint That is, T (4
subsets 81,32,~~,Sv , each containing roughly the same data samples. It then deﬁnes the 1"" duecfly es?
training set Proxmate 8 occurs 11
31:8—3i i=l,2,...,v (4.152) canbecom
, . . . . . . . R"( so that 3 contains the fraction (v—l)/v of the original training set. v is usually chosen to
be large, like 10. .
In vfold crossvalidation, v auxiliary trees are grown together with the main tree T Similar to '
grown on 8. The f" tree is grown on the 1‘“ training set 3'. By applying minimum cost tree Kay 03 complexity pruning, for any given value of the costcomplexity parameter a, we can obtain kcv =
the corresponding minimum costcomplexity subtrees T(a) and T ’(a) , i =1, 2,...,v. Ac cording to Theorem 2 in Section 4.5.6.1, those minimum costcomplexity subtrees will form v +1 sequences of subtrees: Cross sample esti:
information
(4153) on a smalle
higher misc
. . . . overestima
T' >I}' >Tz' >I;'~~ >{r'} i=l,2,...,v (4.154) iuustratedi T>I;>T2>T2~~~>{r} and m Recognition rd to set aside
re third of the
l abundant) to
rithrn to attain set ER is run
asclassiﬁcation
s the one with (4.151) t it reduces the
indant training
dation is often le of cross vali
i into v disjoint
n deﬁnes the 1‘" (4.152) sually chosen to he main tree T minimum cost
5, we can obtain
=l,2,...,v . AC'
lbtrees will form (4.153) (4.154) h
Classiﬁcation and Regression Trees 189 ALGORITHM 4.5: THE CARTALGORITHM Step 1: Question Set: Create a standard set of questions Q that consists of all possible ques
tions about the measured variables.
Step 2: Splitting Criterion: Pick a splitting criterion that can evaluate all the possible questions
in any node. Usually it is either entropylike measurement for classification trees or mean
square errors for regression trees.
Step 3: initialization: Create a tree with one (root) node, consisting of all training samples.
Step 4: Split Candidates: Find the best composite question for each terminal node: a. Generate a tree with several simplequestion splits as described in Section 4.5.3. b. Cluster leaf nodes into two classes according to the same splitting criterion. 0. Based on the clustering done in (b), construct a corresponding composite question.
Step 5: Split: Out of all the split candidates in Step 4, split the one with best criterion.
Step 6: Stop Criterion: if all the leaf nodes containing data samples from the same class or all
the potential splits generating improvement fall below a preset threshold [3 , go to Step 7; oth
erwise go to Step 4. Step 7: Use independent test sample estimate or crossvalidation estimate to prune the original
tree into the optimal size. The basic assumption of crossvalidation is that the procedure is stable if v is large.
That is, T (a) should have the same classification accuracy as T’(a) . Although we cannot
directly estimate the testset misclassiﬁcation for the main tree R.(T ((1)) , we could ap
proximate it via the testset misclassiﬁcation measure R'(T'(a)) , since each data sample in 8 occurs in one and only one test set 8,. The vfold crossvalidation estimate R" (T ((1))
can be computed as: R" (T(a)) = $2R'(T’(a)) (4.155) i=l Similar to Eq. (4.151), once R" (T ((1)) is computed, the optimal vfold crossvalidation
tree 7%, can be found through k" = argmin R6701) (4.156)
k Crossvalidation is computationally expensive in comparison with independent test
sample estimation, though it makes more effective use of all training data and reveals useful
information regarding the stability of the tree structure. Since the auxiliary trees are grown
on a smaller training set (a fraction v —1/v of the original training data), they tend to have a
higher misclassiﬁcation rate. Therefore, the crossvalidation estimates R" (T) tend to be an Wetestimation of the misclassiﬁcation rate. The algorithm for generating a CART tree is
Illustrated in Algorithm 4.5. ...
View
Full Document
 Spring '10
 Glass
 Decision Analysis, leaf node, data samples

Click to edit the document details