Unformatted text preview: 2.5 85 40 Prog 0.4706 0.5294
6 g2 13.2 40 17 No 0.5750 0.4250
12 ploidy:diploid,tetraploid 31 11 No 0.6452 0.3548
24 g2 11.845 7 1 No 0.8571 0.1429 *
25 g2 11.845 24 10 No 0.5833 0.4167
50 g2 11.005 17 5 No 0.7059 0.2941 *
51 g2 11.005 7 2 Prog 0.2857 0.7143 *
13 ploidy:aneuploid 9 3 Prog 0.3333 0.6667 *
7 g2 13.2 45 17 Prog 0.3778 0.6222
14 g2 17.91 22 8 No 0.6364 0.3636
28 age 62.5 15 4 No 0.7333 0.2667 *
29 age 62.5 7 3 Prog 0.4286 0.5714 *
15 g2 17.91 23 3 Prog 0.1304 0.8696 *
textcfit The creation of a labeled factor variable as the response improves the labeling
of the printout.
We have explicitly directed the routine to treat progstat as a categorical variable by asking for method='class'. Since progstat is a factor this would have
been the default choice. Since no optional classi cation parameters are speci ed the routine will use the Gini rule for splitting, prior probabilities that are
proportional to the observed data frequencies, and 0 1 losses.
The child nodes of node x are always numbered 2x left and 2x + 1 right,
to help in navigating the printout compare the printout to gure 3.
Other items in the list are the de nition of the variable and split used to create
a node, the number of subjects at the node, the loss or error at the node for
this example, with proportional priors and unit losses this will be the number
misclassi ed, the classi cation of the node, and the predicted class for the
* indicates that the node is terminal.
Grades 1 and 2 go to the left, grades 3 and 4 go to the right. The tree is
arranged so that the branches with the largest average class" go to the right. 11 4 Pruning the tree
4.1 De nitions We have built a complete tree, possibly quite large and or complex, and must now
decide how much of that model to retain. In forward stepwise regression, for instance, this issue is addressed sequentially and no additional variables are added
when the F-test for the remaining variables fails to achieve some level .
Let T1 , T2 ,....,Tk be the terminal nodes of a tree T. De ne
jT j = number of terminal nodes
risk of T = RT = k=1 P Ti RTi
In comparison to regression, jT j is analogous to the model degrees of freedom and
RT to the residual sum of squares.
Now let be some number between 0 and 1 which measures the 'cost' of adding
another variable to the model; will be called a complexity parameter. Let RT0
be the risk for the zero split tree. De ne
R T = RT + jT j
to be the cost for the tree, and de ne T to be that subtree of the full model which
has minimal cost. Obviously T0 = the full model and T1 = the model with no splits
at all. The following results are shown in 1 .
1. If T1 and T2 are subtrees of T with R T1 = R T2 , then either T1 is a
subtree of T2 or T2 is a subtree of T1 ; hence either jT1 j jT2 j or jT2 j jT1 j.
then either T = T or T is a strict subtree of T .
3. Given some set of...
View Full Document
- Fall '13
- Regression Analysis, Missing values