Split n loss yval yprob denotes terminal node 1 root

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 2.5 85 40 Prog  0.4706 0.5294  6 g2 13.2 40 17 No  0.5750 0.4250  12 ploidy:diploid,tetraploid 31 11 No  0.6452 0.3548  24 g2 11.845 7 1 No  0.8571 0.1429  * 25 g2 11.845 24 10 No  0.5833 0.4167  50 g2 11.005 17 5 No  0.7059 0.2941  * 51 g2 11.005 7 2 Prog  0.2857 0.7143  * 13 ploidy:aneuploid 9 3 Prog  0.3333 0.6667  * 7 g2 13.2 45 17 Prog  0.3778 0.6222  14 g2 17.91 22 8 No  0.6364 0.3636  28 age 62.5 15 4 No  0.7333 0.2667  * 29 age 62.5 7 3 Prog  0.4286 0.5714  * 15 g2 17.91 23 3 Prog  0.1304 0.8696  * plotcfit textcfit The creation of a labeled factor variable as the response improves the labeling of the printout. We have explicitly directed the routine to treat progstat as a categorical variable by asking for method='class'. Since progstat is a factor this would have been the default choice. Since no optional classi cation parameters are speci ed the routine will use the Gini rule for splitting, prior probabilities that are proportional to the observed data frequencies, and 0 1 losses. The child nodes of node x are always numbered 2x left and 2x + 1 right, to help in navigating the printout compare the printout to gure 3. Other items in the list are the de nition of the variable and split used to create a node, the number of subjects at the node, the loss or error at the node for this example, with proportional priors and unit losses this will be the number misclassi ed, the classi cation of the node, and the predicted class for the node. * indicates that the node is terminal. Grades 1 and 2 go to the left, grades 3 and 4 go to the right. The tree is arranged so that the branches with the largest average class" go to the right. 11 4 Pruning the tree 4.1 De nitions We have built a complete tree, possibly quite large and or complex, and must now decide how much of that model to retain. In forward stepwise regression, for instance, this issue is addressed sequentially and no additional variables are added when the F-test for the remaining variables fails to achieve some level . Let T1 , T2 ,....,Tk be the terminal nodes of a tree T. De ne jT j = number of terminal nodes P risk of T = RT  = k=1 P Ti RTi  i In comparison to regression, jT j is analogous to the model degrees of freedom and RT  to the residual sum of squares. Now let be some number between 0 and 1 which measures the 'cost' of adding another variable to the model; will be called a complexity parameter. Let RT0  be the risk for the zero split tree. De ne R T  = RT  + jT j to be the cost for the tree, and de ne T to be that subtree of the full model which has minimal cost. Obviously T0 = the full model and T1 = the model with no splits at all. The following results are shown in 1 . 1. If T1 and T2 are subtrees of T with R T1  = R T2 , then either T1 is a subtree of T2 or T2 is a subtree of T1 ; hence either jT1 j jT2 j or jT2 j jT1 j. 2. If then either T = T or T is a strict subtree of T . 3. Given some set of...
View Full Document

This document was uploaded on 09/26/2013.

Ask a homework question - tutors are online