Rpart_TechReport61

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: . For each predictor an optimal split point and a misclassi cation error are computed. Losses and priors do not enter in | none are de ned for the age groups | so the risk is simply misclassi ed n. Also evaluated is the blind rule `go with the majority' which has misclassi cation error minp; 1 , p where p =  in A with age 40 nA. The surrogates are ranked, and any variables which do no better than the blind rule are discarded from the list. Assume that the majority of subjects have age  40 and that there is another variable x which is uncorrelated to age; however, the subject with the largest value of x is also over 40 years of age. Then the surrogate variable x max versus x  max will have one less error that the blind rule, sending 1 subject to the right and n , 1 to the left. A continuous variable that is completely unrelated to age has probability 1 , p2 of generating such a trim-one-end surrogate by chance alone. For this reason the rpart routines impose one more constraint during the construction of the surrogates: a candidate split must send at least 2 observations to the left and at least 2 to the right. Any observation which is missing the split variable is then classi ed using the rst surrogate variable, or if missing that, the second surrogate is used, and etc. If an observation is missing all the surrogates the blind rule is used. Other strategies for these `missing everything' observations can be convincingly argued, but there should be few or no observations of this type we hope. 5.3 Example: Stage C prostate cancer cont. Let us return to the stage C prostate cancer data of the earlier example. For a more detailed listing of the rpart object, we use the summary function. It includes the information from the CP table not repeated below, plus information about each node. It is easy to print a subtree based on a di erent cp value using the cp option. Any value between 0.0555 and 0.1049 would produce the same result as is listed below, that is, the tree with 3 splits. Because the printout is long, the file option of summary.rpart is often useful. printcpfit Classification tree: rpartformula = progstat ~ age + eet + g2 + grade + gleason + ploidy, data = stagec 18 Variables actually used in tree construction: 1 age g2 grade ploidy Root node error: 54 146 = 0.36986 1 2 3 4 5 CP nsplit rel error 0.104938 0 1.00000 0.055556 3 0.68519 0.027778 4 0.62963 0.018519 6 0.57407 0.010000 7 0.55556 xerror xstd 1.0000 0.10802 1.1852 0.11103 1.0556 0.10916 1.0556 0.10916 1.0556 0.10916 summarycfit,cp=.06 Node number 1: 146 observations, complexity param=0.1049 predicted class= No expected loss= 0.3699 class counts: 92 54 probabilities: 0.6301 0.3699 left son=2 61 obs right son=3 85 obs Primary splits: grade 2.5 to the left, improve=10.360, 0 missing gleason 5.5 to the left, improve= 8.400, 3 missing ploidy splits as LRR, improve= 7.657, 0 missing g2 13.2 to the left, improve= 7.187, 7 missing age 58.5 to the right, improve= 1.388,...
View Full Document

This document was uploaded on 09/26/2013.

Ask a homework question - tutors are online