Form li j 8 li i 6 j 0 ij in which case l i

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: lem. For P arbitrary loss matrix of dimension C 2, rpart uses the above formula with Li = j Li; j . P A second justi cation for altered priors is this. An impurity index I A = f pi has its maximum at p1 = p2 = : : : = pC = 1=C . If a problem had, for instance, a misclassi cation loss for class 1 which was twice the loss for a class 2 or 3 observation, one would wish IA to have its maximum at p1 =1 5, p2 = p3 =2 5, since this is the worst possible set of proportions on which to decide a node's class. The altered priors technique does exactly this, by shifting the pi . Two nal notes When altered priors are used, they a ect only the choice of split. The ordinary losses and priors are used to compute the risk of the node. The altered priors simply help the impurity rule choose splits that are likely to be good" in terms of the risk. The argument for altered priors is valid for both the gini and information splitting rules. 3.3 Example: Stage C prostate cancer class method This rst example is based on a data set of 146 stage C prostate cancer patients 4 . The main clinical endpoint of interest is whether the disease recurs after initial surgical removal of the prostate, and the time interval to that progression if any. The endpoint in this example is status, which takes on the value 1 if the disease has progressed and 0 if not. Later we'll analyze the data using the exponential exp method, which will take into account time to progression. A short description of each of the variables is listed below. The main predictor variable of interest in this study was DNA ploidy, as determined by ow cytometry. For diploid and tetraploid tumors, the ow cytometric method was also able to estimate the percent of tumor cells in a G2 growth stage of their cell cycle; G2  is systematically missing for most aneuploid tumors. The variables in the data set are 9 grade<2.5 | g2<13.2 No ploidy:ab g2>11.845 g2<11.005 g2>17.91 Prog No age>62.5 Prog No Prog No Prog Figure 3: Classi cation tree for the Stage C data pgtime time to progression, or last follow-up free of progression pgstat status at last follow-up 1=progressed, 0=censored age age at diagnosis eet early endocrine therapy 1=no, 0=yes ploidy diploid tetraploid aneuploid DNA pattern g2  of cells in G2 phase grade tumor grade 1-4 gleason Gleason grade 3-10 The model is t by using the rpart function. The rst argument of the function is a model formula, with the  symbol standing for is modeled as". The print function gives an abbreviated output, as for other S models. The plot and text command plot the tree and then label the plot, the result is shown in gure 3. progstat - factorstagec$pgstat, levels=0:1, labels=c"No", "Prog" cfit - rpartprogstat age + eet + g2 + grade + gleason + ploidy, data=stagec, method='class' printcfit  node, split, n, loss, yval, yprob * denotes terminal node 1 root 146 54 No  0.6301 0.3699  10 2 grade 2.5 61 9 No  0.8525 0.1475  * 3 grade...
View Full Document

This document was uploaded on 09/26/2013.

Ask a homework question - tutors are online