Unformatted text preview: lem. For P
arbitrary loss matrix of dimension C 2, rpart uses the above
formula with Li = j Li; j .
P A second justi cation for altered priors is this. An impurity index I A =
f pi has its maximum at p1 = p2 = : : : = pC = 1=C . If a problem had, for
instance, a misclassi cation loss for class 1 which was twice the loss for a class 2 or
3 observation, one would wish IA to have its maximum at p1 =1 5, p2 = p3 =2 5,
since this is the worst possible set of proportions on which to decide a node's class.
The altered priors technique does exactly this, by shifting the pi .
Two nal notes
When altered priors are used, they a ect only the choice of split. The ordinary
losses and priors are used to compute the risk of the node. The altered priors
simply help the impurity rule choose splits that are likely to be good" in
terms of the risk.
The argument for altered priors is valid for both the gini and information
splitting rules. 3.3 Example: Stage C prostate cancer class method This rst example is based on a data set of 146 stage C prostate cancer patients
4 . The main clinical endpoint of interest is whether the disease recurs after initial
surgical removal of the prostate, and the time interval to that progression if any.
The endpoint in this example is status, which takes on the value 1 if the disease has
progressed and 0 if not. Later we'll analyze the data using the exponential exp
method, which will take into account time to progression. A short description of
each of the variables is listed below. The main predictor variable of interest in this
study was DNA ploidy, as determined by ow cytometry. For diploid and tetraploid
tumors, the ow cytometric method was also able to estimate the percent of tumor
cells in a G2 growth stage of their cell cycle; G2 is systematically missing for
most aneuploid tumors.
The variables in the data set are 9 grade<2.5
 g2<13.2
No ploidy:ab g2>11.845
g2<11.005 g2>17.91 Prog No
age>62.5
Prog
No Prog No Prog Figure 3: Classi cation tree for the Stage C data
pgtime time to progression, or last followup free of progression
pgstat status at last followup 1=progressed, 0=censored
age
age at diagnosis
eet
early endocrine therapy 1=no, 0=yes
ploidy diploid tetraploid aneuploid DNA pattern
g2
of cells in G2 phase
grade tumor grade 14
gleason Gleason grade 310
The model is t by using the rpart function. The rst argument of the function
is a model formula, with the symbol standing for is modeled as". The print
function gives an abbreviated output, as for other S models. The plot and text
command plot the tree and then label the plot, the result is shown in gure 3.
progstat  factorstagec$pgstat, levels=0:1, labels=c"No", "Prog"
cfit
 rpartprogstat
age + eet + g2 + grade + gleason + ploidy,
data=stagec, method='class'
printcfit node, split, n, loss, yval, yprob
* denotes terminal node
1 root 146 54 No 0.6301 0.3699 10 2 grade 2.5 61 9 No 0.8525 0.1475 *
3 grade...
View
Full Document
 Fall '13
 Regression Analysis, Missing values

Click to edit the document details