Unformatted text preview: svalidation using 1SE rule of 0:3444 + 0:0363369. This subtree is
extracted with call to prune and saved in fit9. The pruned tree is shown in gure
4. Two options have been used in the plot. The compress option tries to narrow the
printout by vertically overlapping portions of the plot. It has only a small e ect on
this particular dendrogram. The branch option controls the shape of the branches
that connect a node to its children. The section on plotting 9 will discuss this and
other options in more detail.
The largest tree, with 36 terminal nodes, correctly classi es 170 200 = 85
1 , 0:15 of the observations, but uses several of the random predictors in doing
16 so and seriously over ts the data. If the number of observations per terminal node
minbucket had been set to 1 instead of 2, then every observation would be classi ed
correctly in the nal model, many in terminal nodes of size 1. 5 Missing data 5.1 Choosing the split Missing values are one of the curses of statistical models and analysis. Most procedures deal with them by refusing to deal with them incomplete observations are
tossed out. Rpart is somewhat more ambitious. Any observation with values for
the dependent variable and at least one independent variable will participate in the
modeling.
The quantity to be maximized is still
I = pAI A , pAL I AL , pAR I AR
The leading term is the same for all variables and splits irrespective of missing
data, but the right two terms are somewhat modi ed. Firstly, the impurity indices
I AR and I AL are calculated only over the observations which are not missing
a particular predictor. Secondly, the two probabilities pAL and pAR are also
calculated only over the relevant observations, but they are then adjusted so that
they sum to pA. This entails some extra bookkeeping as the tree is built, but
ensures that the terminal node probabilities sum to 1.
In the extreme case of a variable for which only 2 observations are nonmissing,
the impurity of the two sons will both be zero when splitting on that variable. Hence
I will be maximal, and this `almost all missing' coordinate is guaranteed to be
chosen as best; the method is certainly awed in this extreme case. It is di cult to
say whether this bias toward missing coordinates carries through to the nonextreme
cases, however, since a more complete variable also a ords for itself more possible
values at which to split. 5.2 Surrogate variables Once a splitting variable and a split point for it have been decided, what is to
be done with observations missing that variable? One approach is to estimate the
missing datum using the other independent variables; rpart uses a variation of this
to de ne surrogate variables.
As an example, assume that the split age 40, age 40 has been chosen. The
surrogate variables are found by reapplying the partitioning algorithm without
17 recursion to predict the two categories `age 40' vs. `age 40' using the other
independent variables...
View
Full Document
 Fall '13
 Regression Analysis, Missing values

Click to edit the document details