Complexity parameter column cp has been similarly

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: s-validation using 1-SE rule of 0:3444 + 0:0363369. This subtree is extracted with call to prune and saved in fit9. The pruned tree is shown in gure 4. Two options have been used in the plot. The compress option tries to narrow the printout by vertically overlapping portions of the plot. It has only a small e ect on this particular dendrogram. The branch option controls the shape of the branches that connect a node to its children. The section on plotting 9 will discuss this and other options in more detail. The largest tree, with 36 terminal nodes, correctly classi es 170 200 = 85 1 , 0:15 of the observations, but uses several of the random predictors in doing 16 so and seriously over ts the data. If the number of observations per terminal node minbucket had been set to 1 instead of 2, then every observation would be classi ed correctly in the nal model, many in terminal nodes of size 1. 5 Missing data 5.1 Choosing the split Missing values are one of the curses of statistical models and analysis. Most procedures deal with them by refusing to deal with them incomplete observations are tossed out. Rpart is somewhat more ambitious. Any observation with values for the dependent variable and at least one independent variable will participate in the modeling. The quantity to be maximized is still I = pAI A , pAL I AL  , pAR I AR  The leading term is the same for all variables and splits irrespective of missing data, but the right two terms are somewhat modi ed. Firstly, the impurity indices I AR  and I AL are calculated only over the observations which are not missing a particular predictor. Secondly, the two probabilities pAL  and pAR  are also calculated only over the relevant observations, but they are then adjusted so that they sum to pA. This entails some extra bookkeeping as the tree is built, but ensures that the terminal node probabilities sum to 1. In the extreme case of a variable for which only 2 observations are non-missing, the impurity of the two sons will both be zero when splitting on that variable. Hence I will be maximal, and this `almost all missing' coordinate is guaranteed to be chosen as best; the method is certainly awed in this extreme case. It is di cult to say whether this bias toward missing coordinates carries through to the non-extreme cases, however, since a more complete variable also a ords for itself more possible values at which to split. 5.2 Surrogate variables Once a splitting variable and a split point for it have been decided, what is to be done with observations missing that variable? One approach is to estimate the missing datum using the other independent variables; rpart uses a variation of this to de ne surrogate variables. As an example, assume that the split age 40, age 40 has been chosen. The surrogate variables are found by re-applying the partitioning algorithm without 17 recursion to predict the two categories `age 40' vs. `age 40' using the other independent variables...
View Full Document

This document was uploaded on 09/26/2013.

Ask a homework question - tutors are online