Unformatted text preview: causes slightly di erent cross-validation groups. 11 Relation to other programs
11.1 CART Almost all of the de nitions in rpart are equivalent to those used in CART, and the
output should usually be very similar. The printout given by summary.rpart was
also strongly in uenced by some early cart output. Some known di erences are
Surrogate splits: cart uses the percentage agreement between the surrogate
and the primary split, and rpart uses the total number of agreements. When
one of the surrogate variables has missing values this can lead to a di erent
ordering. For instance, assume that the best surrogate based on x1 has 45 50
= 90 agreement with 10 missing, and the best based on x2 has 46 60 none
missing. Then rpart will pick x2 . This is only a serious issue when there
are a large number of missing values for one variable, and indeed the change
was motivated by examples where a nearly-100best surrogate due to perfect
concordance with the primary split.
: Computation: Some versions of the cart code have been optimized for very
large data problems, and include such features as subsampling from the larger
nodes. Large data sets can be a problem in S-plus.
50 11.2 Tree The user interface to rpart is almost identical to that of the tree functions. In
fact, the rpart object was designed to be a simple superset of the tree class, and
to inherit most methods from it saving a lot of code writing. However, this close
connection had to be abandoned. The rpart object is still very similiar to tree
objects, di ering in 3 respects
Addition of a method component. This was the single largest reason for divergence. In tree, splitting of a categorical variable results in a yprob element
in the data structure, but regression does not. Most of the downstream"
functions then contain the code fragment if the obeject contains a yprob
component, then do A, else do B". rpart has more than two methods, and
this simple approach does not work. Rather, the method used is itself retained
in the output.
Additional components to describe the tree. This includes the yval2 component, which contains further response information beyond the primary value.
For the gini method, for instance, the primary response value is the predicted
class for a node, and the additional value is the complete vector of class counts.
The predicted probability vector yprob is a function of these, the priors, and
the tree topology. Other additional components store the competitor and
surrogate split information.
The xlevels component in rpart is a list containing, for each factor variable,
the list of levels for that factor. In tree, the list also contains NULL values for
the non-factor predictors. In one problem with a very large 4096 number of
predictors we found that the processing of this list consumed nearly as much
time and memory as the problem itself. The ine ciency in S-plus that caused
this may have since been corrected.
Although the rpart structure does not inherit from c...
View Full Document
- Fall '13
- Regression Analysis, Missing values