xpredrpartfit err stagecpgtime temp2 sumerr

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: causes slightly di erent cross-validation groups. 11 Relation to other programs 11.1 CART Almost all of the de nitions in rpart are equivalent to those used in CART, and the output should usually be very similar. The printout given by summary.rpart was also strongly in uenced by some early cart output. Some known di erences are Surrogate splits: cart uses the percentage agreement between the surrogate and the primary split, and rpart uses the total number of agreements. When one of the surrogate variables has missing values this can lead to a di erent ordering. For instance, assume that the best surrogate based on x1 has 45 50 = 90 agreement with 10 missing, and the best based on x2 has 46 60 none missing. Then rpart will pick x2 . This is only a serious issue when there are a large number of missing values for one variable, and indeed the change was motivated by examples where a nearly-100best surrogate due to perfect concordance with the primary split. : Computation: Some versions of the cart code have been optimized for very large data problems, and include such features as subsampling from the larger nodes. Large data sets can be a problem in S-plus. 50 11.2 Tree The user interface to rpart is almost identical to that of the tree functions. In fact, the rpart object was designed to be a simple superset of the tree class, and to inherit most methods from it saving a lot of code writing. However, this close connection had to be abandoned. The rpart object is still very similiar to tree objects, di ering in 3 respects Addition of a method component. This was the single largest reason for divergence. In tree, splitting of a categorical variable results in a yprob element in the data structure, but regression does not. Most of the downstream" functions then contain the code fragment if the obeject contains a yprob component, then do A, else do B". rpart has more than two methods, and this simple approach does not work. Rather, the method used is itself retained in the output. Additional components to describe the tree. This includes the yval2 component, which contains further response information beyond the primary value. For the gini method, for instance, the primary response value is the predicted class for a node, and the additional value is the complete vector of class counts. The predicted probability vector yprob is a function of these, the priors, and the tree topology. Other additional components store the competitor and surrogate split information. The xlevels component in rpart is a list containing, for each factor variable, the list of levels for that factor. In tree, the list also contains NULL values for the non-factor predictors. In one problem with a very large 4096 number of predictors we found that the processing of this list consumed nearly as much time and memory as the problem itself. The ine ciency in S-plus that caused this may have since been corrected. Although the rpart structure does not inherit from c...
View Full Document

This document was uploaded on 09/26/2013.

Ask a homework question - tutors are online