Rpart_TechReport61

# An event is ci logci 0 which is in nite

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: nately it is not so. One given of tree-based modeling is that a right-sized model is arrived at by purposely over- tting the data and then pruning back the branches. A program that aborts due to a numeric exception during the rst stage is uninformative to say the least. Of more concern is that this edge e ect does not seem to be limited to the pathologic case detailed above. Any near approach to the boundary value  = 0 leads to large values of the deviance, and the procedure tends to discourage any nal node with a small number of events. An ad hoc solution is to use the revised estimate ^ = max ; P ^k  ti where k is 1 2 or 1 6. That is, pure nodes are given a partial event. This is similar to the starting estimates used in the GLM program for a Poisson regression. This is unsatisfying, however, and we propose instead using a shrinkage estimate. Assume that the true rates j for the leaves of the tree are random values P P from a Gamma;  distribution. Set  to the observed overall event rate ci = ti , and let the user choose as a prior the coe cient of variation k = =. A value of k = 0 represents extreme pessimism  the leaf nodes will all give the same result", whereas k = 1 represents extreme optimism. The Bayes estimate of the event rate for a node works out to be P ^ k = + P ci ;  + ti ^ where = 1=k2 and = =. This estimate is scale invariant, has a simple interpretation, and shrinks least those nodes with a large amount of information. In practice, a value of k = 10 does essentially no shrinkage. For method='poisson', the optional parameters list is the single number k, with a default value of 1. This corresponds to prior coe cient of variation of 1 for the estimated j . We have not nearly enough experience to decide if this is a good value. It does stop the log0 message though. Cross-validation does not work very well. The procedure gives very conservative results, and quite often declares the no-split tree to be the best. This may be another artifact of the edge e ect. 8.3 Example: solder data The solder data frame, as explained in the Splus help le, is a design object with 900 observations, which are the results of an experiment varying 5 factors relevant to the wave-soldering procedure for mounting components on printed circuit boards. The 37 response variable, skips, is a count of how many solder skips appeared to a visual inspection. The other variables are listed below: Opening Solder Mask PadType Panel factor: factor: factor: factor: factor: amount of clearance around the mounting pad S amount of solder used Thin Thick Type of solder mask used 5 possible Mounting pad used 10 possible panel 1, 2 or 3 on board being counted M L In this call, the rpart.control options are modi ed: maxcompete = 2 means that only 2 other competing splits are listed default is 4; cp = .05 means that a smaller tree will be built initially default is .01. The y variable for Poisson partitioning may be a two column matrix con...
View Full Document

## This document was uploaded on 09/26/2013.

Ask a homework question - tutors are online