Unformatted text preview: nately it is not
so. One given of tree-based modeling is that a right-sized model is arrived at by
purposely over- tting the data and then pruning back the branches. A program
that aborts due to a numeric exception during the rst stage is uninformative to
say the least. Of more concern is that this edge e ect does not seem to be limited
to the pathologic case detailed above. Any near approach to the boundary value
= 0 leads to large values of the deviance, and the procedure tends to discourage
any nal node with a small number of events.
An ad hoc solution is to use the revised estimate ^ = max ; P
ti where k is 1 2 or 1 6. That is, pure nodes are given a partial event. This is similar
to the starting estimates used in the GLM program for a Poisson regression. This
is unsatisfying, however, and we propose instead using a shrinkage estimate.
Assume that the true rates j for the leaves of the tree are random values P
a Gamma; distribution. Set to the observed overall event rate ci = ti ,
and let the user choose as a prior the coe cient of variation k = =. A value of
k = 0 represents extreme pessimism the leaf nodes will all give the same result",
whereas k = 1 represents extreme optimism. The Bayes estimate of the event rate
for a node works out to be
^ k = + P ci ;
where = 1=k2 and = =.
This estimate is scale invariant, has a simple interpretation, and shrinks least
those nodes with a large amount of information. In practice, a value of k = 10 does
essentially no shrinkage. For method='poisson', the optional parameters list is the
single number k, with a default value of 1. This corresponds to prior coe cient of
variation of 1 for the estimated j . We have not nearly enough experience to decide
if this is a good value. It does stop the log0 message though.
Cross-validation does not work very well. The procedure gives very conservative
results, and quite often declares the no-split tree to be the best. This may be another
artifact of the edge e ect. 8.3 Example: solder data The solder data frame, as explained in the Splus help le, is a design object with 900
observations, which are the results of an experiment varying 5 factors relevant to the
wave-soldering procedure for mounting components on printed circuit boards. The
37 response variable, skips, is a count of how many solder skips appeared to a visual
inspection. The other variables are listed below:
factor: amount of clearance around the mounting pad S
amount of solder used Thin Thick
Type of solder mask used 5 possible
Mounting pad used 10 possible
panel 1, 2 or 3 on board being counted M L In this call, the rpart.control options are modi ed: maxcompete = 2 means that
only 2 other competing splits are listed default is 4; cp = .05 means that a smaller
tree will be built initially default is .01. The y variable for Poisson partitioning
may be a two column matrix con...
View Full Document
This document was uploaded on 09/26/2013.
- Fall '13