Unformatted text preview: nately it is not
so. One given of treebased modeling is that a rightsized model is arrived at by
purposely over tting the data and then pruning back the branches. A program
that aborts due to a numeric exception during the rst stage is uninformative to
say the least. Of more concern is that this edge e ect does not seem to be limited
to the pathologic case detailed above. Any near approach to the boundary value
= 0 leads to large values of the deviance, and the procedure tends to discourage
any nal node with a small number of events.
An ad hoc solution is to use the revised estimate ^ = max ; P
^k
ti where k is 1 2 or 1 6. That is, pure nodes are given a partial event. This is similar
to the starting estimates used in the GLM program for a Poisson regression. This
is unsatisfying, however, and we propose instead using a shrinkage estimate.
Assume that the true rates j for the leaves of the tree are random values P
P from
a Gamma; distribution. Set to the observed overall event rate ci = ti ,
and let the user choose as a prior the coe cient of variation k = =. A value of
k = 0 represents extreme pessimism the leaf nodes will all give the same result",
whereas k = 1 represents extreme optimism. The Bayes estimate of the event rate
for a node works out to be
P
^ k = + P ci ;
+ ti
^
where = 1=k2 and = =.
This estimate is scale invariant, has a simple interpretation, and shrinks least
those nodes with a large amount of information. In practice, a value of k = 10 does
essentially no shrinkage. For method='poisson', the optional parameters list is the
single number k, with a default value of 1. This corresponds to prior coe cient of
variation of 1 for the estimated j . We have not nearly enough experience to decide
if this is a good value. It does stop the log0 message though.
Crossvalidation does not work very well. The procedure gives very conservative
results, and quite often declares the nosplit tree to be the best. This may be another
artifact of the edge e ect. 8.3 Example: solder data The solder data frame, as explained in the Splus help le, is a design object with 900
observations, which are the results of an experiment varying 5 factors relevant to the
wavesoldering procedure for mounting components on printed circuit boards. The
37 response variable, skips, is a count of how many solder skips appeared to a visual
inspection. The other variables are listed below:
Opening
Solder
Mask
PadType
Panel factor:
factor:
factor:
factor:
factor: amount of clearance around the mounting pad S
amount of solder used Thin Thick
Type of solder mask used 5 possible
Mounting pad used 10 possible
panel 1, 2 or 3 on board being counted M L In this call, the rpart.control options are modi ed: maxcompete = 2 means that
only 2 other competing splits are listed default is 4; cp = .05 means that a smaller
tree will be built initially default is .01. The y variable for Poisson partitioning
may be a two column matrix con...
View
Full
Document
This document was uploaded on 09/26/2013.
 Fall '13

Click to edit the document details