Unformatted text preview: numbers 1 ; 2 ; : : : ; m ; both T 1 ; : : : ; T m and RT 1 , : : :,
RT m can be computed e ciently.
Using the rst result, we can uniquely de ne T as the smallest tree T for which
R T is minimized.
Since any sequence of nested trees based on T has at most jT j members, result
2 implies that all possible values of can be grouped into m intervals, m jT j
I1 = 0; 1
I2 = 1 ; 2
Im = m,1 ; 1
where all 2 Ii share the same minimizing subtree.
12 4.2 Cross-validation Cross-validation is used to choose a best value for by the following steps:
1. Fit the full model on the data set
compute I1 ; I2 ; :::; Im
set 1 = 0
=p 2 3
p m,2 m,1
each i is a `typical value' for its Ii
2. Divide the data set into s groups G1 ; G2 ; :::; Gs each of size s=n, and for each
t a full model on the data set `everyone except Gi ' and determine
T 1 ; T 2 ; :::; T m for this reduced data set,
compute the predicted class for each observation in Gi , under each of the
models T j for 1 j m,
from this compute the risk for each subject.
3. Sum over the Gi to get an estimate of risk for each j . For that complexity
parameter with smallest risk compute T for the full data set, this is chosen
as the best trimmed tree.
In actual practice, we may use instead the 1-SE rule. A plot of versus risk often
has an initial sharp drop followed by a relatively at plateau and then a slow rise.
The choice of among those models on the plateau can be essentially random. To
avoid this, both an estimate of the risk and its standard error are computed during
the cross-validation. Any risk within one standard error of the achieved minimum
is marked as being equivalent to the minimum i.e. considered to be part of the
at plateau. Then the simplest model, among all those tied" on the plateau, is
In the usual de nition of cross-validation we would have taken s = n above, i.e.,
each of the Gi would contain exactly one observation, but for moderate n this is
computationally prohibitive. A value of s = 10 has been found to be su cient, but
users can vary this if they wish.
In Monte-Carlo trials, this method of pruning has proven very reliable for screening out `pure noise' variables in the data set.
13 4.3 Example: The Stochastic Digit Recognition Problem This example is found in section 2.6 of 1 , and used as a running example throughout much of their book. Consider the segments of an unreliable digital readout
2 4 5 3
where each light is correct with probability 0.9, e.g., if the true digit is a 2, the lights
1, 3, 4, 5, and 7 are on with probability 0.9 and lights 2 and 6 are on with probability 0.1. Construct test data where Y 2 f0; 1; :::; 9g, each with proportion 1 10 and
the Xi , i = 1; : : : ; 7 are i.i.d. Bernoulli variables with parameter depending on Y.
X8 , X24 are generated as i.i.d bernoulli P fXi = 1g = :5, and are independent of Y.
They correspond to imbedding the readout in a larger rectangle of random lights.
View Full Document
- Fall '13
- Regression Analysis, Missing values