Unformatted text preview: e of size 200 was generated accordingly and the procedure applied using
the gini index see 3.2.1 to build the tree. The Splus code to compute the simulated
data and the t are shown below.
n  200
temp  c1,1,1,0,1,1,1,
0,0,1,0,0,1,0,
1,0,1,1,1,0,1,
1,0,1,1,0,1,1,
0,1,1,1,0,1,0,
1,1,0,1,0,1,1,
0,1,0,1,1,1,1,
1,0,1,0,0,1,0,
1,1,1,1,1,1,1,
1,1,1,1,0,1,0
lights  matrixtemp, 10, 7, byrow=T
The true light pattern 09
temp1  matrixrbinomn*7, 1, .9, n, 7 Noisy lights
temp1  ifelselights y+1, ==1, temp1, 1temp1
temp2  matrixrbinomn*17, 1, .5, n, 17 Random lights
x  cbindtemp1, temp2 x is the matrix of predictors 14 x.7>0.5
 x.3>0.5 x.4<0.5 x.6<0.5 x.1<0.5 x.1>0.5 x.4<0.5 5 2 6 1 x.1<0.5 4 7 9 x.5<0.5
0
3 8 Figure 4: Optimally pruned tree for the stochastic digit recognition data
y  rep0:9, length=200 The particular data set of this example can be replicated by setting .Random.seed
to c21, 14, 49, 32, 43, 1, 32, 22, 36, 23, 28, 3 before the call to rbinom. Now we
t the model:
temp3  rpart.controlxval=10, minbucket=2, minsplit=4, cp=0
dfit  rparty
x, method='class', control=temp3
printcpdfit Classification tree:
rpartformula = y
x, method = "class", control = temp3 Variables actually used in tree construction:
1 x.1 x.10 x.12 x.13 x.15 x.19 x.2 x.20 x.22 x.3
Root node error: 180 200 = 0.9 1
2
3
4 CP nsplit rel error
0.1055556
0
1.00000
0.0888889
2
0.79444
0.0777778
3
0.70556
0.0666667
5
0.55556 xerror
1.09444
1.01667
0.90556
0.75000 15 xstd
0.0095501
0.0219110
0.0305075
0.0367990 x.4 x.5 x.6 x.7 x.8 5
6
7
8
9
10
11
12 0.0555556
0.0166667
0.0111111
0.0083333
0.0055556
0.0027778
0.0013889
0.0000000 8
9
11
12
16
27
31
35 0.36111
0.30556
0.27222
0.26111
0.22778
0.16667
0.15556
0.15000 0.56111
0.36111
0.37778
0.36111
0.35556
0.34444
0.36667
0.36667 0.0392817
0.0367990
0.0372181
0.0367990
0.0366498
0.0363369
0.0369434
0.0369434 fit9  prunedfit, cp=.02
plotfit9, branch=.3, compress=T
textfit9 The cp table di ers from that in section 3.5 of 1 in several ways, the last two
of which are somewhat important.
The actual values are di erent, of course, because of di erent random number
generators in the two runs.
The table is printed from the smallest tree no splits to the largest one 35
splits. We nd it easier to compare one tree to another when they start at
the same place.
The number of splits is listed, rather than the number of nodes. The number
of nodes is always 1 + the number of splits.
For easier reading, the error columns have been scaled so that the rst node
has an error of 1. Since in this example the model with no splits must make
180 200 misclassi cations, multiply columns 35 by 180 to get a result in
terms of absolute error. Computations are done on the absolute error scale,
and printed on relative scale.
The complexity parameter column cp has been similarly scaled.
Looking at the cp table, we see that the best tree has 10 terminal nodes 9 splits,
based on cros...
View
Full Document
 Fall '13
 Regression Analysis, Missing values

Click to edit the document details