Unformatted text preview: o some criterion.
This may not be optimal at the end even for the same
criterion, as you will see in your homework.
However, the greedy approach is computational
efficient so it is popular. 25 How to Apply Hunt ’s Algorithm (continued) Using the greedy approach we still have to decide 3
things:
#1) What attribute test conditions to consider
#2) What criterion to use to select the “best” split
#3) When to stop splitting
For #1 we will consider only binary splits for both
numeric and categorical predictors as discussed on the
next slide
For #2 we will consider misclassification error, Gini
index and entropy 26 #3 is a subtle business involving model selection. It is
tricky because we don’t want to overfit or underfit. #1) What Attribute Test Conditions to
Consider (Section 4.3.3, Page 155)
We will consider only binary splits for both numeric
and categorical predictors as discussed, but your book
talks about multiway splits also
Nominal {Sports,
Luxury} CarType
{Family} Ordinal – like nominal but don’t break order with split
{Small,
Medium} OR Size
{Large} {Medium,
Large} Size
{Small} Numeric – often use midpoints between numbers
Taxable
Income > 80K?
Yes No 27 #2) What criterion to use to select the
“best ” split (Section 4.3.4, Page 158)
We will consider misclassification error, Gini index
and entropy Error (t ) 1 max P (i  t ) Misclassification Error: i GINI (t ) 1 [ p ( j  t )]2
Gini Index:
j Entropy: Entropy (t ) p ( j  t ) log p ( j  t )
j 2 28 Misclassification Error Error (t ) 1 max P (i  t )
i Misclassification error is usually our final metric which
we want to minimize on the test set, so there is a logical
argument for using it as the split criterion
It is simply the fraction of total cases misclassified
1  Misclassification error = “Accuracy” (page 149) 29 In class exercise #28:
This is textbook question #7 part (a) on page
201. 30 Gini Index
GINI (t ) 1 [ p ( j  t )]2
j This is commonly used in many algorithms like CART
and the rpart() function in R
After the Gini index is computed in each node, the
overall value of the Gini index is computed as the
weighted average of the Gini index in each node
k GINI split ni GINI (i )
i 1 n 31 Gini Examples for a Single Node
GINI (t ) 1 [ p ( j  t )]2
j C1
C2 0
6 P(C1) = 0/6 = 0 C1
C2 1
5 P(C1) = 1/6 C1
C2 2
4 P(C1) = 2/6 P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444 32 In class exercise #29:
This is textbook question #3 part (f) on page 200. 33 Misclassification Error Vs. Gini Index
Parent
C1 7 C2 3 Gini = 0.42 A?
Gini(N1)
= 1 – (3/3)2 – (0/3)2
=0
Gini(Children)
= 3/10 * 0
+ 7/10 * 0.49
= 0.343 Yes No Node N1 Node N2 C1
C2 N1
3
0 Gini(N2)
= 1 – (4/7)2 – (3/7)2
= 0.490 N2
4
3 Gini=0.361 The Gini index decreases from .42 to .343 while the
misclassification error stays at 30%. This illustrates why we
often want to use a surrogate loss function like the Gini
index even if we really only care about misclassification. 34...
View
Full Document
 Fall '09
 TAYLOR
 Statistics, 80k, misclassification error

Click to edit the document details