Stats 202 - Lecture 7

Greedy means that the optimal split is chosen at each

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: o some criterion. This may not be optimal at the end even for the same criterion, as you will see in your homework. However, the greedy approach is computational efficient so it is popular. 25 How to Apply Hunt ’s Algorithm (continued) Using the greedy approach we still have to decide 3 things: #1) What attribute test conditions to consider #2) What criterion to use to select the “best” split #3) When to stop splitting For #1 we will consider only binary splits for both numeric and categorical predictors as discussed on the next slide For #2 we will consider misclassification error, Gini index and entropy 26 #3 is a subtle business involving model selection. It is tricky because we don’t want to overfit or underfit. #1) What Attribute Test Conditions to Consider (Section 4.3.3, Page 155) We will consider only binary splits for both numeric and categorical predictors as discussed, but your book talks about multiway splits also Nominal {Sports, Luxury} CarType {Family} Ordinal – like nominal but don’t break order with split {Small, Medium} OR Size {Large} {Medium, Large} Size {Small} Numeric – often use midpoints between numbers Taxable Income > 80K? Yes No 27 #2) What criterion to use to select the “best ” split (Section 4.3.4, Page 158) We will consider misclassification error, Gini index and entropy Error (t ) 1 max P (i | t ) Misclassification Error: i GINI (t ) 1 [ p ( j | t )]2 Gini Index: j Entropy: Entropy (t ) p ( j | t ) log p ( j | t ) j 2 28 Misclassification Error Error (t ) 1 max P (i | t ) i Misclassification error is usually our final metric which we want to minimize on the test set, so there is a logical argument for using it as the split criterion It is simply the fraction of total cases misclassified 1 - Misclassification error = “Accuracy” (page 149) 29 In class exercise #28: This is textbook question #7 part (a) on page 201. 30 Gini Index GINI (t ) 1 [ p ( j | t )]2 j This is commonly used in many algorithms like CART and the rpart() function in R After the Gini index is computed in each node, the overall value of the Gini index is computed as the weighted average of the Gini index in each node k GINI split ni GINI (i ) i 1 n 31 Gini Examples for a Single Node GINI (t ) 1 [ p ( j | t )]2 j C1 C2 0 6 P(C1) = 0/6 = 0 C1 C2 1 5 P(C1) = 1/6 C1 C2 2 4 P(C1) = 2/6 P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444 32 In class exercise #29: This is textbook question #3 part (f) on page 200. 33 Misclassification Error Vs. Gini Index Parent C1 7 C2 3 Gini = 0.42 A? Gini(N1) = 1 – (3/3)2 – (0/3)2 =0 Gini(Children) = 3/10 * 0 + 7/10 * 0.49 = 0.343 Yes No Node N1 Node N2 C1 C2 N1 3 0 Gini(N2) = 1 – (4/7)2 – (3/7)2 = 0.490 N2 4 3 Gini=0.361 The Gini index decreases from .42 to .343 while the misclassification error stays at 30%. This illustrates why we often want to use a surrogate loss function like the Gini index even if we really only care about misclassification. 34...
View Full Document

Ask a homework question - tutors are online