Information Gain (IG)
●
Now we want to measure how informative an attribute is with respect to the target class
●
How much gain in information it gives us about the value of the target class
●
The information we gain by splitting the set on all values of a single attribute
●
Parent set
: the original set of examples
●
Children set
: the result of splitting on the attribute values
●
The entropy for each child (c i ) is weighted by the proportion of instances belonging to that child, p(c
i
)
Example:

●
Two class problem ( circle = good credit and star = bad credit
●
By looking at the figure in previous slide, the children sets seem purer than the parent set

●
This split (by Balance) reduces entropy substantially
●
The Balance attribute provides a lot of information on the value of the target class:
○
good credit vs. bad credit
●
The results of splitting by the Residence variable
●
The Residence variable has a positive information gain
●
However, it is lower than that of the Balance (which was 0.37)
●
Therefore, the Balance variable is more informative than the Residence
Why trees?
●
Decision trees (DTs), or classification trees, are one of the most popular data mining tools (along with linear and
logistic
●
They are:

○
Easy to understand
○
Easy to implement
○
Easy to use
○
Computationally cheap
○
Almost all data mining packages include DTs
Trees as Sets of Rules
●
▪ The classification tree is equivalent to this rule set
●
▪ Each rule consists of the attribute tests along the path connected with AND
●
▪ IF (Employed = Yes) THEN Class=No Write-off
●
▪ IF (Employed = No) AND (Balance < 50k) THEN
●
Class=No Write-off
●
▪ IF (Employed = No) AND (Balance ≥ 50k) AND
●
(Age < 45) THEN Class=No Write-off
●
▪ IF (Employed = No) AND (Balance ≥ 50k) AND
●
(Age ≥ 45) THEN Class=Write-off


●
2+ = 2 “yes”
●
3- = 3 “no”
●
So you should follow the gath with the most pure gains (largest number); follow the most pure path
●
If entropy =
0 means “yes”
●
The closer entropy is to zero -> IG will be larger
●
If close to 1, then it means “yes”; if it’s closer to
Visualizing Segmentations – Probabilities
circle = Write-off
plus = No Write-off

Geometric Interpretation of a Model
●
Split over income
●
Split overage
●
Pattern:
○
IF Balance >= 50K & Age > 45
○
THEN Default = ‘no’
○
ELSE Default = ‘yes’
●
Decision tree cannot do this because it’s linear and this would have to use regression

●
What alternatives are there to partitioning this way?
●
“True” boundary may not be closely
●
approximated by a linear boundary!
Different Faces of Classification
Classification Problem
●
Most general case: The target takes on discrete values that are NOT ordered
●
Most common: binary classification where the target is either 0 or 1
●
Solutions to Classification
●
Classifier model: Model predicts the same set of discrete value as the data had
●
Probability estimation: Model predicts a score between 0 and 1 that is meant to be the
●
probability of being in that class
MegaTelCo

Predicting Churn with Tree Induction


Cornell Notes
Topic:
Course
Class 5
Date Oct 2, 2019
Essential Question:
●
[check chapter]
●
Read B1 - C34
●
DUE: ASSIGNMENT 1 ON OCTOBER 25
Questions/Cues:
●
Notes

Pro-Tip
––––––
Highlight
what’s important!
Recap:
Data Mining vs. Use of the Model

Summary
●


You've reached the end of your free preview.
Want to read all 145 pages?
- Fall '19
- Data Mining, Data Opportunities