This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Aims
11s1: COMP9417 Machine Learning and Data Mining This lecture will enable you to describe decision tree learning, the use of
entropy and the problem of overﬁtting. Following it you should be able
to: Decision Tree Learning
March 8, 2011 Acknowledgement: Material derived from slides by:
Tom M. Mitchell, http://www2.cs.cmu.edu/~tom/mlbook.html • deﬁne the decision tree representation
• list representation properties of data and models for which decision
trees are appropriate
• reproduce the basic topdown algorithm for decision tree induction
(TDIDT)
• deﬁne entropy in the context of learning a Boolean classiﬁer from
examples Andrew W. Moore, http://www.cs.cmu.edu/~awm/tutorials
and Eibe Frank, http://www.cs.waikato.ac.nz/ml/weka/
COMP9417: March 8, 2011 Decision Tree Learning: Slide 1 Introduction Aims • describe the inductive bias of the basic TDIDT algorithm
• deﬁne overﬁtting of a training set by a hypothesis
• describe developments of the basic TDIDT algorithm: pruning, rule
generation, numerical attributes, manyvalued attributes, costs, missing
values
[Recommended reading: Mitchell, Chapter 3]
[Recommended exercises: 3.1, 3.2, 3.4(a,b)] • Decision trees are the single most popular data mining tool
–
–
–
– Easy to understand
Easy to implement
Easy to use
Computationally cheap • There are some drawbacks, though ! (such as overﬁtting)
• They do classiﬁcation : predict a categorical output from categorical
and/or real inputs COMP9417: March 8, 2011 Decision Tree Learning: Slide 2 COMP9417: March 8, 2011 Decision Tree Learning: Slide 3 Decision Tree for P layT ennis A Tree to Predict CSection Risk
Learned from medical records of 1000 women Outlook Negative examples are Csections
Sunny
Humidity High
No Wind Yes Normal Strong Yes [833+,167] .83+ .17Fetal_Presentation = 1: [822+,116] .88+ .12 Previous_Csection = 0: [767+,81] .90+ .10  Primiparous = 0: [399+,13] .97+ .03  Primiparous = 1: [368+,68] .84+ .16   Fetal_Distress = 0: [334+,47] .88+ .12    Birth_Weight < 3349: [201+,10.6] .95+ .05    Birth_Weight >= 3349: [133+,36.4] .78+ .22   Fetal_Distress = 1: [34+,21] .62+ .38 Previous_Csection = 1: [55+,35] .61+ .39Fetal_Presentation = 2: [3+,29] .11+ .89Fetal_Presentation = 3: [8+,22] .27+ .73 Rain Overcast No COMP9417: March 8, 2011 Weak
Yes Decision Tree Learning: Slide 4 COMP9417: March 8, 2011 Decision Trees Decision Tree Learning: Slide 5 Decision Tree Learning: Slide 7 Decision Trees X ∧Y Decision tree representation: X


X • Each internal node tests an attribute
• Each branch corresponds to attribute value
• Each leaf node assigns a classiﬁcation =
Y
Y
= t:
= t: true
= f: no
f: no X ∨Y How would we represent: X
X

 • ∧, ∨, XOR
• (A ∧ B ) ∨ (C ∧ ¬D ∧ E ) =
=
Y
Y t: true
f:
= t: true
= f: no • M of N
COMP9417: March 8, 2011 Decision Tree Learning: Slide 6 COMP9417: March 8, 2011 When to Consider Decision Trees Decision Trees 2 of 3
X




X



 =
Y
Y


=
Y


Y • Instances describable by attribute–value pairs t:
= t: true
= f:
Z = t: true
Z = f: false
f:
= t:
Z = t: true
Z = f: false
= f: false • Target function is discrete valued
• Disjunctive hypothesis may be required
• Possibly noisy training data
Examples:
• Equipment or medical diagnosis So in general decision trees represent a disjunction of conjunctions of
constraints on the attributes values of instances.
COMP9417: March 8, 2011 Decision Tree Learning: Slide 8 • Credit risk analysis
• Modeling calendar scheduling preferences
COMP9417: March 8, 2011 Decision Tree Learning: Slide 9 Which attribute is best? TopDown Induction of Decision Trees (TDIDT)
Main loop:
[29+,35] 1. A ← the “best” decision attribute for next node t 2. Assign A as decision attribute for node A1=?
f A2=? [29+,35] t f 3. For each value of A, create new descendant of node
[21+,5] 4. Sort training examples to leaf nodes [8+,30] [18+,33] [11+,2] 5. If training examples perfectly classiﬁed, Then STOP, Else iterate over
new leaf nodes COMP9417: March 8, 2011 Decision Tree Learning: Slide 10 COMP9417: March 8, 2011 Decision Tree Learning: Slide 11 Bits Fewer Bits You are watching a set of independent random samples of X
You observe that X has four possible values
P ( X = A) = 1
4 P (X = B ) = 1
4 P (X = C ) = 1
4 Someone tells you that the probabilities are not equal P (X = D ) = 1
4 P ( X = A) = 1
2 P (X = B ) = 1
4 P (X = C ) = 1
8 P (X = D ) = 1
8 It’s possible . . . So you might see: BAACBADCDADDDA...
You transmit data over a binary serial link. You can encode each reading
with two bits (e.g. A = 00, B = 01, C = 10, D= 11) . . . to invent a coding for your transmission that only uses 1.75 bits on
average per symbol. How ? 0100001001001110110011111100... COMP9417: March 8, 2011 Decision Tree Learning: Slide 12 COMP9417: March 8, 2011 Decision Tree Learning: Fewer Bits Fewer Bits Someone tells you that the probabilities are not equal
P ( X = A) = 1
2 P (X = B ) = 1
4 P (X = C ) = Slide 13 1
8 Suppose there are three equally likely values
P (X = D ) = 1
8 It’s possible . . . P ( X = A) = 1
3 P (X = B ) = 1
3 P (X = C ) = 1
3 Here’s a na¨ coding, costing 2 bits per symbol
ıve . . . to invent a coding for your transmission that only uses 1.75 bits per
symbol on average. How ?
A
B
C
D 0
10
110
111 A
B
C 00
01
10 Can you think of a coding that would need only 1.6 bits per symbol on
average ? (This is just one of several ways) COMP9417: March 8, 2011 Decision Tree Learning: Slide 14 COMP9417: March 8, 2011 Decision Tree Learning: Slide 15 Fewer Bits Fewer Bits Suppose there are three equally likely values Suppose there are three equally likely values P ( X = A) = 1
3 P (X = B ) = 1
3 P (X = C ) = 1
3 P ( X = A) = Using the same approach as before, we can get a coding costing 1.6 bits
per symbol on average . . .
A
B
C 1
3 P (X = B ) = 1
3 1
3 From information theory, the optimal number of bits to encode a symbol
with probability p is − log2 p . . .
So the best we can do for this case is − log2
C, or 1.5849625007211563 bits per symbol 0
10
11 P (X = C ) = 1
3 bits for each of A, B and This gives us, on average 1 × 1 bit for A and 2 × 1 × 2 bits for B and C,
3
3
which equals 5 ≈ 1.6 bits.
3
Is this the best we can do ? COMP9417: March 8, 2011 Decision Tree Learning: Slide 16 COMP9417: March 8, 2011 General Case P ( X = V 2 ) = p2 ... Slide 17 General Case
“High entropy” means X is very uniform and boring
“Low entropy” means X is very varied and interesting Suppose X can have one of m values . . . V1, V2, . . . Vm
P ( X = V 1 ) = p1 Decision Tree Learning: P ( X = V m ) = pm What’s the smallest possible number of bits, on average, per symbol,
needed to transmit a stream of symbols drawn from X ’s distribution ?
It’s
H (X ) = −p1 log2 p1 − p2 log2 p2 − . . . − pm log2 pm
m
=−
pj log2 pj
j =1 H (X ) = the entropy of X COMP9417: March 8, 2011 Decision Tree Learning: Slide 18 COMP9417: March 8, 2011 Decision Tree Learning: Slide 19 Entropy Entropy Entropy(S) 1.0 Entropy measures the “impurity” of S
Entropy (S ) ≡ −p⊕ log2 p⊕ − p log2 p 0.5 A “pure” sample is one in which all examples are of the same class.
0.0 0.5
p
+ 1.0 Where:
S is a sample of training examples
p⊕ is the proportion of positive examples in S
p is the proportion of negative examples in S
COMP9417: March 8, 2011 Decision Tree Learning: Slide 20 COMP9417: March 8, 2011 Decision Tree Learning: Entropy Slide 21 Information Gain Entropy (S ) = expected number of bits needed to encode class (⊕ or
) of randomly drawn member of S (under the optimal, shortestlength
code) Gain(S, A) = expected reduction in entropy due to sorting on A Why ?
Information theory: optimal length code assigns − log2 p bits to message
having probability p.
So, expected number of bits to encode ⊕ or of random member of S :
p⊕(− log2 p⊕) + p(− log2 p) Gain(S, A) ≡ Entropy (S ) − [29+,35] t A1=?
f v ∈V alues(A)  Sv 
Entropy (Sv )
S  A2=? [29+,35] t f Entropy (S ) ≡ −p⊕ log2 p⊕ − p log2 p
[21+,5]
COMP9417: March 8, 2011 Decision Tree Learning: Slide 22 COMP9417: March 8, 2011 [8+,30] [18+,33] [11+,2] Decision Tree Learning: Slide 23 Information Gain Gain(S, A1) = Entropy (S ) − Information Gain  St 
 Sf 
Entropy (St) +
Entropy (Sf )
S 
S  = 0.9936 −
26 21
21
5
5
= (( (− log2( ) −
log2( ))) +
64 26
26
26
26
38
8
8
30
30
( (− log2( ) −
log2( ))))
64 38
38
38
38
= 0.9936 − ( 0.2869 + 0.4408 ) Gain(S, A2) = 0.9936 − ( 0.7464 + 0.0828 )
= 0.1643 = 0.2658 COMP9417: March 8, 2011 Decision Tree Learning: Slide 24 COMP9417: March 8, 2011 Slide 25 Training Examples Information Gain So we choose A1, since it gives a larger expected reduction in entropy. COMP9417: March 8, 2011 Decision Tree Learning: Decision Tree Learning: Slide 26 Day
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14 Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain COMP9417: March 8, 2011 Temperature
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild Humidity
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High Wind
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Strong PlayTennis
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No Decision Tree Learning: Slide 27 {D1, D2, ..., D14} Which attribute is the best classifier? [9+,5−]
Outlook S: [9+,5]
E =0.940 S: [9+,5]
E =0.940
Sunny Overcast Rain Wind Humidity {D1,D2,D8,D9,D11} Normal [3+,4]
E =0.985
Gain (S, Humidity )
= .940  (7/14).985  (7/14).592
= .151 [6+,1]
E =0.592 Weak {D3,D7,D12,D13} {D4,D5,D6,D10,D14} [2+,3−] [4+,0−] [3+,2−] ? High Yes ? Strong [6+,2]
E =0.811 [3+,3]
E =1.00 Which attribute should be tested here? Gain (S, Wind ) Ssunny = {D1,D2,D8,D9,D11} = .940  (8/14).811  (6/14)1.0
= .048 Gain (Ssunny , Humidity) = .970 − (3/5) 0.0 − (2/5) 0.0 = .970
Gain (Ssunny , Temperature) = .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570
Gain (Ssunny , Wind) = .970 − (2/5) 1.0 − (3/5) .918 = .019 COMP9417: March 8, 2011 Decision Tree Learning: Slide 28 COMP9417: March 8, 2011 Brief History of Decision Tree Learning Algorithms Decision Tree Learning: Hypothesis Space Search by ID3 • late 1950’s – Bruner et al. in psychology work on modelling concept
acquisition + – + • early 1960s – Hunt et al. in computer science work on Concept Learning
Systems (CLS)
• late 1970s – Quinlan’s Iterative Dichotomizer 3 (ID3) based on CLS is
eﬃcient at learning on thenlarge data sets ... A2 A1
+ – + – + – ... A2 A2 • late 1990s – C5.0, commercial version of C4.5 (available from SPSS
and www.rulequest.com) + – + – + A3 ... Slide 30 COMP9417: March 8, 2011 – + – A4
– + • current – widely available and applied; inﬂuential techniques
Decision Tree Learning: + + • early 1990s – ID3 adds features, develops into C4.5, becomes the
“default” machine learning algorithm COMP9417: March 8, 2011 Slide 29 ... Decision Tree Learning: Slide 31 Hypothesis Space Search by ID3 Inductive Bias in ID3
Note H is the power set of instances X • Hypothesis space is complete!
functions w.r.t attributes) (contains all ﬁnite discretevalued
→Unbiased? – Target function surely in there... Not really... • Outputs a single hypothesis (which one?) • Preference for short trees, and for those with high information gain
attributes near the root – Can’t play 20 questions...
• No back tracking • Bias is a preference for some hypotheses, rather than a restriction of
hypothesis space H – Local minima... • an incomplete search of a complete hypothesis space versus a complete
search of an incomplete hypothesis space (as in learning conjunctive
concepts) • Statisticallybased search choices
– Robust to noisy data...
• Inductive bias: approx “prefer shortest tree”
COMP9417: March 8, 2011 • Occam’s razor: prefer the shortest hypothesis that ﬁts the data
Decision Tree Learning: Slide 32 COMP9417: March 8, 2011 Occam’s Razor Decision Tree Learning: Slide 33 Occam’s Razor William of Ockham (c. 12871347) Argument opposed: Entities should not be multiplied beyond necessity • There are many ways to deﬁne small sets of hypotheses
– e.g., all trees with a prime number of nodes that use attributes
beginning with “Z” Why prefer short hypotheses?
Argument in favour: • What’s so special about small sets based on size of hypothesis?? • Fewer short hypotheses than long hypotheses We will come back to this topic again. → a short hyp that ﬁts data unlikely to be coincidence
→ a long hyp that ﬁts data might be coincidence COMP9417: March 8, 2011 Decision Tree Learning: Slide 34 COMP9417: March 8, 2011 Decision Tree Learning: Slide 35 Overﬁtting in Decision Tree Learning {D1, D2, ..., D14}
[9+,5−] Consider adding noisy training example #15: Outlook Sunny, Hot, N ormal, Strong, P layT ennis = N o
Sunny Overcast Rain What eﬀect on earlier tree?
{D1,D2,D8,D9,D11} Humidity {D4,D5,D6,D10,D14} [4+,0−] [3+,2−] ? Sunny {D3,D7,D12,D13} [2+,3−] Outlook Yes ? Rain Overcast Wind Yes Which attribute should be tested here?
Ssunny = {D1,D2,D8,D9,D11} High
No Normal
Yes Strong
No COMP9417: March 8, 2011 Weak Gain (Ssunny , Humidity) = .970 − (3/5) 0.0 − (2/5) 0.0 = .970
Gain (Ssunny , Temperature) = .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570 Yes
Decision Tree Learning: Gain (Ssunny , Wind) = .970 − (2/5) 1.0 − (3/5) .918 = .019
Slide 36 COMP9417: March 8, 2011 Overﬁtting in General Decision Tree Learning: Slide 37 Overﬁtting in Decision Tree Learning
0.9 Consider error of hypothesis h over 0.85 • training data: errortrain(h) 0.8
Accuracy • entire distribution D of data: errorD (h)
Deﬁnition 0.75
0.7 Hypothesis h ∈ H overﬁts training data if there is an alternative
hypothesis h ∈ H such that 0.65 errortrain(h) < errortrain(h) 0.55 and COMP9417: March 8, 2011 0.6 0.5 errorD (h) > errorD (h) On training data
On test data 0 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes)
Decision Tree Learning: Slide 38 COMP9417: March 8, 2011 Decision Tree Learning: Slide 39 Avoiding Overﬁtting Avoiding Overﬁtting How can we avoid overﬁtting? Pruning Prepruning • prepruning • Usually based on statistical signiﬁcance test • postpruning
overﬁtting stop growing when data split not statistically signiﬁcant
grow full tree, then remove subtrees which are • Stops growing the tree when there is no statistically signiﬁcant
association between any attribute and the class at a particular node Postpruning avoids problem of “early stopping” • Most popular test: chisquared test How to select “best” tree: • ID3: chisquared test plus information gain
– Only statistically signiﬁcant attributes were allowed to be selected
by information gain procedure • Measure performance over training data ?
• Measure performance over separate validation data set ?
• MDL: minimize size(tree) + size(misclassif ications(tree)) ?
COMP9417: March 8, 2011 Decision Tree Learning: Slide 40 COMP9417: March 8, 2011 Avoiding Overﬁtting Slide 41 Avoiding Overﬁtting Postpruning Early stopping
• Prepruning may suﬀer from early stopping: may stop the growth of
tree prematurely
• Classic example: XOR/Parityproblem
– No individual attribute exhibits a signiﬁcant association with the
class
– Target structure only visible in fully expanded tree
– Prepruning won’t expand the root node
• But: XORtype problems not common in practice • Builds full tree ﬁrst and prunes it afterwards
– Attribute interactions are visible in fullygrown tree
• Problem: identiﬁcation of subtrees and nodes that are due to chance
eﬀects
• Two main pruning operations:
– Subtree replacement
– Subtree raising
• Possible strategies: error estimation, signiﬁcance testing, MDL principle • And: prepruning faster than postpruning COMP9417: March 8, 2011 Decision Tree Learning: • We examine two methods: Reducederror Pruning and Errorbased
Pruning
Decision Tree Learning: Slide 42 COMP9417: March 8, 2011 Decision Tree Learning: Slide 43 ReducedError Pruning Eﬀect of ReducedError Pruning
0.9 Split data into training and validation set 0.85 Do until further pruning is harmful: 0.8 2. Greedily remove the one that most improves validation set accuracy 0.75
Accuracy 1. Evaluate impact on validation set of pruning each possible node (plus
those below it) 0.7
0.65
0.6 • Good
• Bad On training data
On test data
On test data (during pruning) 0.55 produces smallest version of most accurate subtree
reduces eﬀective size of training set 0.5 0 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes)
COMP9417: March 8, 2011 Decision Tree Learning: Slide 44 COMP9417: March 8, 2011 Decision Tree Learning: Slide 45 Subtree replacement Errorbased pruning (C4.5) Bottomup: tree is considered for replacement once all its subtrees have
been considered Quinlan (1993) describes the successor to ID3 – C4.5
• many extensions – see below
• postpruning using training set
• includes subtree replacement and subtree raising
• also: pruning by converting tree to rules
• commercial version – C5.0 – is widely used
– go to RuleQuest.com COMP9417: March 8, 2011 Decision Tree Learning: Slide 46 COMP9417: March 8, 2011 Decision Tree Learning: Slide 47 Subtree raising Errorbased pruning Deletes node and redistributes instances – more complicated, slow Goal is to improve estimate of error on unseen data using all and only
data from training set
• Pruning operation is performed if this does not increase the estimated
error
• C4.5’s method: using upper limit of 25% conﬁdence interval derived
from the training data
– Standard Bernoulliprocessbased method
– Note: statistically motivated but not statistically valid
– But: works well in practice ! COMP9417: March 8, 2011 Decision Tree Learning: Slide 48 COMP9417: March 8, 2011 Errorbased pruning Decision Tree Learning: Slide 49 Errorbased pruning • Error estimate for subtree is weighted sum of error estimates for all its
leaves • health plan contribution: node measures
f = 0.36, e = 0.46 • Error estimate for a node: • subtree measures:
2 e= z
f + 2N + z 1 f
f2
N−N
2
+z
N – none: f = 0.33, e = 0.47
– half: f = 0.5, e = 0.72
– full: f = 0.33, e = 0.47 2 + 4z 2
N • subtrees combined 6 : 2 : 6 gives 0.51 • If c = 25% then z = 0.69 (from normal distribution) • subtrees estimated to give greater error
so prune away • f is the error on the training data
• N is the number of instances covered by the leaf COMP9417: March 8, 2011 Decision Tree Learning: Slide 50 COMP9417: March 8, 2011 Decision Tree Learning: Slide 51 Rule PostPruning Converting A Tree to Rules
Outlook This method was introduced in Quinlan’s C4.5
Sunny 1. Convert tree to equivalent set of rules
Humidity High
No Rain
Wind Yes 2. Prune each rule independently of others
3. Sort ﬁnal rules into desired sequence for use Overcast Normal
Yes Strong Weak No Yes For: simpler classiﬁers, people prefer rules to trees
Against: does not scale well, slow for large trees & datasets IF
THEN (Outlook = Sunny ) ∧ (Humidity = High)
P layT ennis = N o IF
THEN (Outlook = Sunny ) ∧ (Humidity = N ormal)
P layT ennis = Y es ...
COMP9417: March 8, 2011 Decision Tree Learning: Slide 52 COMP9417: March 8, 2011 Decision Tree Learning: Slide 53 Rules from Trees Rules from Trees (Rule PostPruning)
Rules can be simpler than trees but just as accurate, e.g., in C4.5Rules: Select a “good” subset of rules within a class (C4.5Rules):
• goal: remove rules not useful in terms of accuracy • path from root to leaf in (unpruned) tree forms a rule • ﬁnd a subset of rules which minimises an MDL criterion – i.e., tree forms a set of rules • tradeoﬀ accuracy and complexity of ruleset • can simplify rules independently by deleting conditions • stochastic search using simulated annealing – i.e., rules can be generalized while maintaining accuracy
• greedy rule simpliﬁcation algorithm Sets of rules can be ordered by class (C4.5Rules): – drop the condition giving lowest estimated error (as for pruning)
– continue while estimated error does not increase • order classes by increasing chance of making false positive errors
• set as a default the class with the most training instances not covered
by any rule COMP9417: March 8, 2011 Decision Tree Learning: Slide 54 COMP9417: March 8, 2011 Decision Tree Learning: Slide 55 Continuous Valued Attributes Continuous Valued Attributes Decision trees originated for discrete attributes only. Now: continuous
attributes. • Splits evaluated on all possible split points Can create a discrete attribute to test continuous value: • More computation: n − 1 possible splits for n values of an attribute in
training set
• Fayyad (1991) • T emperature = 82.5 – sort examples on continuous attribute
– ﬁnd midway boundaries where class changes, e.g. for Temperature
(48+60)
and (80+90)
2
2 • (T emperature > 72.3) ∈ {t, f }
• Usual method: continuous attributes have a binary split • Choose best split point by info gain (or evaluation of choice) • Note: • Note: C4.5 uses actual values in data – discrete attributes – one split exhausts all values
– continuous attributes – can have many splits in a tree COMP9417: March 8, 2011 Decision Tree Learning: Temperature:
PlayTennis:
Slide 56 40
No 48
No 60
Yes 72
Yes COMP9417: March 8, 2011 Attributes with Many Values 80
Yes 90
No Decision Tree Learning: Slide 57 Attributes with Many Values One approach: use GainRatio instead Problem:
• If attribute has many values, Gain will select it GainRatio(S, A) ≡ Gain(S, A)
SplitInf ormation(S, A) • Why ? more likely to split instances into “pure” subsets
• Imagine using Date = Jun 3 1996 as attribute SplitInf ormation(S, A) ≡ − • High gain on training set, useless for prediction c
 Si 
i=1 S  where Si is subset of S for which A has value vi COMP9417: March 8, 2011 Decision Tree Learning: Slide 58 COMP9417: March 8, 2011 log2  Si 
S  Decision Tree Learning: Slide 59 Attributes with Costs Attributes with Many Values Why does this help ? Consider • sensitive to how broadly and uniformly attribute splits instances • medical diagnosis, BloodT est has cost $150 • actually the entropy of S w.r.t. values of A
• therefore higher for manyvalued attributes, especially if mostly
uniformly distributed across possible values COMP9417: March 8, 2011 Decision Tree Learning: Slide 60 • robotics, W idth f rom 1f t has cost 23 sec.
How to learn a consistent tree with low expected cost? COMP9417: March 8, 2011 Attributes with Costs Decision Tree Learning: Slide 61 Attributes with Costs One approach: replace gain by Key idea: evaluate gain relative to cost, so prefer decision trees using
lowercost attributes. • Tan and Schlimmer (1990) More recently
Gain2(S, A)
.
Cost(A) • Domingos (1999) – MetaCost, a metalearning wrapper approach
• uses ensemble learning method to estimate probabilities
• decisiontheoretic approach • Nunez (1988) 2Gain(S,A) − 1
(Cost(A) + 1)w
where w ∈ [0, 1] determines importance of cost COMP9417: March 8, 2011 General problem: class costs, instance costs, . . .
See5 / C5.0 can use costs . . . Decision Tree Learning: Slide 62 COMP9417: March 8, 2011 Decision Tree Learning: Slide 63 Unknown Attribute Values Windowing
Early implementations – training sets too large for memory What if some examples missing values of A?
Use training example anyway, sort through tree.
approaches Here are 3 possible 1. select subset of instances – the window • If node n tests A, assign most common value of A among other
examples sorted to node n
• assign most common value of A among other examples with same
target value 3. use tree to classify training instances not in window
4. if all instances correctly classiﬁed then halt, else 6. go to step 2 – assign fraction pi of example to each descendant in tree Windowing retained in C4.5 because it can lead to more accurate trees.
Related to ensemble learning. Note: need to classify new (unseen) examples in same fashion
Decision Tree Learning: Slide 64 Summary
• Decision tree learning is a practical method for concept learning and
other classiﬁer learning tasks
• TDIDT family descended from ID3 searches complete hypothesis space
 the hypothesis is there, somewhere...
• Uses a search or preference bias, search for optimal tree is not tractable
• Overﬁtting is inevitable with an expressive hypothesis space and noisy
data, so pruning is important
• Decades of research into extensions and reﬁnements of the general
approach
• The “default” machine learning method, illustrates many general issues
• Can be updated with use of “ensemble” methods
COMP9417: March 8, 2011 2. construct decision tree from all instances in the window 5. add selected misclassiﬁed instances to the window • assign probability pi to each possible value vi of A COMP9417: March 8, 2011 As a solution ID3 implemented windowing : Decision Tree Learning: Slide 66 COMP9417: March 8, 2011 Decision Tree Learning: Slide 65 ...
View
Full
Document
This note was uploaded on 06/20/2011 for the course COMP 9417 taught by Professor Some during the Three '11 term at University of New South Wales.
 Three '11
 some
 Data Mining, Machine Learning

Click to edit the document details