2.1.Decision Tree Learning

2.1.Decision Tree Learning - Aims 11s1: COMP9417 Machine...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Aims 11s1: COMP9417 Machine Learning and Data Mining This lecture will enable you to describe decision tree learning, the use of entropy and the problem of overfitting. Following it you should be able to: Decision Tree Learning March 8, 2011 Acknowledgement: Material derived from slides by: Tom M. Mitchell, http://www-2.cs.cmu.edu/~tom/mlbook.html • define the decision tree representation • list representation properties of data and models for which decision trees are appropriate • reproduce the basic top-down algorithm for decision tree induction (TDIDT) • define entropy in the context of learning a Boolean classifier from examples Andrew W. Moore, http://www.cs.cmu.edu/~awm/tutorials and Eibe Frank, http://www.cs.waikato.ac.nz/ml/weka/ COMP9417: March 8, 2011 Decision Tree Learning: Slide 1 Introduction Aims • describe the inductive bias of the basic TDIDT algorithm • define overfitting of a training set by a hypothesis • describe developments of the basic TDIDT algorithm: pruning, rule generation, numerical attributes, many-valued attributes, costs, missing values [Recommended reading: Mitchell, Chapter 3] [Recommended exercises: 3.1, 3.2, 3.4(a,b)] • Decision trees are the single most popular data mining tool – – – – Easy to understand Easy to implement Easy to use Computationally cheap • There are some drawbacks, though ! (such as overfitting) • They do classification : predict a categorical output from categorical and/or real inputs COMP9417: March 8, 2011 Decision Tree Learning: Slide 2 COMP9417: March 8, 2011 Decision Tree Learning: Slide 3 Decision Tree for P layT ennis A Tree to Predict C-Section Risk Learned from medical records of 1000 women Outlook Negative examples are C-sections Sunny Humidity High No Wind Yes Normal Strong Yes [833+,167-] .83+ .17Fetal_Presentation = 1: [822+,116-] .88+ .12| Previous_Csection = 0: [767+,81-] .90+ .10| | Primiparous = 0: [399+,13-] .97+ .03| | Primiparous = 1: [368+,68-] .84+ .16| | | Fetal_Distress = 0: [334+,47-] .88+ .12| | | | Birth_Weight < 3349: [201+,10.6-] .95+ .05| | | | Birth_Weight >= 3349: [133+,36.4-] .78+ .22| | | Fetal_Distress = 1: [34+,21-] .62+ .38| Previous_Csection = 1: [55+,35-] .61+ .39Fetal_Presentation = 2: [3+,29-] .11+ .89Fetal_Presentation = 3: [8+,22-] .27+ .73- Rain Overcast No COMP9417: March 8, 2011 Weak Yes Decision Tree Learning: Slide 4 COMP9417: March 8, 2011 Decision Trees Decision Tree Learning: Slide 5 Decision Tree Learning: Slide 7 Decision Trees X ∧Y Decision tree representation: X | | X • Each internal node tests an attribute • Each branch corresponds to attribute value • Each leaf node assigns a classification = Y Y = t: = t: true = f: no f: no X ∨Y How would we represent: X X | | • ∧, ∨, XOR • (A ∧ B ) ∨ (C ∧ ¬D ∧ E ) = = Y Y t: true f: = t: true = f: no • M of N COMP9417: March 8, 2011 Decision Tree Learning: Slide 6 COMP9417: March 8, 2011 When to Consider Decision Trees Decision Trees 2 of 3 X | | | | X | | | | = Y Y | | = Y | | Y • Instances describable by attribute–value pairs t: = t: true = f: Z = t: true Z = f: false f: = t: Z = t: true Z = f: false = f: false • Target function is discrete valued • Disjunctive hypothesis may be required • Possibly noisy training data Examples: • Equipment or medical diagnosis So in general decision trees represent a disjunction of conjunctions of constraints on the attributes values of instances. COMP9417: March 8, 2011 Decision Tree Learning: Slide 8 • Credit risk analysis • Modeling calendar scheduling preferences COMP9417: March 8, 2011 Decision Tree Learning: Slide 9 Which attribute is best? Top-Down Induction of Decision Trees (TDIDT) Main loop: [29+,35-] 1. A ← the “best” decision attribute for next node t 2. Assign A as decision attribute for node A1=? f A2=? [29+,35-] t f 3. For each value of A, create new descendant of node [21+,5-] 4. Sort training examples to leaf nodes [8+,30-] [18+,33-] [11+,2-] 5. If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes COMP9417: March 8, 2011 Decision Tree Learning: Slide 10 COMP9417: March 8, 2011 Decision Tree Learning: Slide 11 Bits Fewer Bits You are watching a set of independent random samples of X You observe that X has four possible values P ( X = A) = 1 4 P (X = B ) = 1 4 P (X = C ) = 1 4 Someone tells you that the probabilities are not equal P (X = D ) = 1 4 P ( X = A) = 1 2 P (X = B ) = 1 4 P (X = C ) = 1 8 P (X = D ) = 1 8 It’s possible . . . So you might see: BAACBADCDADDDA... You transmit data over a binary serial link. You can encode each reading with two bits (e.g. A = 00, B = 01, C = 10, D= 11) . . . to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How ? 0100001001001110110011111100... COMP9417: March 8, 2011 Decision Tree Learning: Slide 12 COMP9417: March 8, 2011 Decision Tree Learning: Fewer Bits Fewer Bits Someone tells you that the probabilities are not equal P ( X = A) = 1 2 P (X = B ) = 1 4 P (X = C ) = Slide 13 1 8 Suppose there are three equally likely values P (X = D ) = 1 8 It’s possible . . . P ( X = A) = 1 3 P (X = B ) = 1 3 P (X = C ) = 1 3 Here’s a na¨ coding, costing 2 bits per symbol ıve . . . to invent a coding for your transmission that only uses 1.75 bits per symbol on average. How ? A B C D 0 10 110 111 A B C 00 01 10 Can you think of a coding that would need only 1.6 bits per symbol on average ? (This is just one of several ways) COMP9417: March 8, 2011 Decision Tree Learning: Slide 14 COMP9417: March 8, 2011 Decision Tree Learning: Slide 15 Fewer Bits Fewer Bits Suppose there are three equally likely values Suppose there are three equally likely values P ( X = A) = 1 3 P (X = B ) = 1 3 P (X = C ) = 1 3 P ( X = A) = Using the same approach as before, we can get a coding costing 1.6 bits per symbol on average . . . A B C 1 3 P (X = B ) = 1 3 1 3 From information theory, the optimal number of bits to encode a symbol with probability p is − log2 p . . . So the best we can do for this case is − log2 C, or 1.5849625007211563 bits per symbol 0 10 11 P (X = C ) = 1 3 bits for each of A, B and This gives us, on average 1 × 1 bit for A and 2 × 1 × 2 bits for B and C, 3 3 which equals 5 ≈ 1.6 bits. 3 Is this the best we can do ? COMP9417: March 8, 2011 Decision Tree Learning: Slide 16 COMP9417: March 8, 2011 General Case P ( X = V 2 ) = p2 ... Slide 17 General Case “High entropy” means X is very uniform and boring “Low entropy” means X is very varied and interesting Suppose X can have one of m values . . . V1, V2, . . . Vm P ( X = V 1 ) = p1 Decision Tree Learning: P ( X = V m ) = pm What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X ’s distribution ? It’s H (X ) = −p1 log2 p1 − p2 log2 p2 − . . . − pm log2 pm m ￿ =− pj log2 pj j =1 H (X ) = the entropy of X COMP9417: March 8, 2011 Decision Tree Learning: Slide 18 COMP9417: March 8, 2011 Decision Tree Learning: Slide 19 Entropy Entropy Entropy(S) 1.0 Entropy measures the “impurity” of S Entropy (S ) ≡ −p⊕ log2 p⊕ − p￿ log2 p￿ 0.5 A “pure” sample is one in which all examples are of the same class. 0.0 0.5 p + 1.0 Where: S is a sample of training examples p⊕ is the proportion of positive examples in S p￿ is the proportion of negative examples in S COMP9417: March 8, 2011 Decision Tree Learning: Slide 20 COMP9417: March 8, 2011 Decision Tree Learning: Entropy Slide 21 Information Gain Entropy (S ) = expected number of bits needed to encode class (⊕ or ￿) of randomly drawn member of S (under the optimal, shortest-length code) Gain(S, A) = expected reduction in entropy due to sorting on A Why ? Information theory: optimal length code assigns − log2 p bits to message having probability p. So, expected number of bits to encode ⊕ or ￿ of random member of S : p⊕(− log2 p⊕) + p￿(− log2 p￿) Gain(S, A) ≡ Entropy (S ) − [29+,35-] t A1=? f ￿ v ∈V alues(A) | Sv | Entropy (Sv ) |S | A2=? [29+,35-] t f Entropy (S ) ≡ −p⊕ log2 p⊕ − p￿ log2 p￿ [21+,5-] COMP9417: March 8, 2011 Decision Tree Learning: Slide 22 COMP9417: March 8, 2011 [8+,30-] [18+,33-] [11+,2-] Decision Tree Learning: Slide 23 Information Gain Gain(S, A1) = Entropy (S ) − ￿ Information Gain | St | | Sf | Entropy (St) + Entropy (Sf ) |S | |S | = 0.9936 − 26 21 21 5 5 = (( (− log2( ) − log2( ))) + 64 26 26 26 26 38 8 8 30 30 ( (− log2( ) − log2( )))) 64 38 38 38 38 = 0.9936 − ( 0.2869 + 0.4408 ) ￿ Gain(S, A2) = 0.9936 − ( 0.7464 + 0.0828 ) = 0.1643 = 0.2658 COMP9417: March 8, 2011 Decision Tree Learning: Slide 24 COMP9417: March 8, 2011 Slide 25 Training Examples Information Gain So we choose A1, since it gives a larger expected reduction in entropy. COMP9417: March 8, 2011 Decision Tree Learning: Decision Tree Learning: Slide 26 Day D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 Outlook Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain COMP9417: March 8, 2011 Temperature Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild Humidity High High High High Normal Normal Normal High Normal Normal Normal High Normal High Wind Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong PlayTennis No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Decision Tree Learning: Slide 27 {D1, D2, ..., D14} Which attribute is the best classifier? [9+,5−] Outlook S: [9+,5-] E =0.940 S: [9+,5-] E =0.940 Sunny Overcast Rain Wind Humidity {D1,D2,D8,D9,D11} Normal [3+,4-] E =0.985 Gain (S, Humidity ) = .940 - (7/14).985 - (7/14).592 = .151 [6+,1-] E =0.592 Weak {D3,D7,D12,D13} {D4,D5,D6,D10,D14} [2+,3−] [4+,0−] [3+,2−] ? High Yes ? Strong [6+,2-] E =0.811 [3+,3-] E =1.00 Which attribute should be tested here? Gain (S, Wind ) Ssunny = {D1,D2,D8,D9,D11} = .940 - (8/14).811 - (6/14)1.0 = .048 Gain (Ssunny , Humidity) = .970 − (3/5) 0.0 − (2/5) 0.0 = .970 Gain (Ssunny , Temperature) = .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570 Gain (Ssunny , Wind) = .970 − (2/5) 1.0 − (3/5) .918 = .019 COMP9417: March 8, 2011 Decision Tree Learning: Slide 28 COMP9417: March 8, 2011 Brief History of Decision Tree Learning Algorithms Decision Tree Learning: Hypothesis Space Search by ID3 • late 1950’s – Bruner et al. in psychology work on modelling concept acquisition + – + • early 1960s – Hunt et al. in computer science work on Concept Learning Systems (CLS) • late 1970s – Quinlan’s Iterative Dichotomizer 3 (ID3) based on CLS is efficient at learning on then-large data sets ... A2 A1 + – + – + – ... A2 A2 • late 1990s – C5.0, commercial version of C4.5 (available from SPSS and www.rulequest.com) + – + – + A3 ... Slide 30 COMP9417: March 8, 2011 – + – A4 – + • current – widely available and applied; influential techniques Decision Tree Learning: + + • early 1990s – ID3 adds features, develops into C4.5, becomes the “default” machine learning algorithm COMP9417: March 8, 2011 Slide 29 ... Decision Tree Learning: Slide 31 Hypothesis Space Search by ID3 Inductive Bias in ID3 Note H is the power set of instances X • Hypothesis space is complete! functions w.r.t attributes) (contains all finite discrete-valued →Unbiased? – Target function surely in there... Not really... • Outputs a single hypothesis (which one?) • Preference for short trees, and for those with high information gain attributes near the root – Can’t play 20 questions... • No back tracking • Bias is a preference for some hypotheses, rather than a restriction of hypothesis space H – Local minima... • an incomplete search of a complete hypothesis space versus a complete search of an incomplete hypothesis space (as in learning conjunctive concepts) • Statistically-based search choices – Robust to noisy data... • Inductive bias: approx “prefer shortest tree” COMP9417: March 8, 2011 • Occam’s razor: prefer the shortest hypothesis that fits the data Decision Tree Learning: Slide 32 COMP9417: March 8, 2011 Occam’s Razor Decision Tree Learning: Slide 33 Occam’s Razor William of Ockham (c. 1287-1347) Argument opposed: Entities should not be multiplied beyond necessity • There are many ways to define small sets of hypotheses – e.g., all trees with a prime number of nodes that use attributes beginning with “Z” Why prefer short hypotheses? Argument in favour: • What’s so special about small sets based on size of hypothesis?? • Fewer short hypotheses than long hypotheses We will come back to this topic again. → a short hyp that fits data unlikely to be coincidence → a long hyp that fits data might be coincidence COMP9417: March 8, 2011 Decision Tree Learning: Slide 34 COMP9417: March 8, 2011 Decision Tree Learning: Slide 35 Overfitting in Decision Tree Learning {D1, D2, ..., D14} [9+,5−] Consider adding noisy training example #15: Outlook Sunny, Hot, N ormal, Strong, P layT ennis = N o Sunny Overcast Rain What effect on earlier tree? {D1,D2,D8,D9,D11} Humidity {D4,D5,D6,D10,D14} [4+,0−] [3+,2−] ? Sunny {D3,D7,D12,D13} [2+,3−] Outlook Yes ? Rain Overcast Wind Yes Which attribute should be tested here? Ssunny = {D1,D2,D8,D9,D11} High No Normal Yes Strong No COMP9417: March 8, 2011 Weak Gain (Ssunny , Humidity) = .970 − (3/5) 0.0 − (2/5) 0.0 = .970 Gain (Ssunny , Temperature) = .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570 Yes Decision Tree Learning: Gain (Ssunny , Wind) = .970 − (2/5) 1.0 − (3/5) .918 = .019 Slide 36 COMP9417: March 8, 2011 Overfitting in General Decision Tree Learning: Slide 37 Overfitting in Decision Tree Learning 0.9 Consider error of hypothesis h over 0.85 • training data: errortrain(h) 0.8 Accuracy • entire distribution D of data: errorD (h) Definition 0.75 0.7 Hypothesis h ∈ H overfits training data if there is an alternative hypothesis h￿ ∈ H such that 0.65 errortrain(h) < errortrain(h￿) 0.55 and COMP9417: March 8, 2011 0.6 0.5 errorD (h) > errorD (h￿) On training data On test data 0 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes) Decision Tree Learning: Slide 38 COMP9417: March 8, 2011 Decision Tree Learning: Slide 39 Avoiding Overfitting Avoiding Overfitting How can we avoid overfitting? Pruning Pre-pruning • pre-pruning • Usually based on statistical significance test • post-pruning overfitting stop growing when data split not statistically significant grow full tree, then remove sub-trees which are • Stops growing the tree when there is no statistically significant association between any attribute and the class at a particular node Post-pruning avoids problem of “early stopping” • Most popular test: chi-squared test How to select “best” tree: • ID3: chi-squared test plus information gain – Only statistically significant attributes were allowed to be selected by information gain procedure • Measure performance over training data ? • Measure performance over separate validation data set ? • MDL: minimize size(tree) + size(misclassif ications(tree)) ? COMP9417: March 8, 2011 Decision Tree Learning: Slide 40 COMP9417: March 8, 2011 Avoiding Overfitting Slide 41 Avoiding Overfitting Post-pruning Early stopping • Pre-pruning may suffer from early stopping: may stop the growth of tree prematurely • Classic example: XOR/Parity-problem – No individual attribute exhibits a significant association with the class – Target structure only visible in fully expanded tree – Prepruning won’t expand the root node • But: XOR-type problems not common in practice • Builds full tree first and prunes it afterwards – Attribute interactions are visible in fully-grown tree • Problem: identification of subtrees and nodes that are due to chance effects • Two main pruning operations: – Subtree replacement – Subtree raising • Possible strategies: error estimation, significance testing, MDL principle • And: pre-pruning faster than post-pruning COMP9417: March 8, 2011 Decision Tree Learning: • We examine two methods: Reduced-error Pruning and Error-based Pruning Decision Tree Learning: Slide 42 COMP9417: March 8, 2011 Decision Tree Learning: Slide 43 Reduced-Error Pruning Effect of Reduced-Error Pruning 0.9 Split data into training and validation set 0.85 Do until further pruning is harmful: 0.8 2. Greedily remove the one that most improves validation set accuracy 0.75 Accuracy 1. Evaluate impact on validation set of pruning each possible node (plus those below it) 0.7 0.65 0.6 • Good • Bad On training data On test data On test data (during pruning) 0.55 produces smallest version of most accurate subtree reduces effective size of training set 0.5 0 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes) COMP9417: March 8, 2011 Decision Tree Learning: Slide 44 COMP9417: March 8, 2011 Decision Tree Learning: Slide 45 Sub-tree replacement Error-based pruning (C4.5) Bottom-up: tree is considered for replacement once all its sub-trees have been considered Quinlan (1993) describes the successor to ID3 – C4.5 • many extensions – see below • post-pruning using training set • includes sub-tree replacement and sub-tree raising • also: pruning by converting tree to rules • commercial version – C5.0 – is widely used – go to RuleQuest.com COMP9417: March 8, 2011 Decision Tree Learning: Slide 46 COMP9417: March 8, 2011 Decision Tree Learning: Slide 47 Sub-tree raising Error-based pruning Deletes node and redistributes instances – more complicated, slow Goal is to improve estimate of error on unseen data using all and only data from training set • Pruning operation is performed if this does not increase the estimated error • C4.5’s method: using upper limit of 25% confidence interval derived from the training data – Standard Bernoulli-process-based method – Note: statistically motivated but not statistically valid – But: works well in practice ! COMP9417: March 8, 2011 Decision Tree Learning: Slide 48 COMP9417: March 8, 2011 Error-based pruning Decision Tree Learning: Slide 49 Error-based pruning • Error estimate for subtree is weighted sum of error estimates for all its leaves • health plan contribution: node measures f = 0.36, e = 0.46 • Error estimate for a node: • sub-tree measures: 2 e= z f + 2N + z ￿ 1 f f2 N−N 2 +z N – none: f = 0.33, e = 0.47 – half: f = 0.5, e = 0.72 – full: f = 0.33, e = 0.47 2 + 4z 2 N • sub-trees combined 6 : 2 : 6 gives 0.51 • If c = 25% then z = 0.69 (from normal distribution) • sub-trees estimated to give greater error so prune away • f is the error on the training data • N is the number of instances covered by the leaf COMP9417: March 8, 2011 Decision Tree Learning: Slide 50 COMP9417: March 8, 2011 Decision Tree Learning: Slide 51 Rule Post-Pruning Converting A Tree to Rules Outlook This method was introduced in Quinlan’s C4.5 Sunny 1. Convert tree to equivalent set of rules Humidity High No Rain Wind Yes 2. Prune each rule independently of others 3. Sort final rules into desired sequence for use Overcast Normal Yes Strong Weak No Yes For: simpler classifiers, people prefer rules to trees Against: does not scale well, slow for large trees & datasets IF THEN (Outlook = Sunny ) ∧ (Humidity = High) P layT ennis = N o IF THEN (Outlook = Sunny ) ∧ (Humidity = N ormal) P layT ennis = Y es ... COMP9417: March 8, 2011 Decision Tree Learning: Slide 52 COMP9417: March 8, 2011 Decision Tree Learning: Slide 53 Rules from Trees Rules from Trees (Rule Post-Pruning) Rules can be simpler than trees but just as accurate, e.g., in C4.5Rules: Select a “good” subset of rules within a class (C4.5Rules): • goal: remove rules not useful in terms of accuracy • path from root to leaf in (unpruned) tree forms a rule • find a subset of rules which minimises an MDL criterion – i.e., tree forms a set of rules • trade-off accuracy and complexity of rule-set • can simplify rules independently by deleting conditions • stochastic search using simulated annealing – i.e., rules can be generalized while maintaining accuracy • greedy rule simplification algorithm Sets of rules can be ordered by class (C4.5Rules): – drop the condition giving lowest estimated error (as for pruning) – continue while estimated error does not increase • order classes by increasing chance of making false positive errors • set as a default the class with the most training instances not covered by any rule COMP9417: March 8, 2011 Decision Tree Learning: Slide 54 COMP9417: March 8, 2011 Decision Tree Learning: Slide 55 Continuous Valued Attributes Continuous Valued Attributes Decision trees originated for discrete attributes only. Now: continuous attributes. • Splits evaluated on all possible split points Can create a discrete attribute to test continuous value: • More computation: n − 1 possible splits for n values of an attribute in training set • Fayyad (1991) • T emperature = 82.5 – sort examples on continuous attribute – find midway boundaries where class changes, e.g. for Temperature (48+60) and (80+90) 2 2 • (T emperature > 72.3) ∈ {t, f } • Usual method: continuous attributes have a binary split • Choose best split point by info gain (or evaluation of choice) • Note: • Note: C4.5 uses actual values in data – discrete attributes – one split exhausts all values – continuous attributes – can have many splits in a tree COMP9417: March 8, 2011 Decision Tree Learning: Temperature: PlayTennis: Slide 56 40 No 48 No 60 Yes 72 Yes COMP9417: March 8, 2011 Attributes with Many Values 80 Yes 90 No Decision Tree Learning: Slide 57 Attributes with Many Values One approach: use GainRatio instead Problem: • If attribute has many values, Gain will select it GainRatio(S, A) ≡ Gain(S, A) SplitInf ormation(S, A) • Why ? more likely to split instances into “pure” subsets • Imagine using Date = Jun 3 1996 as attribute SplitInf ormation(S, A) ≡ − • High gain on training set, useless for prediction c ￿ | Si | i=1 |S | where Si is subset of S for which A has value vi COMP9417: March 8, 2011 Decision Tree Learning: Slide 58 COMP9417: March 8, 2011 log2 | Si | |S | Decision Tree Learning: Slide 59 Attributes with Costs Attributes with Many Values Why does this help ? Consider • sensitive to how broadly and uniformly attribute splits instances • medical diagnosis, BloodT est has cost $150 • actually the entropy of S w.r.t. values of A • therefore higher for many-valued attributes, especially if mostly uniformly distributed across possible values COMP9417: March 8, 2011 Decision Tree Learning: Slide 60 • robotics, W idth f rom 1f t has cost 23 sec. How to learn a consistent tree with low expected cost? COMP9417: March 8, 2011 Attributes with Costs Decision Tree Learning: Slide 61 Attributes with Costs One approach: replace gain by Key idea: evaluate gain relative to cost, so prefer decision trees using lower-cost attributes. • Tan and Schlimmer (1990) More recently Gain2(S, A) . Cost(A) • Domingos (1999) – MetaCost, a meta-learning wrapper approach • uses ensemble learning method to estimate probabilities • decision-theoretic approach • Nunez (1988) 2Gain(S,A) − 1 (Cost(A) + 1)w where w ∈ [0, 1] determines importance of cost COMP9417: March 8, 2011 General problem: class costs, instance costs, . . . See5 / C5.0 can use costs . . . Decision Tree Learning: Slide 62 COMP9417: March 8, 2011 Decision Tree Learning: Slide 63 Unknown Attribute Values Windowing Early implementations – training sets too large for memory What if some examples missing values of A? Use training example anyway, sort through tree. approaches Here are 3 possible 1. select subset of instances – the window • If node n tests A, assign most common value of A among other examples sorted to node n • assign most common value of A among other examples with same target value 3. use tree to classify training instances not in window 4. if all instances correctly classified then halt, else 6. go to step 2 – assign fraction pi of example to each descendant in tree Windowing retained in C4.5 because it can lead to more accurate trees. Related to ensemble learning. Note: need to classify new (unseen) examples in same fashion Decision Tree Learning: Slide 64 Summary • Decision tree learning is a practical method for concept learning and other classifier learning tasks • TDIDT family descended from ID3 searches complete hypothesis space - the hypothesis is there, somewhere... • Uses a search or preference bias, search for optimal tree is not tractable • Overfitting is inevitable with an expressive hypothesis space and noisy data, so pruning is important • Decades of research into extensions and refinements of the general approach • The “default” machine learning method, illustrates many general issues • Can be updated with use of “ensemble” methods COMP9417: March 8, 2011 2. construct decision tree from all instances in the window 5. add selected misclassified instances to the window • assign probability pi to each possible value vi of A COMP9417: March 8, 2011 As a solution ID3 implemented windowing : Decision Tree Learning: Slide 66 COMP9417: March 8, 2011 Decision Tree Learning: Slide 65 ...
View Full Document

This note was uploaded on 06/20/2011 for the course COMP 9417 taught by Professor Some during the Three '11 term at University of New South Wales.

Ask a homework question - tutors are online