DSCI4520_TreeAlgorithms_4

# DSCI4520_TreeAlgorithms_4 - DSCI 4520/5240 DATA MINING DSCI...

This preview shows pages 1–10. Sign up to view the full content.

Lecture 4 - 1 DSCI 4520/5240 DATA MINING Some slide material taken from: Witten & Frank 2000, Olson & Shi 2007, de Ville 2006, SAS Education 2005 DSCI 4520/5240 Lecture 4 Decision Tree Algorithms DSCI 4520/5240 DBDSS (DATA MINING)

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Lecture 4 - 2 DSCI 4520/5240 DATA MINING Objective Review of some Decision Tree algorithms.
Lecture 4 - 3 DSCI 4520/5240 DATA MINING This example is related to determining credit risks. We have a total of 10 people. 6 are good risks and 4 are bad. We apply splits to the tree based on employment status. When we break this down, we find that there are 7 employed and 3 not employed. Of the 3 that are not employed , all of them are bad credit risks and thus we have learned something about our data. Decision Trees: Credit risk example Note that here we cannot split this node down any further since all of our data is grouped into one set. This is called a pure node . The other node, however, can be split again based on a different criterion. So we can continue to grow the tree on the left hand side. CORRESPONDING RULES: IF employed = yes AND married = yes THEN risk = good IF employed = yes AND married = no THEN risk = good IF employed = no THEN risk = bad

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Lecture 4 - 4 DSCI 4520/5240 DATA MINING Decision Tree performance Confidence is the degree of accuracy of a rule. Support is the degree to which the rule conditions occur in the data. EXAMPLE: if 10 customers purchased Zane Grey’s The Young Pitcher and 8 of them also purchased The Short Stop , the rule: {IF basket has The Young Pitcher THEN basket has The Short Stop } has confidence of 0.80. If these purchases were the only 10 to cover these books out of 10,000,000 purchases, the support is only 0.000001.
Lecture 4 - 5 DSCI 4520/5240 DATA MINING Rule Interestingness Interestingness is the idea that Data Mining discovers something unexpected. Consider the rule: {IF basket has eggs THEN basket has bacon }. Suppose the confidence level is 0.90 and the support level is 0.20. This may be a useful rule, but it may not be interesting if the grocer was already aware of this association. Recall the definition of DM as the discovery of previously unknown knowledge!

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Lecture 4 - 6 DSCI 4520/5240 DATA MINING Rule Induction algorithms 1R ID3 C4.5/C5.0 CART CHAID CN2 BruteDL SDL They are recursive algorithms that identify data partitions of progressive separation with respect to the outcome . The partitions are then organized into a decision tree . Common Algorithms:
Lecture 4 - 7 DSCI 4520/5240 DATA MINING Illustration of two Tree algorithms 1R and Discretization in 1R Naïve Bayes Classification ID3: Min Entropy and Max Info Gain

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Lecture 4 - 8 DSCI 4520/5240 DATA MINING 1R
Lecture 4 - 9 DSCI 4520/5240 DATA MINING 1R: Inferring Rudimentary Rules 1R: learns a 1-level decision tree In other words, generates a set of rules that all test on one particular attribute Basic version (assuming nominal attributes) One branch for each of the attribute’s values

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

### Page1 / 36

DSCI4520_TreeAlgorithms_4 - DSCI 4520/5240 DATA MINING DSCI...

This preview shows document pages 1 - 10. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online