lecture9 - Data Mining CS57300 Purdue University September...

Info iconThis preview shows pages 1–9. Sign up to view the full content.

View Full Document Right Arrow Icon
Data Mining CS57300 Purdue University September 23, 2010
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Project proposal Due : Tuesday Oct 5 Length : 1/2 page Content : • The project's goals, including primary task and possible hypotheses • A description of the data that you will use • A list of the algorithms that you will develop and/or analyze
Background image of page 2
Decision trees (cont)
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
When to stop growing • Full growth methods • All samples for at a node belong to the same class • There are no attributes left for further splits • There are no samples left • What impact does this have on the quality of the learned trees?
Background image of page 4
Overftting • Overftting the training data • Given a model space M, a model m M is overftting the training data iF m' M, such that m has smaller error than m' on the training data, but m' has smaller error on the entire distribution oF instances • Approaches For avoiding overftting • Prepruning • Postpruning
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Pruning • Postpruning • Use a separate set of examples to evaluate the utility of pruning nodes from the tree (after tree is fully grown) • Prepruning • Apply a statistical test to decide whether to expand a node • Use an explicit measure of complexity to penalize large trees (e.g., Minimum Description Length)
Background image of page 6
Algorithm comparison • CART • Evaluation criterion: Gini index • Search algorithm: Simple to complex, hill-climbing search • Stopping criterion: When leaves are pure • Pruning mechanism: Cross-validation to select gini threshold • C4.5 • Evaluation criterion: Information gain • Search algorithm: Simple to complex, hill-climbing search • Stopping criterion: When leaves are pure • Pruning mechanism: Reduce error pruning
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
• Reduced error pruning • Quinlan 87, Mingers 87, Esposito et al. 96 • Minimal error pruning • Pessimistic error pruning • Quinlan 87 • Error-Based pruning • Quinlan 87 • Cost-Complexity pruning • Brieman et al. 84 Source: www.ailab.si/blaz/predavanja/uisp/ slides /uisp05-Post Pruning .ppt
Background image of page 8
Image of page 9
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 03/13/2012 for the course CS 573 taught by Professor Staff during the Fall '08 term at Purdue University-West Lafayette.

Page1 / 33

lecture9 - Data Mining CS57300 Purdue University September...

This preview shows document pages 1 - 9. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online