Overfitting-L5

# Overfitting-L5 - CSE572:DataMining Lecture 7 Model...

This preview shows pages 1–10. Sign up to view the full content.

1 CSE 572: Data Mining Lecture 7: Model Overfitting

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
2 Classification Errors Training errors (apparent errors) Errors committed on the training set Test errors Errors committed on the test set Generalization errors Expected error of a model over random selection of records from the same distribution
3 Example Data Set Two class problem: +, o 3000 data points (30% for training, 70% for testing) Data set for + class is generated from a uniform distribution Data set for o class is generated from a mixture of 3 gaussian distributions, centered at (5,15), (10,5), and (15,15)

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
4 Decision Trees x1 < 13.29 x2 < 17.35 x2 < 12.63 x1 < 6.56 x2 < 8.64 x2 < 1.38 x1 < 2.15 x1 < 7.24 x1 < 12.11 x1 < 18.88 x1 < 13.29 x2 < 17.35 x1 < 6.56 x2 < 8.64 x2 < 1.38 x1 < 2.15 x1 < 7.24 x1 < 12.11 x1 < 18.88 x2 < 4.06 x1 < 6.99 x1 < 6.78 x2 < 19.93 x1 < 3.03 x2 < 12.68 x1 < 2.72 x2 < 15.77 x2 < 17.14 x2 < 12.89 x2 < 13.80 x2 < 16.75 x2 < 16.33 Decision Tree with 11 leaf nodes Decision Tree with 24 leaf nodes Which tree is better?
5 Model Overfitting Underfitting: when model is too simple, both training and test errors are large Overfitting: when model is too complex, training error is small but test error is large

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
6 Mammal Classification Problem Body Temperature Give Birth Warm Cold Yes No Mammals Non- mammals Non- mammals Training Set Decision Tree Model training error = 0%
7 Effect of Noise  Training Set: Test Set: Example : Mammal Classification problem Body Temperature Give Birth Warm-blooded Cold-blooded Yes No Mammals Non- mammals Non- mammals Model M1: train err = 0%, test err = 30% Model M2: train err = 20%, test err = 10% Give Birth Four- legged Yes No Yes No Mammals Non- mammals Non- mammals Body Temperature Warm-blooded Cold-blooded Non- mammals

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
8 Lack of  Representative Samples Body Temperature Hibernates Warm-blooded Cold-blooded Yes No Non- mammals Non- mammals Mammals Four- legged Yes No Non- mammals Lack of training records at the leaf nodes for making reliable classification Training Set: Test Set: Model M3: train err = 0%, test err = 30%
9 Effect of Multiple Comparison Procedure Consider the task of predicting whether stock market will rise/fall in the next 10 trading days Random guessing: P ( correct ) = 0.5 Make 10 random guesses in a row: Day 1 Up Day 2 Down Day 3 Down Day 4 Up Day 5 Down Day 6 Down Day 7 Up Day 8 Up Day 9 Up Day 10 Down

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

## This note was uploaded on 04/08/2010 for the course CS 420 taught by Professor Dawsonengler during the Spring '02 term at San Jose State.

### Page1 / 26

Overfitting-L5 - CSE572:DataMining Lecture 7 Model...

This preview shows document pages 1 - 10. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online