# Lecture08 - Data Mining: Principles and Algorithms Jianyong...

This preview shows pages 1–7. Sign up to view the full content.

12/3/2009 Data Mining: Principles and Algorithms 1 Data Mining: Principles and Algorithms Jianyong Wang Database Lab, Institute of Software Department of Computer Science and Technology Tsinghua University [email protected]

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
12/3/2009 Data Mining: Principles and Algorithms 2 Chapter 4. Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian classification Rule-based classification Artificial Neural Networks Support Vector Machines (SVM) Associative classification Lazy learners (or learning from your neighbors) Other classification methods Ensemble methods Prediction Accuracy and error measures Summary
12/3/2009 Data Mining: Principles and Algorithms 3 Classification: Definition Given a collection of records ( training set ) - Each record contains a set of attributes , one of the attributes is the class . Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. - A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Apply Model Induction Deduction Learn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Test Set Learning algorithm Training Set

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
12/3/2009 Data Mining: Principles and Algorithms 4 Examples of Classification Task Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc ……
12/3/2009 Data Mining: Principles and Algorithms 5 Classification A Two-Step Process Model construction: describing a set of predetermined classes - Each record/tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute - Training set: the set of records/tuples used for model construction - The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects - Estimate accuracy of the model The known label of a test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur - If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
12/3/2009
This is the end of the preview. Sign up to access the rest of the document.

## This note was uploaded on 06/02/2010 for the course COMPUTER DM2009F taught by Professor Wangwei during the Fall '09 term at Tsinghua University.

### Page1 / 44

Lecture08 - Data Mining: Principles and Algorithms Jianyong...

This preview shows document pages 1 - 7. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online