Stats 202 - Lecture 7

# Stats 202 - Lecture 7 - Statistics 202 Statistical Aspects...

1 Statistics 202: Statistical Aspects of Data Mining Professor Rajan Patel Lecture 7 = Start Chapter 4 Agenda: 1) Assign Homework 3 2) Start lecturing over Chapter 4

2 Introduction to Data Mining by Tan, Steinbach, Kumar Chapter 4: Classification: Basic Concepts, Decision Trees, and Model Evaluation
3 Illustration of the Classification Task: Apply Model Induction Deduction Learn Model Model 7OG Attrib1 Attrib2 Attrib3 Class Yes Large !2\$5K No !2 No Medium °°K No "3 No Small &7°K No #4 Yes Medium !2°K No \$5 No Large (9\$5K Yes %6 No Medium %6°K No &7 Yes Large !2!2°K No '8 No Small '8\$5K Yes (9 No Medium &7\$5K No ° No Small (9°K Yes 7OG Attrib1 Attrib2 Attrib3 Class No Small \$5\$5K ? !2 Yes Medium '8°K ? "3 Yes Large °K ? #4 No Small (9\$5K ? \$5 No Large %6&7K ? Test Set Learning algorithm Training Set Learning Algorithm Model

4 Classification: Definition ± Given a collection of records ( ZYXWDOTOTJ YXHZY ) Each record contains a set of DZYZYXWOE[Z[ZYHYX ±^]^° , with one additional attribute which is the FRDYXYX ±_^_° . ± Find a SUGHR to VUXWHGOFZY the class as a function of the values of other attributes. ± Goal: previously unseen records should be assigned a class as accurately as possible. A ZYHYXZY YXHZY is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
5 Classification Examples ± Classifying credit card transactions as legitimate or fraudulent ± Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil ± Categorizing news stories as finance, weather, entertainment, sports, etc ± Predicting tumor cells as benign or malignant

6 Classification Techniques ± There are many techniques/algorithms for carrying out classification ± In this chapter we will study only GHFOYXOUT ZYXWHHYX ± In Chapter 5 we will study other techniques, including some very modern and effective techniques
7 An Example of a Decision Tree G\W Refund Marital Status Taxable Income Cheat Yes Single !2\$5K No !2 No Married °°K No "3 No Single &7°K No #4 Yes Married !2°K No \$5 No Divorced (9\$5K Yes %6 No Married %6°K No &7 Yes Divorced !2!2°K No '8 No Single '8\$5K Yes (9 No Married &7\$5K No ° No Single (9°K Yes categorical categorical continuous class Refund MarSt TaxInc YES NO NO NO Yes No Married Single± Divorced < '8°K > '8°K 6VUROZYZYOTJ <ZYZYXWOE[Z[ZYHYX Training Data Model: Decision Tree

8 Applying the Tree Model to Predict the Class for a New Observation Refund MarSt TaxInc YES NO NO NO Yes No Married Single± Divorced < '8°K > '8°K Refund Marital Status Taxable Income Cheat No Married '8°K ?
