class_08_27

Statistical Data Mining ORIE 474 Fall 2007 Tatiyana V. Apanasovich Introduction 08/27/07

1.2 Nature of Data Sets What is a Data Set ? Set of measurements taken from some environment or process Simplest case: collection of n objects , for each object we have the same p measurements n x p data matrix Observed Data Matrix can also be referred to as: data set, training data, sample, database p 2 1 1 n
Ex: Adult data ID Age Work class Education Marital Status Sex Income class 25 59 Private HS graduate Divorced Female <=50K 26 56 Local gov Bachelors Married Male >50K 27 19 Private HS graduate Never married Male <=50K 28 54 N/A Some college Married Male >50K 29 39 Private HS graduate Divorced Male <=50K 30 49 Private HS graduate Married Male <=50K 31 23 Local gov Assoc academic Never married Male <=50K Source: Machine Learning Repository of UCI

1.3 Models and Patterns Model structure global summary of a data set Model can make statement about any point in the full measurement space: if we consider the rows as p- dimensional vectors Ex: Diagnosis based on test results Pattern structure makes statements only about restricted regions of the space spanned by the variables Ex: mail order purchases may reveal that people who buy certain combinations of products are likely to buy others
1.4 DM Objectives Model Building A Exploratory Data Analysis B Descriptive Modeling C Predictive Modeling Pattern Recognition D Discovering Pattern and Rules E Retrieval by Content

A. Exploratory Data Analysis (Chapter 3) Goal: Explore data w/o clear ideas of what we are looking for Techniques: For p>3, projection techniques useful and necessary Principal components Spatial displays
B. Descriptive Modeling (Chapter 9) Goal: Describe all of the data (or the process generating it) Ex: Models for the overall probability distribution of the data (density estimation) Partitioning of the p-dim. Measurement space into Models describing the relationship between variables (dependency modeling)

C. Predictive Modeling (Chapter 10
