{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

Topic3-DMConcepts - Chapter3 DataMiningConcepts Data Mining...

Info iconThis preview shows pages 1–11. Sign up to view the full content.

View Full Document Right Arrow Icon
Chapter 3 Data Mining Concepts: Data Preparation and Model Evaluation 1 Data Mining 2011 - Volinsky - Columbia University
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Data Preparation • Data in the real world is dirty – incomplete : lacking attribute values, lacking certain  attributes of interest, or containing only aggregate data – noisy : containing errors or outliers – inconsistent : containing discrepancies in codes or names • No quality data, no quality mining results! – Quality decisions must be based on quality data – Data warehouse needs consistent integration of quality  data – Assessment of quality reflects on confidence in results 2 Data Mining 2011 - Volinsky - Columbia University
Background image of page 2
Preparing Data for Analysis Think about your data how is it measured, what does it mean? nominal or categorical jersey numbers, ids, colors, simple labels sometimes recoded into integers - careful! ordinal rank has meaning - numeric value not necessarily educational attainment, military rank integer valued distances between numeric values have meaning temperature, time ratio  zero value has meaning - means that fractions and ratios are sensible money, age, height,  It might seem obvious what a given data value is, but not always pain index, movie ratings, etc 3 Data Mining 2011 - Volinsky - Columbia University
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Investigate your data carefully! • Example: lapsed donors to a charity: (KDD  Cup 1998) – Made their last donation to PVA 13 to 24   months prior to June 1997  – 200,000 (training and test sets)  – Who should get the current mailing?  – What is the cost effective strategy? – “tcode” was an important variable… 4 Data Mining 2011 - Volinsky - Columbia University
Background image of page 4
5 Data Mining 2011 - Volinsky - Columbia University
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
6 Data Mining 2011 - Volinsky - Columbia University
Background image of page 6
7 Data Mining 2011 - Volinsky - Columbia University
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
8 Data Mining 2011 - Volinsky - Columbia University
Background image of page 8
Tasks in Data Preprocessing • Data cleaning – Check for data quality – Missing data • Data transformation – Normalization and aggregation • Data reduction – Obtains reduced representation in volume but produces the same or  similar analytical results • Data discretization – Combination of reduction and transformation but with particular  importance, especially for numerical data 9 Data Mining 2011 - Volinsky - Columbia University
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Data Cleaning / Quality Individual measurements – Random noise in individual measurements Outliers Random data entry errors Noise in label assignment (e.g., class labels in medical data sets) can be corrected or smoothed out – Systematic errors E.g., all ages > 99 recorded as 99 More individuals aged 20, 30, 40, etc than expected – Missing information Missing at random
Background image of page 10
Image of page 11
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}