Topic3-DMConcepts

Topic3-DMConcepts - Chapter3 DataMiningConcepts:

Info iconThis preview shows pages 1–11. Sign up to view the full content.

View Full Document Right Arrow Icon
Chapter 3 Data Mining Concepts: Data Preparation and Model Evaluation 1 Data Mining 2011 - Volinsky - Columbia University
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Data Preparation • Data in the real world is dirty – incomplete : lacking attribute values, lacking certain  attributes of interest, or containing only aggregate data – noisy : containing errors or outliers – inconsistent : containing discrepancies in codes or names • No quality data, no quality mining results! – Quality decisions must be based on quality data – Data warehouse needs consistent integration of quality  data – Assessment of quality reflects on confidence in results 2 Data Mining 2011 - Volinsky - Columbia University
Background image of page 2
Preparing Data for Analysis Think about your data how is it measured, what does it mean? nominal or categorical jersey numbers, ids, colors, simple labels sometimes recoded into integers - careful! ordinal rank has meaning - numeric value not necessarily educational attainment, military rank integer valued distances between numeric values have meaning temperature, time ratio  zero value has meaning - means that fractions and ratios are sensible money, age, height,  It might seem obvious what a given data value is, but not always pain index, movie ratings, etc 3 Data Mining 2011 - Volinsky - Columbia University
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Investigate your data carefully! • Example: lapsed donors to a charity: (KDD  Cup 1998) – Made their last donation to PVA 13 to 24   months prior to June 1997  – 200,000 (training and test sets)  – Who should get the current mailing?  – What is the cost effective strategy? – “tcode” was an important variable… 4 Data Mining 2011 - Volinsky - Columbia University
Background image of page 4
5 Data Mining 2011 - Volinsky - Columbia University
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
6 Data Mining 2011 - Volinsky - Columbia University
Background image of page 6
7 Data Mining 2011 - Volinsky - Columbia University
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
8 Data Mining 2011 - Volinsky - Columbia University
Background image of page 8
Tasks in Data Preprocessing • Data cleaning – Check for data quality – Missing data • Data transformation – Normalization and aggregation • Data reduction – Obtains reduced representation in volume but produces the same or  similar analytical results • Data discretization – Combination of reduction and transformation but with particular  importance, especially for numerical data 9 Data Mining 2011 - Volinsky - Columbia University
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Data Cleaning / Quality Individual measurements – Random noise in individual measurements Outliers Random data entry errors Noise in label assignment (e.g., class labels in medical data sets) can be corrected or smoothed out – Systematic errors E.g., all ages > 99 recorded as 99 More individuals aged 20, 30, 40, etc than expected
Background image of page 10
Image of page 11
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 69

Topic3-DMConcepts - Chapter3 DataMiningConcepts:

This preview shows document pages 1 - 11. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online