Data-L9 - CSE572:DataMining Lecture 3: Data Preprocessing...

Info iconThis preview shows pages 1–13. Sign up to view the full content.

View Full Document Right Arrow Icon
1 CSE 572: Data Mining Lecture 3: Data Preprocessing Read Section 2.3
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 Data Quality Issues Noise Outliers Missing values Duplicate data
Background image of page 2
3 Noise Noise refers to modification of original values Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen Two Sine Waves Two Sine Waves + Noise
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
4 Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set
Background image of page 4
5 Missing Values Reasons for missing values Information is not collected (e.g., people decline to give their age and weight) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values Eliminate Data Objects Estimate Missing Values Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities)
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
6 Duplicate Data Data set may include data objects that are duplicates, or almost duplicates of one another Major issue when merging data from heterogeneous sources Examples: Person with multiple email addresses Paper citations: L. Breiman, L. Friedman, and P. Stone, (1984). Classication and Regression. Wadsworth, Belmont, CA. Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. Wadsworth and Brooks/ Cole, 1984. Deduplication Process of dealing with duplicate data issues
Background image of page 6
7 Data Preprocessing Aggregation Sampling Feature subset selection Discretization and Binarization Feature creation (see Textbook) Attribute Transformation (see Textbook) Dimensionality Reduction
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
8 Aggregation Combining two or more attributes (or objects) into a single attribute (or object) Purpose Data reduction reduce the number of attributes or objects Change of scale cities aggregated into regions, states, countries, etc More “stable” data aggregated data tends to have less variability
Background image of page 8
9 Aggregation Standard Deviation of Average Monthly Precipitation Standard Deviation of Average Yearly Precipitation Variation of Precipitation in Australia
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
10 Sampling  Sampling is the main technique employed for data selection It is often used for both the preliminary investigation of the data and the final data analysis Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming Sampling is used in data mining because it is too expensive or time consuming to process all the data
Background image of page 10
11 Sampling …  The key principle for effective sampling is to find a representative sample A sample is representative if it has approximately the same property (of interest) as the original set of data
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
12 Types of Sampling Simple Random Sampling There is an equal probability of selecting any particular item Sampling without replacement As each item is selected, it is removed from the population
Background image of page 12
Image of page 13
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 04/08/2010 for the course CS 420 taught by Professor Dawsonengler during the Spring '02 term at San Jose State University .

Page1 / 45

Data-L9 - CSE572:DataMining Lecture 3: Data Preprocessing...

This preview shows document pages 1 - 13. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online