lecture3 - Data Mining CS57300 Purdue University September...

Info iconThis preview shows pages 1–15. Sign up to view the full content.

View Full Document Right Arrow Icon
Data Mining CS57300 Purdue University September 2, 2010
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Data quality
Background image of page 2
Data quality • Examples of data quality problems: • Noise • Outliers • Missing values • Duplicate data Tan, Steinbach, Kumar. Introduction to Data Mining, 2004.
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Noise • Noise refers to measurement error in data values • Could be random error or systematic error Two Sine Waves Two Sine Waves + Noise Tan, Steinbach, Kumar. Introduction to Data Mining, 2004.
Background image of page 4
Outliers • Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set • Could indicate “interesting” cases, or could indicate errors in the data Tan, Steinbach, Kumar. Introduction to Data Mining, 2004.
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Missing values • Reasons for missing values • Information is not collected (e.g., people decline to give their age) • Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) • Ways to handle missing values • Eliminate entities with missing values • Estimate attributes with missing values • Ignore the missing values during analysis • Replace with all possible values (weighted by their probabilities) • Impute missing values Tan, Steinbach, Kumar. Introduction to Data Mining, 2004.
Background image of page 6
Duplicate data • Data set may include data entities that are duplicates, or almost duplicates of one another • Major issue when merging data from heterogeneous sources • Example: same person with multiple email addresses • Data cleaning • Finding and dealing with duplicate entities • Finding and correcting measurement error • Dealing with missing values Tan, Steinbach, Kumar. Introduction to Data Mining, 2004.
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Other data preprocessing methods • Sampling • Dimensionality reduction • Feature construction and selection • Attribute Transformation • Examples: Discretization, distance calculations
Background image of page 8
Representing data in Euclidean space • If data objects have the same Fxed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute • Many data mining techniques then use similarity/dissimilarity measures to characterize relationships between the instances Tan, Steinbach, Kumar. Introduction to Data Mining, 2004.
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Distance measures
Background image of page 10
Distance measures • Many data mining techniques utilize similarity/dissimilarity measures to characterize relationships between instances • Nearest-neighbor classifcation • Cluster analysis Proximity : general term to indicate similarity and dissimilarity Distance : dissimilarity only
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Metric properties • A metric d(i,j) is a dissimilarity measure that satisfes the Following properties: d(i,j) 0 for all i,j and d(i,j)=0 iff i=j d(i,j) = d(j,i) for all i,j d(i,j) d(i,k)+d(k,j) for all i,j,k
Background image of page 12
• Manhattan distance (L1) • Euclidean distance (L2) • Most common metric • Assumes variables are commensurate Distance metrics d E ( x, y ) = ± p i =1 ( x i - y i ) 2 d M ( x, y ) = p i =1 | x i - y i |
Background image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
• Normalization
Background image of page 14
Image of page 15
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 03/13/2012 for the course CS 573 taught by Professor Staff during the Fall '08 term at Purdue University-West Lafayette.

Page1 / 47

lecture3 - Data Mining CS57300 Purdue University September...

This preview shows document pages 1 - 15. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online