Lecture03 - Data Mining: Principles and Algorithms Jianyong...

Info iconThis preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon
October 14, 2009 Data Mining: Principles and Algorithms 1 Data Mining: Principles and Algorithms Jianyong Wang Database Lab, Institute of Software Department of Computer Science and Technology Tsinghua University jianyong@tsinghua.edu.cn
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
October 14, 2009 Data Mining: Principles and Algorithms 2 Chapter 2: Data Preprocessing What is data? Why preprocess the data? Data summarization Data cleaning <== Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary
Background image of page 2
October 14, 2009 Data Mining: Principles and Algorithms 3 Data Cleaning Importance - ―Data cleaning is one of the three biggest problems in data warehousing‖—Ralph Kimball - ―Data cleaning is the number one problem in data warehousing‖—DCI survey - Similar situation in data mining Data cleaning tasks - Fill in missing values - Identify outliers and smooth out noisy data - Correct inconsistent data - Resolve redundancy caused by data integration
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
October 14, 2009 Data Mining: Principles and Algorithms 4 Missing Data Data is not always available - E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to - Equipment malfunction - Inconsistent with other recorded data and thus deleted - Data not entered due to misunderstanding - Certain data may not be considered important at the time of entry - Not register history or changes of the data Missing data may need to be inferred.
Background image of page 4
October 14, 2009 Data Mining: Principles and Algorithms 5 How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification). Fill in the missing value manually: tedious + infeasible? Fill in it automatically with - A global constant : e.g., ―unknown‖, a new class?! - The attribute mean - The attribute mean for all samples belonging to the same class: smarter - The most probable value: inference-based such as Bayesian formula or decision tree
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
October 14, 2009 Data Mining: Principles and Algorithms 6 Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may be due to - Faulty data collection instruments - Data entry problems – human factors - Data transmission problems Data quality problems is pervasive in large databases, e.g., - 2% of records obsolete in customer files in one month (deaths, name changes, etc) according to ― Data Warehousing Institute report 2002” . - Pricing anomalies: UA tickets selling for $5, 32GB of memory selling for $1.99 at amazon.com $1 selling for a house in Detroit in 2008 Massive financial impact - $611B/year loss in US due to poor customer data - $2.5B/year loss due to incorrect prices in retail DBs Source: http://www.dmreview.com/article_sub.cfm?articleId=2073
Background image of page 6
October 14, 2009 Data Mining: Principles and Algorithms 7 How to Handle Noisy Data?
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 8
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 56

Lecture03 - Data Mining: Principles and Algorithms Jianyong...

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online