# Lecture03 - Data Mining: Principles and Algorithms Jianyong...

This preview shows pages 1–8. Sign up to view the full content.

October 14, 2009 Data Mining: Principles and Algorithms 1 Data Mining: Principles and Algorithms Jianyong Wang Database Lab, Institute of Software Department of Computer Science and Technology Tsinghua University jianyong@tsinghua.edu.cn

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
October 14, 2009 Data Mining: Principles and Algorithms 2 Chapter 2: Data Preprocessing What is data? Why preprocess the data? Data summarization Data cleaning <== Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary
October 14, 2009 Data Mining: Principles and Algorithms 3 Data Cleaning Importance - ―Data cleaning is one of the three biggest problems in data warehousing‖—Ralph Kimball - ―Data cleaning is the number one problem in data warehousing‖—DCI survey - Similar situation in data mining Data cleaning tasks - Fill in missing values - Identify outliers and smooth out noisy data - Correct inconsistent data - Resolve redundancy caused by data integration

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
October 14, 2009 Data Mining: Principles and Algorithms 4 Missing Data Data is not always available - E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to - Equipment malfunction - Inconsistent with other recorded data and thus deleted - Data not entered due to misunderstanding - Certain data may not be considered important at the time of entry - Not register history or changes of the data Missing data may need to be inferred.
October 14, 2009 Data Mining: Principles and Algorithms 5 How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification). Fill in the missing value manually: tedious + infeasible? Fill in it automatically with - A global constant : e.g., ―unknown‖, a new class?! - The attribute mean - The attribute mean for all samples belonging to the same class: smarter - The most probable value: inference-based such as Bayesian formula or decision tree

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
October 14, 2009 Data Mining: Principles and Algorithms 6 Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may be due to - Faulty data collection instruments - Data entry problems – human factors - Data transmission problems Data quality problems is pervasive in large databases, e.g., - 2% of records obsolete in customer files in one month (deaths, name changes, etc) according to ― Data Warehousing Institute report 2002” . - Pricing anomalies: UA tickets selling for \$5, 32GB of memory selling for \$1.99 at amazon.com \$1 selling for a house in Detroit in 2008 Massive financial impact - \$611B/year loss in US due to poor customer data - \$2.5B/year loss due to incorrect prices in retail DBs
October 14, 2009 Data Mining: Principles and Algorithms 7 How to Handle Noisy Data?

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

## Lecture03 - Data Mining: Principles and Algorithms Jianyong...

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online