03Preprocessing - Data Mining Concepts and Techniques(3rd...

Info iconThis preview shows pages 1–10. Sign up to view the full content.

View Full Document Right Arrow Icon
09/16/09 1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 3 Jiawei Han, Micheline Kamber, and Jian Pei Simon Fraser University ©2009  Han,   .  All rights reserved.
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
09/16/09 2
Background image of page 2
09/16/09 3 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data Reduction Data Transformation and Data Discretization Summary
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
09/16/09 4 Data Quality: Multi-Dimensional Measure A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Interpretability
Background image of page 4
09/16/09 5 Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove  outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data reduction Dimensionality reduction Numerosity reduction Data compression Data transformation and data discretization Normalization  Concept hierarchy generation
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
09/16/09 6 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data Reduction Data Transformation and Data Discretization Summary
Background image of page 6
09/16/09 7 Data Cleaning Data in the Real World Is Dirty: incomplete : lacking attribute values, lacking certain  attributes of interest, or containing only aggregate data e.g.,  Occupation =“ ” (missing data) noisy : containing noise, errors, or outliers e.g.,  Salary =“−10” (an error) inconsistent : containing discrepancies in codes or  names, e.g., Age =“42”  Birthday =“03/07/1997” Was rating “1,2,3”, now rating “A, B, C” discrepancy between duplicate records
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
09/16/09 8 Incomplete (Missing) Data Data is not always available E.g., many tuples have no recorded value for several  attributes, such as customer income in sales data Missing data may be due to  equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the  time of entry not register history or changes of the data Missing data may need to be inferred
Background image of page 8
09/16/09 9 How to Handle Missing Data? Ignore the tuple: usually done when class label is missing  (when doing classification)—not effective when the % of  missing values per attribute varies considerably Fill in the missing value manually: tedious + infeasible? Fill in it automatically with
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 10
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

Page1 / 66

03Preprocessing - Data Mining Concepts and Techniques(3rd...

This preview shows document pages 1 - 10. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online