class03

class03 - DataMining Chapter2 JiaweiHan...

Info iconThis preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon
August 29, 2011 Data Mining: Concepts and Techniques 1 Data Mining:     Concepts and Techniques   — Chapter 2 — Jiawei Han Department of Computer Science  University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj ©2006 Jiawei Han and Micheline Kamber, All rights reserved
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
August 29, 2011 Data Mining: Concepts and Techniques 2 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning  Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary
Background image of page 2
August 29, 2011 Data Mining: Concepts and Techniques 3 Why Data Preprocessing? Data in the real world is dirty incomplete : lacking attribute values, lacking  certain attributes of interest, or containing only  aggregate data e.g., occupation=“ ” noisy : containing errors or outliers e.g., Salary=“-10” inconsistent : containing discrepancies in  codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
August 29, 2011 Data Mining: Concepts and Techniques 4 Why Is Data Dirty? Incomplete data may come from “Not applicable” data value when collected Different considerations between the time when the data was  collected and when it is analyzed. Human/hardware/software problems Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data) Duplicate records also need data cleaning
Background image of page 4
August 29, 2011 Data Mining: Concepts and Techniques 5 Why Is Data Preprocessing Important? No quality data, no quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even  misleading statistics. Data warehouse needs consistent integration of quality  data Data extraction, cleaning, and transformation comprises  the majority of the work of building a data warehouse
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
August 29, 2011 Data Mining: Concepts and Techniques 6 Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility Broad categories: Intrinsic, contextual, representational, and accessibility
Background image of page 6
August 29, 2011 Data Mining: Concepts and Techniques 7 Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove 
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 8
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 08/29/2011 for the course CAP 4770 taught by Professor Staff during the Fall '08 term at FIU.

Page1 / 118

class03 - DataMining Chapter2 JiaweiHan...

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online