L12preprocessing

L12preprocessing - Preprocessing Lecture Notes (cse352)...

Info iconThis preview shows pages 1–14. Sign up to view the full content.

View Full Document Right Arrow Icon
Preprocessing Lecture Notes (cse352) Professor Anita Wasilewska
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Data Preprocessing • Why preprocess the data? • Data cleaning • Data integration and transformation • Data reduction • Discretization and concept hierarchy generation • Summary
Background image of page 2
TYPES OF DATA (1) • Generally we distinguish: Quantitative Data Qualitative Data • Bivaluated: often very useful • Remember: Null Values are not applicable • Missing data usually not acceptable
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Why Data Preprocessing? • Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names
Background image of page 4
Why Data Preprocessing? • No quality data, no quality results! Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Data Quality • A well-accepted multidimensional view of data quality: – Accuracy – Completeness – Consistency – Timeliness – Believability – Interpretability – Accessibility
Background image of page 6
Major Tasks in Data Preprocessing • Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration (if needed) – Integration of multiple databases, data cubes, or files • Data transformation – Normalization and aggregation
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Major Tasks in Data Preprocessing • Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization – Part of data reduction but with particular importance, especially for numerical data
Background image of page 8
Forms of data preprocessing
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Data Cleaning • Data cleaning tasks –Fill in missing values –Identify outliers and smooth out noisy data –Correct inconsistent data
Background image of page 10
Missing Data • Data is not always available • Missing data may be due to – equipment malfunction – inconsistent with other recorded data and thus deleted – data not entered due to misunderstanding – certain data may not be considered important at the time of entry – not register history or changes of the data • Missing data may need to be inferred.
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
How to Handle Missing Data? (1). Ignore the tuple (record) : usually done when class label (a value of the classification attribute) is missing (assuming the tasks in classification) It is not effective when the percentage of missing values per attribute varies considerably. (2) Fill in the missing value manually: tedious + often infeasible? (3) Fill in the missing value automatically (methods to follow)
Background image of page 12
Fill in Missing Data (1) Use a global constant to fill in the missing value (not efficient and often incorrect) (2) Use the attribute values mean to fill in the missing value (3) Use the attribute values mean for all samples belonging to the same class (in case of the classification) to fill in the missing value: smarter then (2) (4) Use the most probable value to fill in the missing value
Background image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 14
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 37

L12preprocessing - Preprocessing Lecture Notes (cse352)...

This preview shows document pages 1 - 14. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online