This preview shows page 1. Sign up to view the full content.
Unformatted text preview: on, data often
needs to be extracted from different databases and joined, and perhaps sampled.
2. Data Preprocessing. After the selection of data, the selected data needs to be
preprocessed (cleaned). Data preprocessing/cleaning involves looking for obvious
flaws in the data and removing them, and removing records with errors or
insignificant values. Removal of flaws in the data involves collecting the
necessary information to model or account for the flaws, and deciding on
strategies for handling missing and unknown values in data fields. Data cleaning
also involves data conversions to ensure uniform representation of data in all
records. This is mainly required when the target data is derived from different databases in which values for the same field may be represented using different
notations. For example, the sex field may be represented as 1 for male and 0 for
female in one database, and M for male and F for female in another database. The
process of data cleaning is often referred to as data scrubbing.
Data Transformation. After the selection and cleaning process, certain
transformations on the preprocessed data may be necessary. These range from
conversions from one type of data to another, to deriving new variables using
mathematical or logical formulae. This step also involves finding useful features
to represent the data, depending on the goal of the knowledge discovery task, and
using dimensionality reduction or transformation methods to properly summarize
the data for the intended knowledge discovery.
4. Data Mining. This step involves application of the core discovery procedures
on the transformed data to reveal patterns and new knowledge. This includes
searching for patterns of interest in a particular representational form or a set of
such representations, including classification rules or trees, regression, clustering,
sequence modeling, dependency and line analysis. All this requires selection of
method(s) to be used for searching for useful patterns in the data,...
View Full Document
This document was uploaded on 04/07/2014.
- Spring '14