However as shown in the figure the data mining

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: on, data often needs to be extracted from different databases and joined, and perhaps sampled. 2. Data Preprocessing. After the selection of data, the selected data needs to be preprocessed (cleaned). Data preprocessing/cleaning involves looking for obvious flaws in the data and removing them, and removing records with errors or insignificant values. Removal of flaws in the data involves collecting the necessary information to model or account for the flaws, and deciding on strategies for handling missing and unknown values in data fields. Data cleaning also involves data conversions to ensure uniform representation of data in all records. This is mainly required when the target data is derived from different databases in which values for the same field may be represented using different notations. For example, the sex field may be represented as 1 for male and 0 for female in one database, and M for male and F for female in another database. The process of data cleaning is often referred to as data scrubbing. 3. Data Transformation. After the selection and cleaning process, certain transformations on the preprocessed data may be necessary. These range from conversions from one type of data to another, to deriving new variables using mathematical or logical formulae. This step also involves finding useful features to represent the data, depending on the goal of the knowledge discovery task, and using dimensionality reduction or transformation methods to properly summarize the data for the intended knowledge discovery. 4. Data Mining. This step involves application of the core discovery procedures on the transformed data to reveal patterns and new knowledge. This includes searching for patterns of interest in a particular representational form or a set of such representations, including classification rules or trees, regression, clustering, sequence modeling, dependency and line analysis. All this requires selection of method(s) to be used for searching for useful patterns in the data,...
View Full Document

This document was uploaded on 04/07/2014.

Ask a homework question - tutors are online