Unformatted text preview: ovelty, usefulness, and simplicity. Interestingness functions can be deﬁned explicitly or
can be manifested implicitly through an ordering placed by the KDD system on the discovered patterns or models.
Given these notions, we can consider a
pattern to be knowledge if it exceeds some interestingness threshold, which is by no
means an attempt to deﬁne knowledge in the
philosophical or even the popular view. As a
matter of fact, knowledge in this deﬁnition is
purely user oriented and domain speciﬁc and
is determined by whatever functions and
thresholds the user chooses.
Data mining is a step in the KDD process
that consists of applying data analysis and
discovery algorithms that, under acceptable
computational efﬁciency limitations, produce a particular enumeration of patterns (or
models) over the data. Note that the space of FALL 1996 41 Articles patterns is often inﬁnite, and the enumeration of patterns involves some form of
search in this space. Practical computational
constraints place severe limits on the subspace that can be explored by a data-mining
The KDD process involves using the
database along with any required selection,
preprocessing, subsampling, and transformations of it; applying data-mining methods
(algorithms) to enumerate patterns from it;
and evaluating the products of data mining
to identify the subset of the enumerated patterns deemed knowledge. The data-mining
component of the KDD process is concerned
with the algorithmic means by which patterns are extracted and enumerated from data. The overall KDD process (ﬁgure 1) includes the evaluation and possible
interpretation of the mined patterns to determine which patterns can be considered
new knowledge. The KDD process also includes all the additional steps described in
the next section.
The notion of an overall user-driven process is not unique to KDD: analogous proposals have been put forward both in statistics
(Hand 1994) and in machine learning (Brodley and Smyth 1996). The KDD Process
The KDD process is interactive and iterative,
involving numerous steps with many decisions made by the user. Brachman and Anand
(1996) give a practical view of the KDD process, emphasizing the interactive nature of
the process. Here, we broadly outline some of
its basic steps:
First is developing an understanding of the
application domain and the relevant prior
knowledge and identifying the goal of the
KDD process from the customer’s viewpoint.
Second is creating a target data set: selecting a data set, or focusing on a subset of variables or data samples, on which discovery is
to be performed.
Third is data cleaning and preprocessing.
Basic operations include removing noise if
appropriate, collecting the necessary information to model or account for noise, deciding
on strategies for handling missing data ﬁelds,
and accounting for time-sequence information and known changes.
Fourth is data reduction and projection:
ﬁnding useful features to represent the data
View Full Document
- Spring '14
- Data Mining, KDD