Unformatted text preview: d ofﬂine).
Once organizations and individuals have
solved the problem of how to store and access their data, the natural next step is the
question, What else do we do with all the data? This is where opportunities for KDD naturally arise.
A popular approach for analysis of data
warehouses is called online analytical processing
(OLAP), named for a set of principles proposed by Codd (1993). OLAP tools focus on
providing multidimensional data analysis,
which is superior to SQL in computing summaries and breakdowns along many dimensions. OLAP tools are targeted toward simplifying and supporting interactive data analysis,
but the goal of KDD tools is to automate as
much of the process as possible. Thus, KDD is
a step beyond what is currently supported by
most standard database systems. Basic Deﬁnitions
KDD is the nontrivial process of identifying
valid, novel, potentially useful, and ultimate- Articles Interpretation /
Transformation Knowledge Preprocessing
--- --- ----- --- ----- --- --- Preprocessed Data
Data Target Date Figure 1. An Overview of the Steps That Compose the KDD Process. ly understandable patterns in data (Fayyad,
Piatetsky-Shapiro, and Smyth 1996).
Here, data are a set of facts (for example,
cases in a database), and pattern is an expression in some language describing a subset of
the data or a model applicable to the subset.
Hence, in our usage here, extracting a pattern
also designates ﬁtting a model to data; ﬁnding structure from data; or, in general, making any high-level description of a set of data.
The term process implies that KDD comprises
many steps, which involve data preparation,
search for patterns, knowledge evaluation,
and reﬁnement, all repeated in multiple iterations. By n ontrivial , we mean that some
search or inference is involved; that is, it is
not a straightforward computation of
predeﬁned quantities like computing the average value of a set of numbers.
The discovered patterns should be valid on
new data with some degree of certainty. We
also want patterns to be novel (at least to the
system and preferably to the user) and potentially useful, that is, lead to some beneﬁt to
the user or task. Finally, the patterns should
be understandable, if not immediately then
after some postprocessing.
The previous discussion implies that we can
deﬁne quantitative measures for evaluating
extracted patterns. In many cases, it is possible to deﬁne measures of certainty (for example, estimated prediction accuracy on new data) or utility (for example, gain, perhaps in
dollars saved because of better predictions or
speedup in response time of a system). Notions such as novelty and understandability
are much more subjective. In certain contexts,
understandability can be estimated by simplicity (for example, the number of bits to describe a pattern). An important notion, called
interestingness (for example, see Silberschatz
and Tuzhilin  and Piatetsky-Shapiro and
Matheus ), is usually taken as an overall
measure of pattern value, combining validity,
View Full Document
- Spring '14
- Data Mining, KDD