Unformatted text preview: data errors) are another consideration. High amounts of noise make it hard to
identify patterns unless a large number of cases can mitigate random noise and help clarify
the aggregate patterns. Changing and time- oriented data, although making the application development more difﬁcult, make it potentially much more useful because it is easier
to retrain a system than a human. Finally,
and perhaps one of the most important considerations, is prior knowledge. It is useful to
know something about the domain —what
are the important ﬁelds, what are the likely
relationships, what is the user utility function, what patterns are already known, and so
on. Research and Application Challenges
We outline some of the current primary research and application challenges for KDD.
This list is by no means exhaustive and is intended to give the reader a feel for the types
of problem that KDD practitioners wrestle
L arger databases: D atabases with hundreds of ﬁelds and tables and millions of
records and of a multigigabyte size are commonplace, and terabyte (1012 bytes) databases
are beginning to appear. Methods for dealing
with large data volumes include more
efﬁcient algorithms (Agrawal et al. 1996),
sampling, approximation, and massively parallel processing (Holsheimer et al. 1996).
High dimensionality: Not only is there often a large number of records in the database,
but there can also be a large number of ﬁelds
(attributes, variables); so, the dimensionality
of the problem is high. A high-dimensional
data set creates problems in terms of increasing the size of the search space for model induction in a combinatorially explosive manner. In addition, it increases the chances that
a data-mining algorithm will ﬁnd spurious
patterns that are not valid in general. Approaches to this problem include methods to
reduce the effective dimensionality of the
problem and the use of prior knowledge to
identify irrelevant variables.
Overﬁtting: When the algorithm searches
for the best parameters for one particular
model using a limited set of data, it can model not only the general patterns in the data
but also any noise speciﬁc to the data set, resulting in poor performance of the model on
test data. Possible solutions include cross-validation, regularization, and other sophisticated statistical strategies.
Assessing of statistical signiﬁcance: A
problem (related to overﬁtting) occurs when
the system is searching over many possible
models. For example, if a system tests models
at the 0.001 signiﬁcance level, then on average, with purely random data, N /1000 of
these models will be accepted as signiﬁcant. FALL 1996 49 Articles This point is frequently missed by many initial attempts at KDD. One way to deal with
this problem is to use methods that adjust
the test statistic as a function of the search,
for example, Bonferroni adjustments for independent tests or randomization testing.
Changing data and knowledge: Rapidly
changing (nonstationary) data can m...
View Full Document
- Spring '14
- Data Mining, KDD