From Data Mining to Knowledge Discovery in Databases

Finally and perhaps one of the most important

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: data errors) are another consideration. High amounts of noise make it hard to identify patterns unless a large number of cases can mitigate random noise and help clarify the aggregate patterns. Changing and time- oriented data, although making the application development more difficult, make it potentially much more useful because it is easier to retrain a system than a human. Finally, and perhaps one of the most important considerations, is prior knowledge. It is useful to know something about the domain —what are the important fields, what are the likely relationships, what is the user utility function, what patterns are already known, and so on. Research and Application Challenges We outline some of the current primary research and application challenges for KDD. This list is by no means exhaustive and is intended to give the reader a feel for the types of problem that KDD practitioners wrestle with. L arger databases: D atabases with hundreds of fields and tables and millions of records and of a multigigabyte size are commonplace, and terabyte (1012 bytes) databases are beginning to appear. Methods for dealing with large data volumes include more efficient algorithms (Agrawal et al. 1996), sampling, approximation, and massively parallel processing (Holsheimer et al. 1996). High dimensionality: Not only is there often a large number of records in the database, but there can also be a large number of fields (attributes, variables); so, the dimensionality of the problem is high. A high-dimensional data set creates problems in terms of increasing the size of the search space for model induction in a combinatorially explosive manner. In addition, it increases the chances that a data-mining algorithm will find spurious patterns that are not valid in general. Approaches to this problem include methods to reduce the effective dimensionality of the problem and the use of prior knowledge to identify irrelevant variables. Overfitting: When the algorithm searches for the best parameters for one particular model using a limited set of data, it can model not only the general patterns in the data but also any noise specific to the data set, resulting in poor performance of the model on test data. Possible solutions include cross-validation, regularization, and other sophisticated statistical strategies. Assessing of statistical significance: A problem (related to overfitting) occurs when the system is searching over many possible models. For example, if a system tests models at the 0.001 significance level, then on average, with purely random data, N /1000 of these models will be accepted as significant. FALL 1996 49 Articles This point is frequently missed by many initial attempts at KDD. One way to deal with this problem is to use methods that adjust the test statistic as a function of the search, for example, Bonferroni adjustments for independent tests or randomization testing. Changing data and knowledge: Rapidly changing (nonstationary) data can m...
View Full Document

This document was uploaded on 02/15/2014.

Ask a homework question - tutors are online