From Data Mining to Knowledge Discovery in Databases

This is where opportunities for kdd naturally arise a

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: d offline). Once organizations and individuals have solved the problem of how to store and access their data, the natural next step is the question, What else do we do with all the data? This is where opportunities for KDD naturally arise. A popular approach for analysis of data warehouses is called online analytical processing (OLAP), named for a set of principles proposed by Codd (1993). OLAP tools focus on providing multidimensional data analysis, which is superior to SQL in computing summaries and breakdowns along many dimensions. OLAP tools are targeted toward simplifying and supporting interactive data analysis, but the goal of KDD tools is to automate as much of the process as possible. Thus, KDD is a step beyond what is currently supported by most standard database systems. Basic Definitions KDD is the nontrivial process of identifying valid, novel, potentially useful, and ultimate- Articles Interpretation / Evaluation Data Mining Transformation Knowledge Preprocessing Selection Patterns --- --- ----- --- ----- --- --- Preprocessed Data Data Transformed Data Target Date Figure 1. An Overview of the Steps That Compose the KDD Process. ly understandable patterns in data (Fayyad, Piatetsky-Shapiro, and Smyth 1996). Here, data are a set of facts (for example, cases in a database), and pattern is an expression in some language describing a subset of the data or a model applicable to the subset. Hence, in our usage here, extracting a pattern also designates fitting a model to data; finding structure from data; or, in general, making any high-level description of a set of data. The term process implies that KDD comprises many steps, which involve data preparation, search for patterns, knowledge evaluation, and refinement, all repeated in multiple iterations. By n ontrivial , we mean that some search or inference is involved; that is, it is not a straightforward computation of predefined quantities like computing the average value of a set of numbers. The discovered patterns should be valid on new data with some degree of certainty. We also want patterns to be novel (at least to the system and preferably to the user) and potentially useful, that is, lead to some benefit to the user or task. Finally, the patterns should be understandable, if not immediately then after some postprocessing. The previous discussion implies that we can define quantitative measures for evaluating extracted patterns. In many cases, it is possible to define measures of certainty (for example, estimated prediction accuracy on new data) or utility (for example, gain, perhaps in dollars saved because of better predictions or speedup in response time of a system). Notions such as novelty and understandability are much more subjective. In certain contexts, understandability can be estimated by simplicity (for example, the number of bits to describe a pattern). An important notion, called interestingness (for example, see Silberschatz and Tuzhilin [1995] and Piatetsky-Shapiro and Matheus [1994]), is usually taken as an overall measure of pattern value, combining validity, n...
View Full Document

This document was uploaded on 02/15/2014.

Ask a homework question - tutors are online