From Data Mining to Knowledge Discovery in Databases

Because our discussion and overview of data mining

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ersal data-mining method, and choosing a particular algorithm for a particular application is something of an art. In practice, a large portion of the application effort can go into properly formulating the problem (asking the right question) rather than into optimizing the algorithmic details of a particular data-mining method (Langley and Simon 1995; Hand 1994). Because our discussion and overview of data-mining methods has been brief, we want to make two important points clear: First, our overview of automated search focused mainly on automated methods for extracting patterns or models from data. Although this approach is consistent with the definition we gave earlier, it does not necessarily represent what other communities might refer to as data mining. For example, some use the term to designate any manual Articles search of the data or search assisted by queries to a database management system or to refer to humans visualizing patterns in data. In other communities, it is used to refer to the automated correlation of data from transactions or the automated generation of transaction reports. We choose to focus only on methods that contain certain degrees of search autonomy. Second, beware the hype: The state of the art in automated methods in data mining is still in a fairly early stage of development. There are no established criteria for deciding which methods to use in which circumstances, and many of the approaches are based on crude heuristic approximations to avoid the expensive search required to find optimal, or even good, solutions. Hence, the reader should be careful when confronted with overstated claims about the great ability of a system to mine useful information from large (or even small) databases. Application Issues For a survey of KDD applications as well as detailed examples, see Piatetsky-Shapiro et al. (1996) for industrial applications and Fayyad, Haussler, and Stolorz (1996) for applications in science data analysis. Here, we examine criteria for selecting potential applications, which can be divided into practical and technical categories. The practical criteria for KDD projects are similar to those for other applications of advanced technology and include the potential impact of an application, the absence of simpler alternative solutions, and strong organizational support for using technology. For applications dealing with personal data, one should also consider the privacy and legal issues (Piatetsky-Shapiro 1995). The technical criteria include considerations such as the availability of sufficient data (cases). In general, the more fields there are and the more complex the patterns being sought, the more data are needed. However, strong prior knowledge (see discussion later) can reduce the number of needed cases significantly. Another consideration is the relevance of attributes. It is important to have data attributes that are relevant to the discovery task; no amount of data will allow prediction based on attributes that do not capture the required information. Furthermore, low noise levels (few...
View Full Document

Ask a homework question - tutors are online