From Data Mining to Knowledge Discovery in Databases

Figure 8 illustrates the use of a nearest neighbor

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: le: Use representative examples from the database to approximate a model; that is, predictions on new examples are derived from the properties of similar examples in the model whose prediction is known. Techniques include nearestneighbor classification and regression algorithms (Dasarathy 1991) and case-based reasoning systems (Kolodner 1993). Figure 8 illustrates the use of a nearest-neighbor classifier for the loan data set: The class at any new point in the two-dimensional space is the same as the class of the closest point in the original training data set. A potential disadvantage of example-based methods (compared with tree-based methods) is that a well-defined distance metric for evaluating the distance between data points is required. For the loan data in figure 8, this would not be a problem because income and debt are measured in the same units. However, if one wished to include variables such as the duration of the loan, sex, and profession, then it would require more effort to define a sensible metric between the variables. Model evaluation is typically based on cross-validation estimates (Weiss and Kulikowski 1991) of a prediction error: Parameters of the model to be estimated can include the number of neighbors to use for prediction and the distance metric itself. Like nonlinear regression methods, example-based methods are often asymptotically powerful in terms of approximation properties but, conversely, can be difficult to interpret because the model is implicit in the data and not explicitly formulated. Related techniques include kernel-density Debt o No Loan o x o x x x o o x x o o x x o o o x x o o Loan o Income Figure 7. An Example of Classification Boundaries Learned by a Nonlinear Classifier (Such as a Neural Network) for the Loan Data Set. o Debt No Loan o x o x x x x o x o o x o o o x x o o x o Loan o Income Figure 8. Classification Boundaries for a Nearest-Neighbor Classifier for the Loan Data Set. FALL 1996 47 Articles estimation (Silverman 1986) and mixture modeling (Titterington, Smith, and Makov 1985). Probabilistic Graphic Dependency Models Understanding data mining and model induction at this component level clarifies the behavior of any data-mining algorithm and makes it easier for the user to understand its overall contribution and applicability to the KDD process. Graphic models specify probabilistic dependencies using a graph structure (Whittaker 1990; Pearl 1988). In its simplest form, the model specifies which variables are directly dependent on each other. Typically, these models are used with categorical or discrete-valued variables, but extensions to special cases, such as Gaussian densities, for real-valued variables are also possible. Within the AI and statistical communities, these models were initially developed within the framework of probabilistic expert systems; the structure of the model and the parameters (the conditional probabilities attached to the links of the graph) were elicited from experts. Recently, there has been significant work in both the AI and statistical communities on methods whereby...
View Full Document

This document was uploaded on 02/15/2014.

Ask a homework question - tutors are online