This preview shows page 1. Sign up to view the full content.
Unformatted text preview: er
features such as parallel database access. No single product, however, can support the large
variety of database servers, so a gateway must be used for all but the four or five leading DBMSs.
The most common gateway supported is Microsoft’s ODBC (Open Database Connectivity). In
some instances it is useful if the data mining tool can consolidate data from multiple sources.
Algorithms. You must understand the characteristics of the algorithms the data mining product
uses so you can determine if they match the characteristics of your problem. In particular, learn
how the algorithms treat the data types of your response and predictor variables, how fast they
train, and how fast they work on new data.
Another important algorithm feature is sensitivity to noise. Real data has irrelevant columns, rows
(cases) that don’t conform to the pattern your model finds, and missing or incorrect values. How
much of this noise can your model-building tool stand before its accuracy drops? In other words,
how sensitive is the algorithm to missing data, and how robust are the patterns it discovers in the
face of extraneous and incorrect data? In some instances, simply adding more data may be
enough to compensate for noise, but if the additional data itself is very noisy, it may actually
reduce accuracy. In fact, a major aspect of data preparation is to reduce the amount of noise in
your data that is under your control.
Interfaces to other products. There are many tools that can help you understand your data
before you build your model, and help you interpret the results of your model. These include
traditional query and reporting tools, graphics and visualization tools, and OLAP tools. Data
mining software that provides an easy integration path with other vendors’ products provides the
user with many additional ways to get the most out of the knowledge discovery process.
Model evaluation and interpretation. Products can help the user understand the results by
providing measures (of accuracy, significance, etc.) in useful formats such as confusion matrices
and ROI charts, by allowing the user to perform sensitivity analysis on the result, and by
presenting the result in alternative ways, such as graphically.
Model deployment. The results of a model may be applied by writing directly to a database or
extracting records from it. When you need to apply the model to new cases as they come, it is
usually necessary to incorporate the model into a program using an API or code generated by the
data mining tool. In either case, one of the key problems in deploying models is to deal with the
transformations necessary to make predictions. Many data mining tools leave this as a separate
job for the user or programmer. © 1999 Two Crows Corporation 35 Scalability. How effective is the tool in dealing with large amounts of data — both rows and
columns — and with sophisticated validation techniques? These challenges require the ability to
take advantage of powerful hardware. What kinds of parallelism does the tool support? Is there
parallel use of a parallel DBMS and are the algorithms themselves parallel? What kind of parallel
computers does it support — SMP servers or MPP servers? How well does it scale as the number
of processors increases? Does it support parallel data access?
Data mining algorithms written for a uniprocessor machine won’t automatically run faster on a
parallel machine; they must be rewritten to take advantage of the parallel processors. There are
two basic ways of accomplishing this. In the first method, independent pieces of the application
are assigned to different processors. The more processors, the more pieces can be executed
without reducing throughput. This is called inter-model parallelism. This kind of scale-up is also
useful in building multiple independent models. For example, a neural net application could build
multiple models using different architectures (e.g., each with a different number of nodes or
hidden layers) simultaneously on each processor. But what happens if building each model takes a
long time? We then need to break this model into tasks, execute those tasks on separate
processors, and recombine them for the answer. This second method is called intra-model
User interface. To facilitate model building, some products provide a GUI (graphical user
interface) for semi-automatic model building, while others provide a scripting language. Some
products also provide data mining APIs which can be used embedded in a programming language
like C, Visual Basic, or PowerBuilder. Because of important technical decisions in data
preparation and selection and choice of modeling strategies, even a GUI interface that simplifies
the model building itself requires expertise to find the most effective models.
Keep in mind that the people who build, deploy, and use the results of the models may be
different groups with varying skills. You must evaluate a product’s user interface as to suitability
for each of these groups. SUMMARY
Data mining offers great promise in helping organizations uncover patterns hidden in their data that
can be used to predict the behavior of customers, products and processes. However, data mining tools
need to be guided by users who understand the business, the data, and the general nature of the
analytical methods involved. Realistic expectations can yield rewarding results across a wide range of
applications, from improving revenues to reducing costs.
Building models is only one step in knowledge discovery. It’s vital to properly collect and prepare the
data, and to check the models against the real world. The “best” model is often found after building
models of several different types, or by trying different technologies or algorithms.
Choosing the right data mining products means finding a tool with good basic capabilities, an
interface that matches the skill level of the people who’ll be using it, and features relevant to your
specific business problems. After you’ve narrowed down the list of potential solutions, get a hands-on
trial of the likeliest ones. 36 © 1999 Two Crows Corporation...
View Full Document
This note was uploaded on 01/19/2014 for the course STATS 315B taught by Professor Friedman during the Winter '08 term at Stanford.
- Winter '08