Another important algorithm feature is sensitivity to

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: er features such as parallel database access. No single product, however, can support the large variety of database servers, so a gateway must be used for all but the four or five leading DBMSs. The most common gateway supported is Microsoft’s ODBC (Open Database Connectivity). In some instances it is useful if the data mining tool can consolidate data from multiple sources. Algorithms. You must understand the characteristics of the algorithms the data mining product uses so you can determine if they match the characteristics of your problem. In particular, learn how the algorithms treat the data types of your response and predictor variables, how fast they train, and how fast they work on new data. Another important algorithm feature is sensitivity to noise. Real data has irrelevant columns, rows (cases) that don’t conform to the pattern your model finds, and missing or incorrect values. How much of this noise can your model-building tool stand before its accuracy drops? In other words, how sensitive is the algorithm to missing data, and how robust are the patterns it discovers in the face of extraneous and incorrect data? In some instances, simply adding more data may be enough to compensate for noise, but if the additional data itself is very noisy, it may actually reduce accuracy. In fact, a major aspect of data preparation is to reduce the amount of noise in your data that is under your control. Interfaces to other products. There are many tools that can help you understand your data before you build your model, and help you interpret the results of your model. These include traditional query and reporting tools, graphics and visualization tools, and OLAP tools. Data mining software that provides an easy integration path with other vendors’ products provides the user with many additional ways to get the most out of the knowledge discovery process. Model evaluation and interpretation. Products can help the user understand the results by providing measures (of accuracy, significance, etc.) in useful formats such as confusion matrices and ROI charts, by allowing the user to perform sensitivity analysis on the result, and by presenting the result in alternative ways, such as graphically. Model deployment. The results of a model may be applied by writing directly to a database or extracting records from it. When you need to apply the model to new cases as they come, it is usually necessary to incorporate the model into a program using an API or code generated by the data mining tool. In either case, one of the key problems in deploying models is to deal with the transformations necessary to make predictions. Many data mining tools leave this as a separate job for the user or programmer. © 1999 Two Crows Corporation 35 Scalability. How effective is the tool in dealing with large amounts of data — both rows and columns — and with sophisticated validation techniques? These challenges require the ability to take advantage of powerful hardware. What kinds of parallelism does the tool support? Is there parallel use of a parallel DBMS and are the algorithms themselves parallel? What kind of parallel computers does it support — SMP servers or MPP servers? How well does it scale as the number of processors increases? Does it support parallel data access? Data mining algorithms written for a uniprocessor machine won’t automatically run faster on a parallel machine; they must be rewritten to take advantage of the parallel processors. There are two basic ways of accomplishing this. In the first method, independent pieces of the application are assigned to different processors. The more processors, the more pieces can be executed without reducing throughput. This is called inter-model parallelism. This kind of scale-up is also useful in building multiple independent models. For example, a neural net application could build multiple models using different architectures (e.g., each with a different number of nodes or hidden layers) simultaneously on each processor. But what happens if building each model takes a long time? We then need to break this model into tasks, execute those tasks on separate processors, and recombine them for the answer. This second method is called intra-model parallelism. User interface. To facilitate model building, some products provide a GUI (graphical user interface) for semi-automatic model building, while others provide a scripting language. Some products also provide data mining APIs which can be used embedded in a programming language like C, Visual Basic, or PowerBuilder. Because of important technical decisions in data preparation and selection and choice of modeling strategies, even a GUI interface that simplifies the model building itself requires expertise to find the most effective models. Keep in mind that the people who build, deploy, and use the results of the models may be different groups with varying skills. You must evaluate a product’s user interface as to suitability for each of these groups. SUMMARY Data mining offers great promise in helping organizations uncover patterns hidden in their data that can be used to predict the behavior of customers, products and processes. However, data mining tools need to be guided by users who understand the business, the data, and the general nature of the analytical methods involved. Realistic expectations can yield rewarding results across a wide range of applications, from improving revenues to reducing costs. Building models is only one step in knowledge discovery. It’s vital to properly collect and prepare the data, and to check the models against the real world. The “best” model is often found after building models of several different types, or by trying different technologies or algorithms. Choosing the right data mining products means finding a tool with good basic capabilities, an interface that matches the skill level of the people who’ll be using it, and features relevant to your specific business problems. After you’ve narrowed down the list of potential solutions, get a hands-on trial of the likeliest ones. 36 © 1999 Two Crows Corporation...
View Full Document

This note was uploaded on 01/19/2014 for the course STATS 315B taught by Professor Friedman during the Winter '08 term at Stanford.

Ask a homework question - tutors are online