In most cases the data should be stored in its own

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: nt names may be used for the same entity (synonyms). There are often unit incompatibilities, especially when data sources are consolidated from different countries; for example, U.S. dollars and Canadian dollars cannot be added without conversion. f. Metadata construction. The information in the Dataset Description and Data Description reports is the basis for the metadata infrastructure. In essence this is a database about the database itself. It provides information that will be used in the creation of the physical database as well as information that will be used by analysts in understanding the data and building the models. g. Load the data mining database. In most cases the data should be stored in its own database. For large amounts or complex data, this will usually be a DBMS as opposed to a flat file. Having collected, integrated and cleaned the data, it is now necessary to actually load the database itself. Depending on the DBMS and hardware being used, the amount of data, and the complexity of the database design, this may turn out to be a serious undertaking that requires the expertise of information systems professionals. h. Maintain the data mining database. Once created, a database needs to be cared for. It needs to be backed up periodically; its performance should be monitored; and it may need occasional reorganization to reclaim disk storage or to improve performance. For a large, complex database stored in a DBMS, the maintenance may also require the services of information systems professionals. 3. Explore the data. See the DATA DESCRIPTION FOR DATA MINING section above for a detailed discussion of visualization, link analysis, and other means of exploring the data. The goal is to identify the most important fields in predicting an outcome, and determine which derived values may be useful. In a data set with hundreds or even thousands of columns, exploring the data can be as timeconsuming and labor-intensive as it is illuminating. A good interface and fast computer response are very important in this phase because the very nature of your exploration is changed when you have to wait even 20 minutes for some graphs, let alone a day. 4. Prepare data for modeling. This is the final data preparation step before building models. There are four main parts to this step: a. Select variables b. Select rows c. Construct new variables d. Transform variables a. Select variables. Ideally, you would take all the variables you have, feed them to the data mining tool and let it find those which are the best predictors. In practice, this doesn’t work very well. One reason is that the time it takes to build a model increases with the number of 26 © 1999 Two Crows Corporation variables. Another reason is that blindly including extraneous columns can lead to incorrect models. A very common error, for example, is to use as a predictor variable data that can only be known if you know the value of the response variable. People have actually used date of birth to “predict” age without realizing it. While in principle some data mining algorithms will automatically ignore irrelevant variables and properly account for related (covariant) columns, in practice it is wise to avoid depending solely on the tool. Often your knowledge of the problem domain can let you make many of these selections correctly. For example, including ID number or Social Security number as predictor variables will at best have no benefit and at worst may reduce the weight of other important variables. b. Select rows. As in the case of selecting variables, you would like to use all the rows you have to build models. If you have a lot of data, however, this may take too long or require buying a bigger computer than you would like. Consequently it is often a good idea to sample the data when the database is large. This yields no loss of information for most business problems, although sample selection must be done carefully to ensure the sample is truly random. Given a choice of either investigating a few models built on all the data or investigating more models built on a sample, the latter approach will usually help you develop a more accurate and robust model. You may also want to throw out data that are clearly outliers. While in some cases outliers may contain information important to your model building, often they can be ignored based on your understanding of the problem. For example, they may be the result of incorrectly entered data, or of a one-time occurrence such as a labor strike. Sometimes you may need to add new records (e.g., for customers who made no purchases). c. Construct new variables. It is often necessary to construct new predictors derived from the raw data. For example, forecasting credit risk using a debt-to-income ratio rather than just debt and income as predictor variables may yield more accurate results that are also easier to understand. Certain variables that have little effect alone may need to be combined with others, using various arithmetic or algebraic operations (e.g., a...
View Full Document

This note was uploaded on 01/19/2014 for the course STATS 315B taught by Professor Friedman during the Winter '08 term at Stanford.

Ask a homework question - tutors are online