G how where conditions timeframe eg daily weekly

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: For example, many European data sets are constrained in their use by privacy regulations that are far stricter than those in the United States. b. Data Description Describe the contents of each file or database table. Some of the properties documented in a Data Description Report are: • Number of fields/columns • Number/percentage of records with missing values • Field names For each field: • Data type • Definition • Description • Source of field • Unit of measure • Number of unique values • List of values • Range of values • Number/percentage of missing values • Collection information (e.g., how, where, conditions) • Timeframe (e.g., daily, weekly, monthly) • Specific time data (e.g., every Monday or every Tuesday) • Primary key/foreign key relationships 24 © 1999 Two Crows Corporation c. Selection. The next step in preparing the data mining database is to select the subset of data to mine. This is not the same as sampling the database or choosing predictor variables. Rather, it is a gross elimination of irrelevant or unneeded data. Other criteria for excluding data may include resource constraints, cost, restrictions on data use, or quality problems. d. Data quality assessment and data cleansing. GIGO (Garbage In, Garbage Out) is quite applicable to data mining, so if you want good models you need to have good data. A data quality assessment identifies characteristics of the data that will affect the model quality. Essentially, you are trying to ensure not only the correctness and consistency of values but also that all the data you have is measuring the same thing in the same way. There are a number of types of data quality problems. Single fields may have an incorrect value. For example, recently a man’s nine-digit Social Security identification number was accidentally entered as income when the government computed his taxes! Even when individual fields have what appear to be correct values, there may be incorrect combinations, such as pregnant males. Sometimes the value for a field is missing. Inconsistencies must be identified and removed when consolidating data from multiple sources. Missing data can be a particularly pernicious problem. If you have to throw out every record with a field missing, you may wind up with a very small database or an inaccurate picture of the whole database. The fact that a value is missing may be significant in itself. Perhaps only wealthy customers regularly leave the “income” field blank, for instance. It can be worthwhile to create a new variable to identify missing values, build a model using it, and compare the results with those achieved by substituting for the missing value to see which leads to better predictions. Another approach is to calculate a substitute value. Some common strategies for calculating missing values include using the modal value (for nominal variables), the median (for ordinal variables), or the mean (for continuous variables). A less common strategy is to assign a missing value based on the distribution of values for that variable. For example, if a database consisted of 40% females and 60% males, then you might assign a missing gender entry the value of “female” 40% of the time and “male” 60% of the time. Sometimes people build predictive models using data mining techniques to predict missing values. This usually gives a better result than a simple calculation, but is much more time-consuming. Recognize that you will not be able to fix all the problems, so you will need to work around them as best as possible. It is far preferable and more cost-effective to put in place procedures and checks to avoid the data quality problems — “an ounce of prevention.” Usually, however, you must build the models you need with the data you now have, and avoidance is something you’ll work toward for the future. e. Integration and consolidation. The data you need may reside in a single database or in multiple databases. The source databases may be transaction databases used by the operational systems of your company. Other data may be in data warehouses or data marts built for specific purposes. Still other data may reside in a proprietary database belonging to another company such as a credit bureau. Data integration and consolidation combines data from different sources into a single mining database and requires reconciling differences in data values from the various sources. Improperly reconciled data is a major source of quality problems. There are often large © 1999 Two Crows Corporation 25 differences in the way data are defined and used in different databases. Some inconsistencies may be easy to uncover, such as different addresses for the same customer. Making it more difficult to resolve these problems is that they are often subtle. For example, the same customer may have different names or — worse — multiple customer identification numbers. The same name may be used for different entities (homonyms), or differe...
View Full Document

This note was uploaded on 01/19/2014 for the course STATS 315B taught by Professor Friedman during the Winter '08 term at Stanford.

Ask a homework question - tutors are online