Another approach to handling missing values is

This preview shows page 43 - 45 out of 230 pages.

We have textbook solutions for you!
The document you are viewing contains questions related to this textbook.
Business English
The document you are viewing contains questions related to this textbook.
Chapter 4 / Exercise 113
Business English
Guffey/Seefer
Expert Verified
Another approach to handling missing values is imputation . You try to guess the missing value by one of several techniques to prevent discarding records or attributes that might also contain valuable information. Some imputation techniques, increasing in sophistication, are: 1. Fill in a random value taken from the other records. 2. Take the mode, median or average of the attribute from the other records. 3. Make a statistical model of the distribution of the value in the other records and randomly choose a value according to that distribution. 4. Try to predict the missing value with statistical or mining techniques from the values found in similar records. The last technique requires more work, and also contains the risk of results that reinforce themselves, because the data we use is expected to be correct in order to build our model in the end. This may introduce a bias that defeats the important aspect of data mining; that is, information is generated from data only, without any assumptions.
We have textbook solutions for you!
The document you are viewing contains questions related to this textbook.
Business English
The document you are viewing contains questions related to this textbook.
Chapter 4 / Exercise 113
Business English
Guffey/Seefer
Expert Verified
26 Intelligent Miner for Data: Enhance Your Business Intelligence 2.4.4.2 Data Manipulation Some techniques require normalization of the data, where the values found for a certain attribute are converted to have a distribution approaching that of the standard normal distribution with an average of 0 and a standard deviation of 1. Sometimes the available attributes show a high degree of intercorrelation, meaning that the same information is present in several attributes. To prevent this information from dominating too much, we can use several techniques for dimension reduction . These techniques try to reduce the amount of attributes to the minimum that still contains the original information. Sometimes they are also useful to speed up the data mining process because of the reduced number of attributes. One of the data mining techniques sometimes used in the data preparation step is clustering . It is described in more detail in 4.6.3, “Clustering” on page 58. The reason for using clustering is that it splits up the data into more or less homogeneous groups. When these groups are very different, it might be better not to try to handle them in one model, but to build models for each separate group. When we apply clustering in this step, it might become a small scale data mining process in itself. The last important action in data preparation is to split the records into a training set and a test set . This ensures that, in the end, we have data available that was not used to build the model. This data is used to validate the model. It prevents overfitting , which means that the model is completely fitted to the training set, and is not general enough to handle records outside that set of data. Sometimes cross-validation is used. In that case, we also treat the sets the other way around, building a model on the test set and testing it on the training set. This can even lead to n-fold cross-validation , using multiple sets, and therefore multiple models that are tested on all other sets.

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture