# CH11 - STAT1303A Data Management 11 Data Preparation 11...

This preview shows pages 1–3. Sign up to view the full content.

STAT1303A Data Management 11. Data Preparation 11 Data Preparation Also, we attempt to replace the missing values by the sample mean. Indeed, this is related to the issue of data imputation. In this chapter, more advanced method of the data imputation is discussed. Furthermore, in certain applications, transformed data is more suitable for data analysis and so some transformation methods will be discussed in this chapter. Finally, the random sampling is an important technique in statistical analysis. In particular, we may want to know the sampling distribution of a test statistic. Then, the random sampling of the raw data set can be used to derive the sampling distribution. Here, we introduce the method of simple random sampling for our purpose. 11.1 Data Imputation Data imputation can replace the missing values by some other non-missing values. analysis models, the observations with missing values in one or more variables might not be useful. For example, a multiple regression model requires no missing values in any explanatory variables. Another example is that a principal component analysis requires no missing values in the data. To handle missing values, we may have used the following methods: 1. Ignore those observations with missing values. This method is not very e/ective because the e/ective number of observations is reduced. This method is especially poor for the data analysis with many variables. Consider a case of 20 variables, each with 2% missing data, assuming independent, only (1 & 0 : 02) 20 = 66 : 8% observations without missing values. 3. A global constant - replace all missing values by a single constant, e.g. ±missing² , ±unknown² 4. The variable mean - replace all missing values by the mean of the variable (or other summary statistics, e.g. median, mode and etc.). 5. The variable mean for all samples belonging to the same class as the observation with the missing value, i.e. divide the population into sub-populations before replacing missing values by the mean. advanced method. The missing values are determined by statistical modeling tools such as regression, inference-based tools, decision tree and etc. HKU STAT1303A (2009-10, Semester 1) 11 1

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
STAT1303A Data Management 11. Data Preparation 7. Do nothing. Some data analysis methods allow missing values, e.g. decision tree. Observations with missing data may be used in these methods. This may be the most natural way to deal with the missing data for these methods. Since some data analysis methods do not allow missing values, the data
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

### Page1 / 14

CH11 - STAT1303A Data Management 11 Data Preparation 11...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online