STAT1303A Data Management
11. Data Preparation
11 Data Preparation
Also, we attempt to replace the missing values by the sample mean. Indeed, this is
related to the issue of data imputation. In this chapter, more advanced method of
the data imputation is discussed. Furthermore, in certain applications, transformed
data is more suitable for data analysis and so some transformation methods will be
discussed in this chapter. Finally, the random sampling is an important technique
in statistical analysis. In particular, we may want to know the sampling distribution
of a test statistic. Then, the random sampling of the raw data set can be used to
derive the sampling distribution. Here, we introduce the method of simple random
sampling for our purpose.
11.1 Data Imputation
Data imputation can replace the missing values by some other non-missing values.
analysis models, the observations with missing values in one or more variables might
not be useful. For example, a multiple regression model requires no missing values in
any explanatory variables. Another example is that a principal component analysis
requires no missing values in the data.
To handle missing values, we may have used the following methods:
1. Ignore those observations with missing values.
This method is not very
e/ective because the e/ective number of observations is reduced. This method
is especially poor for the data analysis with many variables. Consider a
case of 20 variables, each with 2% missing data, assuming independent, only
observations without missing values.
3. A global constant - replace all missing values by a single constant, e.g.
4. The variable mean - replace all missing values by the mean of the variable (or
other summary statistics, e.g. median, mode and etc.).
5. The variable mean for all samples belonging to the same class as the
observation with the missing value, i.e.
divide the population into
sub-populations before replacing missing values by the mean.
advanced method. The missing values are determined by statistical modeling
tools such as regression, inference-based tools, decision tree and etc.
HKU STAT1303A (2009-10, Semester 1)