This preview shows pages 1–3. Sign up to view the full content.
STAT1303A Data Management
11. Data Preparation
11 Data Preparation
Also, we attempt to replace the missing values by the sample mean. Indeed, this is
related to the issue of data imputation. In this chapter, more advanced method of
the data imputation is discussed. Furthermore, in certain applications, transformed
data is more suitable for data analysis and so some transformation methods will be
discussed in this chapter. Finally, the random sampling is an important technique
in statistical analysis. In particular, we may want to know the sampling distribution
of a test statistic. Then, the random sampling of the raw data set can be used to
derive the sampling distribution. Here, we introduce the method of simple random
sampling for our purpose.
11.1 Data Imputation
Data imputation can replace the missing values by some other nonmissing values.
analysis models, the observations with missing values in one or more variables might
not be useful. For example, a multiple regression model requires no missing values in
any explanatory variables. Another example is that a principal component analysis
requires no missing values in the data.
To handle missing values, we may have used the following methods:
1. Ignore those observations with missing values.
This method is not very
e/ective because the e/ective number of observations is reduced. This method
is especially poor for the data analysis with many variables. Consider a
case of 20 variables, each with 2% missing data, assuming independent, only
(1
&
0
:
02)
20
= 66
:
8%
observations without missing values.
3. A global constant  replace all missing values by a single constant, e.g.
±missing²
, ±unknown²
4. The variable mean  replace all missing values by the mean of the variable (or
other summary statistics, e.g. median, mode and etc.).
5. The variable mean for all samples belonging to the same class as the
observation with the missing value, i.e.
divide the population into
subpopulations before replacing missing values by the mean.
advanced method. The missing values are determined by statistical modeling
tools such as regression, inferencebased tools, decision tree and etc.
HKU STAT1303A (200910, Semester 1)
11
1
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentSTAT1303A Data Management
11. Data Preparation
7. Do nothing. Some data analysis methods allow missing values, e.g. decision
tree. Observations with missing data may be used in these methods. This
may be the most natural way to deal with the missing data for these methods.
Since some data analysis methods do not allow missing values, the data
This is the end of the preview. Sign up
to
access the rest of the document.
 Spring '11

Click to edit the document details