Missing Data
This discussion borrows heavily from:
Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, by Jacob and
Patricia Cohen (1975 edition). The 2003 edition of Cohen and Cohen’s book is also used a little.
Paul Allison’s Sage Monograph on Missing Data (Sage paper # 136, 2002).
Newman, Daniel A. 2003. Longitudinal Modeling with Randomly and Systematically Missing
Data: A Simulation of Ad Hoc, Maximum Likelihood, and Multiple Imputation Techniques.
Organizational Research Methods, Vol. 6 No. 3, July 2003 pp. 328-362.
Patrick Royston’s series of articles in volumes 4 and 5 of The Stata Journal on multiple
imputation. See especially Royston, Patrick. 2005. Multiple Imputation of Missing Values:
Update. The Stata Journal Vol. 5 No. 2, pp. 188-201.
Also, Stata 11 has its own built-in commands for multiple imputation. If you have Stata 11, the
entire MI manual is available as a PDF file.
Often, part or all of the data are missing for a subject. This handout will describe the various
types of missing data and common methods for handling it. The readings can help you with the
more advanced methods.
I.
Types of missing data.
There are several useful distinctions we can make.
Dependent versus independent variables
.
Most methods involve missing values for IVs,
although in recent years methods for dealing with missing data in the dependent variable have
been developed.
Random versus selective loss of data
.
A researcher must ask why the data are missing. In
some cases the loss is completely at random (MCAR), i.e. the absence of values on an IV is
unrelated to Y or other IVs. Also, as Allison notes (p. 4) ―Data on Y are said to be missing at
random (MAR) if the probability of missing data on Y is unrelated to the value of Y, after
controlling for other variables in the analysis…For example, the MAR assumption would be
satisfied if the probability of missing data on income depended on a person’s marital status,
but within each marital status category, the probability of missing income was unrelated to
income.‖ Unfortunately, in survey research, the loss often is not random. Refusal or inability
to respond may be correlated with such things as education, income, interest in the subject,
geographic location, etc. Selective loss of data is much more problematic than random loss.
Missing by design; or, not asked or not applicable
.
These are special cases of random
versus selective loss of data. Sometimes data are missing because the researcher deliberately
did not ask the question of that particular respondent. For economic reasons, some questions
might only be asked of a random subsample of the entire sample. For example, prior to 2010
there was a ―short‖ version of the census (answered by everyone) and a ―long‖ version that
was only answered by 20%. This can be treated the same as a random loss of data, keeping in
mind that the loss may be very high.
Other times,