Data Acquisition
Overview
GIGO  garbage in, garbage out  is a core principle in statistics. No amount of sophisticated
data analysis can compensate for botched data acquisition.
Successful data acquisition
consists of three interelated activities:
1. deciding what quantities should be measured,
2. specifying exactly how these quantities should be measured, and, finally
3. collecting the data in a manner which supports optimal statistical analysis.
Let’s consider each in turn.
1. Deciding what to measure is outside the scope of this course because it requires subject
matter expertise. For example, deciding what variables need to be measured to success
fully characterize a chemical process requires the expertise of a chemist and/or chemical
engineer.
2. Similarly, specifying exactly how the quantities of interest should be measured requires
subjectmatter expertise.
However, since the measurements should be done so as to
maximize precision while minimizing variability, and these are statistical considerations,
statistical expertise is involved.
The exact specification of how a quantity should be
measured is called an
operational definition
. The author discusses these in section 4.1
and provides several examples.
3. Finally, collecting the data in a way which supports optimal statistical analysis requires
knowledge of the statistical methods to be used.
Since all the statistical methods in
this course require that the data satisfy, in some form, the IID assumption
,
acquiring data so as to satisfy this assumption is the primary goal.
There are two basic data acquisition scenarios: sampling a population and sampling a pro
cess. We will discuss sampling a population first, the topic of section 4.2.
Sampling Populations
What is the goal of sampling a population? Suppose we want to describe some population,
that is, we want to make quantitative statements about it. For example, suppose we want to
determine what fraction of IU students are not from Indiana, i.e., we want to say something
like “10% of IU students are not from Indiana.” The ideal way to do this would be to examine
each student’s records and determine if their home address is in Indiana. Suppose, due to
time, money or other constraints this is not possible. In this case we are forced to make our
1
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
statement about the IU students (the population) based on examining a fraction (sample)
of all the students. Since in most cases the sample is a miniscule fraction of the population,
making statements (inferences) about a population based on a sample is a risky, errorprone
endeavor.
An interesting example of the danger of inferences based on sample data is provided by
recent developments in the dating of the Shroud of Turin, a 4m linen cloth thought by many
to be the burial shroud of Jesus of Nazareth. In 1988 a sample of cloth was taken from the
Shroud and carbondated by two different laboratories. The labs gave consistent results and
the Shroud was dated to about 12601390 AD. This showed that the Shroud was not the
burial shroud of Jesus of Nazareth but instead was another medieval forgery. (There were
This is the end of the preview.
Sign up
to
access the rest of the document.
 Spring '08
 DeVasher
 Statistics, Simple random sample

Click to edit the document details