3.1
DATA DESCRIPTION
In the previous chapter, we discussed how to
gather data
intelligently for a designed
experiment or an observational study, which is
Step 2
in learning from data. We turn now
to
Step 3
,
classification
,
summarizing
, and
presentation of the data
.
As already said in chapter 1, the field of statistics can be divided into two major
branches: descriptive statistics and inferential statistics. In both branches, we work with a
set of measurements.
For situations in which data description is our major objective, the set of measurements
available to us is frequently the entire population. For example, suppose that we wish to
describe the distribution of annual incomes for all families registered in the 2000 census.
Because all these data are recorded and are available on computer tapes, we do not need
to obtain a random sample from the population; the complete set of measurements is at
our disposal. Our major problem is in
organizing
,
summarizing
, and
describing
these
data. That is, making sense of the data. Good descriptive statistics enable us to make
sense of the data by reducing a large set of measurements to a few summary measures
that provide a good, rough picture of the original measurements.
In situations in which we are concerned with statistical inference, a sample is usually the
only set of measurements available to us. We use information in the sample to draw
conclusions about the population from which the sample was drawn. Of course, in the
process of making inferences, we also need to
organize
,
summarize
, and
describe
the
sample data. For example, a company is interested in determining the proportion of
packages out of total production of a certain drug that are improperly sealed or have been
damaged in transit. Obviously, it would be impossible to inspect all packages at all stores
where the drug is sold, but a random sample of the production could be obtained, and the
proportion defective in the sample could be used to estimate the actual proportion of
improperly sealed or damaged packages.
The objective of data description is to summarize the characteristics of a data set, identify
any patterns in the data, and to present that information in a convenient form. When
describing a distribution of data it is necessary to describe four things: (
1
) the
center
of
the distribution, (
2
) the
spread
of the distribution, (
3
) the
shape
of the distribution, and
(
4
) any unusual features in the distribution, such as
extreme values
(
outliers
and
influential points
),
ranges of values not represented
, and
concentrations of data
.
In this chapter we will show how to construct charts, graphs, and tables that convey the
nature of a data set. The procedure that we will use to accomplish this objective in a
particular situation depends on the type of data, qualitative or quantitative, that we want to
describe, and the number of variables measured. In this chapter we also present numerical