
Unformatted text preview: Springer Texts in Statistics
Advisors: George Casella Stephen Fienberg Ingram Olkin Springer Texts in Statistics
Alfred
Berger
Bilodeau and Brenner
Blom
Brockwell and Davis
Carmona
Chow and Teicher
Christensen
Christensen
Christensen
Creighton
Davis
Dean and Voss
du Toit, Steyn,
and Stumpf
Durrett
Edwards
Finkelstein and Levin
Flury
Heiberger and Holland
Jobson
Jobson
Kalbfleisch
Kalbfleisch
Karr
Keyfitz
Kiefer
Kokoska and Nevison
Kulkarni
Lange Elements of Statistics for the Life and Social Sciences
Introduction to Probability and Stochastic Processes, Second Edition
Theory of Multivariate Statistics
Probability and Statistics: Theory and Applications
Introduction to Time Series and Forecasting, Second Edition
Statistical Analysis of Financial Data in S-Plus
Probability Theory: Independence, Interchangeability, Martingales,
Third Edition
Advanced Linear Modeling: Multivariate, Times Series, and Spatial Data; Nonparametic
Regression and Response Surface Maximization, Second Edition
Log-Linear Models and Logistic Regression, Second Edition
Plane Answers to Complex Questions: The Theory of Linear Models,
Second Edition
A First Course in Probability Models and Statistical Inference
Statistical Methods for the Analysis of Repeated Measurements
Design and Analysis of Experiments
Graphical Exploratory Data Analysis
Essential of Stochastic Processes
Introduction to Graphical Modeling, Second Edition
Statistics for Lawyers
A First Course in Multivariate Statistics
Statistical Analysis and Data Display: An Intermediate Course with Examples in
S-PLUS, R, and SAS
Applied Multivariate Data Analysis, Volume I:
Regression and Experimental Design
Applied Multivariate Data Analysis, Volume II:
Categorical and Multivariate Methods
Probability and Statistical Inference, Volume I:
Probability, Second Edition
Probability and Statistical Inference, Volume II:
Statistical Interference, Second Edition
Probability
Applied Mathematical Demography, Second Edition
Introduction to Statistical Inference
Statistical Tables and Formulae
Modeling, Analysis, Design, and Control of Stochastic Systems
Applied Probability Continued after index Richard M. Heiberger Burt Holland Statistical Analysis and
Data Display
An Intermediate Course with Examples
in S-PLUS, R, and SAS With 200 Figures f) Springer Richard M. Heiberger
Department of Statistics
Temple University
Philadelphia, PA 19122
USA
[email protected] Burt Holland
Department of Statistics
Temple University
Philadelphia, P A 19122
USA
[email protected] Editorial Board
George Casella Stephen Fienberg Ingram Olkin Department of Statistics
University of Florida
Gainesville, FL 32611-8545
USA Department of Statistics
Carnegie Mellon University
Pittsburgh, PA 15213-3890
USA Department of Statistics
Stanford University
Stanford, CA 94305
USA Cover illustration: Cover art is a variation of Figure 14.14d. The data source is (Williams, 2001).
Cygwin: Copyright © 1996, 1998, 200 I, 2003 Free Software Foundation, Inc.
EMACS: Copyright © 1989, 1991 Free Software Foundation, Inc.
Excel: Copyright © 1985-1999, Microsoft Corp.
Ghostscript: Copyright © 1994, 1995, 1997, 1998, 1999,2000 Aladdin Enterprises, Menlo Park, California, U.S.A. All rights reserved.
GSview: Copyright © 1993-2001 Ghostgum Software Ply Ltd.
Internet Explorer: Copyright © 1995-2001 Microsoft Corp.
Linux: Copyright © 2004, Eklektix, Inc.
LogXact: Copyright © Cytel Software Corporation
MathType: Copyright © 1990-1999 Design Science, Inc.
Microsoft Windows: Copyright © 1981-2001 Microsoft Corp.
MiKTeX: Copyright © 1999 Christian Schenk
MS-DOS: Copyright © 1985-2001 Microsoft Corp.
MS-Word: Copyright © 1983-1999, Microsoft Corp.
PostScript: Copyright © Adobe Systems Incorporated
R: Copyright © 2002, The R Development Core Team
SAS: Copyright © 2002 by SAS Institute Inc., Cary, NC, USA.
sas.l1brary/code/ischeffe.sas: copyright holder unknown.
S-Plus: Copyright © 1988, 2002 Insightful Corp.
Stata: Copyright © 1984-2002 Stata Corp.
TeX is a trademark of the American Mathematical Society.
Unix: Copyright © 1998 The Open Group
Windows XP: Copyright © 200 I Microsoft Corporation. All rights reserved.
XLISP-STAT 2.1 Copyright © 1990, by Luke Tierney ISBN 978-1-4419-2320-2
ISBN 978-1-4757-4284-8 (eBook)
DOI 10.1007/978-1-4757-4284-8 © 2004 Springer Science+Business Media New York
Originally published by Springer Science+Business Media Inc. in 2004.
Softcover reprint of the hardcover I st edition 2004
All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher
Springer Science+Business Media, LLC , except for brief excerpts in connection with reviews or scholarly analysis. Use in
connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such,
is notto be taken as an expression of opinion as to whether or not they are subject to proprietary rights. 9 8 765 4 3 2 1 SPIN 10935286 In loving memory of Mary Morris Heiberger
To my family : Margaret, Irene, Andrew, and Ben Holland Preface 1 Audience
Students seeking master's degrees in applied statistics in the late 1960s and
1970s typically took a year-long sequence in statistical methods. Popular
choices of the course text book in that period prior to the availability of highspeed computing and graphics capability were those authored by Snedecor
and Cochran, and Steel and Torrie.
By 1980, the topical coverage in these classics failed to include a great many
new and important elementary techniques in the data analyst's toolkit. In
order to teach the statistical methods sequence with adequate coverage of
topics, it became necessary to draw material from each of four or five text
sources. Obviously, such a situation makes life difficult for both students
and instructors. In addition, statistics students need to become proficient
with at least one high-quality statistical software package.
This book can serve as a standalone text for a contemporary year-long
course in statistical methods at a level appropriate for statistics majors at
the master's level or other quantitatively oriented disciplines at the doctoral
level. The topics include both concepts and techniques developed many
years ago and a variety of newer tools not commonly found in textbooks.
This text requires some previous studies of mathematics and statistics. We
suggest some basic understanding of calculus including maximization or
minimization of functions of one or two variables, and the ability to undertake definite integrations of elementary functions. We recommend acquired
knowledge from an earlier statistics course, including a basic understanding viii of statistical measures, probability distributions, interval estimation, and
hypothesis testing. 2 Structure
The book is organized around statistical topics. Each chapter introduces
concepts and terminology, develops the rationale for its methods, presents
the mathematics and calculations for its methods, and gives examples supported by graphics and computer output, culminating in a writeup of
conclusions. Some chapters have greater detail of presentation than others,
based on our personal interests and expertise.
Our emphasis on graphical display of data is a distinguishing characteristic of this book. Many of our graphical displays appear here
for the first time. Appendix G summarizes those new graphs that are
based on Cartesian products. We show graphs, how to construct and
interpret them, and how they relate to the tabular outputs that appear automatically when a statistical program "analyzes" a data set.
The graphs are not automatic and so must be requested. Gaining
an understanding of a data set is always more easily accomplished
by looking at appropriately drawn graphs than by examining tabular summaries. In our opinion, graphs are the heart of most statistical
analyses; the corresponding tabular results are formal confirmations
of our visual impressions. We advanced this point of view in seminars and presentations ((Heiberger, 1998), (Heiberger and Holland, 2002),
(Heiberger and Holland, 2003b), and (Heiberger and Holland, 2003a)) and
so have others, for example (Gelman et al., 2002). A vivid demonstration
of it appears in Section 4.2.
We have chosen to work with both of what we believe are the two leading
statistical languages available today: S (available as both S-PLUS and R),
and SAS. S is an exceptionally well-developed tool for statistical research
and analysis, that is for exploring and designing new techniques of analysis,
as well as for analysis. S is especially strong for statistical graphics, the
output of data analysis through which both the raw data and the results
are displayed for the analyst and the client. SAS is the most widely used
package for serious and extensive statistical analysis and data management.
Because of our heavy use of graphics as an essential part of most analyses,
we make somewhat heavier use of S than SAS. We frequently mention
the package name S-PLUS, rather than the language name S, in situations
where S-PLUS and R could equally well be used.
Although we do not explicitly teach S-PLUS or SAS, we make the reader
aware of their powerful capabilities by using them to perform the data anal- Preface ix yses we present. Sections B.5 and C.4 contain our currently recommended
references for learning S-PLUS and SAS. All S-PLUS and SAS code used in
the book appears in the companion online files that readers are expected to
download from the Springer website (see Preface Section 3). We anticipate
that readers will wish to adapt our code to their own data analyses. The
code files used to produce the book's numerous graphs are identified alongside each graph. Readers are encouraged to examine these code files in the
online files in order to gain full understanding of what has been plotted.
We believe that a firm control of the language gives the analyst the tools
to think about the ideal way to detect and display the information in the
data. We focus our presentation on the written command languages, the
most flexible descriptors of the statistical techniques. The written languages
provide the opportunity for growth and understanding of the underlying
techniques. The point-and-click technology is convenient for routine tasks.
However, many interesting data analyses are not routine and therefore cannot be accomplished by pointing and clicking the icons provided by the
program developers. 3 Data and Programs
The data for all examples and exercises in this book, and the sample code in both languages [S (meaning S-PLUS and R) and SAS] for
all examples and figures, are provided on the accompanying online files
(Heiberger and Holland, 2004b). Occasionally we produce listing (output)
files that are too big to include in this text. In such situations we place
the complete file in the online files and only excerpts in the text. (The
collection of directories and files in the online files is distributed from the
Springer web page . com as a downloadable zipped
file. Search for "Heiberger Holland". We recommend that readers burn a
CD of the unzipped directories for reference and copy the entire directory
structure to their hard disk for use. See the file README. HH on the website
for details.)
The filename in the online files is given in the text for every code fragment,
function, and macro presented. The code and the PostScript file for every
figure in the text is in the online files. Transcripts (*. st files for S-PLUS,
and * .lst and occasionally * .log files for SAS) are included for code
fragments that produce printed output. x The directories are structured by chapter, with three subdirectories for
each.
chapter/code/
chapter/transcript/
chapter/figure/
The filename is indicated at the time the example is presented.
In addition, there are several directories not associated with specific
chapters.
datasets/
spIus.library/
sas.library/
software/
All datasets are in the datasets directory. The splus . library and
sas . library directories contain general utilities and new analysis and display functions. All our code and examples assume that these libraries are
attached.
In S-PLUS and R, the libraries are attached by running the .First function
described in Appendix B. The . First function must be customized for the
individual computer.
In SAS, the macros are made available by running the file hh. sas described
in Appendix C. The hh. sas file must be customized for the individual
computer.
Both customizations are simple and these are the only customizations required. All our functions and input statements are defined relative to the
paths defined in these customizations. Once these customizations have been
made, all examples in the book work as written, with no changes. 4 Software
We include in the Software Appendix A and the (sftw/code/url.htm) file
the urIs to the software we recommend:
• S-PLUS, Insightful's implementation of the S language
• R, the GNU-licensed implementation of the S language
• SAS
• Ghostscript/Ghostview for displaying PostScript graphs Preface xi • Emacs, the extensible text editor from the Free Software Foundation
• ESS (Emacs Speaks Statistics), an intelligent environment for statistical
analysis
• Springer, the online files for this book are distributed from the Springer
website • H'-'IEX. We wrote this book in H'-'IEX (Lamport, 1986), the best mathematical typesetting software (and the one required by several statistics
journals), so we provide the urI for that as well. 4.1 Microsoft Windows We include urIs for
• Cygwin, an implementation of the Unix shell and other user tools for
Microsoft Windows
• Standalone utilities (gunzip, gzip, tar) that work in the MS-DOS
prompt window
• gnuservand ispell, utilities that work with Emacs
• MathType fonts, for improved appearance of mathematics written in
Microsoft Word. 4.2 Unix Most of the software listed above is distributed as part of Unix systems
and is probably already available on the Unix system you are using. The
statistical programs S-PLUS, R, and SAS, and the ESS interface between
Emacs and the statistical software will be needed. 5 Exercises
Learning requires that the student work a fair selection of the exercises
provided, using, where appropriate, one of the statistical software packages we discuss. Beginning with the exercises in Chapter 5, even when not
specifically asked to do so, the student should routinely plot the data in
a way that illuminates its structure, and state all assumptions made and
discuss their reasonableness. xii Acknowledgments
We are indebted to many people for providing us advice, comments and
assistance with this project. Among them are our editor John Kimmel
and the production staff at Springer, our colleagues Francis Hsuan and
Byron Jones, our current and former students (particularly Paolo Teles
who coauthored the paper on which Chapter 18 is based, Kenneth Swartz,
and Yuo Guo), and Sara R. Heiberger. Each of gratefully acknowledges the
support of a study leave from Temple University. We are also grateful to
Insightful Corp. for providing us with current copies of S-PLUS software
for ourselves and our student, and to the many professionals who reviewed
portions of early drafts of this manuscript. Contents Preface
1
2
3
4 5 Audience . . . . . .
Structure . . . . . .
Data and Programs
Software . . . . . . .
4.1
Microsoft Windows.
4.2
Unix
Exercises . . . . . . . . 1 Introduction and Motivation
1.1
Statistics in Context . .
1.2
Examples of Uses of Statistics . . . . . . . . .
1.2.1
Investigation of Salary Discrimination
1.2.2
Measuring Body Fat . . . .
1.2.3
Minimizing Film Thickness . . . . . . .
1.2.4
Surveys . . . . . . . . . . . . . . . . . .
1.2.5
Bringing Pharmaceutical Products to Market
1.3
The Rest of the Book .
1.3.1
Fundamentals ...
1.3.2
Linear Models . . .
1.3.3
Other Techniques.
1.3.4
New Graphical Display Techniques.
2 Data and Statistics
2.1
Types of Data . . . . . . . . .
2.2
Data Display and Calculation
2.2.1
Presentation . . . . . . vii
vii
viii
IX x
Xl xi
Xl 1 3
4
4 5
5
6
6
7
7
7
9
9 11
11
12
13 xiv Contents 2.3 2.4 2.5
2.6 2.2.2
Rounding
Importing Data.
2.3.1
S-Pws ..
2.3.2
SAS ...
2.3.3
Data Rearrangement .
Analysis with Missing Data. .
2.4.1
Missing Data in S-PWS .
2.4.2
Missing Data in SAS. . .
Tables and Graphs . . . . . . . . .
Files for Statistical Analysis and Data Display (HH)
2.6.1
Datasets . . . . . . . . . . . . .
2.6.2
Code, Transcripts, and Figures
2.6.3 Functions and Macros
2.6.4
Software............. 3 Statistics Concepts
3.1
A Brief Introduction to Probability . . . . . . . . . . . . . . .
3.2
Random Variables and Probability Distributions . . . . . . .
3.2.1
Discrete Versus Continuous Probability Distributions
3.2.2
Displaying Probability Distributions . . . . . . . .
3.3 Concepts That Are Used When Discussing Distributions.
3.3.1
Expectation and Variance of Random Variables
3.3.2
Median of Random Variables. . . . .
3.3.3
Symmetric and Skewed Distributions . . . . . .
3.3.4 Displays of Univariate Data . . . . . . . . . . . .
3.3.5
Multivariate Distributions-Covariance and Correlation
3.4 Three Probability Distributions . .
3.4.1
The Binomial Distribution ..
The Normal Distribution ...
3.4.2
3.4.3
The (Student's) t Distribution
3.5
Sampling Distributions ..
3.6
Estimation................
3.6.1
Statistical Models. . . . . . . .
3.6.2
Point and Interval Estimators.
3.6.3
Criteria for Point Estimators .
3.6.4
Confidence Interval Estimation .
3.6.5
Example-Confidence Interval on the Mean JL of a Population
Having Known Standard Deviation . . . .
3.6.6
Example-One-Sided Confidence Intervals
3.7 Hypothesis Testing. . . . . . . . . . . . . . . . . . .
3.8 Examples of Statistical Tests . . . . . . . . . . . . .
3.9 Power and Operating Characteristic (O.C.) Curves
3.10 Sampling . . . . . . . . . . . . . .
3.10.1 Simple Random Sampling. . . . . . . . . . 13
14
14
15
15
16
16
17
17
18
18
18
19
19
21
21
22
23
24
27
27
28
28
30
34
37
37
38
39
40
41
41
42
42
43
44
44
45
47
49
52
53 Contents 3.11 3.10.2 Stratified Random Sampling .
3.10.3 Cluster Random Sampling ...
3.10.4 Systematic Random Sampling
3.10.5 Standard Errors of Sample Means
3.10.6 Sources of Bias in Samples
Exercises . . . . . . . . . . . . . . . 4 Graphs
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10 Definition..............
Example-Ecological Correlation
Scatterplots.........
Scatterplot Matrix . . . . . . . . .
Example-Life Expectancy. . . .
Scatterplot Matrices-Continued.
Data Transformations . . . . . . .
Life Expectancy Example-Continued
SAS Graphics.
Exercises . . . . . . . . . . . . . . . . . 5 Introductory Inference
5.1
Normal (z) Intervals and Tests. . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1
Test of a Hypothesis Concerning the Mean of a Population Having
Known Standard Deviation . . . . . . . . . . . . . . . . . . .
5.1.2
Confidence Intervals for Unknown Population Proportion p ...
5.1.3
Tests on an Unknown Population Proportion p. . . . . . . . . . .
5.1.4
Example-One-Sided Hypothesis Test Concerning a Population
Proportion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
t-intervals and Tests for the Mean of a Population Having Unknown
Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3
Confidence Interval on the Variance or Standard Deviation of a Normal
Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4
Comparisons of Two Populations Based on Independent Samples. . . . .
5.4.1
Confidence Intervals on the Difference Between Two Population
Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.2
Confidence Interval on the Difference of Between Two Means . .
5.4.3
Tests Comparing Two Population Means When the Samples Are
Independent ....
View
Full Document