#### You've reached the end of your free preview.

Want to read all 306 pages?

**Unformatted text preview:** Multivariate Statistics
A practical approach OTHER STATISTICS TEXTS FROM
CHAPMAN AND HALL
Applied Statistics
D.R. Cox and E.l. Snell
The Analysis of Time Series
C. Chatfield
Decision Analysis: A Bayesian Approach
IQ. Smith
Statistics for Technology
C. Chatfield
Introduction to Multivariate Analysis
C. Chatfield and A.l. Collins
Introduction to Optimization Methods and their Application in Statistics
B.S. Everitt
An Introduction to Statistical ModeUing
A.l. Dobson
Multivariate Analysis of Variance and Repeated Measures
D.l. Hand and c.c. Taylor
Multivariate Statistical Methods - a primer
Bryan F. Manly
Statistical Methods in Agriculture and Experimental Biology
R. Mead and R.N. Curnow
Elements of Simulation
D.l.T. Morgan
Essential Statistics
D.G. Rees
Applied Statistics: A Handbook of BMDP Analyses
E.J. Snell
Intermediate Statistical Methods
G.B. Wetherill
Foundations of Statistics
D.G. Rees
Probability: Methods and Measurement
A. O'Hagan
Elementary Applications of Probability Theory
H.C. Tuckwell Further information on the complete range of Chapman and Hall
statistics books is available from the publishers. Multivariate Statistics
A practical approach
Bernhard Flury
and Hans Riedwyl London New York
CHAPMAN AND HALL First published in 1988 by Chapman and Hall Ltd
11 New Fetter Lane, London EC4P 4EE
Published in the USA by Chapman and Hall
29 West 35th Street, New York NY 10001 © 1988 B. Flury and H. Riedwyl
Softcover reprint of the hardcover 1st edition 1988
ISBN-13:978-94-010-7041-6
This title is available in both hardbound and paperback editions. The
paperback edition is sold subject to the condition that it shall not, by way of
trade or otherwise, be lent, resold, hired out, or otherwise circulated without
the publisher's prior consent in any form of binding or cover other than that in
which it is published and without a similar condition including this condition
being imposed on the subsequent purchaser.
All rights reserved. No part of this book may be reprinted, or reproduced or
utilized in any form or by any electronic, mechanical or other means, now
known or hereafter invented, including photocopying and recording, or in any
information storage and retrieval system, without permission in writing from
the publisher.
British Library Cataloguing in Publication Data
Flury, Bernhard
Multivariate statistics: a practical
approach.
1. Multivariate analysis
I. Title II. Riedwyl, Hans
519.5'35
QA278
ISBN-13:978-94-010-7041-6
DOl: 10.1007/978-94-009-1217- 5 e-ISBN-13:978-94-009-1217-5 Library of Congress Cataloging-in-Publication Data
Flury, Bernhard, 1951Multivariate statistics.
Rev. translation of: Angewandte multivariate
Statistik.
Bibliography: p.
Includes index.
1. Multivariate analysis. I. Riedwyl, Hans.
II. Title..
QA278.F58813
1988
519.5'35
87-18405
ISBN-13:978-94-010 -7041-6 Contents Preface ix 1 The data
Discussion 2 Univariate plots and descriptive statistics
Discussion
Further study 11
16
18 3 Scatterplot, correlation and covariance
Discussion
Further study 19
36
37 4 Face plots
Discussion
Further study 52
52 5
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10 Multiple linear regression
Introductory remarks
The model of multiple linear regression
Least squares estimation
Residual analysis
Model building, analysis of variance
The overall test of significance
Coefficient of determination and multiple correlation
Tests of partial hypotheses
Standard errors of the regression coefficients
Selection of a subset of regressors
Discussion
Further study 6
6.1 Linear combinations
Introduction 1
9 38 54
54
54 56
58
59
61
62
63
64
64 69
74
75
75 vi Contents 6.2
6.3
6.4
6.5 A special linear combination
Linear combinations of two variables
Linear combinations of several variables
Mean and standard deviation of linear combinations
Discussion
Further study 75
77
81
84
86
86 7 Linear discriminant analysis for two groups 88
88
88 7.1 Introduction
7.2 Multivariate standard distance
7.3 Relationship between discriminant analysis and multiple
7.4
7.5
7.6
7.7 8 linear regression
Testing hypotheses about the discriminant function
Screening a discriminant function
Further uses of the coefficient of determination
Classification of observations
Discussion
Further study
Examples 94
96
99
104
106
114
123
123 Identification analysis 136
136 analysis 137
138
142
144
149
150
152 8.1 Introduction
8.2 Identification analysis as a special case of discriminant 8.3 More about standard distance
8.4 Identification of a bank note
8.5 Analysis of outliers
Discussion
Further study
Examples 9 9.1
9.2
9.3
9.4
9.5 10 Specification analysis
Standard distance between a sample and a hypothetical
mean vector
Specification analysis of the bank notes
Confidence regions for a mean vector
A more general model
Specification faces
Further study
Examples
Principal component analysis 10.1 Introduction 156
156
159
165
168
170
174
174
181
181 Contents
10.2
10.3
10.4
10.5
10.6 Principal components of two variables
Properties of principal components in the multidimensional case
Principal component analysis of the genuine bank notes
The singular case
Principal components, standard distance, and the
multivariate normal distribution
10.7 Standard errors of the principal component coefficients and
related problems
10.8 Principal component analysis of several groups
Discussion
Further study
Examples
11 Comparing the covariance structures of two groups 11.1
11.2
11.3
11.4 Introduction
The bivariate case
The multivariate case
Comparison of the covariance matrices of the genuine and forged
bank notes
Partial statistics for the analysis of Ymax and Ymin
Stepwise analysis of Ymax and Ymin
Relationships to standard distance and principal component
analysis
Critical values of the distribution of F max and F min
Discussion
Further study
Examples 11.5
11.6
11. 7
11.8 Vll 184
190
192
197
198
201
205
210
212
213
234
234
235
237
243
245
248
252
256
257
258
258 12 Exercises
12.1 Exercises based on the bank note data
12.2 Additional exercises 263
263
270 13 276
276
277
277
279 Mathematical appendix
Introduction and preliminaries
Data matrix, mean vector, covariance and correlation
Multiple linear regression
Linear combinations
Multivariate standard distance and the linear discriminant
function
13.6 Principal component analysis
13.7 Comparison of two covariance matrices 13.1
13.2
13.3
13.4
13.5 280
282
286 References 288 Index 293 Preface During the last twenty years multivariate statistical methods have become
increasingly popular among scientists in various fields. The theory had already
made great progress in previous decades and routine applications of
multivariate methods followed with the advent of fast computers. Nowadays
statistical software packages perform in seconds what used to take weeks of
tedious calculations.
Although this is certainly a welcome development, we find, on the other
hand, that many users of statistical packages are not too sure of what they
are doing, and this is especially true for multivariate statistical methods.
Many researchers have heard about such techniques and feel intuitively that
multivariate methods could be useful for their own work, but they haven't
mastered the usual mathematical prerequisites. This book tries to fill the gap
by explaining - in words and graphs - some basic concepts and selected
methods of multivariate statistical analysis.
Why another book? Are the existing books on applied multivariate statistics
all obsolete? No, some of them are up to date and, indeed, quite good.
However, we think that this introduction distinguishes itselffrom any existing
text in three ways. First, we illustrate the basic concepts mainly with graphical
tools, hence the large number of figures. Second, all techniques, with just
one exception, are illustrated by the same numerical data. Being familiar
with the data will help the reader to understand what insights multivariate
methods can otTer. Third, we have avoided mathematical notation as much
as possible. While we are well aware of the fact that this is a risky business,
since avoiding mathematical language inevitably implies some loss of
accuracy, we feel that it is possible to understand the basic ideas of multivariate
analysis without mastering matrix algebra. Of course, we do not wish to
discourage anybody from learning multivariate statistics the mathematical
way. Indeed, many ideas and concepts can be stated in fewer words using
concise mathematical language, but in order to appreciate the advantages
of a more abstract approach the reader needs to be familiar with matrix
theory, which most non-mathematicians tend to think of as a difficult subject.
This book has grown out of its German predecessor Angewandte x Preface multivariate Statistik, published in 1983 by Gustav Fischer Verlag, Stuttgart. It is not just a translation, however; many new ideas and didactic tools have been introduced, one chapter has been removed, and another one has been
added. The chapter that was deleted was the one on factor analysis, which
was criticized by some reviewers of the German text. Although it is probably
impossible to write about factor analysis without provoking criticism, we
felt that we were not sufficiently expert in this field, and that it was probably
better to omit this topic altogether.
Some remarks are in order concerning Chapter 5, which deals with multiple
linear regression. This chapter differs from the others in having a more
compressed presentation. It was indeed conceived as a restatement of linear
regression rather than a first introduction, and assumes that the reader is
familiar at least with simple linear regression. We felt that it wasn't possible
to omit this chapter, since a basic understanding of linear regression is
extremely helpful for understanding the linear discriminant function and
related concepts.
We do not claim that our book treats the various multivariate methods
comprehensively - on the contrary, we purposely limited the presentation to
those ideas and techniques that seemed most important to us, and tried to
explain those as carefully as possible. Our selection of topics is obviously
biased towards our own research areas, but this is what we feel most
competent to write about.
Many chapters are followed by one or several additional examples and
by a discussion. The questions stem mostly from the participants of several
courses on applied multivariate analysis that we taught from 1980 to 1985.
These questions gave us an opportunity to discuss additional important
methods or ideas that didn't fit into the main text. The length of the discussions
essentially reflects the number of times the respective chapter has been taught
so far.
Some readers of the German predecessor of this book have asked us to
supplement it with detailed instructions for the use of existing statistical
software such as BMDP, SAS, or SPSS. We have resisted the temptation to
introduce such instructions for two reasons. First, we would like the
information in this book to remain correct for, say, fifteen years. Most
instructions for the use of currently existing programs will probably be
obsolete by that time. Second, we think that statistical software should do
what textbooks propose, whereas the current situation is often just the
opposite: students think of statistics as the output produced by some computer
program.
Understanding this book requires relatively few prerequisites: the reader
should master basic algebra and be familiar with the basic methods of
univariate statistics. Knowing linear regression, at least for the case of a
single regressor, will be very helpful, however. It may be a good idea to do Preface xi some parallel reading in one of the books on regression given at the end of
Chapter 5.
Among the many people who contributed to this book we name just a
few. Rudolf Maibach wrote a preliminary version of Chapter 1. Erika
Gautschi translated parts of the German text into English. George McCabe,
Jean-Pierre Airoldi and three anonymous referees provided helpful comments
on several chapters of the manuscript. Kathi Schlitz and Emilia Bonnemain
typed the manuscript with great enthusiasm, and Thomas Hanni programmed
many graphics. We thank them all warmly. Finally, we would like to thank the
publishers of this book, especially Elizabeth Johnston, and her colleagues at
Chapman and Hall for their encouragement and cooperation.
Bernhard Flury
Hans Riedwyl
Berne, January 1987 I
The data For the sake of simplicity we shall almost always refer to the same set of data
for illustration. This comes from an inquiry that was conducted into genuine
and forged thousand franc bills, henceforth called bills. How did we get hold of
such unusual data?
To find the answer to this and other questions, let us listen for a while to two
gentlemen sitting comfortably in their armchairs and chatting to each other.
One is a statistician (S), the other an expert (E) in the fight against
counterfeiting. s: Now I'll tell you the real reason I invited you to my home this evening.
E: I thought the bottle of 1967 Gevrey-Chambertin was the reason, or
am I wrong?
S: That was one reason, but not the only one. I would like to talk to you
about an idea which I have been carrying around in my head for some
time now.
E: Well, then - go ahead!
S: You have a lot of experience in sorting out genuine bills from forged
ones, haven't you? You can tell, at first sight, whether a particular bill
is genuine or not.
E: It's not quite that simple. I can certainly detect a bad forgery right
away - say, if the water mark is missing, the paper is different, or the
print contains gross errors. In sophisticated forgeries, however, one
often has to look very closely to discover some small error. Forgers are
becoming smarter all the time and their technical tools more and more
precise and fine.
S: And so you have to rely on modern science to search for criminals.
E: Sure; but what are you driving at?
S: It occurred to me that statistics could help with this. In psychology
and biology, for example, statistical methods are used for classifying
items into groups and for determining which features characterize
group membership.
E: That sounds very interesting, indeed. Can you be a bit more specific? 2 The data s: E:
S: E:
S: E:
S: E:
S: Alright, alright! I once examined a bill rather closely. What interests
me are attributes that I can easily measure. The quality of the paper,
the water mark, the colours are all attributes which do not satisfy these
conditions. Linear dimensions, on the other hand, can easily be
measured, even by a layman like myself. The forger, I am sure, has a lot
of trouble reproducing all linear measures with perfect precision. So I
could select various distances on the bill and measure them. For
example, the length and height of the print image would be a good
place to start.
Yes, sure, but how is this information going to help you?
I need it in order to compare genuine bills with forged ones. I could
carry out the measurements on, let's say, 100 genuine and 100 forged
bills. Since the forger surely does not work with absolute precision, I
would find out very quickly that the measurements on the genuine
bills differed from those on the forged ones - and that's precisely what
I want to exploit.
Wait a minute! Why would you want to measure 100 bills; wouldn't a
single one be enough?
No, because I am interested not only in the actual length of a line, but
also in the variability of the measurements on a collection of bills. The
variability between bills may give me information about the precision
of the production process.
I see. One other question: is it at all possible to measure lines
accurately enough in order to be able to detect a difference between
genuine and forged bills?
That's a problem that can be solved. Either I can use sophisticated
measurement equipment, or else I can try to project the bills on a
screen and 'blow them up'. Naturally, there will still be some error of
measurement, but it will be small compared with the variability due to
the production process.
Alright, let's assume you have made all your measurements. Then you
compare the two mean values for each given line. Correct?
It's not quite that simple. I can now carry out many different analyses.
I consider the lengths that I have measured to be quantities subject to
chance, or 'random variables'. I would like to have several of these,
because the more random variables are available, the more information can be derived from them. To begin with, I would compare the
genuine and the forged bills with regard to mean and standard
deviation of each variable, and also with regard to correlations
between the variables. Then I would represent, graphically, the
various measurements of each bill. Furthermore, I'd like to know
whether the group of forged bills differs markedly from the group of
the genuine bills in the features under consideration. In other words,
can we discriminate between the two groups? If so, I might be able to The data
tell with high probability whether a given bill is genuine or forged,just
by combining the various measurements in an optimal way. In this
way I could certainly provide you with valuable help in putting the
forgers out of business.
E: All these methods you describe sound very good. But I'll tell you now
how we approach the problem. First of all, you have to realize that I
am not the one who encounters a forgery first. Only in the rarest of
cases is it the police who discover the existence of a forgery. Generally
it is the bank tellers, or other people who handle a lot of paper money,
who first become suspicious of a bill. If the forgery is clumsy, they will
detect it right away. If, however, the bill looks just a little bit unusual,
they may send it for further examination. So, when it finally comes to
me, it has already been established that a forgery is at hand. My first
task consists of informing all banks and other money agencies about
the forgery as quickly as possible. For this purpose a leaflet with a
rough description of the falsification is sent to all interested parties.
After that, we give one or several samples of the forgery to the printer
who produces our bills and their specialists carry out a detailed
examination. They check the bill with regard to printing process, type
of paper, water mark, colours and chemical composition of inks,
and much more. We expect this kind of information to help us find
and eliminate the forgery workshop and the distribution organization.
S: Well, it is precisely in this investigation that I want to help you. On the
basis of the features that I consider, I would like to determine whether
a single bill can be attributed to a group or not. I call this
'identification'. Indeed, even during the production of genuine bills.
one could, with the help of 'identification analysis', discover at an
early stage a change in the manufacturing process, which would cause
defective bills to be produced.
E: You probably assume that the forger manufactures the false notes
under exactly the same homogeneous conditions as we manufacture
our genuine bills. From experience I know that most of them are
produced in batches in some old cellar or in a garage. For this reason I
have my doubts about comparing forged and genuine bills
statistically.
S: I, too, will not necessarily assume conditions of homogeneity. I will
ask whether a collection of bills, for example a set of 100 forged ones,
can be divided into classes in such a way that they look homogeneous
within classes, but differ from one class to another. If that is the case,
then of course I would like again to describe group differences. I hope I
am not confusing you too much with these ideas.
E: Let's put this to a test! In the cause of science, I can certainly get you a
few genuine and forged bills. 3 Table 1.1 Six variables measured on 100 genuine Swiss bank notes 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17 18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36...

View
Full Document