Flury et al_Multivariate Statistics- A practical approach 1988.pdf - Multivariate Statistics A practical approach OTHER STATISTICS TEXTS FROM CHAPMAN

Flury et al_Multivariate Statistics- A practical approach 1988.pdf

This preview shows page 1 out of 306 pages.

You've reached the end of your free preview.

Want to read all 306 pages?

Unformatted text preview: Multivariate Statistics A practical approach OTHER STATISTICS TEXTS FROM CHAPMAN AND HALL Applied Statistics D.R. Cox and E.l. Snell The Analysis of Time Series C. Chatfield Decision Analysis: A Bayesian Approach IQ. Smith Statistics for Technology C. Chatfield Introduction to Multivariate Analysis C. Chatfield and A.l. Collins Introduction to Optimization Methods and their Application in Statistics B.S. Everitt An Introduction to Statistical ModeUing A.l. Dobson Multivariate Analysis of Variance and Repeated Measures D.l. Hand and c.c. Taylor Multivariate Statistical Methods - a primer Bryan F. Manly Statistical Methods in Agriculture and Experimental Biology R. Mead and R.N. Curnow Elements of Simulation D.l.T. Morgan Essential Statistics D.G. Rees Applied Statistics: A Handbook of BMDP Analyses E.J. Snell Intermediate Statistical Methods G.B. Wetherill Foundations of Statistics D.G. Rees Probability: Methods and Measurement A. O'Hagan Elementary Applications of Probability Theory H.C. Tuckwell Further information on the complete range of Chapman and Hall statistics books is available from the publishers. Multivariate Statistics A practical approach Bernhard Flury and Hans Riedwyl London New York CHAPMAN AND HALL First published in 1988 by Chapman and Hall Ltd 11 New Fetter Lane, London EC4P 4EE Published in the USA by Chapman and Hall 29 West 35th Street, New York NY 10001 © 1988 B. Flury and H. Riedwyl Softcover reprint of the hardcover 1st edition 1988 ISBN-13:978-94-010-7041-6 This title is available in both hardbound and paperback editions. The paperback edition is sold subject to the condition that it shall not, by way of trade or otherwise, be lent, resold, hired out, or otherwise circulated without the publisher's prior consent in any form of binding or cover other than that in which it is published and without a similar condition including this condition being imposed on the subsequent purchaser. All rights reserved. No part of this book may be reprinted, or reproduced or utilized in any form or by any electronic, mechanical or other means, now known or hereafter invented, including photocopying and recording, or in any information storage and retrieval system, without permission in writing from the publisher. British Library Cataloguing in Publication Data Flury, Bernhard Multivariate statistics: a practical approach. 1. Multivariate analysis I. Title II. Riedwyl, Hans 519.5'35 QA278 ISBN-13:978-94-010-7041-6 DOl: 10.1007/978-94-009-1217- 5 e-ISBN-13:978-94-009-1217-5 Library of Congress Cataloging-in-Publication Data Flury, Bernhard, 1951Multivariate statistics. Rev. translation of: Angewandte multivariate Statistik. Bibliography: p. Includes index. 1. Multivariate analysis. I. Riedwyl, Hans. II. Title.. QA278.F58813 1988 519.5'35 87-18405 ISBN-13:978-94-010 -7041-6 Contents Preface ix 1 The data Discussion 2 Univariate plots and descriptive statistics Discussion Further study 11 16 18 3 Scatterplot, correlation and covariance Discussion Further study 19 36 37 4 Face plots Discussion Further study 52 52 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 Multiple linear regression Introductory remarks The model of multiple linear regression Least squares estimation Residual analysis Model building, analysis of variance The overall test of significance Coefficient of determination and multiple correlation Tests of partial hypotheses Standard errors of the regression coefficients Selection of a subset of regressors Discussion Further study 6 6.1 Linear combinations Introduction 1 9 38 54 54 54 56 58 59 61 62 63 64 64 69 74 75 75 vi Contents 6.2 6.3 6.4 6.5 A special linear combination Linear combinations of two variables Linear combinations of several variables Mean and standard deviation of linear combinations Discussion Further study 75 77 81 84 86 86 7 Linear discriminant analysis for two groups 88 88 88 7.1 Introduction 7.2 Multivariate standard distance 7.3 Relationship between discriminant analysis and multiple 7.4 7.5 7.6 7.7 8 linear regression Testing hypotheses about the discriminant function Screening a discriminant function Further uses of the coefficient of determination Classification of observations Discussion Further study Examples 94 96 99 104 106 114 123 123 Identification analysis 136 136 analysis 137 138 142 144 149 150 152 8.1 Introduction 8.2 Identification analysis as a special case of discriminant 8.3 More about standard distance 8.4 Identification of a bank note 8.5 Analysis of outliers Discussion Further study Examples 9 9.1 9.2 9.3 9.4 9.5 10 Specification analysis Standard distance between a sample and a hypothetical mean vector Specification analysis of the bank notes Confidence regions for a mean vector A more general model Specification faces Further study Examples Principal component analysis 10.1 Introduction 156 156 159 165 168 170 174 174 181 181 Contents 10.2 10.3 10.4 10.5 10.6 Principal components of two variables Properties of principal components in the multidimensional case Principal component analysis of the genuine bank notes The singular case Principal components, standard distance, and the multivariate normal distribution 10.7 Standard errors of the principal component coefficients and related problems 10.8 Principal component analysis of several groups Discussion Further study Examples 11 Comparing the covariance structures of two groups 11.1 11.2 11.3 11.4 Introduction The bivariate case The multivariate case Comparison of the covariance matrices of the genuine and forged bank notes Partial statistics for the analysis of Ymax and Ymin Stepwise analysis of Ymax and Ymin Relationships to standard distance and principal component analysis Critical values of the distribution of F max and F min Discussion Further study Examples 11.5 11.6 11. 7 11.8 Vll 184 190 192 197 198 201 205 210 212 213 234 234 235 237 243 245 248 252 256 257 258 258 12 Exercises 12.1 Exercises based on the bank note data 12.2 Additional exercises 263 263 270 13 276 276 277 277 279 Mathematical appendix Introduction and preliminaries Data matrix, mean vector, covariance and correlation Multiple linear regression Linear combinations Multivariate standard distance and the linear discriminant function 13.6 Principal component analysis 13.7 Comparison of two covariance matrices 13.1 13.2 13.3 13.4 13.5 280 282 286 References 288 Index 293 Preface During the last twenty years multivariate statistical methods have become increasingly popular among scientists in various fields. The theory had already made great progress in previous decades and routine applications of multivariate methods followed with the advent of fast computers. Nowadays statistical software packages perform in seconds what used to take weeks of tedious calculations. Although this is certainly a welcome development, we find, on the other hand, that many users of statistical packages are not too sure of what they are doing, and this is especially true for multivariate statistical methods. Many researchers have heard about such techniques and feel intuitively that multivariate methods could be useful for their own work, but they haven't mastered the usual mathematical prerequisites. This book tries to fill the gap by explaining - in words and graphs - some basic concepts and selected methods of multivariate statistical analysis. Why another book? Are the existing books on applied multivariate statistics all obsolete? No, some of them are up to date and, indeed, quite good. However, we think that this introduction distinguishes itselffrom any existing text in three ways. First, we illustrate the basic concepts mainly with graphical tools, hence the large number of figures. Second, all techniques, with just one exception, are illustrated by the same numerical data. Being familiar with the data will help the reader to understand what insights multivariate methods can otTer. Third, we have avoided mathematical notation as much as possible. While we are well aware of the fact that this is a risky business, since avoiding mathematical language inevitably implies some loss of accuracy, we feel that it is possible to understand the basic ideas of multivariate analysis without mastering matrix algebra. Of course, we do not wish to discourage anybody from learning multivariate statistics the mathematical way. Indeed, many ideas and concepts can be stated in fewer words using concise mathematical language, but in order to appreciate the advantages of a more abstract approach the reader needs to be familiar with matrix theory, which most non-mathematicians tend to think of as a difficult subject. This book has grown out of its German predecessor Angewandte x Preface multivariate Statistik, published in 1983 by Gustav Fischer Verlag, Stuttgart. It is not just a translation, however; many new ideas and didactic tools have been introduced, one chapter has been removed, and another one has been added. The chapter that was deleted was the one on factor analysis, which was criticized by some reviewers of the German text. Although it is probably impossible to write about factor analysis without provoking criticism, we felt that we were not sufficiently expert in this field, and that it was probably better to omit this topic altogether. Some remarks are in order concerning Chapter 5, which deals with multiple linear regression. This chapter differs from the others in having a more compressed presentation. It was indeed conceived as a restatement of linear regression rather than a first introduction, and assumes that the reader is familiar at least with simple linear regression. We felt that it wasn't possible to omit this chapter, since a basic understanding of linear regression is extremely helpful for understanding the linear discriminant function and related concepts. We do not claim that our book treats the various multivariate methods comprehensively - on the contrary, we purposely limited the presentation to those ideas and techniques that seemed most important to us, and tried to explain those as carefully as possible. Our selection of topics is obviously biased towards our own research areas, but this is what we feel most competent to write about. Many chapters are followed by one or several additional examples and by a discussion. The questions stem mostly from the participants of several courses on applied multivariate analysis that we taught from 1980 to 1985. These questions gave us an opportunity to discuss additional important methods or ideas that didn't fit into the main text. The length of the discussions essentially reflects the number of times the respective chapter has been taught so far. Some readers of the German predecessor of this book have asked us to supplement it with detailed instructions for the use of existing statistical software such as BMDP, SAS, or SPSS. We have resisted the temptation to introduce such instructions for two reasons. First, we would like the information in this book to remain correct for, say, fifteen years. Most instructions for the use of currently existing programs will probably be obsolete by that time. Second, we think that statistical software should do what textbooks propose, whereas the current situation is often just the opposite: students think of statistics as the output produced by some computer program. Understanding this book requires relatively few prerequisites: the reader should master basic algebra and be familiar with the basic methods of univariate statistics. Knowing linear regression, at least for the case of a single regressor, will be very helpful, however. It may be a good idea to do Preface xi some parallel reading in one of the books on regression given at the end of Chapter 5. Among the many people who contributed to this book we name just a few. Rudolf Maibach wrote a preliminary version of Chapter 1. Erika Gautschi translated parts of the German text into English. George McCabe, Jean-Pierre Airoldi and three anonymous referees provided helpful comments on several chapters of the manuscript. Kathi Schlitz and Emilia Bonnemain typed the manuscript with great enthusiasm, and Thomas Hanni programmed many graphics. We thank them all warmly. Finally, we would like to thank the publishers of this book, especially Elizabeth Johnston, and her colleagues at Chapman and Hall for their encouragement and cooperation. Bernhard Flury Hans Riedwyl Berne, January 1987 I The data For the sake of simplicity we shall almost always refer to the same set of data for illustration. This comes from an inquiry that was conducted into genuine and forged thousand franc bills, henceforth called bills. How did we get hold of such unusual data? To find the answer to this and other questions, let us listen for a while to two gentlemen sitting comfortably in their armchairs and chatting to each other. One is a statistician (S), the other an expert (E) in the fight against counterfeiting. s: Now I'll tell you the real reason I invited you to my home this evening. E: I thought the bottle of 1967 Gevrey-Chambertin was the reason, or am I wrong? S: That was one reason, but not the only one. I would like to talk to you about an idea which I have been carrying around in my head for some time now. E: Well, then - go ahead! S: You have a lot of experience in sorting out genuine bills from forged ones, haven't you? You can tell, at first sight, whether a particular bill is genuine or not. E: It's not quite that simple. I can certainly detect a bad forgery right away - say, if the water mark is missing, the paper is different, or the print contains gross errors. In sophisticated forgeries, however, one often has to look very closely to discover some small error. Forgers are becoming smarter all the time and their technical tools more and more precise and fine. S: And so you have to rely on modern science to search for criminals. E: Sure; but what are you driving at? S: It occurred to me that statistics could help with this. In psychology and biology, for example, statistical methods are used for classifying items into groups and for determining which features characterize group membership. E: That sounds very interesting, indeed. Can you be a bit more specific? 2 The data s: E: S: E: S: E: S: E: S: Alright, alright! I once examined a bill rather closely. What interests me are attributes that I can easily measure. The quality of the paper, the water mark, the colours are all attributes which do not satisfy these conditions. Linear dimensions, on the other hand, can easily be measured, even by a layman like myself. The forger, I am sure, has a lot of trouble reproducing all linear measures with perfect precision. So I could select various distances on the bill and measure them. For example, the length and height of the print image would be a good place to start. Yes, sure, but how is this information going to help you? I need it in order to compare genuine bills with forged ones. I could carry out the measurements on, let's say, 100 genuine and 100 forged bills. Since the forger surely does not work with absolute precision, I would find out very quickly that the measurements on the genuine bills differed from those on the forged ones - and that's precisely what I want to exploit. Wait a minute! Why would you want to measure 100 bills; wouldn't a single one be enough? No, because I am interested not only in the actual length of a line, but also in the variability of the measurements on a collection of bills. The variability between bills may give me information about the precision of the production process. I see. One other question: is it at all possible to measure lines accurately enough in order to be able to detect a difference between genuine and forged bills? That's a problem that can be solved. Either I can use sophisticated measurement equipment, or else I can try to project the bills on a screen and 'blow them up'. Naturally, there will still be some error of measurement, but it will be small compared with the variability due to the production process. Alright, let's assume you have made all your measurements. Then you compare the two mean values for each given line. Correct? It's not quite that simple. I can now carry out many different analyses. I consider the lengths that I have measured to be quantities subject to chance, or 'random variables'. I would like to have several of these, because the more random variables are available, the more information can be derived from them. To begin with, I would compare the genuine and the forged bills with regard to mean and standard deviation of each variable, and also with regard to correlations between the variables. Then I would represent, graphically, the various measurements of each bill. Furthermore, I'd like to know whether the group of forged bills differs markedly from the group of the genuine bills in the features under consideration. In other words, can we discriminate between the two groups? If so, I might be able to The data tell with high probability whether a given bill is genuine or forged,just by combining the various measurements in an optimal way. In this way I could certainly provide you with valuable help in putting the forgers out of business. E: All these methods you describe sound very good. But I'll tell you now how we approach the problem. First of all, you have to realize that I am not the one who encounters a forgery first. Only in the rarest of cases is it the police who discover the existence of a forgery. Generally it is the bank tellers, or other people who handle a lot of paper money, who first become suspicious of a bill. If the forgery is clumsy, they will detect it right away. If, however, the bill looks just a little bit unusual, they may send it for further examination. So, when it finally comes to me, it has already been established that a forgery is at hand. My first task consists of informing all banks and other money agencies about the forgery as quickly as possible. For this purpose a leaflet with a rough description of the falsification is sent to all interested parties. After that, we give one or several samples of the forgery to the printer who produces our bills and their specialists carry out a detailed examination. They check the bill with regard to printing process, type of paper, water mark, colours and chemical composition of inks, and much more. We expect this kind of information to help us find and eliminate the forgery workshop and the distribution organization. S: Well, it is precisely in this investigation that I want to help you. On the basis of the features that I consider, I would like to determine whether a single bill can be attributed to a group or not. I call this 'identification'. Indeed, even during the production of genuine bills. one could, with the help of 'identification analysis', discover at an early stage a change in the manufacturing process, which would cause defective bills to be produced. E: You probably assume that the forger manufactures the false notes under exactly the same homogeneous conditions as we manufacture our genuine bills. From experience I know that most of them are produced in batches in some old cellar or in a garage. For this reason I have my doubts about comparing forged and genuine bills statistically. S: I, too, will not necessarily assume conditions of homogeneity. I will ask whether a collection of bills, for example a set of 100 forged ones, can be divided into classes in such a way that they look homogeneous within classes, but differ from one class to another. If that is the case, then of course I would like again to describe group differences. I hope I am not confusing you too much with these ideas. E: Let's put this to a test! In the cause of science, I can certainly get you a few genuine and forged bills. 3 Table 1.1 Six variables measured on 100 genuine Swiss bank notes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36...
View Full Document

  • Fall '18
  • F. TAILOKA
  • Test, The Land, Multivariate statistics

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture