Econometrics_Notes_-_University_of_Utah__370_pages_

Econometrics_Notes_-_University_of_Utah__370_pages_ - Class...

Info iconThis preview shows pages 1–4. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Class Notes Econ 7800 Fall Semester 2003 Hans G. Ehrbar Economics Department, University of Utah, 1645 Campus Center Drive, Salt Lake City UT 84112-9300, U.S.A. URL: www.econ.utah.edu/ehrbar/ecmet.pdf E-mail address : ehrbar@econ.utah.edu Abstract. This is an attempt to make a carefully argued set of class notes freely available. The source code for these notes can be downloaded from www.econ.utah.edu/ehrbar/ecmet-sources.zip Copyright Hans G. Ehrbar under the GNU Public License The present version has those chapters relevant for Econ 7800. Contents Chapter 1. Syllabus Econ 7800 Fall 2003 vii Chapter 2. Probability Fields 2.1. The Concept of Probability 2.2. Events as Sets 2.3. The Axioms of Probability 2.4. Objective and Subjective Interpretation of Probability 2.5. Counting Rules 2.6. Relationships Involving Binomial Coefficients 2.7. Conditional Probability 2.8. Ratio of Probabilities as Strength of Evidence 2.9. Bayes Theorem 2.10. Independence of Events 2.11. How to Plot Frequency Vectors and Probability Vectors 1 1 5 8 10 11 12 13 18 19 20 22 Chapter 3. Random Variables 3.1. Notation 3.2. Digression about Infinitesimals 3.3. Definition of a Random Variable 3.4. Characterization of Random Variables 3.5. Discrete and Absolutely Continuous Probability Measures 3.6. Transformation of a Scalar Density Function 3.7. Example: Binomial Variable 3.8. Pitfalls of Data Reduction: The Ecological Fallacy 3.9. Independence of Random Variables 3.10. Location Parameters and Dispersion Parameters of a Random Variable 3.11. Entropy 25 25 25 27 27 30 31 32 34 35 35 39 Chapter 4. Specific Random Variables 4.1. Binomial 4.2. The Hypergeometric Probability Distribution 4.3. The Poisson Distribution 4.4. The Exponential Distribution 4.5. The Gamma Distribution 4.6. The Uniform Distribution 4.7. The Beta Distribution 4.8. The Normal Distribution 4.9. The Chi-Square Distribution 4.10. The Lognormal Distribution 4.11. The Cauchy Distribution 49 49 52 52 55 56 59 59 60 62 63 63 iii iv CONTENTS Chapter 5. Chebyshev Inequality, Weak Law of Large Numbers, and Central Limit Theorem 5.1. Chebyshev Inequality 5.2. The Probability Limit and the Law of Large Numbers 5.3. Central Limit Theorem 65 65 66 67 Chapter 6. Vector Random Variables 6.1. Expected Value, Variances, Covariances 6.2. Marginal Probability Laws 6.3. Conditional Probability Distribution and Conditional Mean 6.4. The Multinomial Distribution 6.5. Independent Random Vectors 6.6. Conditional Expectation and Variance 6.7. Expected Values as Predictors 6.8. Transformation of Vector Random Variables 69 70 73 74 75 76 77 79 83 Chapter 7. The Multivariate Normal Probability Distribution 7.1. More About the Univariate Case 7.2. Definition of Multivariate Normal 7.3. Special Case: Bivariate Normal 7.4. Multivariate Standard Normal in Higher Dimensions 87 87 88 88 97 Chapter 8. The Regression Fallacy 101 Chapter 9. A Simple Example of Estimation 9.1. Sample Mean as Estimator of the Location Parameter 9.2. Intuition of the Maximum Likelihood Estimator 9.3. Variance Estimation and Degrees of Freedom 109 109 110 112 Chapter 10. Estimation Principles and Classification of Estimators 10.1. Asymptotic or Large-Sample Properties of Estimators 10.2. Small Sample Properties 10.3. Comparison Unbiasedness Consistency 10.4. The Cramer-Rao Lower Bound 10.5. Best Linear Unbiased Without Distribution Assumptions 10.6. Maximum Likelihood Estimation 10.7. Method of Moments Estimators 10.8. M-Estimators 10.9. Sufficient Statistics and Estimation 10.10. The Likelihood Principle 10.11. Bayesian Inference 121 121 122 123 126 132 134 136 136 137 140 140 Chapter 11. 143 Chapter 12.1. 12.2. 12.3. Interval Estimation 12. Hypothesis Testing Duality between Significance Tests and Confidence Regions The Neyman Pearson Lemma and Likelihood Ratio Tests The Wald, Likelihood Ratio, and Lagrange Multiplier Tests Chapter 13. 149 151 152 154 General Principles of Econometric Modelling 157 Chapter 14. Mean-Variance Analysis in the Linear Model 14.1. Three Versions of the Linear Model 14.2. Ordinary Least Squares 159 159 160 CONTENTS 14.3. The Coefficient of Determination 14.4. The Adjusted R-Square v 166 170 Chapter 15. Digression about Correlation Coefficients 15.1. A Unified Definition of Correlation Coefficients 173 173 Chapter 16.1. 16.2. 16.3. 16.4. 16.5. 177 177 184 189 190 190 16. Specific Datasets Cobb Douglas Aggregate Production Function Houthakker’s Data Long Term Data about US Economy Dougherty Data Wage Data Chapter 17. The Mean Squared Error as an Initial Criterion of Precision 17.1. Comparison of Two Vector Estimators 203 203 Chapter 18.1. 18.2. 18.3. 18.4. 18.5. 207 207 209 210 218 219 18. Sampling Properties of the Least Squares Estimator The Gauss Markov Theorem Digression about Minimax Estimators Miscellaneous Properties of the BLUE Estimation of the Variance Mallow’s Cp-Statistic as Estimator of the Mean Squared Error Chapter 19. Chapter 20.1. 20.2. 20.3. Nonspherical Positive Definite Covariance Matrix 221 20. Best Linear Prediction Minimum Mean Squared Error, Unbiasedness Not Required The Associated Least Squares Problem Prediction of Future Observations in the Regression Model 225 225 230 231 Chapter 21. Updating of Estimates When More Observations become Available 237 Chapter 22.1. 22.2. 22.3. 22.4. 22.5. 22.6. 22.7. 22.8. 22.9. 22. Constrained Least Squares Building the Constraint into the Model Conversion of an Arbitrary Constraint into a Zero Constraint Lagrange Approach to Constrained Least Squares Constrained Least Squares as the Nesting of Two Simpler Models Solution by Quadratic Decomposition Sampling Properties of Constrained Least Squares Estimation of the Variance in Constrained OLS Inequality Restrictions Application: Biased Estimators and Pre-Test Estimators Chapter 23. Additional Regressors 241 241 242 243 245 246 247 248 251 251 253 Chapter 24.1. 24.2. 24.3. 24. Residuals: Standardized, Predictive, “Studentized” Three Decisions about Plotting Residuals Relationship between Ordinary and Predictive Residuals Standardization 263 263 265 267 Chapter 25.1. 25.2. 25.3. 25. Regression Diagnostics Missing Observations Grouped Data Influential Observations and Outliers 271 271 271 271 vi CONTENTS 25.4. Sensitivity of Estimates to Omission of One Observation 273 Chapter 26. Asymptotic Properties of the OLS Estimator 26.1. Consistency of the OLS estimator 26.2. Asymptotic Normality of the Least Squares Estimator 279 280 281 Chapter 27. 283 Least Squares as the Normal Maximum Likelihood Estimate Chapter 28. Random Regressors 28.1. Strongest Assumption: Error Term Well Behaved Conditionally on Explanatory Variables 28.2. Contemporaneously Uncorrelated Disturbances 28.3. Disturbances Correlated with Regressors in Same Observation 289 290 291 Chapter 29. The Mahalanobis Distance 29.1. Definition of the Mahalanobis Distance 293 293 Chapter 30.1. 30.2. 30.3. 30.4. 30. Interval Estimation A Basic Construction Principle for Confidence Regions Coverage Probability of the Confidence Regions Conventional Formulas for the Test Statistics Interpretation in terms of Studentized Mahalanobis Distance 297 297 300 301 301 Chapter 31.1. 31.2. 31.3. 31.4. 31.5. 31. Three Principles for Testing a Linear Constraint Mathematical Detail of the Three Approaches Examples of Tests of Linear Hypotheses The F-Test Statistic is a Function of the Likelihood Ratio Tests of Nonlinear Hypotheses Choosing Between Nonnested Models 305 305 308 315 315 316 Chapter 32. Instrumental Variables 289 317 Appendix A. Matrix Formulas A.1. A Fundamental Matrix Decomposition A.2. The Spectral Norm of a Matrix A.3. Inverses and g-Inverses of Matrices A.4. Deficiency Matrices A.5. Nonnegative Definite Symmetric Matrices A.6. Projection Matrices A.7. Determinants A.8. More About Inverses A.9. Eigenvalues and Singular Value Decomposition 321 321 321 322 323 326 329 331 332 335 Appendix B. Arrays of Higher Rank B.1. Informal Survey of the Notation B.2. Axiomatic Development of Array Operations B.3. An Additional Notational Detail B.4. Equality of Arrays and Extended Substitution B.5. Vectorization and Kronecker Product 337 337 339 343 343 344 Appendix C. Matrix Differentiation C.1. First Derivatives 353 353 Appendix. 359 Bibliography CHAPTER 1 Syllabus Econ 7800 Fall 2003 The class meets Tuesdays and Thursdays 12:25 to 1:45pm in BUC 207. First class Thursday, August 21, 2003; last class Thursday, December 4. Instructor: Assoc. Prof. Dr. Dr. Hans G. Ehrbar. Hans’s office is at 319 BUO, Tel. 581 7797, email ehrbar@econ.utah.edu Office hours: Monday 10–10:45 am, Thursday 5–5:45 pm or by appointment. Textbook: There is no obligatory textbook in the Fall Quarter, but detailed class notes are available at www.econ.utah.edu/ehrbar/ec7800.pdf, and you can purchase a hardcopy containing the assigned chapters only at the University Copy Center, 158 Union Bldg, tel. 581 8569 (ask for the class materials for Econ 7800). Furthermore, the following optional texts will be available at the bookstore: Peter Kennedy, A Guide to Econometrics (fourth edition), MIT Press, 1998 ISBN 0-262-61140-6. The bookstore also has available William H. Greene’s Econometric Analysis, fifth edition, Prentice Hall 2003, ISBN 0-13-066189-9. This is the assigned text for Econ 7801 in the Spring semester 2004, and some of the introductory chapters are already useful for the Fall semester 2003. The following chapters in the class notes are assigned: 2, 3 (but not section 3.2), 4, 5, 6, 7 (but only until section 7.3), 8, 9, 10, 11, 12, 14, only section 15.1 in chapter 15, in chapter 16, we will perhaps do section 16.1 or 16.4, then in chapter 17 we do section 17.1, then chapter 18 until and including 18.5, and in chapter 22 do sections 22.1, 22.3, 22.6, and 22.7. In chapter 29 only the first section 29.1, finally chapters 30, and section 31.2 in chapter 31. Summary of the Class: This is the first semester in a two-semester Econometrics field, but it should also be useful for students taking the first semester only as part of their methodology requirement. The course description says: Probability, conditional probability, distributions, transformation of probability densities, sufficient statistics, limit theorems, estimation principles, maximum likelihood estimation, interval estimation and hypothesis testing, least squares estimation, linear constraints. This class has two focal points: maximum likelihood estimation, and the fundamental concepts of the linear model (regression). If advanced mathematical concepts are necessary in these theoretical explorations, they will usually be reviewed very briefly before we use them. The class is structured in such a way that, if you allocate enough time, it should be possible to refresh your math skills as you go along. Here is an overview of the topics to be covered in the Fall Semester. They may not come exactly in the order in which they are listed here 1. Probability fields: Events as sets, set operations, probability axioms, subjective vs. frequentist interpretation, finite sample spaces and counting rules (combinatorics), conditional probability, Bayes theorem, independence, conditional independence. vii viii 1. SYLLABUS ECON 7800 FALL 2003 2. Random Variables: Cumulative distribution function, density function; location parameters (expected value, median) and dispersion parameters (variance). 3. Special Issues and Examples: Discussion of the “ecological fallacy”; entropy; moment generating function; examples (Binomial, Poisson, Gamma, Normal, Chisquare); sufficient statistics. 4. Limit Theorems: Chebyshev inequality; law of large numbers; central limit theorems. The first Midterm will already be on Thursday, September 18, 2003. It will be closed book, but you are allowed to prepare one sheet with formulas etc. Most of the midterm questions will be similar or identical to the homework questions in the class notes assigned up to that time. 5. Jointly Distributed Random Variables: Joint, marginal, and conditional densities; conditional mean; transformations of random variables; covariance and correlation; sums and linear combinations of random variables; jointly normal variables. 6. Estimation Basics: Descriptive statistics; sample mean and variance; degrees of freedom; classification of estimators. 7. Estimation Methods: Method of moments estimators; least squares estimators. Bayesian inference. Maximum likelihood estimators; large sample properties of MLE; MLE and sufficient statistics; computational aspects of maximum likelihood. 8. Confidence Intervals and Hypothesis Testing: Power functions; Neyman Pearson Lemma; likelihood ratio tests. As example of tests: the run test, goodness of fit test, contingency tables. The second in-class Midterm will be on Thursday, October 16, 2003. 9. Basics of the “Linear Model.” We will discuss the case with nonrandom regressors and a spherical covariance matrix: OLS-BLUE duality, Maximum likelihood estimation, linear constraints, hypothesis testing, interval estimation (t-test, F -test, joint confidence intervals). The third Midterm will be a takehome exam. You will receive the questions on Tuesday, November 25, 2003, and they are due back at the beginning of class on Tuesday, December 2nd, 12:25 pm. The questions will be similar to questions which you might have to answer in the Econometrics Field exam. The Final Exam will be given according to the campus-wide examination schedule, which is Wednesday December 10, 10:30–12:30 in the usual classroom. Closed book, but again you are allowed to prepare one sheet of notes with the most important concepts and formulas. The exam will cover material after the second Midterm. Grading: The three midterms and the final exams will be counted equally. Every week certain homework questions from among the questions in the class notes will be assigned. It is recommended that you work through these homework questions conscientiously. The answers provided in the class notes should help you if you get stuck. If you have problems with these homeworks despite the answers in the class notes, please write you answer down as far as you get and submit your answer to me; I will look at them and help you out. A majority of the questions in the two in-class midterms and the final exam will be identical to these assigned homework questions, but some questions will be different. Special circumstances: If there are special circumstances requiring an individualized course of study in your case, please see me about it in the first week of classes. Hans G. Ehrbar CHAPTER 2 Probability Fields 2.1. The Concept of Probability Probability theory and statistics are useful in dealing with the following types of situations: • Games of chance: throwing dice, shuffling cards, drawing balls out of urns. • Quality control in production: you take a sample from a shipment, count how many defectives. • Actuarial Problems: the length of life anticipated for a person who has just applied for life insurance. • Scientific Eperiments: you count the number of mice which contract cancer when a group of mice is exposed to cigarette smoke. • Markets: the total personal income in New York State in a given month. • Meteorology: the rainfall in a given month. • Uncertainty: the exact date of Noah’s birth. • Indeterminacy: The closing of the Dow Jones industrial average or the temperature in New York City at 4 pm. on February 28, 2014. • Chaotic determinacy: the relative frequency of the digit 3 in the decimal representation of π . • Quantum mechanics: the proportion of photons absorbed by a polarization filter • Statistical mechanics: the velocity distribution of molecules in a gas at a given pressure and temperature. In the probability theoretical literature the situations in which probability theory applies are called “experiments,” see for instance [R´n70, p. 1]. We will not use this e terminology here, since probabilistic reasoning applies to several different types of situations, and not all these can be considered “experiments.” Problem 1. (This question will not be asked on any exams) R´nyi says: “Obe serving how long one has to wait for the departure of an airplane is an experiment.” Comment. Answer. R´ny commits the epistemic fallacy in order to justify his use of the word “expere iment.” Not the observation of the departure but the departure itself is the event which can be theorized probabilistically, and the word “experiment” is not appropriate here. What does the fact that probability theory is appropriate in the above situations tell us about the world? Let us go through our list one by one: • Games of chance: Games of chance are based on the sensitivity on initial conditions: you tell someone to roll a pair of dice or shuffle a deck of cards, and despite the fact that this person is doing exactly what he or she is asked to do and produces an outcome which lies within a well-defined universe known beforehand (a number between 1 and 6, or a permutation of the deck of cards), the question which number or which permutation is beyond 1 2 2. PROBABILITY FIELDS their control. The precise location and speed of the die or the precise order of the cards varies, and these small variations in initial conditions give rise, by the “butterfly effect” of chaos theory, to unpredictable final outcomes. A critical realist recognizes here the openness and stratification of the world: If many different influences come together, each of which is governed by laws, then their sum total is not determinate, as a naive hyperdeterminist would think, but indeterminate. This is not only a condition for the possibility of science (in a hyper-deterministic world, one could not know anything before one knew everything, and science would also not be necessary because one could not do anything), but also for practical human activity: the macro outcomes of human practice are largely independent of micro detail (the postcard arrives whether the address is written in cursive or in printed letters, etc.). Games of chance are situations which deliberately project this micro indeterminacy into the macro world: the micro influences cancel each other out without one enduring influence taking over (as would be the case if the die were not perfectly symmetric and balanced) or deliberate human corrective activity stepping into the void (as a card trickster might do if the cards being shuffled somehow were distinguishable from the backside). The experiment in which one draws balls from urns shows clearly another aspect of this paradigm: the set of different possible outcomes is fixed beforehand, and the probability enters in the choice of one of these predetermined outcomes. This is not the only way probability can arise; it is an extensionalist example, in which the connection between success and failure is external. The world is not a collection of externally related outcomes collected in an urn. Success and failure are not determined by a choice between different spacially separated and individually inert balls (or playing cards or faces on a die), but it is the outcome of development and struggle that is internal to the individual unit. • Quality control in production: you take a sample from a shipment, count how many defectives. Why is statistics and probability useful in production? Because production is work, it is not spontaneous. Nature does not voluntarily give us things in the form in which we need them. Production is similar to a scientific experiment because it is the attempt to create local closure. Such closure can never be complete, there are always leaks in it, through which irregularity enters. • Actuarial Problems: the length of life anticipated for a person who has just applied for life insurance. Not only production, but also life itself is a struggle with physical nature, it is emergence. And sometimes it fails: sometimes the living organism is overwhelmed by the forces which it tries to keep at bay and to subject to its own purposes. • Scientific Eperiments: you count the number of mice which contract cancer when a group of mice is exposed to cigarette smoke: There is local closure regarding the conditions under which the mice live, but even if this closure were complete, individual mice would still react differently, because of genetic differences. No two mice are exactly the same, and despite these differences they are still mice. This is again the stratification of reality. Two mice are two different individuals but they are both mice. Their reaction to the smoke is not identical, since they are different individuals, but it is not completely capricious either, since both are mice. It can be predicted probabilistically. Those mechanisms which make them mice react to the 2.1. THE CONCEPT OF PROBABILITY • • • • • • 3 smoke. The probabilistic regularity comes from the transfactual efficacy of the mouse organisms. Meteorology: the rainfall in a given month. It is very fortunate for the development of life on our planet that we have the chaotic alternation between cloud cover and clear sky, instead of a continuous cloud cover as in Venus or a continuous clear sky. Butterfly effect all over again, but it is possible to make probabilistic predictions since the fundamentals remain stable: the transfactual efficacy of the energy received from the sun and radiated back out into space. Markets: the total personal income in New York State in a given month. Market economies are a very much like the weather; planned economies would be more like production or life. Uncertainty: the exact date of Noah’s birth. This is epistemic uncertainty: assuming that Noah was a real person, the date exists and we know a time range in which it must have been, but we do not know the details. Probabilistic methods can be used to represent this kind of uncertain knowledge, but other methods to represent this knowledge may be more appropriate. Indeterminacy: The closing of the Dow Jones Industrial Average (DJIA) or the temperature in New York City at 4 pm. on February 28, 2014: This is ontological uncertainty, not only epistemological uncertainty. Not only do we not know it, but it is objectively not yet decided what these data will be. Probability theory has limited applicability for the DJIA since it cannot be expected that the mechanisms determining the DJIA will be the same at that time, therefore we cannot base ourselves on the transfactual efficacy of some stable mechanisms. It is not known which stocks will be included in the DJIA at that time, or whether the US dollar will still be the world reserve currency and the New York stock exchange the pinnacle of international capital markets. Perhaps a different stock market index located somewhere else will at that time play the role the DJIA is playing today. We would not even be able to ask questions about that alternative index today. Regarding the temperature, it is more defensible to assign a probability, since the weather mechanisms have probably stayed the same, except for changes in global warming (unless mankind has learned by that time to manipulate the weather locally by cloud seeding etc.). Chaotic determinacy: the relative frequency of the digit 3 in the decimal representation of π : The laws by which the number π is defined have very little to do with the procedure by which numbers are expanded as decimals, therefore the former has no systematic influence on the latter. (It has an influence, but not a systematic one; it is the error of actualism to think that every influence must be systematic.) But it is also known that laws can have remote effects: one of the most amazing theorems in mathematics is the formula π = 1 − 1 + 1 − 1 + · · · which estalishes a connection between 4 3 5 4 the geometry of the circle and some simple arithmetics. Quantum mechanics: the proportion of photons absorbed by a polarization filter: If these photons are already polarized (but in a different direction than the filter) then this is not epistemic uncertainty but ontological indeterminacy, since the polarized photons form a pure state, which is atomic in the algebra of events. In this case, the distinction between epistemic uncertainty and ontological indeterminacy is operational: the two alternatives follow different mathematics. 4 2. PROBABILITY FIELDS • Statistical mechanics: the velocity distribution of molecules in a gas at a given pressure and temperature. Thermodynamics cannot be reduced to the mechanics of molecules, since mechanics is reversible in time, while thermodynamics is not. An additional element is needed, which can be modeled using probability. Problem 2. Not every kind of uncertainty can be formulated stochastically. Which other methods are available if stochastic means are inappropriate? Answer. Dialectics. Problem 3. How are the probabilities of rain in weather forecasts to be interpreted? Answer. Renyi in [R´n70, pp. 33/4]: “By saying that the probability of rain tomorrow is e 80% (or, what amounts to the same, 0.8) the meteorologist means that in a situation similar to that observed on the given day, there is usually rain on the next day in about 8 out of 10 cases; thus, while it is not certain that it will rain tomorrow, the degree of certainty of this event is 0.8.” Pure uncertainty is as hard to generate as pure certainty; it is needed for encryption and numerical methods. Here is an encryption scheme which leads to a random looking sequence of numbers (see [Rao97, p. 13]): First a string of binary random digits is generated which is known only to the sender and receiver. The sender converts his message into a string of binary digits. He then places the message string below the key string and obtains a coded string by changing every message bit to its alternative at all places where the key bit is 1 and leaving the others unchanged. The coded string which appears to be a random binary sequence is transmitted. The received message is decoded by making the changes in the same way as in encrypting using the key string which is known to the receiver. Problem 4. Why is it important in the above encryption scheme that the key string is purely random and does not have any regularities? Problem 5. [Knu81, pp. 7, 452] Suppose you wish to obtain a decimal digit at random, not using a computer. Which of the following methods would be suitable? • a. Open a telephone directory to a random place (i.e., stick your finger in it somewhere) and use the unit digit of the first number found on the selected page. Answer. This will often fail, since users select “round” numbers if possible. In some areas, telephone numbers are perhaps assigned randomly. But it is a mistake in any case to try to get several successive random numbers from the same page, since many telephone numbers are listed several times in a sequence. • b. Same as a, but use the units digit of the page number. Answer. But do you use the left-hand page or the right-hand page? Say, use the left-hand page, divide by 2, and use the units digit. • c. Roll a die which is in the shape of a regular icosahedron, whose twenty faces have been labeled with the digits 0, 0, 1, 1,. . . , 9, 9. Use the digit which appears on top, when the die comes to rest. (A felt table with a hard surface is recommended for rolling dice.) Answer. The markings on the face will slightly bias the die, but for practical purposes this method is quite satisfactory. See Math. Comp. 15 (1961), 94–95, for further discussion of these dice. 2.2. EVENTS AS SETS 5 • d. Expose a geiger counter to a source of radioactivity for one minute (shielding yourself ) and use the unit digit of the resulting count. (Assume that the geiger counter displays the number of counts in decimal notation, and that the count is initially zero.) Answer. This is a difficult question thrown in purposely as a surprise. The number is not uniformly distributed! One sees this best if one imagines the source of radioactivity is very low level, so that only a few emissions can be expected during this minute. If the average number of emissions per minute is λ, the probability that the counter registers k is e−λ λk /k! (the Poisson ∞ distribution). So the digit 0 is selected with probability e−λ λ10k /(10k)!, etc. k=0 • e. Glance at your wristwatch, and if the position of the second-hand is between 6n and 6(n + 1), choose the digit n. Answer. Okay, provided that the time since the last digit selected in this way is random. A bias may arise if borderline cases are not treated carefully. A better device seems to be to use a stopwatch which has been started long ago, and which one stops arbitrarily, and then one has all the time necessary to read the display. • f . Ask a friend to think of a random digit, and use the digit he names. Answer. No, people usually think of certain digits (like 7) with higher probability. • g. Assume 10 horses are entered in a race and you know nothing whatever about their qualifications. Assign to these horses the digits 0 to 9, in arbitrary fashion, and after the race use the winner’s digit. Answer. Okay; your assignment of numbers to the horses had probability 1/10 of assigning a given digit to a winning horse. 2.2. Events as Sets With every situation with uncertain outcome we associate its sample space U , which represents the set of all possible outcomes (described by the characteristics which we are interested in). Events are associated with subsets of the sample space, i.e., with bundles of outcomes that are observable in the given experimental setup. The set of all events we denote with F . (F is a set of subsets of U .) Look at the example of rolling a die. U = {1, 2, 3, 4, 5, 6}. The events of getting an even number is associated with the subset {2, 4, 6}; getting a six with {6}; not getting a six with {1, 2, 3, 4, 5}, etc. Now look at the example of rolling two indistinguishable dice. Observable events may be: getting two ones, getting a one and a two, etc. But we cannot distinguish between the first die getting a one and the second a two, and vice versa. I.e., if we define the sample set to be U = {1, . . . , 6}×{1, . . . , 6}, i.e., the set of all pairs of numbers between 1 and 6, then certain subsets are not observable. {(1, 5)} is not observable (unless the dice are marked or have different colors etc.), only {(1, 5), (5, 1)} is observable. If the experiment is measuring the height of a person in meters, and we make the idealized assumption that the measuring instrument is infinitely accurate, then all possible outcomes are numbers between 0 and 3, say. Sets of outcomes one is usually interested in are whether the height falls within a given interval; therefore all intervals within the given range represent observable events. If the sample space is finite or countably infinite, very often all subsets are observable events. If the sample set contains an uncountable continuum, it is not desirable to consider all subsets as observable events. Mathematically one can define 6 2. PROBABILITY FIELDS quite crazy subsets which have no practical significance and which cannot be meaningfully given probabilities. For the purposes of Econ 7800, it is enough to say that all the subsets which we may reasonably define are candidates for observable events. The “set of all possible outcomes” is well defined in the case of rolling a die and other games; but in social sciences, situations arise in which the outcome is open and the range of possible outcomes cannot be known beforehand. If one uses a probability theory based on the concept of a “set of possible outcomes” in such a situation, one reduces a process which is open and evolutionary to an imaginary predetermined and static “set.” Furthermore, in social theory, the mechanism by which these uncertain outcomes are generated are often internal to the members of the statistical population. The mathematical framework models these mechanisms as an extraneous “picking an element out of a pre-existing set.” From given observable events we can derive new observable events by set theoretical operations. (All the operations below involve subsets of the same U .) Mathematical Note: Notation of sets: there are two ways to denote a set: either by giving a rule, or by listing the elements. (The order in which the elements are listed, or the fact whether some elements are listed twice or not, is irrelevant.) Here are the formal definitions of set theoretic operations. The letters A, B , etc. denote subsets of a given set U (events), and I is an arbitrary index set. ω stands for an element, and ω ∈ A means that ω is an element of A. (2.2.1) A ⊂ B ⇐⇒ (ω ∈ A ⇒ ω ∈ B ) (2.2.2) A ∩ B = {ω : ω ∈ A and ω ∈ B } (A is contained in B ) (intersection of A and B ) Ai = {ω : ω ∈ Ai for all i ∈ I } (2.2.3) i∈I (2.2.4) A ∪ B = {ω : ω ∈ A or ω ∈ B } (union of A and B ) Ai = {ω : there exists an i ∈ I such that ω ∈ Ai } (2.2.5) i∈I Universal set: all ω we talk about are ∈ U . (2.2.6) U (2.2.7) A = {ω : ω ∈ A but ω ∈ U } / (2.2.8) ∅ = the empty set: ω ∈ ∅ for all ω . / These definitions can also be visualized by Venn diagrams; and for the purposes of this class, demonstrations with the help of Venn diagrams will be admissible in lieu of mathematical proofs. Problem 6. For the following set-theoretical exercises it is sufficient that you draw the corresponding Venn diagrams and convince yourself by just looking at them that the statement is true. For those who are interested in a precise mathematical proof derived from the definitions of A ∪ B etc. given above, should remember that a proof of the set-theoretical identity A = B usually has the form: first you show that ω ∈ A implies ω ∈ B , and then you show the converse. • a. Prove that A ∪ B = B ⇐⇒ A ∩ B = A. Answer. If one draws the Venn diagrams, one can see that either side is true if and only if A ⊂ B . If one wants a more precise proof, the following proof by contradiction seems most illuminating: Assume the lefthand side does not hold, i.e., there exists a ω ∈ A but ω ∈ B . Then / ω ∈ A ∩ B , i.e., A ∩ B = A. Now assume the righthand side does not hold, i.e., there is a ω ∈ A / with ω ∈ B . This ω lies in A ∪ B but not in B , i.e., the lefthand side does not hold either. / • b. Prove that A ∪ (B ∩ C ) = (A ∪ B ) ∩ (A ∪ C ) 2.2. EVENTS AS SETS 7 Answer. If ω ∈ A then it is clearly always in the righthand side and in the lefthand side. If there is therefore any difference between the righthand and the lefthand side, it must be for the ω ∈ A: If ω ∈ A and it is still in the lefthand side then it must be in B ∩ C , therefore it is also in / / the righthand side. If ω ∈ A and it is in the righthand side, then it must be both in B and in C , / therefore it is in the lefthand side. • c. Prove that A ∩ (B ∪ C ) = (A ∩ B ) ∪ (A ∩ C ). Answer. If ω ∈ A then it is clearly neither in the righthand side nor in the lefthand side. If / there is therefore any difference between the righthand and the lefthand side, it must be for the ω ∈ A: If ω ∈ A and it is in the lefthand side then it must be in B ∪ C , i.e., in B or in C or in both, therefore it is also in the righthand side. If ω ∈ A and it is in the righthand side, then it must be in either B or C or both, therefore it is in the lefthand side. • d. Prove that A ∩ ∞ i=1 Bi = ∞ i=1 (A ∩ Bi ). Answer. Proof: If ω in lefthand side, then it is in A and in at least one of the Bi , say it is in Bk . Therefore it is in A ∩ Bk , and therefore it is in the righthand side. Now assume, conversely, that ω is in the righthand side; then it is at least in one of the A ∩ Bi , say it is in A ∩ Bk . Hence it is in A and in Bk , i.e., in A and in Bi , i.e., it is in the lefthand side. Problem 7. 3 points Draw a Venn Diagram which shows the validity of de Morgan’s laws: (A ∪ B ) = A ∩ B and (A ∩ B ) = A ∪ B . If done right, the same Venn diagram can be used for both proofs. Answer. There is a proof in [HT83, p. 12]. Draw A and B inside a box which represents U , and shade A from the left (blue) and B from the right (yellow), so that A ∩ B is cross shaded (green); then one can see these laws. Problem 8. 3 points [HT83, Exercise 1.2-13 on p. 14] Evaluate the following unions and intersections of intervals. Use the notation (a, b) for open and [a, b] for closed intervals, (a, b] or [a, b) for half open intervals, {a} for sets containing one element only, and ∅ for the empty set. ∞ (2.2.9) n=1 ∞ (2.2.10) n=1 ∞ 1 ,2 = n 1 ,2 = n 0, 1 = n 0, 1 + 1 = n n=1 ∞ n=1 Answer. ∞ 1 ,2 n (2.2.11) n=1 ∞ 1 , 2 = (0, 2] n n=1 ∞ 0, 1 n =∅ 0, 1 + 1 n = [0, 1] n=1 (2.2.12) Explanation of n=1 none of the intervals. ∞ = (0, 2) 1 ,2 n ∞ n=1 : for every α with 0 < α ≤ 2 there is a n with 1 n ≤ α, but 0 itself is in The set operations become logical operations if applied to events. Every experiment returns an element ω ∈U as outcome. Here ω is rendered green in the electronic version of these notes (and in an upright font in the version for black-and-white printouts), because ω does not denote a specific element of U , but it depends on chance which element is picked. I.e., the green color (or the unusual font) indicate that ω is “alive.” We will also render the events themselves (as opposed to their set-theoretical counterparts) in green (or in an upright font). • We say that the event A has occurred when ω ∈A. 8 2. PROBABILITY FIELDS • If A ⊂ B then event A implies event B , and we will write this directly in terms of events as A ⊂ B . • The set A ∩ B is associated with the event that both A and B occur (e.g. an even number smaller than six), and considered as an event, not a set, the event that both A and B occur will be written A ∩ B . • Likewise, A ∪ B is the event that either A or B , or both, occur. • A is the event that A does not occur. • U the event that always occurs (as long as one performs the experiment). • The empty set ∅ is associated with the impossible event ∅, because whatever the value ω of the chance outcome ω of the experiment, it is always ω ∈ ∅. / If A ∩ B = ∅, the set theoretician calls A and B “disjoint,” and the probability theoretician calls the events A and B “mutually exclusive.” If A ∪ B = U , then A and B are called “collectively exhaustive.” The set F of all observable events must be a σ -algebra, i.e., it must satisfy: ∅∈F A∈F ⇒A ∈F A1 , A2 , . . . ∈ F ⇒ A1 ∪ A2 ∪ · · · ∈ F Ai ∈ F which can also be written as i=1,2,... A1 , A2 , . . . ∈ F ⇒ A1 ∩ A2 ∩ · · · ∈ F Ai ∈ F . which can also be written as i=1,2,... 2.3. The Axioms of Probability A probability measure Pr : F → R is a mapping which assigns to every event a number, the probability of this event. This assignment must be compatible with the set-theoretic operations between events in the following way: Pr[U ] = 1 (2.3.1) Pr[A] ≥ 0 (2.3.2) ∞ (2.3.3) If Ai ∩ Aj = ∅ for all i, j with i = j then Pr[ i=1 for all events A ∞ Ai ] = Pr[Ai ] i=1 Here an infinite sum is mathematically defined as the limit of partial sums. These axioms make probability what mathematicians call a measure, like area or weight. In a Venn diagram, one might therefore interpret the probability of the events as the area of the bubble representing the event. Problem 9. Prove that Pr[A ] = 1 − Pr[A]. Answer. Follows from the fact that A and A are disjoint and their union U has probability 1. Problem 10. 2 points Prove that Pr[A ∪ B ] = Pr[A] + Pr[B ] − Pr[A ∩ B ]. Answer. For Econ 7800 it is sufficient to argue it out intuitively: if one adds Pr[A] + Pr[B ] then one counts Pr[A ∩ B ] twice and therefore has to subtract it again. The brute force mathematical proof guided by this intuition is somewhat verbose: Define D = A ∩ B , E = A ∩ B , and F = A ∩ B . D, E , and F satisfy (2.3.4) D ∪ E = (A ∩ B ) ∪ (A ∩ B ) = A ∩ (B ∪ B ) = A ∩ U = A, (2.3.5) E ∪ F = B, (2.3.6) D ∪ E ∪ F = A ∪ B. 2.3. THE AXIOMS OF PROBABILITY 9 You may need some of the properties of unions and intersections in Problem 6. Next step is to prove that D, E , and F are mutually exclusive. Therefore it is easy to take probabilities (2.3.7) Pr[A] = Pr[D] + Pr[E ]; (2.3.8) Pr[B ] = Pr[E ] + Pr[F ]; Pr[A ∪ B ] = Pr[D] + Pr[E ] + Pr[F ]. (2.3.9) Take the sum of (2.3.7) and (2.3.8), and subtract (2.3.9): (2.3.10) Pr[A] + Pr[B ] − Pr[A ∪ B ] = Pr[E ] = Pr[A ∩ B ]; A shorter but trickier alternative proof is the following. First note that A ∪ B = A ∪ (A ∩ B ) and that this is a disjoint union, i.e., Pr[A∪B ] = Pr[A]+Pr[A ∩B ]. Then note that B = (A∩B )∪(A ∩B ), and this is a disjoint union, therefore Pr[B ] = Pr[A∩B ]+Pr[A ∩B ], or Pr[A ∩B ] = Pr[B ]−Pr[A∩B ]. Putting this together gives the result. Problem 11. 1 point Show that for arbitrary events A and B , Pr[A ∪ B ] ≤ Pr[A] + Pr[B ]. Answer. From Problem 10 we know that Pr[A ∪ B ] = Pr[A] + Pr[B ] − Pr[A ∩ B ], and from axiom (2.3.2) follows Pr[A ∩ B ] ≥ 0. Problem 12. 2 points (Bonferroni inequality) Let A and B be two events. Writing Pr[A] = 1 − α and Pr[B ] = 1 − β , show that Pr[A ∩ B ] ≥ 1 − (α + β ). You are allowed to use that Pr[A ∪ B ] = Pr[A] + Pr[B ] − Pr[A ∩ B ] (Problem 10), and that all probabilities are ≤ 1. Answer. (2.3.11) (2.3.12) Pr[A ∪ B ] = Pr[A] + Pr[B ] − Pr[A ∩ B ] ≤ 1 Pr[A] + Pr[B ] ≤ 1 + Pr[A ∩ B ] (2.3.13) Pr[A] + Pr[B ] − 1 ≤ Pr[A ∩ B ] (2.3.14) 1 − α + 1 − β − 1 = 1 − α − β ≤ Pr[A ∩ B ] Problem 13. (Not eligible for in-class exams) Given a rising sequence of events ∞ B 1 ⊂ B 2 ⊂ B 3 · · · , define B = i=1 B i . Show that Pr[B ] = limi→∞ Pr[B i ]. Answer. Define C 1 = B 1 , C 2 = B 2 ∩ B 1 , C 3 = B 3 ∩ B 2 , etc. Then C i ∩ C j = ∅ for i = j , ∞ n and B n = i=1 C i and B = i=1 C i . In other words, now we have represented every B n and B as a union of disjoint sets, and can therefore apply the third probability axiom (2.3.3): Pr[B ] = n ∞ Pr[C i ], i.e., Pr[C i ]. The infinite sum is merely a short way of writing Pr[B ] = limn→∞ i=1 i=1 n the infinite sum is the limit of the finite sums. But since these finite sums are exactly Pr[C i ] = i=1 n Pr[ i=1 C i ] = Pr[B n ], the assertion follows. This proof, as it stands, is for our purposes entirely acceptable. One can make some steps in this proof still more stringent. For instance, one might use n ∞ induction to prove B n = i=1 C i . And how does one show that B = i=1 C i ? Well, one knows ∞ ∞ that C i ⊂ B i , therefore i=1 C i ⊂ i=1 B i = B . Now take an ω ∈ B . Then it lies in at least one of the B i , but it can be in many of them. Let k be the smallest k for which ω ∈ B k . If k = 1, then ω ∈ C 1 = B 1 as well. Otherwise, ω ∈ B k−1 , and therefore ω ∈ C k . I.e., any element in B lies in / ∞ at least one of the C k , therefore B ⊂ i=1 C i . Problem 14. (Not eligible for in-class exams) From problem 13 derive also the following: if A1 ⊃ A2 ⊃ A3 · · · is a declining sequence, and A = i Ai , then Pr[A] = lim Pr[Ai ]. Answer. If the Ai are declining, then their complements B i = Ai are rising: B 1 ⊂ B 2 ⊂ B 3 · · · are rising; therefore I know the probability of B = B i . Since by de Morgan’s laws, B = A , this gives me also the probability of A. 10 2. PROBABILITY FIELDS The results regarding the probabilities of rising or declining sequences are equivalent to the third probability axiom. This third axiom can therefore be considered a continuity condition for probabilities. If U is finite or countably infinite, then the probability measure is uniquely determined if one knows the probability of every one-element set. We will call Pr[{ω }] = p(ω ) the probability mass function. Other terms used for it in the literature are probability function, or even probability density function (although it is not a density, more about this below). If U has more than countably infinite elements, the probabilities of one-element sets may not give enough information to define the whole probability measure. Mathematical Note: Not all infinite sets are countable. Here is a proof, by contradiction, that the real numbers between 0 and 1 are not countable: assume there is an enumeration, i.e., a sequence a1 , a2 , . . . which contains them all. Write them underneath each other in their (possibly infinite) decimal representation, where 0.di1 di2 di3 . . . is the decimal representation of ai . Then any real number whose decimal representation is such that the first digit is not equal to d11 , the second digit is not equal d22 , the third not equal d33 , etc., is a real number which is not contained in this enumeration. That means, an enumeration which contains all real numbers cannot exist. On the real numbers between 0 and 1, the length measure (which assigns to each interval its length, and to sets composed of several invervals the sums of the lengths, etc.) is a probability measure. In this probability field, every one-element subset of the sample set has zero probability. This shows that events other than ∅ may have zero probability. In other words, if an event has probability 0, this does not mean it is logically impossible. It may well happen, but it happens so infrequently that in repeated experiments the average number of occurrences converges toward zero. 2.4. Objective and Subjective Interpretation of Probability The mathematical probability axioms apply to both objective and subjective interpretation of probability. The objective interpretation considers probability a quasi physical property of the experiment. One cannot simply say: Pr[A] is the relative frequency of the occurrence of A, because we know intuitively that this frequency does not necessarily converge. E.g., even with a fair coin it is physically possible that one always gets head, or that one gets some other sequence which does not converge towards 1 . The above axioms 2 resolve this dilemma, because they allow to derive the theorem that the relative frequencies converges towards the probability with probability one. Subjectivist interpretation (de Finetti: “probability does not exist”) defines probability in terms of people’s ignorance and willingness to take bets. Interesting for economists because it uses money and utility, as in expected utility. Call “a lottery on A” a lottery which pays $1 if A occurs, and which pays nothing if A does not occur. If a person is willing to pay p dollars for a lottery on A and 1 − p dollars for a lottery on A , then, according to a subjectivist definition of probability, he assigns subjective probability p to A. There is the presumption that his willingness to bet does not depend on the size of the payoff (i.e., the payoffs are considered to be small amounts). Problem 15. Assume A, B , and C are a complete disjunction of events, i.e., they are mutually exclusive and A ∪ B ∪ C = U , the universal set. 2.5. COUNTING RULES 11 • a. 1 point Arnold assigns subjective probability p to A, q to B , and r to C . Explain exactly what this means. Answer. We know six different bets which Arnold is always willing to make, not only on A, B , and C , but also on their complements. • b. 1 point Assume that p + q + r > 1. Name three lotteries which Arnold would be willing to buy, the net effect of which would be that he loses with certainty. Answer. Among those six we have to pick subsets that make him a sure loser. If p + q + r > 1, then we sell him a bet on A, one on B , and one on C . The payoff is always 1, and the cost is p + q + r > 1. • c. 1 point Now assume that p + q + r < 1. Name three lotteries which Arnold would be willing to buy, the net effect of which would be that he loses with certainty. Answer. If p + q + r < 1, then we sell him a bet on A , one on B , and one on C . The payoff is 2, and the cost is 1 − p + 1 − q + 1 − r > 2. • d. 1 point Arnold is therefore only coherent if Pr[A] + Pr[B ] + Pr[C ] = 1. Show that the additivity of probability can be derived from coherence, i.e., show that any subjective probability that satisfies the rule: whenever A, B , and C is a complete disjunction of events, then the sum of their probabilities is 1, is additive, i.e., Pr[A ∪ B ] = Pr[A] + Pr[B ]. Answer. Since r is his subjective probability of C , 1 − r must be his subjective probability of C = A ∪ B . Since p + q + r = 1, it follows 1 − r = p + q . This last problem indicates that the finite additivity axiom follows from the requirement that the bets be consistent or, as subjectivists say, “coherent” with each other. However, it is not possible to derive the additivity for countably infinite sequences of events from such an argument. 2.5. Counting Rules In this section we will be working in a finite probability space, in which all atomic events have equal probabilities. The acts of rolling dice or drawing balls from urns can be modeled by such spaces. In order to compute the probability of a given event, one must count the elements of the set which this event represents. In other words, we count how many different ways there are to achieve a certain outcome. This can be tricky, and we will develop some general principles how to do it. Problem 16. You throw two dice. • a. 1 point What is the probability that the sum of the numbers shown is five or less? Answer. 11 12 13 14 21 22 23 , 31 32 41 i.e., 10 out of 36 possibilities, gives the probability 5 . 18 • b. 1 point What is the probability that both of the numbers shown are five or less? Answer. 11 21 31 41 51 12 22 32 42 52 13 23 33 43 53 14 24 34 44 54 15 25 35 , 45 55 i.e., 25 . 36 • c. 2 points What is the probability that the maximum of the two numbers shown is five? (As a clarification: if the first die shows 4 and the second shows 3 then the maximum of the numbers shown is 4.) Answer. 15 25 35 , 45 51 52 53 54 55 i.e., 1 . 4 12 2. PROBABILITY FIELDS In this and in similar questions to follow, the answer should be given as a fully shortened fraction. The multiplication principle is a basic aid in counting: If the first operation can be done n1 ways, and the second operation n2 ways, then the total can be done n1 n2 ways. Definition: A permutation of a set is its arrangement in a certain order. It was mentioned earlier that for a set it does not matter in which order the elements are written down; the number of permutations is therefore the number of ways a given set can be written down without repeating its elements. From the multiplication principle follows: the number of permutations of a set of n elements is n(n − 1)(n − 2) · · · (2)(1) = n! (n factorial). By definition, 0! = 1. If one does not arrange the whole set, but is interested in the number of k tuples made up of distinct elements of the set, then the number of possibilities is n n(n − 1)(n − 2) · · · (n − k + 2)(n − k + 1) = (n−!k)! . (Start with n and the number of factors is k .) (k -tuples are sometimes called ordered k -tuples because the order in n which the elements are written down matters.) [Ame94, p. 8] uses the notation Pk for this. This leads us to the next question: how many k -element subsets does a n-element set have? We already know how many permutations into k elements it has; but always k ! of these permutations represent the same subset; therefore we have to divide by k !. The number of k -element subsets of an n-element set is therefore n n(n − 1)(n − 2) · · · (n − k + 1) n! , = = (2.5.1) k k !(n − k )! (1)(2)(3) · · · k It is pronounced as n choose k , and is also called a “binomial coefficient.” It is n defined for all 0 ≤ k ≤ n. [Ame94, p. 8] calls this number Ck . Problem 17. 5 points Compute the probability of getting two of a kind and three of a kind (a “full house”) when five dice are rolled. (It is not necessary to express it as a decimal number; a fraction of integers is just fine. But please explain what you are doing.) Answer. See [Ame94, example 2.3.3 on p. 9]. Sample space is all ordered 5-tuples out of 6, which has 65 elements. Number of full houses can be identified with number of all ordered pairs of distinct elements out of 6, the first element in the pair denoting the number which appears twice 6 and the second element that which appears three times, i.e., P2 = 6 · 5. Number of arrangements 5 = 5·4 (we have to specify the two places taken by the of a given full house over the five dice is C2 1·2 6 5 two-of-a-kind outcomes.) Solution is therefore P2 · C2 /65 = 50/64 = 0.03858. This approach uses counting. Alternative approach, using conditional probability: probability of getting 3 of one kind and 51 5 1 then two of a different kind is 1 · 6 · 1 · 6 · 6 = 64 . Then multiply by 5 = 10, since this is the 6 2 number of arrangements of the 3 and 2 over the five cards. Problem 18. What is the probability of drawing the King of Hearts and the 1 Queen of Hearts if one draws two cards out of a 52 card game? Is it 522 ? Is it 52 1 2 (52)(51) ? Or is it 1 2 = (52)(51) ? Answer. Of course the last; it is the probability of drawing one special subset. There are two ways of drawing this subset: first the King and then the Queen, or first the Queen and then the King. 2.6. Relationships Involving Binomial Coefficients Problem 19. Show that be so. n k = n n−k . Give an intuitive argument why this must 2.7. CONDITIONAL PROBABILITY Answer. Because n n−k 13 counts the complements of k-element sets. Assume U has n elements, one of which is ν ∈ U . How many k -element subsets of U have ν in them? There is a simple trick: Take all (k − 1)-element subsets of the set you get by removing ν from U , and add ν to each of these sets. I.e., the number −1 is n−1 . Now how many k -element subsets of U do not have ν in them? Simple; just k take the k -element subsets of the set which one gets by removing ν from U ; i.e., it is n−1 k . Adding those two kinds of subsets together one gets all k -element subsets of U: n k (2.6.1) = n−1 k−1 + n−1 k . This important formula is the basis of the Pascal triangle: (2.6.2) 1 (0) 0 1 1 1 (0) (1) 1 2 2 1 2 1 (1) (0) = 1 3 3 1 (3) (3) (3) 2 1 0 4 4 4 1 4 6 4 1 (2) (1) (0) 1 5 10 10 5 1 (5) (5) (5) (5) 0 1 2 3 (2) 2 (3) 3 (4) 4 4 3 () 5 4 () The binomial coefficients also occur in the Binomial Theorem n (2.6.3) (a + b)n = an + n 1 an−1 b + · · · + n n−1 abn−1 + bn = n k an−k bk k=0 Why? When the n factors a + b are multiplied out, each of the resulting terms selects from each of the n original factors either a or b. The term an−k bk occurs therefore n n n−k = k times. As an application: If you set a = 1, b = 1, you simply get a sum of binomial coefficients, i.e., you get the number of subsets in a set with n elements: it is 2n (always count the empty set as one of the subsets). The number of all subsets is easily counted directly. You go through the set element by element and about every element you ask: is it in the subset or not? I.e., for every element you have two possibilities, therefore by the multiplication principle the total number of possibilities is 2n . 2.7. Conditional Probability The concept of conditional probability is arguably more fundamental than probability itself. Every probability is conditional, since we must know that the “experiment” has happened before we can speak of probabilities. [Ame94, p. 10] and [R´n70] give axioms for conditional probability which take the place of the above e axioms (2.3.1), (2.3.2) and (2.3.3). However we will follow here the common procedure of defining conditional probabilities in terms of the unconditional probabilities: (2.7.1) Pr[B |A] = Pr[B ∩ A] Pr[A] How can we motivate (2.7.1)? If we know that A has occurred, then of course the only way that B occurs is when B ∩ A occurs. But we want to multiply all probabilities of subsets of A with an appropriate proportionality factor so that the probability of the event A itself becomes = 1. (5) 5 14 2. PROBABILITY FIELDS Problem 20. 3 points Let A be an event with nonzero probability. Show that the probability conditionally on A, i.e., the mapping B → Pr[B |A], satisfies all the axioms of a probability measure: Pr[U |A] = 1 (2.7.2) Pr[B |A] ≥ 0 (2.7.3) (2.7.4) B i |A] = Pr[ for all events B ∞ ∞ Pr[B i |A] if B i ∩ B j = ∅ for all i, j with i = j . i=1 i=1 Answer. Pr[U |A] = Pr[U ∩A]/ Pr[A] = 1. Pr[B |A] = Pr[B ∩A]/ Pr[A] ≥ 0 because Pr[B ∩A] ≥ 0 and Pr[A] > 0. Finally, (2.7.5) ∞ ∞ ∞ ∞ ∞ Pr[( i=1 B i ) ∩ A] Pr[ i=1 (B i ∩ A)] 1 B i |A] = = = Pr[B i ∩ A] = Pr[B i |A] Pr[ Pr[A] Pr[A] Pr[A] i=1 i=1 i=1 First equal sign is definition of conditional probability, second is distributivity of unions and intersections (Problem 6 d), third because the B i are disjoint and therefore the B i ∩ A are even more disjoint: B i ∩ A ∩ B j ∩ A = B i ∩ B j ∩ A = ∅ ∩ A = ∅ for all i, j with i = j , and the last equal sign again by the definition of conditional probability. Problem 21. You draw two balls without replacement from an urn which has 7 white and 14 black balls. If both balls are white, you roll a die, and your payoff is the number which the die shows in dollars. If one ball is black and one is white, you flip a coin until you get your first head, and your payoff will be the number of flips it takes you to get a head, in dollars again. If both balls are black, you draw from a deck of 52 cards, and you get the number shown on the card in dollars. (Ace counts as one, J, Q, and K as 11, 12, 13, i.e., basically the deck contains every number between 1 and 13 four times.) Show that the probability that you receive exactly two dollars in this game is 1/6. Answer. You know a complete disjunction of events: U = {ww}∪{bb}∪{wb}, with Pr[{ww}] = 1 13 7 7 7 14 = 10 ; Pr[{bb}] = 14 13 = 30 ; Pr[{bw}] = 21 20 + 14 20 = 15 . Furthermore you know the con21 20 21 ditional probabilities of getting 2 dollars conditonally on each of these events: Pr[{2}|{ww}] = 1 ; 6 1 Pr[{2}|{bb}] = 13 ; Pr[{2}|{wb}] = 1 . Now Pr[{2} ∩ {ww}] = Pr[{2}|{ww}] Pr[{ww}] etc., therefore 4 76 21 20 (2.7.6) (2.7.7) (2.7.8) Pr[{2}] = Pr[{2} ∩ {ww}] + Pr[{2} ∩ {bw}] + Pr[{2} ∩ {bb}] 1 6 1 = 6 = 1 7 14 14 7 76 + + 21 20 4 21 20 21 20 17 1 13 1 1 + + = 10 4 15 13 30 6 + 1 14 13 13 21 20 Problem 22. 2 points A and B are arbitrary events. Prove that the probability of B can be written as: (2.7.9) Pr[B ] = Pr[B |A] Pr[A] + Pr[B |A ] Pr[A ] This is the law of iterated expectations (6.6.2) in the case of discrete random variables: it might be written as Pr[B ] = E Pr[B |A] . Answer. B = B ∩ U = B ∩ (A ∪ A ) = (B ∩ A) ∪ (B ∩ A ) and this union is disjoint, i.e., (B ∩ A) ∩ (B ∩ A ) = B ∩ (A ∩ A ) = B ∩ ∅ = ∅. Therefore Pr[B ] = Pr[B ∩ A] + Pr[B ∩ A ]. Now apply definition of conditional probability to get Pr[B ∩ A] = Pr[B |A] Pr[A] and Pr[B ∩ A ] = Pr[B |A ] Pr[A ]. Problem 23. 2 points Prove the following lemma: If Pr[B |A1 ] = Pr[B |A2 ] (call it c) and A1 ∩ A2 = ∅ (i.e., A1 and A2 are disjoint), then also Pr[B |A1 ∪ A2 ] = c. 2.7. CONDITIONAL PROBABILITY 15 Answer. Pr[B ∩ (A1 ∪ A2 )] Pr[(B ∩ A1 ) ∪ (B ∩ A2 )] = Pr[A1 ∪ A2 ] Pr[A1 ∪ A2 ] Pr[B ∩ A1 ] + Pr[B ∩ A2 ] c Pr[A1 ] + c Pr[A2 ] = = = c. Pr[A1 ] + Pr[A2 ] Pr[A1 ] + Pr[A2 ] Pr[B |A1 ∪ A2 ] = (2.7.10) Problem 24. Show by counterexample that the requirement A1 ∩ A2 = ∅ is necessary for this result to hold. Hint: use the example in Problem 38 with A1 = {HH, HT }, A2 = {HH, T H }, B = {HH, T T }. Answer. Pr[B |A1 ] = 1/2 and Pr[B |A2 ] = 1/2, but Pr[B |A1 ∪ A ] = 1/3. The conditional probability can be used for computing probabilities of intersections of events. Problem 25. [Lar82, exercises 2.5.1 and 2.5.2 on p. 57, solutions on p. 597, but no discussion]. Five white and three red balls are laid out in a row at random. • a. 3 points What is the probability that both end balls are white? What is the probability that one end ball is red and the other white? is Answer. You can lay the first ball first and the last ball second: for white balls, the probability 5 5 = 14 ; for one white, one red it is 5 3 + 3 7 = 15 . 87 8 28 54 87 • b. 4 points What is the probability that all red balls are together? What is the probability that all white balls are together? Answer. All red balls together is the same as 3 reds first, multiplied by 6, because you may 1 3 have between 0 and 5 white balls before the first red. 3 2 6 · 6 = 28 . For the white balls you get 87 54321 1 · 4 = 14 . 87654 4 BTW, 3 reds first is same probability as 3 reds last, ie., the 5 whites first: 5 7 3 2 1 = 3 2 1 . 8 654 876 Problem 26. The first three questions here are discussed in [Lar82, example 2.6.3 on p. 62]: There is an urn with 4 white and 8 black balls. You take two balls out without replacement. • a. 1 point What is the probability that the first ball is white? Answer. 1/3 • b. 1 point What is the probability that both balls are white? Answer. It is Pr[second ball white|first ball white] Pr[first ball white] = 3 4 3+8 4+8 = 1 . 11 • c. 1 point What is the probability that the second ball is white? Answer. It is Pr[first ball white and second ball white]+Pr[first ball black and second ball white] = 3 4 4 8 1 + =. 3+84+8 7+48+4 3 This is the same as the probability that the first ball is white. The probabilities are not dependent on the order in which one takes the balls out. This property is called “exchangeability.” One can see it also in this way: Assume you number the balls at random, from 1 to 12. Then the probability for a white ball to have the number 2 assigned to it is obviously 1 . 3 (2.7.11) = • d. 1 point What is the probability that both of them are black? Answer. 87 12 11 = 27 3 11 = 14 33 (or 56 ). 132 • e. 1 point What is the probability that both of them have the same color? Answer. The sum of the two above, 14 33 + 1 11 = 17 33 (or 68 ). 132 16 2. PROBABILITY FIELDS Now you take three balls out without replacement. • f . 2 points Compute the probability that at least two of the three balls are white. 13 Answer. It is 55 . The possibilities are wwb, wbw, bww, and www. Of the first three, each 438 288 has probability 12 11 10 . Therefore the probability for exactly two being white is 1320 = 12 . The 55 4·3·2 24 1 312 13 probability for www is 12·11·10 = 1320 = 55 . Add this to get 1320 = 55 . More systematically, the answer is 4 2 8 1 + 4 3 12 3 . • g. 1 point Compute the probability that at least two of the three are black. 42 . For 55 42 . One 55 Answer. It is Together 8 2 4 1 1008 1320 12 . 3 = 672 1320 exactly two: = 28 . 55 For three it is (8)(7)(6) (12)(11)(10) = 336 1320 can also get is as: it is the complement of the last, or as = 8 3 14 . 55 + • h. 1 point Compute the probability that two of the three are of the same and the third of a different color. Answer. It is 960 1320 = 40 55 = 8 , 11 or 4 1 8 2 + 4 2 8 1 12 3 . • i. 1 point Compute the probability that at least two of the three are of the same color. Answer. This probability is 1. You have 5 black socks and 5 white socks in your drawer. There is a fire at night and you must get out of your apartment in two minutes. There is no light. You fumble in the dark for the drawer. How many socks do you have to take out so that you will have at least 2 of the same color? The answer is 3 socks. Problem 27. If a poker hand of five cards is drawn from a deck, what is the probability that it will contain three aces? (How can the concept of conditional probability help in answering this question?) Answer. [Ame94, example 2.3.3 on p. 9] and [Ame94, example 2.5.1 on p. 13] give two alternative ways to do it. The second answer uses conditional probability: Probability to draw 432 three aces in a row first and then 2 nonaces is 52 51 50 48 47 Then multiply this by 5 = 5·4·3 = 10 49 48 3 1·2·3 This gives 0.0017, i.e., 0.17%. Problem 28. 2 points A friend tosses two coins. You ask: “did one of them land heads?” Your friend answers, “yes.” What’s the probability that the other also landed heads? Answer. U = {HH, HT, T H, T T }; Probability is 13 / 44 = 1 . 3 Problem 29. (Not eligible for in-class exams) [Ame94, p. 5] What is the probability that a person will win a game in tennis if the probability of his or her winning a point is p? Answer. (2.7.12) p4 1 + 4(1 − p) + 10(1 − p)2 + 20p(1 − p)3 1 − 2p(1 − p) How to derive this: {ssss} has probability p4 ; {sssf s}, {ssf ss}, {sf sss}, and {f ssss} have probability 4p4 (1 − p); {sssf f s} etc. (2 f and 3 s in the first 5, and then an s, together 5 = 10 2 possibilities) have probability 10p4 (1 − p)2 . Now {sssf f f } and 6 = 20 other possibilities give 3 deuce at least once in the game, i.e., the probability of deuce is 20p3 (1 − p)3 . Now Pr[win|deuce] = p2 + 2p(1 − p)Pr[win|deuce], because you win either if you score twice in a row (p2 ) or if you get deuce again (probablity 2p(1 − p)) and then win. Solve this to get Pr[win|deuce] = p2 / 1 − 2p(1 − p) and then multiply this conditional probability with the probability of getting deuce at least once: Pr[win after at least one deuce] = 20p3 (1 − p)3 p2 / 1 − 2p(1 − p) . This gives the last term in (2.7.12). 2.7. CONDITIONAL PROBABILITY 17 Problem 30. (Not eligible for in-class exams) Andy, Bob, and Chris play the following game: each of them draws a card without replacement from a deck of 52 cards. The one who has the highest card wins. If there is a tie (like: two kings and no aces), then that person wins among those who drew this highest card whose name comes first in the alphabet. What is the probability for Andy to be the winner? For Bob? For Chris? Does this probability depend on the order in which they draw their cards out of the stack? Answer. Let A be the event that Andy wins, B that Bob, and C that Chris wins. One way to approach this problem is to ask: what are the chances for Andy to win when he draws a king?, etc., i.e., compute it for all 13 different cards. Then: what are the chances for Bob to win when he draws a king, and also his chances for the other cards, and then for Chris. It is computationally easier to make the following partitioning of all outcomes: Either all three cards drawn are different (call this event D), or all three cards are equal (event E ), or two of the three cards are equal (T ). This third case will have to be split into T = H ∪ L, according to whether the card that is different is higher or lower. If all three cards are different, then Andy, Bob, and Chris have equal chances of winning; if all three cards are equal, then Andy wins. What about the case that two cards are the same and the third is different? There are two possibilities. If the card that is different is higher than the two that are the same, then the chances of winning are evenly distributed; but if the two equal cards are higher, then Andy has a 2 chance of winning (when the distribution of the cards Y (lower) 3 and Z (higher) among ABC is is ZZY and ZY Z ), and Bob has a 1 chance of winning (when 3 the distribution is Y ZZ ). What we just did was computing the conditional probabilities Pr[A|D], Pr[A|E ], etc. Now we need the probabilities of D, E , and T . What is the probability that all three cards 3 drawn are the same? The probability that the second card is the same as the first is 51 ; and the probability that the third is the same too is (3)(2) 2 6 ; therefore the total probability is (51)(50) = 2550 . 50 48 44 2112 = 2550 . The probability that two are equal and 51 50 The probability that all three are unequal is 3 432 the third is different is 3 51 48 = 2550 . Now in half of these cases, the card that is different is higher, 50 and in half of the cases it is lower. Putting this together one gets: Uncond. Prob. E H L D Sum all 3 equal 2 of 3 equal, 3rd higher 2 of 3 equal, 3rd lower all 3 unequal 6/2550 216/2550 216/2550 2112/2550 2550/2550 Cond. Prob. ABC 1 0 0 1 3 2 3 1 3 1 3 1 3 1 3 1 3 0 1 3 Prob. of intersection A B C 6/2550 0 0 72/2550 72/2550 72/2550 144/2550 72/2550 0 704/2550 704/2550 704/2550 926/2550 848/2550 776/2550 I.e., the probability that A wins is 926/2550 = 463/1275 = .363, the probability that B wins is 848/2550 = 424/1275 = .3325, and the probability that C wins is 776/2550 = 338/1275 = .304. Here we are using Pr[A] = Pr[A|E ] Pr[E ] + Pr[A|H ] Pr[H ] + Pr[A|L] Pr[L] + Pr[A|D] Pr[D]. Problem 31. 4 points You are the contestant in a game show. There are three closed doors at the back of the stage. Behind one of the doors is a sports car, behind the other two doors are goats. The game master knows which door has the sports car behind it, but you don’t. You have to choose one of the doors; if it is the door with the sports car, the car is yours. After you make your choice, say door A, the game master says: “I want to show you something.” He opens one of the two other doors, let us assume it is door B , and it has a goat behind it. Then the game master asks: “Do you still insist on door A, or do you want to reconsider your choice?” Can you improve your odds of winning by abandoning your previous choice and instead selecting the door which the game master did not open? If so, by how much? Answer. If you switch, you will lose the car if you had initially picked the right door, but you will get the car if you were wrong before! Therefore you improve your chances of winning from 1/3 to 2/3. This is simulated on the web, see www.stat.sc.edu/∼west/javahtml/LetsMakeaDeal.html. 18 2. PROBABILITY FIELDS It is counterintuitive. You may think that one of the two other doors always has a goat behind it, whatever your choice, therefore there is no reason to switch. But the game master not only shows you that there is another door with a goat, he also shows you one of the other doors with a goat behind it, i.e., he restricts your choice if you switch. This is valuable information. It is as if you could bet on both other doors simultaneously, i.e., you get the car if it is behind one of the doors B or C . I.e., if the quiz master had said: I give you the opportunity to switch to the following: you get the car if it is behind B or C . Do you want to switch? The only doubt the contestant may have about this is: had I not picked a door with the car behind it then I would not have been offered this opportunity to switch. 2.8. Ratio of Probabilities as Strength of Evidence Pr1 and Pr2 are two probability measures defined on the same set F of events. Hypothesis H1 says Pr1 is the true probability, and H2 says Pr2 is the true probability. Then the observation of an event A for which Pr1 [A] > Pr2 [A] is evidence in favor of H1 as opposed to H2 . [Roy97] argues that the ratio of the probabilities (also called “likelihood ratio”) is the right way to measure the strength of this evidence. Among others, the following justification is given [Roy97, p. 7]: If H2 is true, it is usually not impossible to find evidence favoring H1 , but it is unlikely ; and its probability is bounded by the (reverse of) the ratio of probabilities. This can be formulated mathematically as follows: Let S be the union of all events A for which Pr1 [A] ≥ k Pr2 [A]. Then it can be shown that Pr2 [S ] ≤ 1/k , i.e., if H2 is true, the probability to find evidence favoring H1 with strength k is never greater than 1/k . Here is a proof in the case that there is only a finite number of possible outcomes U = {ω1 , . . . , ωn }: Renumber the outcomes such that for i = 1, . . . , m, Pr1 [{ωi }] < k Pr2 [{ωi }], and for j = m + 1, . . . , n, Pr1 [{ωj }] ≥ k Pr2 [{ωj }]. Then Pr [{ω }] n n S = {ωm+1 , . . . , ωn }, therefore Pr2 [S ] = j =m+1 Pr2 [{ωj }] ≤ j =m+1 1 k j = 1 1 k Pr1 [S ] ≤ k as claimed. The last inequality holds because Pr1 [S ] ≤ 1, and the equal-sign before this is simply the definition of S . With more mathematical effort, see [Rob70], one can strengthen this simple inequality in a very satisfactory manner: Assume an unscrupulous researcher attempts to find evidence supporting his favorite but erroneous hypothesis H1 over his rival’s H2 by a factor of at least k . He proceeds as follows: he observes an outcome of the above experiment once, say the outcome is ωi(1) . If Pr1 [{ωi(1) }] ≥ k Pr2 [{ωi(1) }] he publishes his result; if not, he makes a second independent observation of the experiment ωi(2) . If Pr1 [{ωi(1) }] Pr1 [{ωi(2) }] > k Pr2 [{ωi(1) }] Pr2 [{ωi(2) }] he publishes his result; if not he makes a third observation and incorporates that in his publication as well, etc. It can be shown that this strategy will not help: if his rival’s hypothesis is true, then the probability that he will ever be able to publish results which seem to show that his own hypothesis is true is still ≤ 1/k . I.e., the sequence of independent observations ωi(2) , ωi(2) , . . . is such that n n (2.8.1) Pr1 [{ωi(j ) }] ≥ k Pr2 j =1 Pr2 [{ωi(1) }] j =1 for some n = 1, 2, . . . ≤ 1 k It is not possible to take advantage of the indeterminacy of a random outcome by carrying on until chance places one ahead, and then to quit. If one fully discloses all the evidence one is accumulating, then the probability that this accumulated evidence supports one’s hypothesis cannot rise above 1/k . Problem 32. It is usually not possible to assign probabilities to the hypotheses H1 and H2 , but sometimes it is. Show that in this case, the likelihood ratio of event 2.9. BAYES THEOREM 19 A is the factor by which the ratio of the probabilities of H1 and H2 is changed by the observation of A, i.e., Pr[H 1 |A] Pr[H 1 ] Pr[A|H 1 ] = Pr[H 2 |A] Pr[H 2 ] Pr[A|H 2 ] (2.8.2) Answer. Apply Bayes’s theorem (2.9.1) twice, once for the numerator, once for the denominator. A world in which probability theory applies is therefore a world in which the transitive dimension must be distinguished from the intransitive dimension. Research results are not determined by the goals of the researcher. 2.9. Bayes Theorem In its simplest form Bayes’s theorem reads Pr[A|B ] = (2.9.1) Pr[B |A] Pr[A] . Pr[B |A] Pr[A] + Pr[B |A ] Pr[A ] Problem 33. Prove Bayes theorem! Answer. Obvious since numerator is Pr[B ∩ A] and denominator Pr[B ∩ A] + Pr[B ∩ A ] = Pr[B ]. This theorem has its significance in cases in which A can be interpreted as a cause of B , and B an effect of A. For instance, A is the event that a student who was picked randomly from a class has learned for a certain exam, and B is the event that he passed the exam. Then the righthand side expression contains that information which you would know from the cause-effect relations: the unconditional probability of the event which is the cause, and the conditional probabilities of the effect conditioned on whether or not the cause happened. From this, the formula computes the conditional probability of the cause given that the effect happened. Bayes’s theorem tells us therefore: if we know that the effect happened, how sure can we be that the cause happened? Clearly, Bayes’s theorem has relevance for statistical inference. Let’s stay with the example with learning for the exam; assume Pr[A] = 60%, Pr[B |A] = .8, and Pr[B |A ] = .5. Then the probability that a student who passed .8)(. the exam has learned for it is (.8)((6)+(6) .4) = .48 = .706. Look at these numbers: . .5)( .68 The numerator is the average percentage of students who learned and passed, and the denominator average percentage of students who passed. Problem 34. AIDS diagnostic tests are usually over 99.9% accurate on those who do not have AIDS (i.e., only 0.1% false positives) and 100% accurate on those who have AIDS (i.e., no false negatives at all). (A test is called positive if it indicates that the subject has AIDS.) • a. 3 points Assuming that 0.5% of the population actually have AIDS, compute the probability that a particular individual has AIDS, given that he or she has tested positive. Answer. A is the event that he or she has AIDS, and T the event that the test is positive. Pr[T |A] Pr[A] 1 · 0.005 = = Pr[T |A] Pr[A] + Pr[T |A ] Pr[A ] 1 · 0.005 + 0.001 · 0.995 1000 · 5 5000 1000 100 · 0.5 = = = = = 0.834028 100 · 0.5 + 0.1 · 99.5 1000 · 5 + 1 · 995 5995 1199 Pr[A|T ] = Even after testing positive there is still a 16.6% chance that this person does not have AIDS. 20 2. PROBABILITY FIELDS • b. 1 point If one is young, healthy and not in one of the risk groups, then the chances of having AIDS are not 0.5% but 0.1% (this is the proportion of the applicants to the military who have AIDS). Re-compute the probability with this alternative number. Answer. 1 · 0.001 100 · 0.1 1000 · 1 1000 1000 = = = = = 0.50025. 1 · 0.001 + 0.001 · 0.999 100 · 0.1 + 0.1 · 99.9 1000 · 1 + 1 · 999 1000 + 999 1999 2.10. Independence of Events 2.10.1. Definition of Independence. Heuristically, we want to say: event B is independent of event A if Pr[B |A] = Pr[B |A ]. From this follows by Problem 23 that the conditional probability is equal to the unconditional probability Pr[B ], i.e., Pr[B ] = Pr[B ∩ A]/ Pr[A]. Therefore we will adopt as definition of independence the so-called multiplication rule: Definition: B and A are independent, notation B ⊥A, if Pr[B ∩ A] = Pr[B ] Pr[A]. This is a symmetric condition, i.e., if B is independent of A, then A is also independent of B . This symmetry is not immediately obvious given the above definition of independence, and it also has the following nontrivial practical implication (this example from [Daw79a, pp. 2/3]): A is the event that one is exposed to some possibly carcinogenic agent, and B the event that one develops a certain kind of cancer. In order to test whether B ⊥A, i.e., whether the exposure to the agent does not increase the incidence of cancer, one often collects two groups of subjects, one group which has cancer and one control group which does not, and checks whether the exposure in these two groups to the carcinogenic agent is the same. I.e., the experiment checks whether A⊥B , although the purpose of the experiment was to determine whether B ⊥A. Problem 35. 3 points Given that Pr[B ∩ A] = Pr[B ] · Pr[A] (i.e., B is independent of A), show that Pr[B ∩ A ] = Pr[B ] · Pr[A ] (i.e., B is also independent of A ). Answer. If one uses our heuristic definition of independence, i.e., B is independent of event A if Pr[B |A] = Pr[B |A ], then it is immediately obvious since definition is symmetric in A and A . However if we use the multiplication rule as the definition of independence, as the text of this Problem suggests, we have to do a little more work: Since B is the disjoint union of (B ∩ A) and (B ∩ A ), it follows Pr[B ] = Pr[B ∩ A] + Pr[B ∩ A ] or Pr[B ∩ A ] = Pr[B ] − Pr[B ∩ A] = Pr[B ] − Pr[B ] Pr[A] = Pr[B ](1 − Pr[A]) = Pr[B ] Pr[A ]. Problem 36. 2 points A and B are two independent events with Pr[A] = Pr[B ] = 1 . Compute Pr[A ∪ B ]. 4 Answer. Pr[A ∪ B ] = Pr[A] + Pr[B ] − Pr[A ∩ B ] = Pr[A] + Pr[B ] − Pr[A] Pr[B ] = 1 . 2 1 3 1 3 and 1 + 1 − 12 = 4 Problem 37. 3 points You have an urn with five white and five red balls. You take two balls out without replacement. A is the event that the first ball is white, and B that the second ball is white. a. What is the probability that the first ball is white? b. What is the probability that the second ball is white? c. What is the probability that both have the same color? d. Are these two events independent, i.e., is Pr[B |A] = Pr[A]? e. Are these two events disjoint, i.e., is A ∩ B = ∅? Answer. Clearly, Pr[A] = 1/2. Pr[B ] = Pr[B |A] Pr[A] + Pr[B |A ] Pr[A ] = (4/9)(1/2) + 5 (5/9)(1/2) = 1/2. The events are not independent: Pr[B |A] = 4/9 = Pr[B ], or Pr[A ∩ B ] = 10 4 = 9 2.10. INDEPENDENCE OF EVENTS 21 2/9 = 1/4. They would be independent if the first ball had been replaced. The events are also not disjoint: it is possible that both balls are white. 2.10.2. Independence of More than Two Events. If there are more than two events, we must require that all possible intersections of these events, not only the pairwise intersections, follow the above multiplication rule. For instance, Pr[A ∩ B ] = Pr[A] Pr[B ]; Pr[A ∩ C ] = Pr[A] Pr[C ]; (2.10.1) A, B , C mutually independent ⇐⇒ Pr[B ∩ C ] = Pr[B ] Pr[C ]; Pr[A ∩ B ∩ C ] = Pr[A] Pr[B ] Pr[C ]. This last condition is not implied by the other three. Here is an example. Draw a ball at random from an urn containing four balls numbered 1, 2, 3, 4. Define A = {1, 4}, B = {2, 4}, and C = {3, 4}. These events are pairwise independent but not mutually independent. Problem 38. 2 points Flip a coin two times independently and define the following three events: A = Head in first flip B = Head in second flip (2.10.2) C = Same face in both flips. Are these three events pairwise independent? Are they mutually independent? Answer. U = 1 , 2 H H HT TH TT 1 . They 2 . A = {HH, HT }, B = {HH, T H }, C = {HH, T T }. Pr[A] = 1 , 2 1 4 = Pr[B ] = Pr[C ] = are pairwise independent, but Pr[A ∩ B ∩ C ] = Pr[{HH }] = Pr[A] Pr[B ] Pr[C ], therefore the events cannot be mutually independent. Problem 39. 3 points A, B , and C are pairwise independent events whose probabilities are greater than zero and smaller than one, and A ∩ B ⊂ C . Can those events be mutually independent? Answer. No; from A ∩ B ⊂ C follows A ∩ B ∩ C = A ∩ B and therefore Pr[A ∩ B ∩ C ] = Pr[A ∩ B ] Pr[C ] since Pr[C ] < 1 and Pr[A ∩ B ] > 0. If one takes unions, intersections, complements of different mutually independent events, one will still end up with mutually independent events. E.g., if A, B , C mutually independent, then A , B , C are mutually independent as well, and A ∩ B independent of C , and A ∪ B independent of C , etc. This is not the case if the events are only pairwise independent. In Problem 39, A ∩ B is not independent of C . '$ '$ '$ RST UVW &% &% X &% Figure 1. Generic Venn Diagram for 3 Events 2.10.3. Conditional Independence. If A and B are independent in the probability measure conditionally on C , i.e., if Pr[A ∩ B |C ] = Pr[A|C ] Pr[B |C ], then they 22 2. PROBABILITY FIELDS are called conditionally independent given that C occurred, notation A⊥B |C . In formulas, (2.10.3) Pr[A ∩ C ] Pr[B ∩ C ] Pr[A ∩ B ∩ C ] = . Pr[C ] Pr[C ] Pr[C ] Problem 40. 5 points Show that A⊥B |C is equivalent to Pr[A|B ∩C ] = Pr[A|C ]. In other words: independence of A and B conditionally on C means: once we know that C occurred, the additional knowledge whether B occurred or not will not help us to sharpen our knowledge about A. Literature about conditional independence (of random variables, not of events) includes [Daw79a], [Daw79b], [Daw80]. 2.10.4. Independent Repetition of an Experiment. If a given experiment has sample space U , and we perform the experiment n times in a row, then this repetition can be considered a single experiment with the sample space consisting of n-tuples of elements of U . This set is called the product set U n = U × U × · · · × U (n terms). If a probability measure Pr is given on F , then one can define in a unique way a probability measure on the subsets of the product set so that events in different repetitions are always independent of each other. The Bernoulli experiment is the simplest example of such an independent repetition. U = {s, f } (stands for success and failure). Assume Pr[{s}] = p, and that the experimenter has several independent trials. For instance, U 5 has, among others, the following possible outcomes: If ω =(f, f, f, f, f ) (f, f, f, f, s) (1 − p)n−1 p (f, f, f, s, f ) (1 − p)n−1 p (f, f, f, s, s) (1 − p)n−2 p2 (f, f, s, f, f ) (2.10.4) then Pr[{ω }] = (1 − p)n (1 − p)n−1 p, etc. One sees, this is very cumbersome, and usually unnecessarily so. If we toss a coin 5 times, the only thing we usually want to know is how many successes there were. As long as the experiments are independent, the question how the successes were distributed over the n different trials is far less important. This brings us to the definition of a random variable, and to the concept of a sufficient statistic. 2.11. How to Plot Frequency Vectors and Probability Vectors If there are only 3 possible outcomes, i.e., U = {ω1 , ω2 , ω3 }, then the set of all probability measures is the set of nonnegative 3-vectors whose components sum up to 1. Graphically, such vectors can be represented as points inside a trilateral triangle with height 1: the three components of the vector are the distances of the point to each of the sides of the triangle. The R/Splus-function triplot in the ecmet package, written by Jim Ramsay ramsay@ramsay2.psych.mcgill.ca, does this, with optional rescaling if the rows of the data matrix do not have unit sums. Problem 41. In an equilateral triangle, call a = the distance of the sides from the center point, b = half the side length, and c = the distance of the corners from √ the center point (as in Figure 2). Show that b = a 3 and c = 2a. 2.11. HOW TO PLOT FREQUENCY VECTORS AND PROBABILITY VECTORS 23 p c p p p c p b a p p Figure 2. Geometry of an equilateral triangle √ Answer. From (a + c)2 + b2 = 4b2 , i.e., (a + c)2 = 3b2 , follows a + c = b 3. But we 2 + b2 = c2 . Therefore a2 + 2ac + c2 = 3b2 = 3c2 − 3a2 , or 4a2 + 2ac − 2c2 = 0 also have a or 2a2 + ac − √2 = (2a − √ a + c) = 0. The positive solution is therefore c = 2a. This gives c c)( a + c = 3a = b 3, or b = a 3. And the function quadplot, also written by Jim Ramsey, does quadrilinear plots, meaning that proportions for four categories are plotted within a regular tetrahedron. Quadplot displays the probability tetrahedron and its points using XGobi. Each vertex of the triangle or tetrahedron corresponds to the degenerate probability distribution in which one of the events has probability 1 and the others have probability 0. The labels of these vertices indicate which event has probability 1. The script kai is an example visualizing data from [Mor65]; it can be run using the command ecmet.script(kai). Example: Statistical linguistics. In the study of ancient literature, the authorship of texts is a perplexing problem. When books were written and reproduced by hand, the rights of authorship were limited and what would now be considered forgery was common. The names of reputable authors were borrowed in order to sell books, get attention for books, or the writings of disciples and collaborators were published under the name of the master, or anonymous old manuscripts were optimistically attributed to famous authors. In the absence of conclusive evidence of authorship, the attribution of ancient texts must be based on the texts themselves, for instance, by statistical analysis of literary style. Here it is necessary to find stylistic criteria which vary from author to author, but are independent of the subject matter of the text. An early suggestion was to use the probability distribution of word length, but this was never acted upon, because it is too dependent on the subject matter. Sentence-length distributions, on the other hand, have proved highly reliable. [Mor65, p. 184] says that sentence-length is “periodic rather than random,” therefore the sample should have at least about 100 sentences. “Sentence-length distributions are not suited to dialogue, they cannot be used on commentaries written on one author by another, nor are they reliable on such texts as the fragmentary books of the historian Diodorus Siculus.” Problem 42. According to [Mor65, p. 184], sentence-length is “periodic rather than random.” What does this mean? Answer. In a text, passages with long sentences alternate with passages with shorter sentences. This is why one needs at least 100 sentences to get a representative distribution of sentences, and this is why fragments and drafts and commentaries on others’ writings do not exhibit an average sentence length distribution: they do not have the melody of the finished text. Besides the length of sentences, also the number of common words which express a general relation (“and”, “in”, “but”, “I”, “to be”) is random with the same distribution at least among the same genre. By contrast, the occurrence of the definite 24 2. PROBABILITY FIELDS article “the” cannot be modeled by simple probabilistic laws because the number of nouns with definite article depends on the subject matter. Table 1 has data about the epistles of St. Paul. Abbreviations: Rom Romans; Co1 1st Corinthians; Co2 2nd Corinthians; Gal Galatians; Phi Philippians; Col Colossians; Th1 1st Thessalonians; Ti1 1st Timothy; Ti2 2nd Timothy; Heb Hebrews. 2nd Thessalonians, Titus, and Philemon were excluded because they were too short to give reliable samples. From an analysis of these and other data [Mor65, p. 224] the first 4 epistles (Romans, 1st Corinthians, 2nd Corinthians, and Galatians) form a consistent group, and all the other epistles lie more than 2 standard deviations from the mean of this group (using χ2 statistics). If Paul is defined as being the author of Galatians, then he also wrote Romans and 1st and 2nd Corinthians. The remaining epistles come from at least six hands. Table 1. Number of Sentences in Paul’s Epistles with 0, 1, 2, and ≥ 3 occurrences of kai no kai one two 3 or more Rom Co1 386 424 141 152 34 35 17 16 Co2 192 86 28 13 Gal Phi Col Th1 Ti1 Ti2 Heb 128 42 23 34 49 45 155 48 29 32 23 38 28 94 5 19 17 8 9 11 37 6 12 9 16 10 4 24 Problem 43. Enter the data from Table 1 into xgobi and brush the four epistles which are, according to Morton, written by Paul himself. 3 of those points are almost on top of each other, and one is a little apart. Which one is this? Answer. In R, issue the commands library(xgobi) then data(PaulKAI) then quadplot(PaulKAI, normalize = TRUE). If you have xgobi but not R, this dataset is one of the default datasets coming with xgobi. CHAPTER 3 Random Variables 3.1. Notation Throughout these class notes, lower case bold letters will be used for vectors and upper case bold letters for matrices, and letters that are not bold for scalars. The (i, j ) element of the matrix A is aij , and the ith element of a vector b is bi ; the arithmetic mean of all elements is ¯. All vectors are column vectors; if a row b vector is needed, it will be written in the form b . Furthermore, the on-line version of these notes uses green symbols for random variables, and the corresponding black symbols for the values taken by these variables. If a black-and-white printout of the on-line version is made, then the symbols used for random variables and those used for specific values taken by these random variables can only be distinguished by their grey scale or cannot be distinguished at all; therefore a special monochrome version is available which should be used for the black-and-white printouts. It uses an upright math font, called “Euler,” for the random variables, and the same letter in the usual slanted italic font for the values of these random variables. Example: If y is a random vector, then y denotes a particular value, for instance an observation, of the whole vector; y i denotes the ith element of y (a random scalar), and yi is a particular value taken by that element (a nonrandom scalar). With real-valued random variables, the powerful tools of calculus become available to us. Therefore we will begin the chapter about random variables with a digression about infinitesimals 3.2. Digression about Infinitesimals In the following pages we will recapitulate some basic facts from calculus. But it will differ in two respects from the usual calculus classes. (1) everything will be given its probability-theoretic interpretation, and (2) we will make explicit use of infinitesimals. This last point bears some explanation. You may say infinitesimals do not exist. Do you know the story with Achilles and the turtle? They are racing, the turtle starts 1 km ahead of Achilles, and Achilles runs ten times as fast as the turtle. So when Achilles arrives at the place the turtle started, the turtle has run 100 meters; and when Achilles has run those 100 meters, the turtle has run 10 meters, and when Achilles has run the 10 meters, then the turtle has run 1 meter, etc. The Greeks were actually arguing whether Achilles would ever reach the turtle. This may sound like a joke, but in some respects, modern mathematics never went beyond the level of the Greek philosophers. If a modern mathematicien sees something like (3.2.1) 1 = 0, i→∞ i lim n or lim n→∞ 25 i=0 1 10 = , i 10 9 26 3. RANDOM VARIABLES then he will probably say that the lefthand term in each equation never really reaches the number written on the right, all he will say is that the term on the left comes arbitrarily close to it. This is like saying: I know that Achilles will get as close as 1 cm or 1 mm to the turtle, he will get closer than any distance, however small, to the turtle, instead of simply saying that Achilles reaches the turtle. Modern mathematical proofs are full of races between Achilles and the turtle of the kind: give me an ε, and I will prove to you that the thing will come at least as close as ε to its goal (so-called epsilontism), but never speaking about the moment when the thing will reach its goal. Of course, it “works,” but it makes things terribly cumbersome, and it may have prevented people from seeing connections. Abraham Robinson in [Rob74] is one of the mathematicians who tried to remedy it. He did it by adding more numbers, infinite numbers and infinitesimal numbers. Robinson showed that one can use infinitesimals without getting into contradictions, and he demonstrated that mathematics becomes much more intuitive this way, not only its elementary proofs, but especially the deeper results. One of the elemrntary books based on his calculus is [HK79]. The well-know logician Kurt G¨del said about Robinson’s work: “I think, in o coming years it will be considered a great oddity in the history of mathematics that the first exact theory of infinitesimals was developed 300 years after the invention of the differential calculus.” G¨del called Robinson’s theory the first theory. I would like to add here the folo lowing speculation: perhaps Robinson shares the following error with the “standard” mathematicians whom he criticizes: they consider numbers only in a static way, without allowing them to move. It would be beneficial to expand on the intuition of the inventors of differential calculus, who talked about “fluxions,” i.e., quantities in flux, in motion. Modern mathematicians even use arrows in their symbol for limits, but they are not calculating with moving quantities, only with static quantities. This perspective makes the category-theoretical approach to infinitesimals taken in [MR91] especially promising. Category theory considers objects on the same footing with their transformations (and uses lots of arrows). Maybe a few years from now mathematics will be done right. We should not let this temporary backwardness of mathematics allow to hold us back in our intuition. ∆y The equation ∆x = 2x does not hold exactly on a parabola for any pair of given (static) ∆x and ∆y ; but if you take a pair (∆x, ∆y ) which is moving towards zero then this equation holds in the moment when they reach zero, i.e., when they vanish. Writing dy and dx means therefore: we are looking at magnitudes which are in the process of vanishing. If one applies a function to a moving quantity one again gets a moving quantity, and the derivative of this function compares the speed with which the transformed quantity moves with the speed of the original quantity. Likewise, n the equation i=1 21 = 1 holds in the moment when n reaches infinity. From this n point of view, the axiom of σ -additivity in probability theory (in its equivalent form of rising or declining sequences of events) indicates that the probability of a vanishing event vanishes. Whenever we talk about infinitesimals, therefore, we really mean magnitudes which are moving, and which are in the process of vanishing. dVx,y is therefore not, as one might think from what will be said below, a static but small volume element located close to the point (x, y ), but it is a volume element which is vanishing into the point (x, y ). The probability density function therefore signifies the speed with which the probability of a vanishing element vanishes. 3.4. CHARACTERIZATION OF RANDOM VARIABLES 27 3.3. Definition of a Random Variable The best intuition of a random variable would be to view it as a numerical variable whose values are not determinate but follow a statistical pattern, and call it x, while possible values of x are called x. In order to make this a mathematically sound definition, one says: A mapping x : U → R of the set U of all possible outcomes into the real numbers R is called a random variable. (Again, mathematicians are able to construct pathological mappings that cannot be used as random variables, but we let that be their problem, not ours.) The green x is then defined as x = x(ω ). I.e., all the randomness is shunted off into the process of selecting an element of U . Instead of being an indeterminate function, it is defined as a determinate function of the random ω . It is written here as x(ω ) and not as x(ω ) because the function itself is determinate, only its argument is random. Whenever one has a mapping x : U → R between sets, one can construct from it in a natural way an “inverse image” mapping between subsets of these sets. Let F , as usual, denote the set of subsets of U , and let B denote the set of subsets of R. We will define a mapping x−1 : B → F in the following way: For any B ⊂ R, we define x−1 (B ) = {ω ∈ U : x(ω ) ∈ B }. (This is not the usual inverse of a mapping, which does not always exist. The inverse-image mapping always exists, but the inverse image of a one-element set is no longer necessarily a one-element set; it may have more than one element or may be the empty set.) This “inverse image” mapping is well behaved with respect to unions and intersections, etc. In other words, we have identities x−1 (A ∩ B ) = x−1 (A) ∩ x−1 (B ) and x−1 (A ∪ B ) = x−1 (A) ∪ x−1 (B ), etc. Problem 44. Prove the above two identities. Answer. These are a very subtle proofs. x−1 (A ∩ B ) = {ω ∈ U : x(ω ) ∈ A ∩ B } = {ω ∈ U : x(ω ) ∈ A and x(ω ) ∈ B = {ω ∈ U : x(ω ) ∈ A} ∩ {ω ∈ U : x(ω ) ∈ B } = x−1 (A) ∩ x−1 (B ). The other identity has a similar proof. Problem 45. Show, on the other hand, by a counterexample, that the “direct image” mapping defined by x(E ) = {r ∈ R : there exists ω ∈ E with x(ω ) = r} no longer satisfies x(E ∩ F ) = x(E ) ∩ x(F ). By taking inverse images under a random variable x, the probability measure on F is transplanted into a probability measure on the subsets of R by the simple prescription Pr[B ] = Pr x−1 (B ) . Here, B is a subset of R and x−1 (B ) one of U , the Pr on the right side is the given probability measure on U , while the Pr on the left is the new probability measure on R induced by x. This induced probability measure is called the probability law or probability distribution of the random variable. Every random variable induces therefore a probability measure on R, and this probability measure, not the mapping itself, is the most important ingredient of a random variable. That is why Amemiya’s first definition of a random variable (definition 3.1.1 on p. 18) is: “A random variable is a variable that takes values acording to a certain distribution.” In other words, it is the outcome of an experiment whose set of possible outcomes is R. 3.4. Characterization of Random Variables We will begin our systematic investigation of random variables with an overview over all possible probability measures on R. The simplest way to get such an overview is to look at the cumulative distribution functions. Every probability measure on R has a cumulative distribution function, 28 3. RANDOM VARIABLES but we will follow the common usage of assigning the cumulative distribution not to a probability measure but to the random variable which induces this probability measure on R. Given a random variable x : U ω → x(ω ) ∈ R. Then the cumulative distribution function of x is the function Fx : R → R defined by: (3.4.1) Fx (a) = Pr[{ω ∈ U : x(ω ) ≤ a}] = Pr[x≤a]. This function uniquely defines the probability measure which x induces on R. Properties of cumulative distribution functions: a function F : R → R is a cumulative distribution function if and only if a ≤ b ⇒ F (a) ≤ F (b) (3.4.2) (3.4.3) lim F (a) = 0 a→−∞ (3.4.4) (3.4.5) lim F (a) = 1 a→∞ lim ε→0,ε>0 F (a + ε) = F (a) Equation (3.4.5) is the definition of continuity from the right (because the limit holds only for ε ≥ 0). Why is a cumulative distribution function continuous from the right? For every nonnegative sequence ε1 , ε2 , . . . ≥ 0 converging to zero which also satisfies ε1 ≥ ε2 ≥ . . . follows {x ≤ a} = i {x ≤ a + εi }; for these sequences, therefore, the statement follows from what Problem 14 above said about the probability of the intersection of a declining set sequence. And a converging sequence of nonnegative εi which is not declining has a declining subsequence. A cumulative distribution function need not be continuous from the left. If limε→0,ε>0 F (x − ε) = F (x), then x is a jump point, and the height of the jump is the probability that x = x. It is a matter of convention whether we are working with right continuous or left continuous functions here. If the distribution function were defined as Pr[x < a] (some authors do this, compare [Ame94, p. 43]), then it would be continuous from the left but not from the right. Problem 46. 6 points Assume Fx (x) is the cumulative distribution function of the random variable x (whose distribution is not necessarily continuous). Which of the following formulas are correct? Give proofs or verbal justifications. (3.4.6) Pr[x = x] = (3.4.7) Pr[x = x] = Fx (x) − (3.4.8) Pr[x = x] = lim ε>0; ε→0 lim ε>0; ε→0 Fx (x + ε) − Fx (x) lim δ>0; δ →0 Fx (x − δ ) Fx (x + ε) − lim δ>0; δ →0 Fx (x − δ ) Answer. (3.4.6) does not hold generally, since its rhs is always = 0; the other two equations always hold. Problem 47. 4 points Assume the distribution of z is symmetric about zero, i.e., Pr[z < −z ] = Pr[z >z ] for all z . Call its cumulative distribution function Fz (z ). Show that the cumulative distribution function of the random variable q = z 2 is √ Fq (q ) = 2Fz ( q ) − 1 for q ≥ 0, and 0 for q < 0. 3.4. CHARACTERIZATION OF RANDOM VARIABLES 29 Answer. If q ≥ 0 then (3.4.9) (3.4.10) (3.4.11) (3.4.12) (3.4.13) √ √ Fq (q ) = Pr[z 2 ≤q ] = Pr[− q ≤z ≤ q ] √ √ = Pr[z ≤ q ] − Pr[z < − q ] √ √ = Pr[z ≤ q ] − Pr[z > q ] √ √ = Fz ( q ) − (1 − Fz ( q )) √ = 2Fz ( q ) − 1. Instead of the cumulative distribution function Fy one can also use the quan− tile function Fy 1 to characterize a probability measure. As the notation suggests, the quantile function can be considered some kind of “inverse” of the cumulative distribution function. The quantile function is the function (0, 1) → R defined by (3.4.14) − Fy 1 (p) = inf {u : Fy (u) ≥ p} or, plugging the definition of Fy into (3.4.14), (3.4.15) − Fy 1 (p) = inf {u : Pr[y ≤u] ≥ p}. The quantile function is only defined on the open unit interval, not on the endpoints 0 and 1, because it would often assume the values −∞ and +∞ on these endpoints, and the information given by these values is redundant. The quantile function is continuous from the left, i.e., from the other side than the cumulative distribution function. If F is continuous and strictly increasing, then the quantile function is the inverse of the distribution function in the usual sense, i.e., F −1 (F (t)) = t for all t ∈ R, and F (F −1 ((p)) = p for all p ∈ (0, 1). But even if F is flat on certain intervals, and/or F has jump points, i.e., F does not have an inverse function, the following important identity holds for every y ∈ R and p ∈ (0, 1): (3.4.16) p ≤ Fy (y ) − iff Fy 1 (p) ≤ y Problem 48. 3 points Prove equation (3.4.16). Answer. ⇒ is trivial: if F (y ) ≥ p then of course y ≥ inf {u : F (u) ≥ p}. ⇐: y ≥ inf {u : F (u) ≥ p} means that every z > y satisfies F (z ) ≥ p; therefore, since F is continuous from the right, also F (y ) ≥ p. This proof is from [Rei89, p. 318]. Problem 49. You throw a pair of dice and your random variable x is the sum of the points shown. • a. Draw the cumulative distribution function of x. Answer. This is Figure 1: the cdf is 0 in (−∞, 2), 1/36 in [2,3), 3/36 in [3,4), 6/36 in [4,5), 10/36 in [5,6), 15/36 in [6,7), 21/36 in [7,8), 26/36 on [8,9), 30/36 in [9,10), 33/36 in [10,11), 35/36 on [11,12), and 1 in [12, +∞). • b. Draw the quantile function of x. Answer. This is Figure 2: the quantile function is 2 in (0, 1/36], 3 in (1/36,3/36], 4 in (3/36,6/36], 5 in (6/36,10/36], 6 in (10/36,15/36], 7 in (15/36,21/36], 8 in (21/36,26/36], 9 in (26/36,30/36], 10 in (30/36,33/36], 11 in (33/36,35/36], and 12 in (35/36,1]. Problem 50. 1 point Give the formula of the cumulative distribution function of a random variable which is uniformly distributed between 0 and b. Answer. 0 for x ≤ 0, x/b for 0 ≤ x ≤ b, and 1 for x ≥ b. 30 3. RANDOM VARIABLES q qq q q q q q q q q Figure 1. Cumulative Distribution Function of Discrete Variable Empirical Cumulative Distribution Function: Besides the cumulative distribution function of a random variable or of a probability measure, one can also define the empirical cumulative distribution function of a sample. Empirical cumulative distribution functions are zero for all values below the lowest observation, then 1/n for everything below the second lowest, etc. They are step functions. If two observations assume the same value, then the step at that value is twice as high, etc. The empirical cumulative distribution function can be considered an estimate of the cumulative distribution function of the probability distribution underlying the sample. [Rei89, p. 12] writes it as a sum of indicator functions: 1 (3.4.17) F= 1[xi ,+∞) ni 3.5. Discrete and Absolutely Continuous Probability Measures One can define two main classes of probability measures on R: One kind is concentrated in countably many points. Its probability distribution can be defined in terms of the probability mass function. Problem 51. Show that a distribution function can only have countably many jump points. q q q q q q q q q q Figure 2. Quantile Function of Discrete Variable 3.6. TRANSFORMATION OF A SCALAR DENSITY FUNCTION ≥ 1 , 4 Answer. Proof: There are at most two with jump height ≥ etc. 1 , 2 31 at most four with jump height Among the other probability measures we are only interested in those which can be represented by a density function (absolutely continuous). A density function is a nonnegative integrable function which, integrated over the whole line, gives 1. Given b such a density function, called fx (x), the probability Pr[x∈(a, b)] = a fx (x)dx. The density function is therefore an alternate way to characterize a probability measure. But not all probability measures have density functions. Those who are not familiar with integrals should read up on them at this point. Start with derivatives, then: the indefinite integral of a function is a function whose derivative is the given function. Then it is an important theorem that the area under the curve is the difference of the values of the indefinite integral at the end points. This is called the definite integral. (The area is considered negative when the curve is below the x-axis.) The intuition of a density function comes out more clearly in terms of infinitesimals. If fx (x) is the value of the density function at the point x, then the probability that the outcome of x lies in an interval of infinitesimal length located near the point x is the length of this interval, multiplied by fx (x). In formulas, for an infinitesimal dx follows (3.5.1) Pr x∈[x, x + dx] = fx (x) |dx| . The name “density function” is therefore appropriate: it indicates how densely the probability is spread out over the line. It is, so to say, the quotient between the probability measure induced by the variable, and the length measure on the real numbers. If the cumulative distribution function has everywhere a derivative, this derivative is the density function. 3.6. Transformation of a Scalar Density Function Assume x is a random variable with values in the region A ⊂ R, i.e., Pr[x∈A] = 0, / and t is a one-to-one mapping A → R. One-to-one (as opposed to many-to-one) means: if a, b ∈ A and t(a) = t(b), then already a = b. We also assume that t has a continuous nonnegative first derivative t ≥ 0 everywhere in A. Define the random variable y by y = t(x). We know the density function of y , and we want to get that of x. (I.e., t expresses the old variable, that whose density function we know, in terms of the new variable, whose density function we want to know.) Since t is one-to-one, it follows for all a, b ∈ A that a = b ⇐⇒ t(a) = t(b). And recall the definition of a derivative in terms of infinitesimals dx: t (x) = t(x+dx)−t(x) . dx In order to compute fx (x) we will use the following identities valid for all x ∈ A: (3.6.1) (3.6.2) fx (x) |dx| = Pr x∈[x, x + dx] = Pr t(x)∈[t(x), t(x + dx)] = Pr t(x)∈[t(x), t(x) + t (x) dx] = fy (t(x)) |t (x)dx| Absolute values are multiplicative, i.e., |t (x)dx| = |t (x)| |dx|; divide by |dx| to get (3.6.3) fx (x) = fy t(x) |t (x)| . This is the transformation formula how to get the density of x from that of y . This formula is valid for all x ∈ A; the density of x is 0 for all x ∈ A. / |dy Heuristically one can get this transformation as follows: write |t (x)| = |dx| , then | one gets it from fx (x) |dx| = fy (t(x)) |dy | by just dividing both sides by |dx|. 32 3. RANDOM VARIABLES In other words, this transformation rule consists of 4 steps: (1) Determine A, the range of the new variable; (2) obtain the transformation t which expresses the old variable in terms of the new variable, and check that it is one-to-one on A; (3) plug expression (2) into the old density; (4) multiply this plugged-in density by the absolute value of the derivative of expression (2). This gives the density inside A; it is 0 outside A. An alternative proof is conceptually simpler but cannot be generalized to the multivariate case: First assume t is monotonically increasing. Then Fx (x) = Pr[x ≤ x] = Pr[t(x) ≤ t(i)] = Fy (t(x)). Now differentiate and use the chain rule. Then also do the monotonically decresing case. This is how [Ame94, theorem 3.6.1 on pp. 48] does it. [Ame94, pp. 52/3] has an extension of this formula to many-to-one functions. Problem 52. 4 points [Lar82, example 3.5.4 on p. 148] Suppose y has density function fy (y ) = (3.6.4) 1 0 for 0 < y < 1 otherwise. Obtain the density fx (x) of the random variable x = − log y . Answer. (1) Since y takes values only between 0 and 1, its logarithm takes values between −∞ and 0, the negative logarithm therefore takes values between 0 and +∞, i.e., A = {x : 0 < x}. (2) Express y in terms of x: y = e−x . This is one-to-one on the whole line, therefore also on A. (3) Plugging y = e−x into the density function gives the number 1, since the density function does not depend on the precise value of y , as long is we know that 0 < y < 1 (which we do). (4) The derivative of y = e−x is −e−x . As a last step one has to multiply the number 1 by the absolute value of the derivative to get the density inside A. Therefore fx (x) = e−x for x > 0 and 0 otherwise. Problem 53. 6 points [Dhr86, p. 1574] Assume the random variable z has the exponential distribution with parameter λ, i.e., its density function is fz (z ) = λ exp(−λz ) for z > 0 and 0 for z ≤ 0. Define u = − log z . Show that the density function of u is fu (u) = exp µ − u − exp(µ − u) where µ = log λ. This density will be used in Problem 140. Answer. (1) Since z only has values in (0, ∞), its log is well defined, and A = R. (2) Express old variable in terms of new: −u = log z therefore z = e−u ; this is one-to-one everywhere. (3) plugging in (since e−u > 0 for all u, we must plug it into λ exp(−λz )) gives . . . . (4) the derivative of z = e−u is −e−u , taking absolute values gives the Jacobian factor e−u . Plugging in and multiplying −u gives the density of u: fu (u) = λ exp(−λe−u )e−u = λe−u−λe , and using λ exp(−u) = exp(µ − u) this simplifies to the formula above. Alternative without transformation rule for densities: Fu (u) = Pr[u≤u] = Pr[− log z ≤u] = −u +∞ Pr[log z ≥ − u] = Pr[z ≥e−u ] = −u λe−λz dz = −e−λz |+∞ = e−λe , now differentiate. e−u e Problem 54. 4 points Assume the random variable z has the exponential distribution with λ = 1, i.e., its density function is fz (z ) = exp(−z ) for z ≥ 0 and 0 √ for z < 0. Define u = z . Compute the density function of u. √ Answer. (1) A = {u : u ≥ 0} since always denotes the nonnegative square root; (2) Express 2 , this is one-to-one on A (but not one-to-one on all of R); old variable in terms of new: z = u (3) then the derivative is 2u, which is nonnegative as well, no absolute values are necessary; (4) multiplying gives the density of u: fu (u) = 2u exp(−u2 ) if u ≥ 0 and 0 elsewhere. 3.7. EXAMPLE: BINOMIAL VARIABLE 33 3.7. Example: Binomial Variable Go back to our Bernoulli trial with parameters p and n, and define a random variable x which represents the number of successes. Then the probability mass function of x is (3.7.1) px (k ) = Pr[x=k ] = nk p (1 − p)(n−k) k k = 0, 1, 2, . . . , n Proof is simple, every subset of k elements represents one possibility of spreading out the k successes. We will call any observed random variable a statistic. And we call a statistic t sufficient for a parameter θ if and only if for any event A and for any possible value t of t, the conditional probability Pr[A|t≤t] does not involve θ. This means: after observing t no additional information can be obtained about θ from the outcome of the experiment. Problem 55. Show that x, the number of successes in the Bernoulli trial with parameters p and n, is a sufficient statistic for the parameter p (the probability of success), with n, the number of trials, a known fixed number. Answer. Since the distribution of x is discrete, it is sufficient to show that for any given k, Pr[A|x=k] does not involve p whatever the event A in the Bernoulli trial. Furthermore, since the Bernoulli trial with n tries is finite, we only have to show it if A is an elementary event in F , i.e., an event consisting of one element. Such an elementary event would be that the outcome of the trial has a certain given sequence of successes and failures. A general A is the finite disjoint union of all elementary events contained in it, and if the probability of each of these elementary events does not depend on p, then their sum does not either. Now start with the definition of conditional probability (3.7.2) Pr[A|x=k] = Pr[A ∩ {x=k}] . Pr[x=k] If A is an elementary event whose number of sucesses is not k, then A ∩ {x=k} = ∅, therefore its probability is 0, which does not involve p. If A is an elementary event which has k successes, then A ∩ {x=k} = A, which has probability pk (1 − p)n−k . Since Pr[{x=k}] = n pk (1 − p)n−k , the k terms in formula (3.7.2) that depend on p cancel out, one gets Pr[A|x=k] = 1/ no p in that formula. n k . Again there is Problem 56. You perform a Bernoulli experiment, i.e., an experiment which can only have two outcomes, success s and failure f . The probability of success is p. • a. 3 points You make 4 independent trials. Show that the probability that the first trial is successful, given that the total number of successes in the 4 trials is 3, is 3/4. Answer. Let B = {sf f f, sf f s, sf sf, sf ss, ssf f, ssf s, sssf, ssss} be the event that the first trial is successful, and let {x=3} = {f sss, sf ss, ssf s, sssf } be the event that there are 3 successes, it has 4 = 4 elements. Then 3 (3.7.3) Pr[B |x=3] = Pr[B ∩ {x=3}] Pr[x=3] Now B ∩ {x=3} = {sf ss, ssf s, sssf }, which has 3 elements. Therefore we get (3.7.4) Pr[B |x=3] = • b. 2 points Discuss this result. 3 · p3 (1 − p) 3 =. 4 · p3 (1 − p) 4 34 3. RANDOM VARIABLES Answer. It is significant that this probability is independent of p. I.e., once we know how many successes there were in the 4 trials, knowing the true p does not help us computing the probability of the event. From this also follows that the outcome of the event has no information about p. The value 3/4 is the same as the unconditional probability if p = 3/4. I.e., whether we know that the true frequency, the one that holds in the long run, is 3/4, or whether we know that the actual frequency in this sample is 3/4, both will lead us to the same predictions regarding the first throw. But not all conditional probabilities are equal to their unconditional counterparts: the conditional probability to get 3 successes in the first 4 trials is 1, but the unconditional probability is of course not 1. 3.8. Pitfalls of Data Reduction: The Ecological Fallacy The nineteenth-century sociologist Emile Durkheim collected data on the frequency of suicides and the religious makeup of many contiguous provinces in Western Europe. He found that, on the average, provinces with greater proportions of Protestants had higher suicide rates and those with greater proportions of Catholics lower suicide rates. Durkheim concluded from this that Protestants are more likely to commit suicide than Catholics. But this is not a compelling conclusion. It may have been that Catholics in predominantly Protestant provinces were taking their own lives. The oversight of this logical possibility is called the “Ecological Fallacy” [Sel58]. This seems like a far-fetched example, but arguments like this have been used to discredit data establishing connections between alcoholism and unemployment etc. as long as the unit of investigation is not the individual but some aggregate. One study [RZ78] found a positive correlation between driver education and the incidence of fatal automobile accidents involving teenagers. Closer analysis showed that the net effect of driver education was to put more teenagers on the road and therefore to increase rather than decrease the number of fatal crashes involving teenagers. Problem 57. 4 points Assume your data show that counties with high rates of unemployment also have high rates of heart attacks. Can one conclude from this that the unemployed have a higher risk of heart attack? Discuss, besides the “ecological fallacy,” also other objections which one might make against such a conclusion. Answer. Ecological fallacy says that such a conclusion is only legitimate if one has individual data. Perhaps a rise in unemployment is associated with increased pressure and increased workloads among the employed, therefore it is the employed, not the unemployed, who get the heart attacks. Even if one has individual data one can still raise the following objection: perhaps unemployment and heart attacks are both consequences of a third variable (both unemployment and heart attacks depend on age or education, or freezing weather in a farming community causes unemployment for workers and heart attacks for the elderly). But it is also possible to commit the opposite error and rely too much on individual data and not enough on “neighborhood effects.” In a relationship between health and income, it is much more detrimental for your health if you are poor in a poor neighborhood, than if you are poor in a rich neighborhood; and even wealthy people in a poor neighborhood do not escape some of the health and safety risks associated with this neighborhood. Another pitfall of data reduction is Simpson’s paradox. According to table 1, the new drug was better than the standard drug both in urban and rural areas. But if you aggregate over urban and rural areas, then it looks like the standard drug was better than the new drug. This is an artificial example from [Spr98, p. 360]. 3.10. LOCATION AND DISPERSION PARAMETERS 35 Responses in Urban and Rural Areas to Each of Two Drugs Standard Drug New Drug Urban Rural Urban Rural No Effect 500 350 1050 120 Cure 100 350 359 180 Table 1. Disaggregated Results of a New Drug Response to Two Drugs Standard Drug New Drug No Effect 850 1170 Cure 450 530 Table 2. Aggregated Version of Table 1 3.9. Independence of Random Variables The concept of independence can be extended to random variables: x and y are independent if all events that can be defined in terms of x are independent of all events that can be defined in terms of y , i.e., all events of the form {ω ∈ U : x(ω ) ∈ C } are independent of all events of the form {ω ∈ U : y (ω ) ∈ D} with arbitrary (measurable) subsets C, D ⊂ R. Equivalent to this is that all events of the sort x≤a are independent of all events of the sort y ≤b. Problem 58. 3 points The simplest random variables are indicator functions, i.e., functions which can only take the values 0 and 1. Assume x is indicator function of the event A and y indicator function of the event B , i.e., x takes the value 1 if A occurs, and the value 0 otherwise, and similarly with y and B . Show that according to the above definition of independence, x and y are independent if and only if the events A and B are independent. (Hint: which are the only two events, other than the certain event U and the null event ∅, that can be defined in terms of x)? Answer. Only A and A . Therefore we merely need the fact, shown in Problem 35, that if A and B are independent, then also A and B are independent. By the same argument, also A and B are independent, and A and B are independent. This is all one needs, except the observation that every event is independent of the certain event and the null event. 3.10. Location Parameters and Dispersion Parameters of a Random Variable 3.10.1. Measures of Location. A location parameter of random variables is a parameter which increases by c if one adds the constant c to the random variable. The expected value is the most important location parameter. To motivate it, assume x is a discrete random variable, i.e., it takes the values x1 , . . . , xr with probr abilities p1 , . . . , pr which sum up to one: i=1 pi = 1. x is observed n times independently. What can we expect the average value of x to be? For this we first need a formula for this average: if ki is the number of times that x assumed the value xi (i = 1, . . . , r) then ki = n, and the average is k1 x1 + · · · + kn xn . With an n n appropriate definition of convergence, the relative frequencies ki converge towards n pi . Therefore the average converges towards p1 x1 + · · · + pn xn . This limit is the expected value of x, written as (3.10.1) E[x] = p1 x1 + · · · + pn xn . Problem 59. Why can one not use the usual concept of convergence here? 36 3. RANDOM VARIABLES Answer. Because there is no guarantee that the sample frequencies converge. It is not physically impossible (although it is highly unlikely) that certain outcome will never be realized. Note the difference between the sample mean, i.e., the average measured in a given sample, and the “population mean” or expected value. The former is a random variable, the latter is a parameter. I.e., the former takes on a different value every time the experiment is performed, the latter does not. Note that the expected value of the number of dots on a die is 3.5, which is not one of the possible outcomes when one rolls a die. Expected value can be visualized as the center of gravity of the probability mass. If one of the tails has its weight so far out that there is no finite balancing point then the expected value is infinite of minus infinite. If both tails have their weights so far out that neither one has a finite balancing point, then the expected value does not exist. It is trivial to show that for a function g (x) (which only needs to be defined for those values which x can assume with nonzero probability), E[g (x)] = p1 g (x1 ) + · · · + pn g (xn ). Example of a countable probability mass distribution which has an infinite ex∞1 a pected value: Pr[x = x] = x2 for x = 1, 2, . . .. (a is the constant 1 i=1 i2 .) The ∞ expected value of x would be i=1 a , which is infinite. But if the random variable i is bounded, then its expected value exists. The expected value of a continuous random variable is defined in terms of its density function: +∞ (3.10.2) E[x] = xfx (x) dx −∞ It can be shown that for any function g (x) defined for all those x for which fx (x) = 0 follows: (3.10.3) E[g (x)] = g (x)fx (x) dx fx (x)=0 Here the integral is taken over all the points which have nonzero density, instead of the whole line, because we did not require that the function g is defined at the points where the density is zero. Problem 60. Let the random variable x have the Cauchy distribution, i.e., its density function is (3.10.4) fx (x) = 1 π (1 + x2 ) Show that x does not have an expected value. Answer. (3.10.5) x dx 1 = π (1 + x2 ) 2π 2x dx 1 = 1 + x2 2π d(x2 ) 1 = ln(1 + x2 ) 1 + x2 2π Rules about how to calculate with expected values (as long as they exist): (3.10.6) (3.10.7) (3.10.8) E[c] = c if c is a constant E[ch] = c E[h] E[h + j ] = E[h] + E[j ] 3.10. LOCATION AND DISPERSION PARAMETERS 37 and if the random variables h and j are independent, then also (3.10.9) E[hj ] = E[h] E[j ]. Problem 61. 2 points You make two independent trials of a Bernoulli experiment with success probability θ, and you observe t, the number of successes. Compute the expected value of t3 . (Compare also Problem 169.) Answer. Pr[t = 0] = (1 − θ)2 ; Pr[t = 1] = 2θ(1 − θ); Pr[t = 2] = θ2 . Therefore an application of (3.10.1) gives E[t3 ] = 03 · (1 − θ)2 + 13 · 2θ(1 − θ) + 23 · θ2 = 2θ + 6θ2 . Theorem 3.10.1. Jensen’s Inequality: Let g : R → R be a function which is convex on an interval B ⊂ R, which means (3.10.10) g (λa + (1 − λ)b) ≤ λg (a) + (1 − λ)g (b) for all a, b ∈ B . Furthermore let x : R → R be a random variable so that Pr[x ∈ B ] = 1. Then g (E[x]) ≤ E[g (x)]. Proof. The Jensen inequality holds with equality if h(x) is a linear function (with a constant term), i.e., in this case, E[h(x)] = h(E[x]). (2) Therefore Jensen’s inequality is proved if we can find a linear function h with the two properties h(E[x]) = g (E[x]), and h(x) ≤ g (x) for all other x—because with such a h, E[g (x)] ≥ E[h(x)] = h(E[x]). (3) The existence of such a h follows from convexity. Since g is convex, for every point a ∈ B there is a number β so that g (x) ≥ g (a) + β (x − a). This β is the slope of g if g is differentiable, and otherwise it is some number between the left and the right derivative (which both always exist for a convex function). We need this for a = E[x]. This existence is the deepest part of this proof. We will not prove it here, for a proof see [Rao73, pp. 57, 58]. One can view it as a special case of the separating hyperplane theorem. Problem 62. Use Jensen’s inequality to show that (E[x])2 ≤ E[x2 ]. You are allowed to use, without proof, the fact that a function is convex on B if the second derivative exists on B and is nonnegative. Problem 63. Show that the expected value of the empirical distribution of a sample is the sample mean. Other measures of locaction: The median is that number m for which there is as much probability mass to the left of m as to the right, i.e., (3.10.11) Pr[x≤m] = 1 2 or, equivalently, Fx (m) = 1 . 2 It is much more robust with respect to outliers than the mean. If there is more than one m satisfying (3.10.11), then some authors choose the smallest (in which case the median is a special case of the quantile function m = F −1 (1/2)), and others the average between the biggest and smallest. If there is no m with property (3.10.11), i.e., if the cumulative distribution function jumps from a value that is less than 1 to 2 a value that is greater than 1 , then the median is this jump point. 2 The mode is the point where the probability mass function or the probability density function is highest. 38 3. RANDOM VARIABLES 3.10.2. Measures of Dispersion. Here we will discuss variance, standard deviation, and quantiles and percentiles: The variance is defined as var[x] = E[(x − E[x])2 ], (3.10.12) but the formula var[x] = E[x2 ] − (E[x])2 (3.10.13) is usually more convenient. How to calculate with variance? (3.10.14) var[ax] = a2 var[x] (3.10.15) var[x + c] = var[x] if c is a constant (3.10.16) var[x + y ] = var[x] + var[y ] if x and y are independent. Note that the variance is additive only when x and y are independent; the expected value is always additive. Problem 64. Here we make the simple step from the definition of the variance to the usually more convenient formula (3.10.13). • a. 2 points Derive the formula var[x] = E[x2 ] − (E[x])2 from the definition of a variance, which is var[x] = E[(x − E[x])2 ]. Hint: it is convenient to define µ = E[x]. Write it down carefully, you will lose points for missing or unbalanced parentheses or brackets. Answer. Here it is side by side with and without the notation E[x] = µ: var[x] = E[(x − E[x])2 ] var[x] = E[(x − µ)2 ] = E[x2 − 2x(E[x]) + (E[x])2 ] (3.10.17) 2 2 = E[x ] − 2(E[x]) + (E[x]) = E[x2 ] − (E[x])2 . 2 = E[x2 − 2xµ + µ2 ] = E[x2 ] − 2µ2 + µ2 = E[x2 ] − µ2 . • b. 1 point Assume var[x] = 3, var[y ] = 2, x and y are independent. Compute var[−x], var[3y + 5], and var[x − y ]. Answer. 3, 18, and 5. Problem 65. If all y i are independent with same variance σ 2 , then show that y ¯ has variance σ 2 /n. The standard deviation is the square root of the variance. Often preferred because has same scale as x. The variance, on the other hand, has the advantage of a simple addition rule. Standardization: if the random variable x has expected value µ and standard deviation σ , then z = x−µ has expected value zero and variance one. σ An αth quantile or a 100αth percentile of a random variable x was already defined previously to be the smallest number x so that Pr[x≤x] ≥ α. 3.10.3. Mean-Variance Calculations. If one knows mean and variance of a random variable, one does not by any means know the whole distribution, but one has already some information. For instance, one can compute E[y 2 ] from it, too. Problem 66. 4 points Consumer M has an expected utility function for money income u(x) = 12x − x2 . The meaning of an expected utility function is very simple: if he owns an asset that generates some random income y , then the utility he derives from this asset is the expected value E[u(y )]. He is contemplating acquiring two 3.10. LOCATION AND DISPERSION PARAMETERS 39 assets. One asset yields an income of 4 dollars with certainty. The other yields an expected income of 5 dollars with standard deviation 2 dollars. Does he prefer the certain or the uncertain asset? Answer. E[u(y )] = 12 E[y ] − E[y 2 ] = 12 E[y ] − var[y ] − (E[y ])2 . Therefore the certain asset gives him utility 48 − 0 − 16 = 32, and the uncertain one 60 − 4 − 25 = 31. He prefers the certain asset. 3.10.4. Moment Generating Function and Characteristic Function. Here we will use the exponential function ex , also often written exp(x), which has the two 2 3 x properties: ex = limn→∞ (1 + n )n (Euler’s limit), and ex = 1 + x + x + x + · · · . 2! 3! Many (but not all) random variables x have a moment generating function mx (t) for certain values of t. If they do for t in an open interval around zero, then their distribution is uniquely determined by it. The definition is mx (t) = E[etx ] (3.10.18) It is a powerful computational device. The moment generating function is in many cases a more convenient characterization of the random variable than the density function. It has the following uses: 1. One obtains the moments of x by the simple formula E[xk ] = (3.10.19) dk mx (t) dtk . t=0 Proof: t3 x3 t2 x2 + + ··· 2! 3! t2 t3 (3.10.21) mx (t) = E[etx ] = 1 + t E[x] + E[x2 ] + E[x3 ] + · · · 2! 3! d t2 (3.10.22) mx (t) = E[x] + t E[x2 ] + E[x3 ] + · · · dt 2! d2 (3.10.23) mx (t) = E[x2 ] + t E[x3 ] + · · · etc. dt2 2. The moment generating function is also good for determining the probability distribution of linear combinations of independent random variables. a. it is easy to get the m.g.f. of λx from the one of x: etx = 1 + tx + (3.10.20) (3.10.24) mλx (t) = mx (λt) because both sides are E[eλtx ]. b. If x, y independent, then (3.10.25) mx+y (t) = mx (t)my (t). The proof is simple: (3.10.26) E[et(x+y) ] = E[etx ety ] = E[etx ] E[ety ] due to independence. √ The characteristic function is defined as ψx (t) = E[eitx ], where i = −1. It has the disadvantage that it involves complex numbers, but it has the advantage that it always exists, since exp(ix) = cos x + i sin x. Since cos and sin are both bounded, they always have an expected value. And, as its name says, the characteristic function characterizes the probability distribution. Analytically, many of its properties are similar to those of the moment generating function. 40 3. RANDOM VARIABLES 3.11. Entropy 3.11.1. Definition of Information. Entropy is the average information gained by the performance of the experiment. The actual information yielded by an event A with probabbility Pr[A] = p = 0 is defined as follows: (3.11.1) I [A] = log2 1 Pr[A] This is simply a transformation of the probability, and it has the dual interpretation of either how unexpected the event was, or the informaton yielded by the occurrense of event A. It is characterized by the following properties [AD75, pp. 3–5]: • I [A] only depends on the probability of A, in other words, the information content of a message is independent of how the information is coded. • I [A] ≥ 0 (nonnegativity), i.e., after knowing whether A occurred we are no more ignorant than before. • If A and B are independent then I [A ∩ B ] = I [A] + I [B ] (additivity for independent events). This is the most important property. • Finally the (inessential) normalization that if Pr[A] = 1/2 then I [A] = 1, i.e., a yes-or-no decision with equal probability (coin flip) is one unit of information. Note that the information yielded by occurrence of the certain event is 0, and that yielded by occurrence of the impossible event is ∞. But the important information-theoretic results refer to average, not actual, information, therefore let us define now entropy: 3.11.2. Definition of Entropy. The entropy of a probability field (experiment) is a measure of the uncertainty prevailing before the experiment is performed, or of the average information yielded by the performance of this experiment. If the set U of possible outcomes of the experiment has only a finite number of different elements, say their number is n, and the probabilities of these outcomes are p1 , . . . , pn , then the Shannon entropy H[F ] of this experiment is defined as (3.11.2) H[F ] = bits n pk log2 k=1 1 pk This formula uses log2 , logarithm with base 2, which can easily be computed from the natural logarithms, log2 x = log x/ log 2. The choice of base 2 is convenient because in this way the most informative Bernoulli experiment, that with success probability p = 1/2 (coin flip), has entropy 1. This is why one says: “the entropy is measured in bits.” If one goes over to logarithms of a different base, this simply means that one measures entropy in different units. In order to indicate this dependence on the measuring unit, equation (3.11.2) was written as the definition H[F ] instead of H[F ] bits itself, i.e., this is the number one gets if one measures the entropy in bits. If one uses natural logarithms, then the entropy is measured in “nats.” Entropy can be characterized axiomatically by the following axioms [Khi57]: • The uncertainty associated with a finite complete scheme takes its largest value if all events are equally likely, i.e., H(p1 , . . . , pn ) ≤ H(1/n, . . . , 1/n). • The addition of an impossible event to a scheme does not change the amount of uncertainty. • Composition Law: If the possible outcomes are arbitrarily combined into m groups W 1 = X 11 ∪ · · · ∪ X 1k1 , W 2 = X 21 ∪ · · · ∪ X 2k2 , . . . , W m = 3.11. ENTROPY 41 X m1 ∪ · · · ∪ X mkm , with corresponding probabilities w1 = p11 + · · · + p1k1 , w2 = p21 + · · · + p2k2 , . . . , wm = pm1 + · · · + pmkm , then H(p1 , . . . , pn ) = H(w1 , . . . , wn ) + + w1 H (p11 /w1 + · · · + p1k1 /w1 ) + + w2 H (p21 /w2 + · · · + p2k2 /w2 ) + · · · + + wm H (pm1 /wm + · · · + pmkm /wm ). Since pij /wj = Pr[X ij |Wj ], the composition law means: if you first learn half the outcome of the experiment, and then the other half, you will in the average get as much information as if you had been told the total outcome all at once. The entropy of a random variable x is simply the entropy of the probability field induced by x on R. It does not depend on the values x takes but only on the probabilities. For discretely distributed random variables it can be obtained by the following “eerily self-referential” prescription: plug the random variable into its own probability mass function and compute the expected value of the negative logarithm of this, i.e., H[x] = E[− log2 px (x)] bits One interpretation of the entropy is: it is the average number of yes-or-no questions necessary to describe the outcome of the experiment. For instance, consider an experiment which has 32 different outcomes occurring with equal probabilities. The entropy is (3.11.3) (3.11.4) H = bits 32 i=1 1 log2 32 = log2 32 = 5 32 i.e., H = 5 bits which agrees with the number of bits necessary to describe the outcome. Problem 67. Design a questioning scheme to find out the value of an integer between 1 and 32, and compute the expected number of questions in your scheme if all numbers are equally likely. Answer. In binary digits one needs a number of length 5 to describe a number between 0 and 31, therefore the 5 questions might be: write down the binary expansion of your number minus 1. Is the first binary digit in this expansion a zero, then: is the second binary digit in this expansion a zero, etc. Formulated without the use of binary digits these same questions would be: is the number between 1 and 16?, then: is it between 1 and 8 or 17 and 24?, then, is it between 1 and 4 or 9 and 12 or 17 and 20 or 25 and 28?, etc., the last question being whether it is odd. Of course, you can formulate those questions conditionally: First: between 1 and 16? if no, then second: between 17 and 24? if yes, then second: between 1 and 8? Etc. Each of these questions gives you exactly the entropy of 1 bit. Problem 68. [CT91, example 1.1.2 on p. 5] Assume there is a horse race with eight horses taking part. The probabilities for winning for the eight horses are 1 1 1 1 1111 2 , 4 , 8 , 16 , 64 , 64 , 64 , 64 . • a. 1 point Show that the entropy of the horse race is 2 bits. Answer. H 1 1 1 1 4 = log2 2 + log2 4 + log2 8 + log2 16 + log2 64 = bits 2 4 8 16 64 1 1 3 1 3 4+4+3+2+3 =++++= =2 2 2 8 4 8 8 42 3. RANDOM VARIABLES • b. 1 point Suppose you want to send a binary message to another person indicating which horse won the race. One alternative is to assign the bit strings 000, 001, 010, 011, 100, 101, 110, 111 to the eight horses. This description requires 3 bits for any of the horses. But since the win probabilities are not uniform, it makes sense to use shorter descriptions for the horses more likely to win, so that we achieve a lower expected value of the description length. For instance, we could use the following set of bit strings for the eight horses: 0, 10, 110, 1110, 111100, 111101, 111110, 111111. Show that the the expected length of the message you send to your friend is 2 bits, as opposed to 3 bits for the uniform code. Note that in this case the expected value of the description length is equal to the entropy. Answer. The math is the same as in the first part of the question: 1 1 1 1 1 1 1 3 1 3 4+4+3+2+3 ·1+ ·2+ ·3+ ·4+4· ·6= + + + + = =2 2 4 8 16 64 2 2 8 4 8 8 Problem 69. [CT91, example 2.1.2 on pp. 14/15]: The experiment has four possible outcomes; outcome x=a occurs with probability 1/2, x=b with probability 1/4, x=c with probability 1/8, and x=d with probability 1/8. • a. 2 points The entropy of this experiment (in bits) is one of the following three numbers: 11/8, 7/4, 2. Which is it? • b. 2 points Suppose we wish to determine the outcome of this experiment with the minimum number of questions. An efficient first question is “Is x=a?” This splits the probability in half. If the answer to the first question is no, then the second question can be “Is x=b?” The third question, if it is necessary, can then be: “Is x=c?” Compute the expected number of binary questions required. c. 2 points Show that the entropy gained by each question is 1 bit. • d. 3 points Assume we know about the first outcome that x=a. What is the entropy of the remaining experiment (i.e., under the conditional probability)? • e. 5 points Show in this example that the composition law for entropy holds. Problem 70. 2 points In terms of natural logarithms equation (3.11.4) defining entropy reads H 1 = bits ln 2 (3.11.5) n pk ln k=1 1 . pk Compute the entropy of (i.e., the average informaton gained by) a roll of an unbiased die. Answer. Same as the actual information gained, since each outcome is equally likely: (3.11.6) H 1 = bits ln 2 1 1 ln 6 + · · · + ln 6 6 6 = ln 6 = 2.585 ln 2 • a. 3 points How many questions does one need in the average to determine the outcome of the roll of an unbiased die? In other words, pick a certain questioning scheme (try to make it efficient) and compute the average number of questions if this scheme is followed. Note that this average cannot be smaller than the entropy H /bits, and if one chooses the questions optimally, it is smaller than H /bits + 1. 3.11. ENTROPY 43 Answer. First question: is it bigger than 3? Second question: is it even? Third question (if necessary): is it a multiple of 3? In this scheme, the number of questions for the six faces of the 4 die are 3, 2, 3, 3, 2, 3, therefore the average is 6 · 3 + 2 · 2 = 2 2 . Also optimal: (1) is it bigger than 6 3 2? (2) is it odd? (3) is it bigger than 4? Gives 2, 2, 3, 3, 3, 3. Also optimal: 1st question: is it 1 or 2? If anser is no, then second question is: is it 3 or 4?; otherwise go directly to the third question: is it odd or even? The steamroller approach: Is it 1? Is it 2? etc. gives 1, 2, 3, 4, 5, 5 with expected number 3 1 . Even this is here < 1 + H /bits. 3 Problem 71. • a. 1 point Compute the entropy of a roll of two unbiased dice if they are distinguishable. Answer. Just twice the entropy from Problem 70. 1 1 1 H = ln 36 + · · · + ln 36 (3.11.7) bits ln 2 36 36 = ln 36 = 5.170 ln 2 • b. Would you expect the entropy to be greater or less in the more usual case that the dice are indistinguishable? Check your answer by computing it. Answer. If the dice are indistinguishable, then one gets less information, therefore the experiment has less entropy. One has six like pairs with probability 1/36 and 6 · 5/2 = 15 unlike pairs with probability 2/36 = 1/18 each. Therefore the average information gained is (3.11.8) 1 1 1 H = 6· ln 36 + 15 · ln 18 bits ln 2 36 18 = 1 ln 2 1 5 ln 36 + ln 18 6 6 = 4.337 • c. 3 points Note that the difference between these two entropies is 5/6 = 0.833. How can this be explained? Answer. This is the composition law (??) in action. Assume you roll two dice which you first consider indistinguishable and afterwards someone tells you which is which. How much information do you gain? Well, if the numbers are the same, then telling you which die is which does not give you any information, since the outcomes of the experiment are defined as: which number has the first die, which number has the second die, regardless of where on the table the dice land. But if the numbers are different, then telling you which is which allows you to discriminate between two outcomes both of which have conditional probability 1/2 given the outcome you already know; in this case the information you gain is therefore 1 bit. Since the probability of getting two different numbers is 5/6, the expected value of the information gained explains the difference in entropy. 1 All these definitions use the convention 0 log 0 = 0, which can be justified by the following continuity argument: Define the function, graphed in Figure 3: (3.11.9) η (w) = w log 0 1 w if w > 0 if w = 0. η is continuous for all w ≥ 0, even at the boundary point w = 0. Differentiation gives η (w) = −(1 + log w), and η (w) = −w−1 . The function starts out at the origin with a vertical tangent, and since the second derivative is negative, it is strictly concave for all w > 0. The definition of strict concavity is η (w) < η (v ) + (w − v )η (v ) for w = v , i.e., the function lies below all its tangents. Substituting η (v ) = −(1 + log v ) and simplifying gives w − w log w ≤ v − w log v for v, w > 0. One verifies that this inequality also holds for v, w ≥ 0. Problem 72. Make a complete proof, discussing all possible cases, that for v, w ≥ 0 follows (3.11.10) w − w log w ≤ v − w log v 44 3. RANDOM VARIABLES Answer. We already know it for v, w > 0. Now if v = 0 and w = 0 then the equation reads 0 ≤ 0; if v > 0 and w = 0 the equation reads 0 ≤ v , and if w > 0 and v = 0 then the equation reads w − w log w ≤ +∞. 3.11.3. How to Keep Forecasters Honest. This mathematical result allows an interesting alternative mathematical characterization of entropy. Assume Anita performs a Bernoulli experiment whose success probability she does not know but wants to know. Clarence knows this probability but is not on very good terms with Anita; therefore Anita is unsure that he will tell the truth if she asks him. Anita knows “how to keep forecasters honest.” She proposes the following deal to Clarence: “you tell me the probability q , and after performing my experiment I pay you the amount log2 (q ) if the experiment is a success, and log2 (1 − q ) if it is a failure. If Clarence agrees to this deal, then telling Anita that value q which is the true success probability of the Bernoulli experiment maximizes the expected value of his payoff. And the maximum expected value of this payoff is exactly the negative of the entropy of the experiment. Proof: Assume the correct value of the probability is p, and the number Clarence tells Tina is q . For every p, q between 0 and 1 we have to show: p log p + (1 − p) log(1 − p) ≥ p log q + (1 − p) log(1 − q ). (3.11.11) For this, plug w = p and v = q as well as w = 1 − p and v = 1 − q into equation (3.11.10) and add. w log 1 w . 1 e d d . . . ... ....... ............................... ... .................................... d .. ..... .... ..... ..... .... ... ....d .... ... ... .... .... ... ... .... .. .... .. .. .... .... . .. .... d..... .... .. .. .. ... ... .. ... ... .. d......... . . .. . ... . ... . .. . .. . d........... . . . . . .. .. .. . . . . . d......... . . T 1 e Figure 3. η : w → w log 1 1 w E .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . w is continuous at 0, and concave everywhere 3.11.4. The Inverse Problem. Now let us go over to the inverse problem: computing those probability fields which have maximum entropy subject to the information you have. If you know that the experiment has n different outcomes, and you do not know the probabilities of these outcomes, then the maximum entropy approach amounts to assigning equal probability 1/n to each outcome. Problem 73. (Not eligible for in-class exams) You are playing a slot machine. Feeding one dollar to this machine leads to one of four different outcomes: E 1 : machine returns nothing, i.e., you lose $1. E 2 : machine returns $1, i.e., you lose 3.11. ENTROPY 45 nothing and win nothing. E 3 : machine returns $2, i.e., you win $1. E 4 : machine returns $10, i.e., you win $9. Events E i occurs with probability pi , but these probabilities are unknown. But due to a new “Truth-in-Gambling Act” you find a sticker on the side of the machine which says that in the long run the machine pays out only $0.90 for every dollar put in. Show that those values of p1 , p2 , p3 , and p4 which maximize the entropy (and therefore make the machine most interesting) subject to the constraint that the expected payoff per dollar put in is $0.90, are p1 = 0.4473, p2 = 0.3158, p3 = 0.2231, p4 = 0.0138. Answer. Solution is derived in [Rie85, pp. 68/9 and 74/5], and he refers to [Rie77]. You have to maximize − pn log pn subject to pn = 1 and cn pn = d. In our case c1 = 0, c2 = 1, c3 = 2, and c4 = 10, and d = 0.9, but the treatment below goes through for arbitrary ci as long as not all of them are equal. This case is discussed in detail in the answer to Problem 74. • a. Difficult: Does the maximum entropy approach also give us some guidelines how to select these probabilities if all we know is that the expected value of the payout rate is smaller than 1? Answer. As shown in [Rie85, pp. 68/9 and 74/5], one can give the minimum value of the entropy for all distributions with payoff smaller than 1: H < 1.6590, and one can also give some bounds for the probabilities: p1 > 0.4272, p2 < 0.3167, p3 < 0.2347, p4 < 0.0214. • b. What if you also know that the entropy of this experiment is 1.5? Answer. This was the purpose of the paper [Rie85]. Problem 74. (Not eligible for in-class exams) Let p1 , p2 , . . . , pn ( pi = 1) be the proportions of the population of a city living in n residential colonies. The cost of living in colony i, which includes cost of travel from the colony to the central business district, the cost of the time this travel consumes, the rent or mortgage payments, and other costs associated with living in colony i, is represented by the monetary amount ci . Without loss of generality we will assume that the ci are numbered in such a way that c1 ≤ c2 ≤ · · · ≤ cn . We will also assume that the ci are not all equal. We assume that the ci are known and that also the average expenditures on travel etc. in the population is known; its value is d. One approach to modelling the population distribution is to maximize the entropy subject to the average expenditures, 1 i.e., to choose p1 , p2 , . . . pn such that H = pi log pi is maximized subject to the two constraints pi = 1 and pi ci = d. This would give the greatest uncertainty about where someone lives. • a. 3 points Set up the Lagrange function and show that (3.11.12) pi = exp(−λci ) exp(−λci ) where the Lagrange multiplier λ must be chosen such that pi ci = d. Answer. The Lagrange function is (3.11.13) L=− pn log pn − κ( pn − 1) − λ( cn pn − d) Partial differentiation with respect to pi gives the first order conditions (3.11.14) − log pi − 1 − κ − λci = 0. Therefore pi = exp(−κ − 1) exp(−λci ). Plugging this into the first constraint gives 1 = pi = 1 exp(−κ − 1) . This constraint therefore defines κ exp(−λci ) or exp(−κ − 1) = exp(−λci ) uniquely, and we can eliminate κ from the formula for pi : (3.11.15) pi = exp(−λci ) exp(−λci ) 46 3. RANDOM VARIABLES Now all the pi depend on the same unknown λ, and this λ must be chosen such that the second constraint holds. This is the Maxwell-Boltzmann distribution if µ = kT where k is the Boltzmann constant and T the temperature. • b. 2 points Here is a mathematical lemma needed for the next part: Prove that for ai ≥ 0 and ci arbitrary follows ai ai c2 ≥ ( ai ci )2 , and if all ai > 0 and i not all ci equal, then this inequality is strict. Answer. By choosing the same subscripts in the second sum as in the first we pair elements of the first sum with elements of the second sum: (3.11.16) c2 a j − j ai i ci a i i j (c2 − ci cj ) a i a j j cj a j = j i,j but if we interchange i and j on the rhs we get (3.11.17) (c2 − ci cj ) a i a j i ( c 2 − cj c i ) a j a i = i = i,j j,i Now add the righthand sides to get (3.11.18) 2 c2 a j − j ai i j ci a i i ( c 2 + c 2 − 2 ci c j ) a i a j = i j cj a j = j i,j (ci − cj )2 a i a j ≥ 0 i,j • c. 3 points It is not possible to solve equations (3.11.12) analytically for λ, but the following can be shown [Kap89, p. 310/11]: the function f defined by (3.11.19) f (λ) = ci exp(−λci ) exp(−λci ) is a strictly decreasing function which decreases from cn to c1 as λ goes from −∞ to ∞, and f (0) = c where c = (1/n) ci . We need that λ for which f (λ) = d, and ¯ ¯ this equation has no real root if d < c1 or d > cn , it has a unique positive root if c1 < d < c it has the unique root 0 for d = c, and it has a unique negative root for ¯ ¯ c < d < cn . From this follows: as long as d lies between the lowest and highest cost, ¯ and as long as the cost numbers are not all equal, the pi are uniquely determined by the above entropy maximization problem. Answer. Here is the derivative; it is negative because of the mathematical lemma just shown: (3.11.20) f (λ) = u v − uv =− v2 exp(−λci ) c2 exp(−λci ) − i exp(−λci ) ci exp(−λci ) 2 2 <0 Since c1 ≤ c2 ≤ · · · ≤ cn , it follows (3.11.21) c1 = c1 exp(−λci ) exp(−λci ) ≤ ci exp(−λci ) exp(−λci ) ≤ cn exp(−λci ) exp(−λci ) = cn Now the statement about the limit can be shown if not all cj are equal, say c1 < ck+1 but c1 = ck . The fraction can be written as (3.11.22) kc1 exp(−λc1 ) + k exp(−λc1 ) + n−k c exp(−λck+i ) i=1 k+i i = 1n−k exp(−λck+i ) = kc1 + Since ck+i − c1 > 0, this converges towards c1 for λ → ∞. k+ n−k c exp(−λ(ck+i − c1 )) i=1 k+i n−k exp(−λ(ck+i − c1 )) i=1 3.11. ENTROPY 47 • d. 3 points Show that the maximum attained entropy is H = λd + k (λ) where (3.11.23) exp(−λcj ) . k (λ) = log Although λ depends on d, show that ∂ H = λ, i.e., it is the same as if λ did not ∂d depend on d. This is an example of the “envelope theorem,” and it also gives an interpretation of λ. Answer. We have to plug the optimal pi = exp(−λci ) exp(−λci ) into the formula for H = − pi log pi . For this note that − log pi = λci + k(λ) where k(λ) = log( exp(−λcj )) does not depend on i. Therefore H = pi (λci + k(λ)) = λ pi ci + k(λ) pi = λd + k(λ), and ∂ H = λ + d ∂λ + k (λ) ∂λ . ∂d ∂d ∂d Now we need the derivative of k(λ), and we discover that k (λ) = −f (λ) where f (λ) was defined in (3.11.19). Therefore ∂ H = λ + (d − f (λ)) ∂λ = λ. ∂d ∂d • e. 5 points Now assume d is not known (but the ci are still known), i.e., we know that (3.11.12) holds for some λ but we don’t know which. We want to estimate this λ (and therefore all pi ) by taking a random sample of m people from that metropolitan area and asking them what their regional living expenditures are and where they live. Assume xi people in this sample live in colony i. One way to estimate this xi λ would be to use the average consumption expenditure of the sample, m ci , as an estimate of the missing d in the above procedure, i.e., choose that λ which satisfies xi f (λ) = m ci . Another procedure, which seems to make a better use of the information given by the sample, would be to compute the maximum likelihood estimator of λ based on all xi . Show that these two estimation procedures are identical. Answer. The xi have the multinomial distribution. Therefore, given that the proportion pi of the population lives in colony i, and you are talking a random sample of size m from the whole population, then the probability to get the outcome x1 , . . . , xn is m! px1 px2 · · · pxn (3.11.24) L= n x1 ! · · · xn ! 1 2 This is what we have to maximize, subject to the condition that the pi are an entropy maximizing population distribution. Let’s take logs for computational simplicity: log L = log m! − (3.11.25) log xj ! + xi log pi j All we know about the pi is that they must be some entropy maximizing probabilities, but we don’t know yet which ones, i.e., they depend on the unknown λ. Therefore we need the formula again − log pi = λci + k(λ) where k(λ) = log( exp(−λcj )) does not depend on i. This gives (3.11.26) log L = log m!− log xj !− j (for this last term remember that xi (λci +k(λ)) = log m!− log xj !−λ xi ci +k(λ)m j xi = m. Therefore the derivative is 1∂ xi log L = ci − f (λ) (3.11.27) m ∂λ m I.e., using the obvious estimate for d is the same as maximum likelihood under the assumption of maximum entropy. This is a powerful estimation strategy. An article with sensational image reconstitutions using maximum entropy algorithms is [SG85, pp. 111, 112, 115, 116]. And [GJM96] applies maximum entropy methods to ill-posed or underdetermined problems in econometrics! CHAPTER 4 Specific Random Variables 4.1. Binomial We will begin with mean and variance of the binomial variable, i.e., the number of successes in n independent repetitions of a Bernoulli trial (3.7.1). The binomial variable has the two parameters n and p. Let us look first at the case n = 1, in which the binomial variable is also called indicator variable: If the event A has probability p, then its complement A has the probability q = 1 − p. The indicator variable of A, which assumes the value 1 if A occurs, and 0 if it doesn’t, has expected value p and variance pq . For the binomial variable with n observations, which is the sum of n independent indicator variables, the expected value (mean) is np and the variance is npq . Problem 75. The random variable x assumes the value a with probability p and the value b with probability q = 1 − p. Show that var[x] = pq (a − b)2 . Answer. E[x] = pa + qb; var[x] = E[x2 ] − (E[x])2 = pa2 + qb2 − (pa + qb)2 = (p − p2 )a2 − 2pqab + (q − q 2 )b2 = pq (a − b)2 . For this last equality we need p − p2 = p(1 − p) = pq . The Negative Binomial Variable is, like the binomial variable, derived from the Bernoulli experiment; but one reverses the question. Instead of asking how many successes one gets in a given number of trials, one asks, how many trials one must make to get a given number of successes, say, r successes. First look at r = 1. Let t denote the number of the trial at which the first success occurs. Then Pr[t=n] = pq n−1 (4.1.1) (n = 1, 2, . . .). This is called the geometric probability. Is the probability derived in this way σ -additive? The sum of a geometrically declining sequence is easily computed: (4.1.2) 1 + q + q 2 + q 3 + · · · = s Now multiply by q : (4.1.3) q + q 2 + q 3 + · · · = qs (4.1.4) 1 = ps Now subtract and write 1 − q = p: Equation (4.1.4) means 1 = p + pq + pq 2 + · · · , i.e., the sum of all probabilities is indeed 1. Now what is the expected value of a geometric variable? Use definition of ex∞ pected value of a discrete variable: E[t] = p k=1 kq k−1 . To evaluate the infinite sum, solve (4.1.4) for s: (4.1.5) s= 1 p ∞ or 1 + q + q2 + q3 + q4 · · · = qk = k=0 49 1 1−q 50 4. SPECIFIC RANDOM VARIABLES and differentiate both sides with respect to q : ∞ (4.1.6) kq k−1 = 1 + 2q + 3q 2 + 4q 3 + · · · = k=1 1 1 = 2. (1 − q )2 p The expected value of the geometric variable is therefore E[t] = p p2 1 = p. Problem 76. Assume t is a geometric random variable with parameter p, i.e., it has the values k = 1, 2, . . . with probabilities (4.1.7) pt (k ) = pq k−1 , where q = 1 − p. The geometric variable denotes the number of times one has to perform a Bernoulli experiment with success probability p to get the first success. • a. 1 point Given a positive integer n. What is Pr[t>n]? (Easy with a simple trick!) Answer. t>n means, the first n trials must result in failures, i.e., Pr[t>n] = q n . Since {t > n} = {t = n + 1} ∪ {t = n + 2} ∪ · · · , one can also get the same result in a more tedious way: It is pq n + pq n+1 + pq n+2 + · · · = s, say. Therefore qs = pq n+1 + pq n+2 + · · · , and (1 − q )s = pq n ; since p = 1 − q , it follows s = q n . • b. 2 points Let m and n be two positive integers with m < n. Show that Pr[t=n|t>m] = Pr[t=n − m]. Answer. Pr[t=n|t>m] = Pr[t=n] Pr[t>m] = pq n−1 qm = pq n−m−1 = Pr[t=n − m]. • c. 1 point Why is this property called the memory-less property of the geometric random variable? Answer. If you have already waited for m periods without success, the probability that success will come in the nth period is the same as the probability that it comes in n − m periods if you start now. Obvious if you remember that geometric random variable is time you have to wait until 1st success in Bernoulli trial. Problem 77. t is a geometric random variable as in the preceding problem. In order to compute var[t] it is most convenient to make a detour via E[t(t − 1)]. Here are the steps: • a. Express E[t(t − 1)] as an infinite sum. Answer. Just write it down according to the definition of expected values: ∞ = k(k − 1)pq k−1 . k=2 1)pq k−1 ∞ k=0 k (k − • b. Derive the formula ∞ k (k − 1)q k−2 = (4.1.8) k=2 2 (1 − q )3 by the same trick by which we derived a similar formula in class. Note that the sum starts at k = 2. Answer. This is just a second time differentiating the geometric series, i.e., first time differentiating (4.1.6). • c. Use a. and b. to derive (4.1.9) E[t(t − 1)] = 2q p2 4.1. BINOMIAL 51 Answer. ∞ ∞ k(k − 1)pq k−1 = pq (4.1.10) k=2 k(k − 1)q k−2 = pq 2 2q = 2. (1 − q )3 p k=2 • d. Use c. and the fact that E[t] = 1/p to derive q (4.1.11) var[t] = 2 . p Answer. (4.1.12) var[t] = E[t2 ] − (E[t])2 = E[t(t − 1)] + E[t] − (E[t])2 = 1 1 q 2q + − 2 = 2. p2 p p p Now let us look at the negative binomial with arbitrary r. What is the probability that it takes n trials to get r successes? (That means, with n − 1 trials we did not yet have r successes.) The probability that the nth trial is a success is p. The probability −1 that there are r − 1 successes in the first n − 1 trials is n−1 pr−1 q n−r . Multiply r those to get: (4.1.13) Pr[t=n] = n − 1 r n−r pq . r−1 This is the negative binomial, also called the Pascal probability distribution with parameters r and p. One easily gets the mean and variance, because due to the memory-less property it is the sum of r independent geometric variables: r rq (4.1.14) E[t] = var[t] = 2 p p Some authors define the negative binomial as the number of failures before the rth success. Their formulas will look slightly different than ours. Problem 78. 3 points A fair coin is flipped until heads appear 10 times, and x is the number of times tails appear before the 10th appearance of heads. Show that the expected value E[x] = 10. Answer. Let t be the number of the throw which gives the 10th head. t is a negative binomial with r = 10 and p = 1/2, therefore E[t] = 20. Since x = t − 10, it follows E[x] = 10. Problem 79. (Banach’s match-box problem) (Not eligible for in-class exams) There are two restaurants in town serving hamburgers. In the morning each of them obtains a shipment of n raw hamburgers. Every time someone in that town wants to eat a hamburger, he or she selects one of the two restaurants at random. What is the probability that the (n + k )th customer will have to be turned away because the restaurant selected has run out of hamburgers? Answer. For each restaurant it is the negative binomial probability distribution in disguise: if a restaurant runs out of hamburgers this is like having n successes in n + k tries. But one can also reason it out: Assume one of the restaurantes must turn customers away after the n + kth customer. Write down all the n + k decisions made: write a 1 if the customer goes to the first restaurant, and a 2 if he goes to the second. I.e., write down n + k ones and twos. Under what conditions will such a sequence result in the n + kth move eating the last hamburgerthe first restaurant? Exactly if it has n ones and k twos, a n + kth move is a one. As in the reasoning k for the negative binomial probability distribution, there are n+−−1 p ossibilities, each of which n1 has probability 2−n−k . Emptying the second restaurant has the same probability. Together the k probability is therefore n+−−1 21−n−k . n1 52 4. SPECIFIC RANDOM VARIABLES 4.2. The Hypergeometric Probability Distribution Until now we had independent events, such as, repeated throwing of coins or dice, sampling with replacement from finite populations, ar sampling from infinite populations. If we sample without replacement from a finite population, the probability of the second element of the sample depends on what the first element was. Here the hypergeometric probability distribution applies. Assume we have an urn with w white and n − w black balls in it, and we take a sample of m balls. What is the probability that y of them are white? We are not interested in the order in which these balls are taken out; we may therefore assume that they are taken out simultaneously, therefore the set U of outcomes is the set of subsets containing m of the n balls. The total number of such n subsets is m . How many of them have y white balls in them? Imagine you first pick y white balls from the set of all white balls (there are w possibilities to do y that), and then you pick m − y black balls from the set of all black balls, which can n be done in m−w different ways. Every union of such a set of white balls with a set −y of black balls gives a set of m elements with exactly y white balls, as desired. There n are therefore w m−w different such sets, and the probability of picking such a set y −y is (4.2.1) Pr[Sample of m elements has exactly y white balls] = w y n−w m−y n m . Problem 80. You have an urn with w white and n − w black balls in it, and you take a sample of m balls with replacement, i.e., after pulling each ball out you put it back in before you pull out the next ball. What is the probability that y of these balls are white? I.e., we are asking here for the counterpart of formula (4.2.1) if sampling is done with replacement. Answer. (4.2.2) w n y n−w n m−y m y Without proof we will state here that the expected value of y , the number of white balls in the sample, is E[y ] = m w , which is the same as if one would select the n balls with replacement. Also without proof, the variance of y is (4.2.3) var[y ] = m w (n − w) (n − m) . n n (n − 1) This is smaller than the variance if one would choose with replacement, which is represented by the above formula without the last term n−m . This last term is n−1 called the finite population correction. More about all this is in [Lar82, p. 176–183]. 4.3. The Poisson Distribution The Poisson distribution counts the number of events in a given time interval. This number has the Poisson distribution if each event is the cumulative result of a large number of independent possibilities, each of which has only a small chance of occurring (law of rare events). The expected number of occurrences is proportional to time with a proportionality factor λ, and in a short time span only zero or one event can occur, i.e., for infinitesimal time intervals it becomes a Bernoulli trial. 4.3. THE POISSON DISTRIBUTION 53 t Approximate it by dividing the time from 0 to t into n intervals of length n ; then the occurrences are approximately n independent Bernoulli trials with probability of success λt . (This is an approximation since some of these intervals may have more n than one occurrence; but if the intervals become very short the probability of having two occurrences in the same interval becomes negligible.) In this discrete approximation, the probability to have k successes in time t is n λt k λt (n−k) Pr[x=k ] = 1− (4.3.1) k n n λt n λt −k 1 n(n − 1) · · · (n − k + 1) (4.3.2) (λt)k 1 − 1− = k! nk n n k (λt) −λt → e (4.3.3) for n → ∞ while k remains constant k! (4.3.3) is the limit because the second and the last term in (4.3.2) → 1. The sum k ∞ of all probabilities is 1 since k=0 (λt!) = eλt . The expected value is (note that we k can have the sum start at k = 1): ∞ E[x] = e−λt (4.3.4) k k=1 (λt)k = λte−λt k! ∞ k=1 (λt)k−1 = λt. (k − 1)! This is the same as the expected value of the discrete approximations. Problem 81. x follows a Poisson distribution, i.e., (λt)k −λt e k! • a. 2 points Show that E[x] = λt. (4.3.5) Pr[x=k ] = for k = 0, 1, . . .. Answer. See (4.3.4). • b. 4 points Compute E[x(x − 1)] and show that var[x] = λt. Answer. For E[x(x − 1)] we can have the sum start at k = 2: ∞ (4.3.6) E[x(x − 1)] = e−λt k(k − 1) k=2 (λt)k = (λt)2 e−λt k! ∞ (λt)k−2 = (λt)2 . (k − 2)! k=2 From this follows (4.3.7) var[x] = E[x2 ] − (E[x])2 = E[x(x − 1)] + E[x] − (E[x])2 = (λt)2 + λt − (λt)2 = λt. The Poisson distribution can be used as an approximation to the Binomial distribution when n large, p small, and np moderate. Problem 82. Which value of λ would one need to approximate a given Binomial with n and p? Answer. That which gives the right expected value, i.e., λ = np. Problem 83. Two researchers counted cars coming down a road, which obey a Poisson distribution with unknown parameter λ. In other words, in an interval of length t one will have k cars with probability (λt)k −λt e. k! Their assignment was to count how many cars came in the first half hour, and how many cars came in the second half hour. However they forgot to keep track of the time when the first half hour was over, and therefore wound up only with one count, (4.3.8) 54 4. SPECIFIC RANDOM VARIABLES namely, they knew that 213 cars had come down the road during this hour. They were afraid they would get fired if they came back with one number only, so they applied the following remedy: they threw a coin 213 times and counted the number of heads. This number, they pretended, was the number of cars in the first half hour. • a. 6 points Did the probability distribution of the number gained in this way differ from the distribution of actually counting the number of cars in the first half hour? Answer. First a few definitions: x is the total number of occurrences in the interval [0, 1]. y 1 is the number of occurrences in the interval [0, t] (for a fixed t; in the problem it was t = 2 , but we will do it for general t, which will make the notation clearer and more compact. Then we want to compute Pr[y =m|x=n]. By definition of conditional probability: (4.3.9) Pr[y =m|x=n] = Pr[y =m and x=n] . Pr[x=n] How can we compute the probability of the intersection Pr[y =m and x=n]? Use a trick: express this intersection as the intersection of independent events. For this define z as the number of events in the interval (t, 1]. Then {y =m and x=n} = {y =m and z =n − m}; therefore Pr[y =m and x=n] = Pr[y =m] Pr[z =n − m]; use this to get (4.3.10) Pr[y =m|x=n] = Pr[y =m] Pr[z =n − m] = Pr[x=n] λm tm −λt λ e m! k n−m (1−t)n−m −λ(1−t) e (n−m)! λn −λ e n! (λt)k = nm t (1−t)n−m , m (1−λ)k tk e−(1−λ)t . Here we use the fact that Pr[x=k] = t ! e−t , Pr[y =k] = k! e−λt , Pr[z =k] = k k! One sees that a. Pr[y =m|x=n] does not depend on λ, and b. it is exactly the probability of having m successes and n − m failures in a Bernoulli trial with success probability t. Therefore the procedure with the coins gave the two researchers a result which had the same probability distribution as if they had counted the number of cars in each half hour separately. • b. 2 points Explain what it means that the probability distribution of the number for the first half hour gained by throwing the coins does not differ from the one gained by actually counting the cars. Which condition is absolutely necessary for this to hold? Answer. The supervisor would never be able to find out through statistical analysis of the data they delivered, even if they did it repeatedly. All estimation results based on the faked statistic would be as accurate regarding λ as the true statistics. All this is only true under the assumption that the cars really obey a Poisson distribution and that the coin is fair. The fact that the Poisson as well as the binomial distributions are memoryless has nothing to do with them having a sufficient statistic. Problem 84. 8 points x is the number of customers arriving at a service counter in one hour. x follows a Poisson distribution with parameter λ = 2, i.e., (4.3.11) Pr[x=j ] = 2j −2 e. j! • a. Compute the probability that only one customer shows up at the service counter during the hour, the probability that two show up, and the probability that no one shows up. • b. Despite the small number of customers, two employees are assigned to the service counter. They are hiding in the back, and whenever a customer steps up to the counter and rings the bell, they toss a coin. If the coin shows head, Herbert serves 4.4. THE EXPONENTIAL DISTRIBUTION 55 the customer, and if it shows tails, Karl does. Compute the probability that Herbert has to serve exactly one customer during the hour. Hint: 1 1 1 (4.3.12) e = 1 + 1 + + + + ··· . 2! 3! 4! • c. For any integer k ≥ 0, compute the probability that Herbert has to serve exactly k customers during the hour. Problem 85. 3 points Compute the moment generating function of a Poisson k variable observed over a unit time interval, i.e., x satisfies Pr[x=k ] = λ ! e−λ and k you want E[etx ] for all t. k ∞ etk λ ! e−λ k k=0 Answer. E[etx ] = ∞ (λet )k −λ e k! k=0 = t t = eλe e−λ = eλ(e −1) . 4.4. The Exponential Distribution Now we will discuss random variables which are related to the Poisson distribution. At time t = 0 you start observing a Poisson process, and the random variable t denotes the time you have to wait until the first occurrence. t can have any nonnegative real number as value. One can derive its cumulative distribution as follows. t>t if and only if there are no occurrences in the interval [0, t]. There0 fore Pr[t>t] = (λt) e−λt = e−λt , and hence the cumulative distribution function 0! Ft (t) = Pr[t≤t] = 1 − e−λt when t ≥ 0, and Ft (t) = 0 for t < 0. The density function is therefore ft (t) = λe−λt for t ≥ 0, and 0 otherwise. This is called the exponential density function (its discrete analog is the geometric random variable). It can also be called a Gamma variable with parameters r = 1 and λ. Problem 86. 2 points An exponential random variable t with parameter λ > 0 has the density ft (t) = λe−λt for t ≥ 0, and 0 for t < 0. Use this density to compute the expected value of t. Answer. E[t] = ∞ 0 λte−λt dt = ∞ uv dt = uv 0 can also use the more abbreviated notation = Either way one obtains E[t] = ∞ −te−λt 0 + ∞ ∞ 0 u dv = uv 0 ∞ −λt e dt 0 =0 ∞ u=t v =λe−λt . One u =1 v =−e−λt ∞ u=t dv =λe−λt dt − . v du, where 0 du =dt v =−e−λt 0 1 −λt ∞ 1 − λe |0 = λ . − 0 ∞ u v dt, where Problem 87. 4 points An exponential random variable t with parameter λ > 0 has the density ft (t) = λe−λt for t ≥ 0, and 0 for t < 0. Use this density to compute the expected value of t2 . ∞ Answer. One can use that Γ(r) = 2/λ2 . Or all from scratch: u = t2 u = 2t E[t2 ] = 0 ∞ 0 λr tr−1 e−λt dt for r = 3 to get: E[t2 ] = (1/λ2 )Γ(3) = λt2 e−λt dt = v = λe−λt . Therefore E[t2 ] = −t2 e−λt v = −e−λt the second do it again: ∞ 0 2te−λt dt = ∞ 0 ∞ 0 + uv dt = uv Therefore the second term becomes 2(t/λ)e−λt ∞ 0 +2 ∞ 0 ∞ 0 ∞ 0 − ∞ 0 uv dt = uv 2te−λt ∞ 0 ∞ 0 − ∞ 0 u v dt, where dt. The first term vanishes, for u v dt, where u=t u =1 v = e−λt . v = −(1/λ)e−λt (1/λ)e−λt dt = 2/λ2 . Problem 88. 2 points Does the exponential random variable with parameter λ > 0, whose cumulative distribution function is Ft (t) = 1 − e−λt for t ≥ 0, and 0 otherwise, have a memory-less property? Compare Problem 76. Formulate this memory-less property and then verify whether it holds or not. Answer. Here is the formulation: for s<t follows Pr[t>t|t>s] = Pr[t>t − s]. This does indeed hold. Proof: lhs = Pr[t>t and t>s] Pr[t>s] = Pr[t>t] Pr[t>s] = e−λt e−λs = e−λ(t−s) . 56 4. SPECIFIC RANDOM VARIABLES Problem 89. The random variable t denotes the duration of an unemployment spell. It has the exponential distribution, which can be defined by: Pr[t>t] = e−λt for t ≥ 0 (t cannot assume negative values). • a. 1 point Use this formula to compute the cumulative distribution function Ft (t) and the density function ft (t) Answer. Ft (t) = Pr[t≤t] = 1 − Pr[t>t] = 1 − e−λt for t ≥ 0, zero otherwise. Taking the derivative gives ft (t) = λe−λt for t ≥ 0, zero otherwise. • b. 2 points What is the probability that an unemployment spell ends after time t + h, given that it has not yet ended at time t? Show that this is the same as the unconditional probability that an unemployment spell ends after time h (memory-less property). Answer. Pr[t>t + h|t>t] = (4.4.1) Pr[t>t + h] Pr[t>t] = e−λ(t+h) = e−λh e−λt • c. 3 points Let h be a small number. What is the probability that an unemployment spell ends at or before t + h, given that it has not yet ended at time t? Hint: for small h, one can write approximately Pr[t < t≤t + h] = hft (t). (4.4.2) Answer. Pr[t≤t + h|t>t] = (4.4.3) = Pr[t≤t + h and t>t] = Pr[t>t] h ft (t) h λe−λt = = h λ. 1 − Ft (t) e−λt 4.5. The Gamma Distribution The time until the second occurrence of a Poisson event is a random variable which we will call t(2) . Its cumulative distribution function is Ft(2) (t) = Pr[t(2) ≤t] = 1 − Pr[t(2) >t]. But t(2) >t means: there are either zero or one occurrences in the time between 0 and t; therefore Pr[t(2) >t] = Pr[x=0]+Pr[x=1] = e−λt + λte−λt . Putting it all together gives Ft(2) (t) = 1 − e−λt − λte−λt . In order to differentiate the cumulative distribution function we need the product rule of differentiation: (uv ) = u v + uv . This gives ft(2) (t) = λe−λt − λe−λt + λ2 te−λt = λ2 te−λt . (4.5.1) Problem 90. 3 points Compute the density function of t(3) , the time of the third occurrence of a Poisson variable. Answer. (4.5.2) (4.5.3) (4.5.4) Pr[t(3) >t] = Pr[x=0] + Pr[x=1] + Pr[x=2] λ2 2 −λt t )e 2 λ3 2 −λt ∂ λ2 2 ft(3) (t) = Ft(3) (t) = − −λ(1 + λt + t ) + (λ + λ2 t) e−λt = te . ∂t 2 2 Ft(3) (t) = Pr[t(3) ≤t] = 1 − (1 + λt + 4.5. THE GAMMA DISTRIBUTION 57 If one asks for the rth occurrence, again all but the last term cancel in the differentiation, and one gets λr tr−1 e−λt . (r − 1)! ft(r) (t) = (4.5.5) This density is called the Gamma density with parameters λ and r. The following definite integral, which is defined for all r > 0 and all λ > 0 is called the Gamma function: ∞ (4.5.6) λr tr−1 e−λt dt. Γ(r) = 0 Although this integral cannot be expressed in a closed form, it is an important function in mathematics. It is a well behaved function interpolating the factorials in the sense that Γ(r) = (r − 1)!. Problem 91. Show that Γ(r) as defined in (4.5.6) is independent of λ, i.e., instead of (4.5.6) one can also use the simpler equation ∞ (4.5.7) tr−1 e−t dt. Γ(r) = 0 Problem 92. 3 points Show by partial integration that the Gamma function satisfies Γ(r + 1) = rΓ(r). Answer. Start with ∞ (4.5.8) λr+1 tr e−λt dt Γ(r + 1) = 0 and integrate by parts: and v = rλr tr−1 : (4.5.9) uv dt with u = λe−λt and v = λr tr , therefore u = −e−λt u v dt = uv − Γ(r + 1) = −λr tr e−λt ∞ ∞ rλr tr−1 e−λt dt = 0 + rΓ(r ). + 0 0 Problem 93. Show that Γ(r) = (r − 1)! for all natural numbers r = 1, 2, . . .. Answer. Proof by induction. First verify that it holds for r = 1, i.e., that Γ(1) = 1: ∞ (4.5.10) λe−λt dt = −e−λt Γ(1) = 0 ∞ 0 =1 and then, assuming that Γ(r) = (r − 1)! Problem 92 says that Γ(r + 1) = rΓ(r ) = r (r − 1)! = r!. √ 1 Without proof: Γ( 2 ) = π . This will be shown in Problem 141. Therefore the following defines a density function, called the Gamma density with parameter r and λ, for all r > 0 and λ > 0: (4.5.11) f (x) = λr r−1 −λx xe Γ(r) for x ≥ 0, 0 otherwise. The only application we have for it right now is: this is the distribution of the time one has to wait until the rth occurrence of a Poisson distribution with intensity λ. Later we will have other applications in which r is not an integer. Problem 94. 4 points Compute the moment generating function of the Gamma distribution. 58 4. SPECIFIC RANDOM VARIABLES Answer. ∞ (4.5.12) mx (t) = E[etx ] = etx 0 = (4.5.14) λ λ−t ∞ λr (λ − t)r = (4.5.13) λr r−1 −λx x e dx Γ(r ) 0 (λ − t)r xr−1 −(λ−t)x e dx Γ(r ) r since the integrand in (4.5.12) is the density function of a Gamma distribution with parameters r and λ − t. Problem 95. 2 points The density and moment generating functions of a Gamma variable x with parameters r > 0 and λ > 0 are λr r−1 −λx (4.5.15) fx (x) = xe for x ≥ 0, 0 otherwise. Γ(r) r λ . λ−t Show the following: If x has a Gamma distribution with parameters r and 1, then v = x/λ has a Gamma distribution with parameters r and λ. You can prove this either using the transformation theorem for densities, or the moment-generating function. mx (t) = (4.5.16) Answer. Solution using density function: The random variable whose density we know is x; 1 its density is Γ(r) xr−1 e−x . If x = λv , then dx = λ, and the absolute value is also λ. Therefore the dv density of v is (4.5.17) (4.5.18) λr v r−1 e−λv . Γ(r ) Solution using the mgf: mx (t) = E[etx ] = 1 1−t r 1 1 − (t/λ) mv (t) E[etv ] = E[e(t/λ)x ] = r = λ λ−t r but this last expression can be recognized to be the mgf of a Gamma with r and λ. Problem 96. 2 points It x has a Gamma distribution with parameters r and λ, and y one with parameters p and λ, and both are independent, show that x + y has a Gamma distribution with parameters r + p and λ (reproductive property of the Gamma distribution.) You may use equation (4.5.14) without proof Answer. (4.5.19) λ λ−t r λ λ−t p = λ λ−t r +p . Problem 97. Show that a Gamma variable x with parameters r and λ has expected value E[x] = r/λ and variance var[x] = r/λ2 . Answer. Proof with moment generating function: (4.5.20) d dt λ λ−t r = r λ λ λ−t r +1 , r therefore E[x] = λ , and by differentiating twice (apply the same formula again), E[x2 ] = r therefore var[x] = λ2 . Proof using density function: For the expected value one gets E[t] = ∞ r r +1 −λt r 1 r Γ(r +1) r tλ e dt = λ · Γ(r+1) = λ . Using λ Γ(r +1) 0 ∞ λr+2 r +1 −λt r (r +1) r (r +1) t e dt = λ2 . λ2 0 Γ(r +2) 2 ] − (E[t])2 = r/λ2 . Therefore var[t] = E[t the same tricks E[t2 ] = ∞ 0 r (r +1) , λ2 λr r −1 −λt t e dt = Γ(r ) ∞ 2 λr r −1 −λt t · Γ(r) t e dt = 0 t· 4.7. THE BETA DISTRIBUTION 59 4.6. The Uniform Distribution Problem 98. Let x be uniformly distributed in the interval [a, b], i.e., the density function of x is a constant for a ≤ x ≤ b, and zero otherwise. • a. 1 point What is the value of this constant? Answer. It is 1 b−a • b. 2 points Compute E[x] Answer. E[x] = bx a b−a dx = 1 b2 −a2 b−a 2 b 2 a+b 2 since b2 − a2 = (b + a)(b − a). a2 +ab+b2 . 3 • c. 2 points Show that E[x2 ] = x Answer. E[x2 ] = dx = a b−a (check it by multiplying out). = 1 b3 −a3 . b−a 3 Now use the identity b3 − a3 = (b − a)(b2 + ab + a2 ) • d. 2 points Show that var[x] = (b−a)2 12 . Answer. var[x] = E[x2 ] − (E[x])2 = a2 +ab+b2 3 (b−a) 12 2 − (a+b)2 4 4a2 +4ab+4b2 12 = − 3a2 +6ab+3b2 12 = . 4.7. The Beta Distribution Assume you have two independent variables, both distributed uniformly over the interval [0, 1], and you want to know the distribution of their maximum. Or of their minimum. Or you have three and you want the distribution of the one in the middle. Then the densities have their maximum to the right, or to the left, or in the middle. The distribution of the rth highest out of n independent uniform variables is an example of the Beta density function. Can also be done and is probabilitytheoretically meaningful for arbitrary real r and n. Problem 99. x and y are two independent random variables distributed uniformly over the interval [0, 1]. Let u be their minimum u = min(x, y ) (i.e., u takes the value of x when x is smaller, and the value of y when y is smaller), and v = max(x, y ). • a. 2 points Given two numbers q and r between 0 and 1. Draw the events u≤q and v ≤r into the unit square and compute their probabilities. • b. 2 points Compute the density functions fu (u) and fv (v ). • c. 2 points Compute the expected values of u and v . Answer. For u: Pr[u ≤ q ] = 1 − Pr[u > q ] = 1 − (1 − q )2 = 2q − q 2 . fv (v ) = 2v Therefore f u (u) = 2 − 2 u 1 (4.7.1) (2 − 2u)u du = E[u] = u2 − 0 1 2 u3 3 = 0 1 . 3 For v it is: Pr[v ≤ r] = r2 ; this is at the same time the cumulative distribution function. Therefore the density function is fv (v ) = 2v for 0 ≤ v ≤ 1 and 0 elsewhere. 1 (4.7.2) E[v ] = v 2v dv = 0 2v 3 3 1 = 0 2 . 3 60 4. SPECIFIC RANDOM VARIABLES 4.8. The Normal Distribution By definition, y is normally distributed with mean µ and variance σ 2 , in symbols, y ∼ N (µ, σ 2 ), if it has the density function (4.8.1) fy (y ) = √ 1 2πσ 2 e− (y −µ)2 2σ 2 . It will be shown a little later that this is indeed a density function. This distribution has the highest entropy among all distributions with a given mean and variance [Kap89, p. 47]. If y ∼ N (µ, σ 2 ), then z = (y − µ)/σ ∼ N (0, 1), which is called the standard Normal distribution. Problem 100. 2 points Compare [Gre97, p. 68]: Assume x ∼ N (3, 4) (mean is 3 and variance 4). Determine with the help of a table of the Standard Normal Distribution function Pr[2<x≤5]. Answer. Pr[2 < x≤5] = Pr[2 − 3 < x − 3 ≤ 5 − 3] = Pr[ 2−3 < x−3 ≤ 5−3 ] = Pr[− 1 < x−3 ≤ 2 2 2 2 2 1 1] = Φ(1) − Φ(− 2 ) = Φ(1) − (1 − Φ( 1 )) = Φ(1) + Φ( 1 ) − 1 = 0.8413 + 0.6915 − 1 = 0.5328. Some 2 2 tables (Greene) give the area between 0 and all positive values; in this case it is 0.3413 + 0.1915. The moment generating function of a standard normal z ∼ N (0, 1) is the following integral: +∞ (4.8.2) mz (t) = E[etz ] = −∞ 1 −z 2 etz √ e 2 dz. 2π To solve this integral, complete the square in the exponent: (4.8.3) tz − t2 1 z2 = − (z − t)2 ; 2 2 2 t2 2 Note that the first summand, t2 , no longer depends on z ; therefore the factor e 2 can be written in front of the integral: (4.8.4) +∞ t2 mz (t) = e 2 −∞ 2 1 t2 1 √ e− 2 (z−t) dz = e 2 , 2π because now the integrand is simply the density function of a N (t, 1). A general univariate normal x ∼ N (µ, σ 2 ) can be written as x = µ + σ z with z ∼ N (0, 1), therefore (4.8.5) mx (t) = E[e(µ+σz)t ] = eµt E[eσzt ] = e(µt+σ 22 t /2) . 2 Problem 101. Given two independent normal variables x ∼ N (µx , σx ) and 2 y ∼ N (µy , σy ). Using the moment generating function, show that (4.8.6) 2 2 αx + β y ∼ N (αµx + βµy , α2 σx + β 2 σy ). Answer. Because of independence, the moment generating function of αx + β y is the product of the m.g.f. of αx and the one of β y : (4.8.7) 2 mαx+β y (t) = eµx αt+σx α 2 22 t /2 µy βt+σy β t /2 22 e which is the moment generating function of a N (αµx + 2 = e(µx α+µy β )t+(σx α 2 βµy , α2 σx + 2 2 +σy β 2 )t2 /2 , 2 β 2 σy ). We will say more about the univariate normal later when we discuss the multivariate normal distribution. 4.8. THE NORMAL DISTRIBUTION 61 Sometimes it is also necessary to use the truncated normal distributions. If z is standard normal, then (4.8.8) E[z |z >z ] = fz (z ) , 1 − Fz (z ) var[z |z >z ] = 1 − µ(µ − z ), where µ = E[z |z >z ]. This expected value is therefore the ordinate of the density function at point z divided by the tail area of the tail over which z is known to vary. (This rule is only valid for the normal density function, not in general!) These kinds of results can be found in [JK70, pp. 81–83] or in the original paper [Coh50] Problem 102. Every customer entering a car dealership in a certain location can be thought of as having a reservation price y in his or her mind: if the car will be offered at or below this reservation price, then he or she will buy the car, otherwise there will be no sale. (Assume for the sake of the argument all cars are equal.) Assume this reservation price is Normally distributed with mean $6000 and standard deviation $1000 (if you randomly pick a customer and ask his or her reservation price). If a sale is made, a person’s consumer surplus is the difference between the reservation price and the price actually paid, otherwise it is zero. For this question you will need the table for the standard normal cumulative distribution function. • a. 2 points A customer is offered a car at a price of $5800. The probability that he or she will take the car is . Answer. We need Pr[y ≥5800. If y =5800 then z = y−6000 = −0.2; Pr[z ≥ − 0.2] = 1 − Pr[z ≤ − 1000 0.2] = 1 − 0.4207 = 0.5793. • b. 3 points Since it is the 63rd birthday of the owner of the dealership, all cars in the dealership are sold for the price of $6300. You pick at random one of the people coming out of the dealership. The probability that this person bought a car and his or her consumer surplus was more than $500 is . Answer. This is the unconditional probability that the reservation price was higher than $6300 + $500 = $6800. i.e., Pr[y ≥6800. Define z = (y − $6000)/$1000. It is a standard normal, and y ≤$6800 ⇐⇒ z ≤.8, Therefore p = 1 − Pr[z ≤.8] = .2119. • c. 4 points Here is an alternative scenario: Since it is the 63rd birthday of the owner of the dealership, all cars in the dealership are sold for the “birthday special” price of $6300. You pick at random one of the people who bought one of these “birthday specials” priced $6300. The probability that this person’s consumer surplus was more than $500 is . The important part of this question is: it depends on the outcome of the experiment whether or not someone is included in the sample sample selection bias. Answer. Here we need the conditional probability: (4.8.9) p = Pr[y >$6800|y >$6300] = 1 − Pr[y ≤$6800] Pr[y >$6800] = . Pr[y >$6300] 1 − Pr[y ≤$6300] Again use the standard normal z = (y − $6000)/$1000. As before, y ≤$6800 ⇐⇒ z ≤.8, and y ≤$6300 ⇐⇒ z ≤.3. Therefore (4.8.10) p= 1 − Pr[z ≤.8] .2119 = = .5546. 1 − Pr[z ≤.3] .3821 It depends on the layout of the normal distribution table how this should be looked up. 62 4. SPECIFIC RANDOM VARIABLES • d. 5 points We are still picking out customers that have bought the birthday specials. Compute the median value m of such a customer’s consumer surplus. It is defined by (4.8.11) Pr[y >$6300 + m|y >$6300] = Pr[y ≤$6300 + m|y >$6300] = 1/2. Answer. Obviously, m ≥ $0. Therefore (4.8.12) Pr[y >$6300 + m|y >$6300] = 1 Pr[y >$6300 + m] =, Pr[y >$6300] 2 or Pr[y >$6300 + m] = (1/2) Pr[y >$6300] = (1/2).3821 = .1910. I.e., Pr[ y−6000 > 6300−6000+m = 1000 1000 m 300 m 300 + 1000 ] = .1910. For this we find in the table 1000 + 1000 = 0.875, therefore 300 + m = 875, 1000 or m = $575. • e. 3 points Is the expected value of the consumer surplus of all customers that have bought a birthday special larger or smaller than the median? Fill in your answer here: . Proof is not required, as long as the answer is correct. Answer. The mean is larger because it is more heavily influenced by outliers. (4.8.13) E[y − 6300|y ≥6300] = E[6000 + 1000z − 6300|6000 + 1000z ≥6300] (4.8.14) = E[1000z − 300|1000z ≥300] (4.8.15) = E[1000z |z ≥0.3] − 300 (4.8.16) = 1000 E[z |z ≥0.3] − 300 (4.8.17) = 1000 f (0.3) − 300 = 698 > 575. 1 − Ψ(0.3) 4.9. The Chi-Square Distribution A χ2 with one degree of freedom is defined to be the distribution of the square q = z 2 of a univariate standard normal variable. Call the cumulative distribution function of a standard normal Fz (z ). Then the cumulative distribution function of the χ2 variable q = z 2 is, according to Problem √ 47, Fq (q ) = 2Fz ( q ) − 1. To get the density of q take the derivative of Fq (q ) with respect to q . For this we need the chain rule, first taking the derivative with respect √ to z = q and multiply by dz : dq d d √ 2Fz ( q ) − 1 = 2Fz (z ) − 1 dq dq 2 dFz dz 2 1 (4.9.2) =2 (z ) = √ e−z /2 √ dz dq 2q 2π 1 (4.9.3) =√ e−q/2 . 2πq √ Now remember the Gamma function. Since Γ(1/2) = π (Proof in Problem 141), one can rewrite (4.9.3) as (4.9.1) (4.9.4) fq (q ) = fq (q ) = (1/2)1/2 q −1/2 e−q/2 , Γ(1/2) i.e., it is a Gamma density with parameters r = 1/2, λ = 1/2. A χ2 with p degrees of freedom is defined as the sum of p independent univariate 2 χ variables. By the reproductive property of the Gamma distribution (Problem 96) 4.11. THE CAUCHY DISTRIBUTION 63 this gives a Gamma variable with parameters r = p/2 and λ = 1/2. If q ∼ χ2 p (4.9.5) then E[q ] = p and var[q ] = 2p We will say that a random variable q is distributed as a σ 2 χ2 iff q /σ 2 is a χ2 . This p p is the distribution of a sum of p independent N (0, σ 2 ) variables. 4.10. The Lognormal Distribution This is a random variable whose log has a normal distribution. See [Gre97, p. 71]. Parametrized by the µ and σ 2 of its log. Density is 2 1 √ e−(ln x−µ/σ )/2 2 x 2πσ [Cow77, pp. 82–87] has an excellent discussion of the properties of the lognormal for income distributions. (4.10.1) 4.11. The Cauchy Distribution Problem 103. 6 points [JK70, pp. 155/6] An example of a distribution without mean and variance is the Cauchy distribution, whose density looks much like the normal density, but has much thicker tails. The density and characteristic functions are (I am not asking you to compute the characteristic function) (4.11.1) fx (x) = 1 π (1 + x2 ) E[eitx ] = exp(− |t|). √ Here i = −1, but you should not be afraid of it, in most respects, i behaves like any real number. The characteristic function has properties very similar to the moment generating function, with the added advantage that it always exists. Using the characteristic functions show that if x and y are independent Cauchy distributions, then (x + y )/2 has the same distribution as x or y . Answer. (4.11.2) E exp it x+y 2 t t = E exp i x exp i y 2 2 = exp(− t t ) exp(− ) = exp(− |t|). 2 2 It has taken a historical learning process to distinguish significant from insignificant events. The order in which the birds sit down on a tree is insignificant, but the constellation of stars on the night sky is highly significant for the seasons etc. The confusion between significant and insignificant events can explain how astrology arose: after it was discovered that the constellation of stars was significant, but without knowledge of the mechanism through which the constellation of stars was significant, people experimented to find evidence of causality between those aspects of the night sky that were changing, like the locations of the planets, and events on earth, like the births of babies. Romans thought the constellation of birds in the sky was significant. Freud discovered that human error may be significant. Modern political consciousness still underestimates the extent to which the actions of states are significant: If a welfare recipient is faced with an intractable labyrinth of regulations and a multitude of agencies, then this is not the unintended result of bureaucracy gone wild, but it is deliberate: this bureaucratic nightmare deters people from using welfare, but it creates the illusion that welfare exists and it does give relief in some blatant cases. 64 4. SPECIFIC RANDOM VARIABLES Also “mistakes” like the bombing of the Chinese embassy are not mistakes but are significant. In statistics the common consensus is that the averages are significant and the deviations from the averages are insignificant. By taking averages one distills the significant, systematic part of the date from the insignificant part. Usually this is justified by the “law of large numbers.” I.e., people think that this is something about reality which can be derived and proved mathematically. However this is an irrealist position: how can math tell us which events are significant? Here the Cauchy distribution is an interesting counterexample: it is a probability distribution for which it does not make sense to take averages. If one takes the average of n observations, then this average does not have less randomness than each individual observation, but it has exactly the same distribution as one single observation. (The law of large numbers does not apply here because the Cauchy distribution does not have an expected value.) In a world in which random outcomes are Cauchy-distributed, taking averages is not be a good way to learn from one’s experiences. People who try to keep track of things by taking averages (or by running regressions, which is a natural extension of taking averages) would have the same status in that world as astrologers have in our world. Taking medians and other quantiles would be considered scientific, but taking averages would be considered superstition. The lesson of this is: even a scientific procedure as innocuous as that of taking averages cannot be justified on purely epistemological grounds. Although it is widely assumed that the law of large numbers is such a justification, it is not. The law of large numbers does not always hold; it only holds if the random variable under consideration has an expected value. The transcendental realist can therefore say: since it apparently does make sense to take averages in our world, we can deduce transcendentally that many random variables which we are dealing with do have finite expected values. This is perhaps the simplest case of a transcendental conclusion. But this simplest case also vindicates another one of Bhaskar’s assumptions: these transcendental conclusions cannot be arrived at in a non-transcendental way, by staying in the science itself. It is impossible to decide, using statistical means alone, whether one’s data come from a distribution which has finite expected values or not. The reason is that one always has only finite datasets, and the empirical distribution of a finite sample always has finite expected values, even if the sample comes from a population which does not have finite expected values. CHAPTER 5 Chebyshev Inequality, Weak Law of Large Numbers, and Central Limit Theorem 5.1. Chebyshev Inequality If the random variable y has finite expected value µ and standard deviation σ , and k is some positive number, then the Chebyshev Inequality says 1 (5.1.1) Pr |y − µ|≥kσ ≤ 2 . k In words, the probability that a given random variable y differs from its expected value by more than k standard deviations is less than 1/k 2 . (Here “more than” and “less than” are short forms for “more than or equal to” and “less than or equal to.”) One does not need to know the full distribution of y for that, only its expected value and standard deviation. We will give here a proof only if y has a discrete distribution, but the inequality is valid in general. Going over to the standardized 1 variable z = y−µ we have to show Pr[|z |≥k ] ≤ k2 . Assuming z assumes the values σ z1 , z2 ,. . . with probabilities p(z1 ), p(z2 ),. . . , then (5.1.2) Pr[|z |≥k ] = p(zi ). i : |zi |≥k Now multiply by k 2 : (5.1.3) k 2 Pr[|z |≥k ] = k 2 p(zi ) i : |zi |≥k 2 zi p(zi ) ≤ (5.1.4) i : |zi |≥k 2 zi p(zi ) = var[z ] = 1. ≤ (5.1.5) all i The Chebyshev inequality is sharp for all k ≥ 1. Proof: the random variable 1 which takes the value −k with probability 2k2 and the value +k with probability 1 1 2k2 , and 0 with probability 1 − k2 , has expected value 0 and variance 1 and the ≤-sign in (5.1.1) becomes an equal sign. Problem 104. [HT83, p. 316] Let y be the number of successes in n trials of a Bernoulli experiment with success probability p. Show that y 1 (5.1.6) Pr − p <ε ≥ 1 − . n 4nε2 Hint: first compute what Chebyshev will tell you about the lefthand side, and then you will need still another inequality. Answer. E[y /n] = p and var[y /n] = pq/n (where q = 1 − p). Chebyshev says therefore (5.1.7) Pr y − p ≥k n 65 pq n ≤ 1 . k2 5. 66 CHEBYSHEV INEQUALITY, WEAK LAW OF LARGE NUMBERS, AND CENTRAL LIMIT THEOREM Setting ε = k pq/n, therefore 1/k2 = pq/nε2 one can rewerite (5.1.7) as (5.1.8) Pr y − p ≥ε n ≤ pq . nε2 Now note that pq ≤ 1/4 whatever their values are. Problem 105. 2 points For a standard normal variable, Pr[|z |≥1] is approximately 1/3, please look up the precise value in a table. What does the Chebyshev inequality says about this probability? Also, Pr[|z |≥2] is approximately 5%, again look up the precise value. What does Chebyshev say? Answer. Pr[|z |≥1] = 0.3174, the Chebyshev inequality says that Pr[|z |≥1] ≤ 1. Pr[|z |≥2] = 0.0456, while Chebyshev says it is ≤ 0.25. Also, 5.2. The Probability Limit and the Law of Large Numbers Let y 1 , y 2 , y 3 , . . . be a sequence of independent random variables all of which n 1 have the same expected value µ and variance σ 2 . Then y n = n i=1 y i has expected ¯ 2 value µ and variance σ . I.e., its probability mass is clustered much more closely n around the value µ than the individual y i . To make this statement more precise we need a concept of convergence of random variables. It is not possible to define it in the “obvious” way that the sequence of random variables y n converges toward y if every realization of them converges, since it is possible, although extremely unlikely, that e.g. all throws of a coin show heads ad infinitum, or follow another sequence for which the average number of heads does not converge towards 1/2. Therefore we will use the following definition: The sequence of random variables y 1 , y 2 , . . . converges in probability to another random variable y if and only if for every δ > 0 lim Pr |y n − y | ≥δ = 0. (5.2.1) n→∞ One can also say that the probability limit of y n is y , in formulas (5.2.2) plim y n = y . n→∞ In many applications, the limiting variable y is a degenerate random variable, i.e., it is a constant. The Weak Law of Large Numbers says that, if the expected value exists, then the probability limit of the sample means of an ever increasing sample is the expected value, i.e., plimn→∞ y n = µ. ¯ Problem 106. 5 points Assuming that not only the expected value but also the variance exists, derive the Weak Law of Large Numbers, which can be written as (5.2.3) lim Pr |y n − E[y ]|≥δ = 0 for all δ > 0, ¯ n→∞ from the Chebyshev inequality (5.2.4) Pr[|x − µ|≥kσ ] ≤ 1 k2 where µ = E[x] and σ 2 = var[x] Answer. From nonnegativity of probability and the Chebyshev inequality for x = y follows ¯ √ kσ 1 σ2 0 ≤ Pr[|y − µ|≥ √n ] ≤ k2 for all k. Set k = δ σ n to get 0 ≤ Pr[|y n − µ|≥δ ] ≤ nδ2 . For any fixed ¯ ¯ δ > 0, the upper bound converges towards zero as n → ∞, and the lower bound is zero, therefore the probability iself also converges towards zero. 5.3. CENTRAL LIMIT THEOREM 67 Problem 107. 4 points Let y 1 , . . . , y n be a sample from some unknown probn 1 ability distribution, with sample mean y = n i=1 y i and sample variance s2 = ¯ n 1 ¯2 i=1 (y i − y ) . Show that the data satisfy the following “sample equivalent” of n the Chebyshev inequality: if k is any fixed positive number, and m is the number of ¯ observations y j which satisfy y j − y ≥k s, then m ≤ n/k 2 . In symbols, n (5.2.5) #{y i : |y i − y | ≥k s} ≤ 2 . ¯ k Hint: apply the usual Chebyshev inequality to the so-called empirical distribution of the sample. The empirical distribution is a discrete probability distribution defined by Pr[y =y i ] = k/n, when the number y i appears k times in the sample. (If all y i are different, then all probabilities are 1/n). The empirical distribution corresponds to the experiment of randomly picking one observation out of the given sample. Answer. The only thing to note is: the sample mean is the expected value in that empirical distribution, the sample variance is the variance, and the relative number m/n is the probability. #{y i : y i ∈ S } = n Pr[S ] (5.2.6) • a. 3 points What happens to this result when the distribution from which the y i are taken does not have an expected value or a variance? Answer. The result still holds but y and s2 do not converge as the number of observations ¯ increases. 5.3. Central Limit Theorem Assume all y i are independent and have the same distribution with mean µ, variance σ 2 , and also a moment generating function. Again, let y n be the sample ¯ mean of the first n observations. The central limit theorem says that the probability distribution for ¯ yn − µ √ (5.3.1) σ/ n converges to a N (0, 1). This is a different concept of convergence than the probability limit, it is convergence in distribution. Problem 108. 1 point Construct a sequence of random variables y 1 , y 2 . . . with the following property: their cumulative distribution functions converge to the cumulative distribution function of a standard normal, but the random variables themselves do not converge in probability. (This is easy!) Answer. One example would be: all y i are independent standard normal variables. yn − ¯ Why do we have the funny expression σ/√µ ? Because this is the standardized n version of y n . We know from the law of large numbers that the distribution of ¯ y n becomes more and more concentrated around µ. If we standardize the sample ¯ averages y n , we compensate for this concentration. The central limit theorem tells ¯ us therefore what happens to the shape of the cumulative distribution function of y n . ¯ If we disregard the fact that it becomes more and more concentrated (by multiplying it by a factor which is chosen such that the variance remains constant), then we see that its geometric shape comes closer and closer to a normal distribution. Proof of the Central Limit Theorem: By Problem 109, (5.3.2) yn − µ ¯ 1 √ =√ σ/ n n n i=1 yi − µ 1 =√ σ n n zi i=1 where z i = yi − µ . σ 5. 68 CHEBYSHEV INEQUALITY, WEAK LAW OF LARGE NUMBERS, AND CENTRAL LIMIT THEOREM Let m3 , m4 , etc., be the third, fourth, etc., moments of z i ; then the m.g.f. of z i is (5.3.3) Therefore the m.g.f. of (5.3.4) t2 m3 t3 m4 t4 + + + ··· 2! 3! 4! √ n i=1 z i is (multiply and substitute t/ n for t): mzi (t) = 1 + 1+ 1 √ n t2 m3 t3 m4 t4 + ··· +√+ 2!n 3! n3 4!n2 n = 1+ wn n n where (5.3.5) wn = t2 m3 t 3 m4 t4 +√+ + ··· . 2! 3! n 4!n Now use Euler’s limit, this time in the form: if wn → w for n → ∞, then 1+ wn n n → t2 2 2 ew . Since our wn → t2 , the m.g.f. of the standardized y n converges toward e , which ¯ is that of a standard normal distribution. The Central Limit theorem is an example of emergence: independently of the distributions of the individual summands, the distribution of the sum has a very specific shape, the Gaussian bell curve. The signals turn into white noise. Here emergence is the emergence of homogenity and indeterminacy. In capitalism, much more specific outcomes emerge: whether one quits the job or not, whether one sells the stock or not, whether one gets a divorce or not, the outcome for society is to perpetuate the system. Not many activities don’t have this outcome. Problem 109. Show in detail that Answer. Lhs = µ √ n σ 1 n n i=1 y n −µ ¯ √ σ/ n √ y i −µ = n σ = 1 √ n 1 n n y i −µ i=1 σ . n i=1 yi − 1 n n i=1 µ = √ n1 σn = rhs. Problem 110. 3 points Explain verbally clearly what the law of large numbers means, what the Central Limit Theorem means, and what their difference is. Problem 111. (For this problem, a table is needed.) [Lar82, exercise 5.6.1, p. 301] If you roll a pair of dice 180 times, what is the approximate probability that the sum seven appears 25 or more times? Hint: use the Central Limit Theorem (but don’t worry about the continuity correction, which is beyond the scope of this class). Answer. Let xi be the random variable that equals one if the i-th roll is a seven, and zero otherwise. Since 7 can be obtained in six ways (1+6, 2+5, 3+4, 4+3, 5+2, 6+1), the probability to get a 7 (which is at the same time the expected value of xi ) is 6/36=1/6. Since x2 = xi , i 180 1 1 5 var[xi ] = E[xi ] − (E[xi ])2 = 6 − 36 = 36 . Define x = x . We need Pr[x≥25]. Since x i=1 i is the sum of many independent identically distributed random variables, the CLT says that x is asympotically normal. Which normal? That which has the same expected value and variance as x. E[x] = 180 · (1/6) = 30 and var[x] = 180 · (5/36) = 25. Therefore define y ∼ N (30, 25). The CLT says that Pr[x≥25] ≈ Pr[y ≥25]. Now y ≥25 ⇐⇒ y − 30≥ − 5 ⇐⇒ y − 30≤ + 5 ⇐⇒ (y − 30)/5≤1. But z = (y − 30)/5 is a standard Normal, therefore Pr[(y − 30)/5≤1] = Fz (1), i.e., the cumulative distribution of the standard Normal evaluated at +1. One can look this up in a table, the probability asked for is .8413. Larson uses the continuity correction: x is discrete, and Pr[x≥25] = Pr[x>24]. Therefore Pr[y ≥25] and Pr[y >24] are two alternative good approximations; but the best is Pr[y ≥24.5] = .8643. This is the continuity correction. n i=1 yi − CHAPTER 6 Vector Random Variables In this chapter we will look at two random variables x and y defined on the same sample space U , i.e., (6.0.6) x: U ω → x(ω ) ∈ R and y: U ω → y (ω ) ∈ R. As we said before, x and y are called independent if all events of the form x ≤ x are independent of any event of the form y ≤ y . But now let us assume they are not independent. In this case, we do not have all the information about them if we merely know the distribution of each. The following example from [Lar82, example 5.1.7. on p. 233] illustrates the issues involved. This example involves two random variables that have only two possible outcomes each. Suppose you are told that a coin is to be flipped two times and that the probability of a head is .5 for each flip. This information is not enough to determine the probability of the second flip giving a head conditionally on the first flip giving a head. For instance, the above two probabilities can be achieved by the following experimental setup: a person has one fair coin and flips it twice in a row. Then the two flips are independent. But the probabilities of 1/2 for heads and 1/2 for tails can also be achieved as follows: The person has two coins in his or her pocket. One has two heads, and one has two tails. If at random one of these two coins is picked and flipped twice, then the second flip has the same outcome as the first flip. What do we need to get the full picture? We must consider the two variables not separately but jointly, as a totality. In order to do this, we combine x and y into one x entity, a vector ∈ R2 . Consequently we need to know the probability measure y x(ω ) ∈ R2 . induced by the mapping U ω → y (ω ) It is not sufficient to look at random variables individually; one must look at them as a totality. Therefore let us first get an overview over all possible probability measures on the plane R2 . In strict analogy with the one-dimensional case, these probability measures can be represented by the joint cumulative distribution function. It is defined as (6.0.7) Fx,y (x, y ) = Pr[ x x ≤ = Pr[x ≤ x and y ≤ y ]. y y For discrete random variables, for which the cumulative distribution function is a step function, the joint probability mass function provides the same information: (6.0.8) px,y (x, y ) = Pr[ x x = = Pr[x=x and y =y ]. y y Problem 112. Write down the joint probability mass functions for the two versions of the two coin flips discussed above. 69 70 6. VECTOR RANDOM VARIABLES Answer. Here are the probability mass functions for these two cases: (6.0.9) First Flip Second Flip H T H .25 .25 T .25 .25 sum .50 .50 sum .50 .50 1.00 First Flip Second Flip H T H .50 .00 T .00 .50 sum .50 .50 sum .50 .50 1.00 The most important case is that with a differentiable cumulative distribution function. Then the joint density function fx,y (x, y ) can be used to define the probability measure. One obtains it from the cumulative distribution function by taking derivatives: ∂2 (6.0.10) fx,y (x, y ) = Fx,y (x, y ). ∂x ∂y Probabilities can be obtained back from the density function either by the integral condition, or by the infinitesimal condition. I.e., either one says for a subset B ⊂ R2 : x Pr[ (6.0.11) ∈ B] = f (x, y ) dx dy, y B or one says, for a infinitesimal two-dimensional volume element dVx,y located at [ x ], y which has the two-dimensional volume (i.e., area) |dV |, (6.0.12) Pr[ x ∈ dVx,y ] = f (x, y ) |dV |. y The vertical bars here do not mean the absolute value but the volume of the argument inside. 6.1. Expected Value, Variances, Covariances To get the expected value of a function of x and y , one simply has to put this function together with the density function into the integral, i.e., the formula is (6.1.1) E[g (x, y )] = g (x, y )fx,y (x, y ) dx dy. R2 Problem 113. Assume there are two transportation choices available: bus and car. If you pick at random a neoclassical individual ω and ask which utility this person derives from using bus or car, the answer will be two numbers that can be u(ω ) written as a vector (u for bus and v for car). v (ω ) • a. 3 points Assuming u has a uniform density in the rectangle with corners v 66 66 71 71 , , , and , compute the probability that the bus will be preferred. 68 72 68 72 Answer. The probability is 9/40. u and v have a joint density function that is uniform in the rectangle below and zero outside (u, the preference for buses, is on the horizontal, and v , the preference for cars, on the vertical axis). The probability is the fraction of this rectangle below the diagonal. 72 71 70 69 68 66 67 68 69 70 71 6.1. EXPECTED VALUE, VARIANCES, COVARIANCES 71 • b. 2 points How would you criticize an econometric study which argued along the above lines? Answer. The preferences are not for a bus or a car, but for a whole transportation systems. And these preferences are not formed independently and individualistically, but they depend on which other infrastructures are in place, whether there is suburban sprawl or concentrated walkable cities, etc. This is again the error of detotalization (which favors the status quo). Jointly distributed random variables should be written as random vectors. Iny stead of we will also write x (bold face). Vectors are always considered to be z column vectors. The expected value of a random vector is a vector of constants, notation E[x1 ] . (6.1.2) E [x] = . . E[xn ] For two random variables x and y , their covariance is defined as (6.1.3) cov[x, y ] = E (x − E[x])(y − E[y ]) Computation rules with covariances are (6.1.4) cov[x, z ] = cov[z , x] cov[x, x] = var[x] (6.1.5) cov[x + y , z ] = cov[x, z ] + cov[y , z ] cov[x, α] = 0 cov[αx, y ] = α cov[x, y ] Problem 114. 3 points Using definition (6.1.3) prove the following formula: (6.1.6) cov[x, y ] = E[xy ] − E[x] E[y ]. Write it down carefully, you will lose points for unbalanced or missing parantheses and brackets. Answer. Here it is side by side with and without the notation E[x] = µ and E[y ] = ν : cov[x, y ] = E (x − E[x])(y − E[y ]) cov[x, y ] = E[(x − µ)(y − ν )] = E xy − x E[y ] − E[x]y + E[x] E[y ] = E[xy − xν − µy + µν ] = E[xy ] − E[x] E[y ] − E[x] E[y ] + E[x] E[y ] = E[xy ] − µν − µν + µν = E[xy ] − E[x] E[y ]. (6.1.7) = E[xy ] − µν. Problem 115. 1 point Using (6.1.6) prove the five computation rules with covariances (6.1.4) and (6.1.5). Problem 116. Using the computation rules with covariances, show that (6.1.8) var[x + y ] = var[x] + 2 cov[x, y ] + var[y ]. If one deals with random vectors, the expected value becomes a vector, and the variance becomes a matrix, which is called dispersion matrix or variance-covariance matrix or simply covariance matrix. We will write it V [x]. Its formal definition is (6.1.9) V [x] = E (x − E [x])(x − E [x]) , 72 6. VECTOR RANDOM VARIABLES but we can look at it simply as the matrix of all variances and covariances, for example x var[x] cov[x, y ] . V [ y ] = cov[y , x] var[y ] (6.1.10) An important computation rule for the covariance matrix is V [x] = Ψ ⇒ V [Ax] = AΨA . (6.1.11) Problem 117. 4 points Let x = y z be a vector consisting of two random variables, with covariance matrix V [x] = Ψ, and let A = ab be an arbitrary cd 2 × 2 matrix. Prove that V [Ax] = AΨA . (6.1.12) Hint: You need to multiply matrices, and to use the following computation rules for covariances: (6.1.13) cov[x + y , z ] = cov[x, z ] + cov[y , z ] cov[αx, y ] = α cov[x, y ] cov[x, x] = var[x]. Answer. V [Ax] = V[ a c b d y z = V[ On the other hand, AΨA a c b d var[y ] cov[y , z ] ay + bz var[ay + bz ]= cy + dz cov[cy + dz , ay + bz ] cov[ay + bz , cy + dz ] var[cy + dz ] = cov[y , z ] var[z ] a b c a var[y ] + b cov[y , z ] = d c var[y ] + d cov[y , z ] a cov[y , z ] + b var[z ] c cov[y , z ] + d var[z ] a b c d Multiply out and show that it is the same thing. Since the variances are nonnegative, one can see from equation (6.1.11) that covariance matrices are nonnegative definite (which is in econometrics is often also called positive semidefinite ). By definition, a symmetric matrix Σ is nonnegative definite if for all vectors a follows a Σ a ≥ 0. It is positive definite if it is nonnegativbe definite, and a Σ a = 0 holds only if a = o. Problem 118. 1 point A symmetric matrix Ω is nonnegative definite if and only if a Ω a ≥ 0 for every vector a. Using this criterion, show that if Σ is symmetric and nonnegative definite, and if R is an arbitrary matrix, then R ΣR is also nonnegative definite. One can also define a covariance matrix between different vectors, C [x, y ]; its i, j element is cov[xi , y j ]. The correlation coefficient of two scalar random variables is defined as (6.1.14) corr[x, y ] = cov[x, y ] var[x] var[y ] . The advantage of the correlation coefficient over the covariance is that it is always between −1 and +1. This follows from the Cauchy-Schwartz inequality (6.1.15) (cov[x, y ])2 ≤ var[x] var[y ]. Problem 119. 4 points Given two random variables y and z with var[y ] = 0, compute that constant a for which var[ay − z ] is the minimum. Then derive the Cauchy-Schwartz inequality from the fact that the minimum variance is nonnegative. 6.2. MARGINAL PROBABILITY LAWS 73 Answer. (6.1.16) (6.1.17) var[ay − z ] = a2 var[y ] − 2a cov[y , z ] + var[z ] First order condition: 0 = 2a var[y ] − 2 cov[y , z ] Therefore the minimum value is a∗ = cov[y , z ]/ var[y ], for which the cross product term is −2 times the first item: (cov[y , z ])2 2(cov[y , z ])2 − + var[z ] var[y ] var[y ] (6.1.18) 0 ≤ var[a∗ y − z ] = (6.1.19) 0 ≤ −(cov[y , z ])2 + var[y ] var[z ]. This proves (6.1.15) for the case var[y ] = 0. If var[y ] = 0, then y is a constant, therefore cov[y , z ] = 0 and (6.1.15) holds trivially. 6.2. Marginal Probability Laws The marginal probability distribution of x (or y ) is simply the probability distribution of x (or y ). The word “marginal” merely indicates that it is derived from the joint probability distribution of x and y . If the probability distribution is characterized by a probability mass function, we can compute the marginal probability mass functions by writing down the joint probability mass function in a rectangular scheme and summing up the rows or columns: (6.2.1) px (x) = px,y (x, y ). y : p(x,y )=0 For density functions, the following argument can be given: (6.2.2) Pr[x ∈ dVx ] = Pr[ By the definition of a product set: many small disjoint intervals, R = (6.2.3) x ∈ dVx × R]. y x ∈ A × B ⇔ x ∈ A and y ∈ B . Split R into y i dVyi , then Pr[x ∈ dVx ] = x ∈ dVx × dVyi y Pr i (6.2.4) fx,y (x, yi )|dVx ||dVyi | = i (6.2.5) = |dVx | fx,y (x, yi )|dVyi |. i Therefore i fx,y (x, y )|dVyi | is the density function we are looking for. Now the |dVyi | are usually written as dy , and the sum is usually written as an integral (i.e., an infinite sum each summand of which is infinitesimal), therefore we get y =+∞ (6.2.6) fx (x) = fx,y (x, y ) dy. y =−∞ In other words, one has to “integrate out” the variable which one is not interested in. 74 6. VECTOR RANDOM VARIABLES 6.3. Conditional Probability Distribution and Conditional Mean The conditional probability distribution of y given x=x is the probability distribution of y if we count only those experiments in which the outcome of x is x. If the distribution is defined by a probability mass function, then this is no problem: (6.3.1) Pr[y =y and x=x] px,y (x, y ) = . Pr[x=x] px (x) py|x (y, x) = Pr[y =y |x=x] = For a density function there is the problem that Pr[x=x] = 0, i.e., the conditional probability is strictly speaking not defined. Therefore take an infinitesimal volume element dVx located at x and condition on x ∈ dVx : (6.3.2) Pr[y ∈ dVy and x ∈ dVx ] Pr[x ∈ dVx ] fx,y (x, y )|dVx ||dVy | = fx (x)|dVx | fx,y (x, y ) = |dVy |. fx (x) Pr[y ∈ dVy |x ∈ dVx ] = (6.3.3) (6.3.4) This no longer depends on dVx , only on its location x. The conditional density is therefore fx,y (x, y ) (6.3.5) fy|x (y, x) = . fx (x) As y varies, the conditional density is proportional to the joint density function, but for every given value of x the joint density is multiplied by an appropriate factor so that its integral with respect to y is 1. From (6.3.5) follows also that the joint density function is the product of the conditional times the marginal density functions. Problem 120. 2 points The conditional density is the joint divided by the marginal: (6.3.6) fy|x (y, x) = fx,y (x, y ) . fx (x) Show that this density integrates out to 1. Answer. The conditional is a density in y with x as parameter. Therefore its integral with respect to y must be = 1. Indeed, +∞ +∞ (6.3.7) fy|x=x (y, x) dy = y =−∞ fx,y (x, y ) dy f x ( x) y =−∞ = f x ( x) =1 f x ( x) because of the formula for the marginal: +∞ (6.3.8) f x ( x) = fx,y (x, y ) dy y =−∞ You see that formula (6.3.6) divides the joint density exactly by the right number which makes the integral equal to 1. Problem 121. [BD77, example 1.1.4 on p. 7]. x and y are two independent random variables uniformly distributed over [0, 1]. Define u = min(x, y ) and v = max(x, y ). • a. Draw in the x, y plane the event {max(x, y ) ≤ 0.5 and min(x, y ) > 0.4} and compute its probability. Answer. The event is the square between 0.4 and 0.5, and its probability is 0.01. 6.4. THE MULTINOMIAL DISTRIBUTION 75 • b. Compute the probability of the event {max(x, y ) ≤ 0.5 and min(x, y ) ≤ 0.4}. Answer. It is Pr[max(x, y ) ≤ 0.5] − Pr[max(x, y ) ≤ 0.5 and min(x, y ) > 0.4], i.e., the area of the square from 0 to 0.5 minus the square we just had, i.e., 0.24. • c. Compute Pr[max(x, y ) ≤ 0.5| min(x, y ) ≤ 0.4]. Answer. (6.3.9) 0.24 0.24 3 Pr[max(x, y ) ≤ 0.5 and min(x, y ) ≤ 0.4] = = =. Pr[min(x, y ) ≤ 0.4] 1 − 0.36 0.64 8 • d. Compute the joint cumulative distribution function of u and v . Answer. One good way is to do it geometrically: for arbitrary 0 ≤ u, v ≤ 1 draw the area {u ≤ u and v ≤ v } and then derive its size. If u ≤ v then Pr[u ≤ u and v ≤ v ] = Pr[v ≤ v ] − Pr[u ≤ u and v > v ] = v 2 − (v − u)2 = 2uv − u2 . If u ≥ v then Pr[u ≤ u and v ≤ v ] = Pr[v ≤ v ] = v 2 . • e. Compute the joint density function of u and v . Note: this joint density is discontinuous. The values at the breakpoints themselves do not matter, but it is very important to give the limits within this is a nontrivial function and where it is zero. Answer. One can see from the way the cumulative distribution function was constructed that the density function must be (6.3.10) 2 0 fu,v (u, v ) = if 0 ≤ u ≤ v ≤ 1 otherwise I.e., it is uniform in the above-diagonal part of the square. This is also what one gets from differentiating 2vu − u2 once with respect to u and once with respect to v . • f . Compute the marginal density function of u. Answer. Integrate v out: the marginal density of u is 1 (6.3.11) f u (u) = 1 = 2 − 2u 2 dv = 2v v =u if 0 ≤ u ≤ 1, and 0 otherwise. u • g. Compute the conditional density of v given u = u. Answer. Conditional density is easy to get too; it is the joint divided by the marginal, i.e., it is uniform: (6.3.12) fv|u=u (v ) = 1 1−u for 0 ≤ u ≤ v ≤ 1 0 otherwise. 6.4. The Multinomial Distribution Assume you have an experiment with r different possible outcomes, with outcome i having probability pi (i = 1, . . . , r). You are repeating the experiment n different times, and you count how many times the ith outcome occurred. Therefore you get a random vector with r different components xi , indicating how often the ith event occurred. The probability to get the frequencies x1 , . . . , xr is m! (6.4.1) Pr[x1 = x1 , . . . , xr = xr ] = px1 px2 · · · pxr r x1 ! · · · xr ! 1 2 This can be explained as follows: The probability that the first x1 experiments yield outcome 1, the next x2 outcome 2, etc., is px1 px2 · · · pxr . Now every other r 12 sequence of experiments which yields the same number of outcomes of the different categories is simply a permutation of this. But multiplying this probability by n! 76 6. VECTOR RANDOM VARIABLES may count certain sequences of outcomes more than once. Therefore we have to divide by the number of permutations of the whole n element set which yield the same original sequence. This is x1 ! · · · xr !, because this must be a permutation which permutes the first x1 elements amongst themselves, etc. Therefore the relevant count n! of permutations is x1 !···xr ! . Problem 122. You have an experiment with r different outcomes, the ith outcome occurring with probability pi . You make n independent trials, and the ith outcome occurred xi times. The joint distribution of the x1 , . . . , xr is called a multinomial distribution with parameters n and p1 , . . . , pr . • a. 3 points Prove that their mean vector and covariance matrix are (6.4.2) p1 p1 − p2 −p1 p2 · · · −p1 pr 1 x1 x1 p2 −p2 p1 p2 − p2 · · · −p2 pr 2 . . µ = E [ . ] = n . and Ψ = V [ . ] = n . . . . .. . . . . . . . . . . . xr xr pr −pr p1 −pr p2 · · · pr − p2 r Hint: use the fact that the multinomial distribution with parameters n and p1 , . . . , pr is the independent sum of n multinomial distributions with parameters 1 and p1 , . . . , pr . Answer. In one trial, x2 = xi , from which follows the formula for the variance, and for i = j , i xi xj = 0, since only one of them can occur. Therefore cov[xi , xj ] = 0 − E[xi ] E[xj ]. For several independent trials, just add this. • b. 1 point How can you show that this covariance matrix is singular? Answer. Since x1 + · · · + xr = n with zero variance, we should expect (6.4.3) p1 − p2 1 −p2 p1 . n . . −pr p1 −p1 p2 p2 − p2 2 . . . −pr p2 ··· ··· .. . ··· −p1 pr 1 0 −p2 pr 1 0 . = . . . . . . . . 2 pr − pr 1 0 6.5. Independent Random Vectors The same definition of independence, which we already encountered with scalar random variables, also applies to vector random variables: the vector random variables x : U → Rm and y : U → Rn are called independent if all events that can be defined in terms of x are independent of all events that can be defined in terms of y , i.e., all events of the form {x(ω ) ∈ C } are independent of all events of the form {y (ω ) ∈ D} with arbitrary (measurable) subsets C ⊂ Rm and D ⊂ Rn . For this it is sufficient that for all x ∈ Rm and y ∈ Rn , the event {x ≤ x} is independent of the event {y ≤ y }, i.e., that the joint cumulative distribution function is the product of the marginal ones. Since the joint cumulative distribution function of independent variables is equal to the product of the univariate cumulative distribution functions, the same is true for the joint density function and the joint probability mass function. Only under this strong definition of independence is it true that any functions of independent random variables are independent. Problem 123. 4 points Prove that, if x and y are independent, then E[xy ] = E[x] E[y ] and therefore cov[x, y ] = 0. (You may assume x and y have density functions). Give a counterexample where the covariance is zero but the variables are nevertheless dependent. 6.6. CONDITIONAL EXPECTATION AND VARIANCE 77 Answer. Just use that the joint density function is the product of the marginals. It can also be done as follows: E[xy ] = E E[xy |x] = E x E[y |x] = now independence is needed = E x E[y ] = E[x] E[y ]. A counterexample is given in Problem 139. Problem 124. 3 points Prove the following: If the scalar random variables x and y are indicator variables (i.e., if each of them can only assume the values 0 and 1), and if cov[x, y ] = 0, then x and y are independent. (I.e., in this respect indicator variables have similar properties as jointly normal random variables.) Answer. Define the events A = {ω ∈ U : x(ω ) = 1} and B = {ω ∈ U : y (ω ) = 1}, i.e., x = iA (the indicator variable of the event A) and y = iB . Then xy = iA∩B . If cov[x, y ] = E[xy ] − E[x] E[y ] = Pr[A ∩ B ] − Pr[A] Pr[B ] = 0, then A and B are independent. Problem 125. If the vector random variables x and y have the property that xi is independent of every y j for all i and j , does that make x and y independent random vectors? Interestingly, the answer is no. Give a counterexample that this fact does not even hold for indicator variables. I.e., construct two random vectors x and y , consisting of indicator variables, with the property that each component of x is independent of each component of y , but x and y are not independent as vector random variables. Hint: Such an example can be constructed in the simplest possible case that x has two components and y has one component; i.e., you merely have to find three indicator variables x1 , x2 , and y with the property that x1 is independent x1 of y , and x2 is independent of y , but the vector is not independent of y . For x2 these three variables, you should use three events which are pairwise independent but not mutually independent. Answer. Go back to throwing a coin twice independently and define A = {HH, HT }; B = {T H, HH }, and C = {HH, T T }, and x1 = IA , x2 = IB , and y = IC . They are pairwise independent, but A ∩ B ∩ C = A ∩ B , i.e., x1 x2 y = x1 x2 , therefore E[x1 x2 y ] = E[x1 x2 ] E[y ] therefore they are not independent. Problem 126. 4 points Prove that, if x and y are independent, then var[xy ] = (E[x])2 var[y ] + (E[y ])2 var[x] + var[x] var[y ]. Answer. Start with result and replace all occurrences of var[z ] with E[z 2 ]−E[z ]2 , then multiply out: E[x]2 (E[y 2 ] − E[y ]2 ) + E[y ]2 (E[x2 ] − E[x]2 ) + (E[x2 ] − E[x]2 )(E[y 2 ] − E[y ]2 ) = E[x2 ] E[y 2 ] − E[x]2 E[y ]2 = E[(xy )2 ] − E[xy ]2 . 6.6. Conditional Expectation and Variance The conditional expectation of y is the expected value of y under the conditional density. If joint densities exist, it follows (6.6.1) E[y |x=x] = y fx,y (x, y ) dy =: g (x). fx (x) This is not a random variable but a constant which depends on x, i.e., a function of x, which is called here g (x). But often one uses the term E[y |x] without specifying x. This is, by definition, the random variable g (x) which one gets by plugging x into g ; it assigns to every outcome ω ∈ U the conditional expectation of y given x=x(ω ). Since E[y |x] is a random variable, it is possible to take its expected value. The law of iterated expectations is extremely important here. It says that you will get the same result as if you had taken the expected value of y : (6.6.2) E E[y |x] = E[y ]. 78 6. VECTOR RANDOM VARIABLES Proof (for the case that the densities exist): E E[y |x] = E[g (x)] = (6.6.3) = y fx,y (x, y ) dy fx (x) dx fx (x) y fx,y (x, y ) dy dx = E[y ]. Problem 127. Let x and y be two jointly distributed variables. For every fixed value x, var[y |x = x] is the variance of y under the conditional distribution, and var[y |x] is this variance as a random variable, namely, as a function of x. • a. 1 point Prove that (6.6.4) var[y |x] = E[y 2 |x] − (E[y |x])2 . This is a very simple proof. Explain exactly what, if anything, needs to be done to prove it. Answer. For every fixed value x, it is an instance of the law var[y ] = E[y 2 ] − (E[y ])2 (6.6.5) applied to the conditional density given x = x. And since it is true for every fixed x, it is also true after plugging in the random variable x. • b. 3 points Prove that (6.6.6) var[y ] = var E[y |x] + E var[y |x] , i.e., the variance consists of two components: the variance of the conditional mean and the mean of the conditional variances. This decomposition of the variance is given e.g. in [Rao73, p. 97] or [Ame94, theorem 4.4.2 on p. 78]. Answer. The first term on the rhs is E[(E[y |x])2 ] − (E[E[y |x]])2 , and the second term, due to (6.6.4), becomes E[E[y 2 |x]] − E[(E[y |x])2 ]. If one adds, the two E[(E[y |x])2 ] cancel out, and the other two terms can be simplified by the law of iterated expectations to give E[y 2 ] − (E[y ])2 . • c. 2 points [Coo98, p. 23] The conditional expected value is sometimes called the population regression function. In graphical data analysis, the sample equivalent of the variance ratio (6.6.7) E var[y |x] var E[y |x] can be used to determine whether the regression function E[y |x] appears to be visually well-determined or not. Does a small or a big variance ratio indicate a welldetermined regression function? Answer. For a well-determined regression function the variance ratio should be small. [Coo98, p. 23] writes: “This ratio is reminiscent of a one-way analysis of variance, with the numerator representing the average within group (slice) variance, and the denominator representing the varince between group (slice) means.” Now some general questions: Problem 128. The figure on page 79 shows 250 independent observations of the random vector [ x ]. y • a. 2 points Draw in by hand the approximate location of E [[ x ]] and the graph y of E[y |x]. Draw into the second diagram the approximate marginal density of x. 6.7. EXPECTED VALUES AS PREDICTORS 79 • b. 2 points Is there a law that the graph of the conditional expectation E[y |x] always goes through the point E [[ x ]]—for arbitrary probability distributions for which y these expectations exist, or perhaps for an important special case? Indicate how this could be proved or otherwise give (maybe geometrically) a simple counterexample. Answer. This is not the law of iterated expectations. It is true for jointly normal variables, not in general. It is also true if x and y are independent; then the graph of E[y |x] is a horizontal line at the height of the unconditional expectation E[y ]. A distribution with U-shaped unconditional distribution has the unconditional mean in the center of the U, i.e., here the unconditional mean does not lie on the curve drawn out by the conditional mean. • c. 2 points Do you have any ideas how the strange-looking cluster of points in the figure on page 79 was generated? Problem 129. 2 points Given two independent random variables x and y with density functions fx (x) and gy (y ). Write down their joint, marginal, and conditional densities. Answer. Joint density: fx,y (x, (y ) = fx (x)gy (y ). ∞ ∞ Marginal density of x is fx (x)gy (y ) dy = fx (x) gy (y ) dy = fx (x), and that of y is −∞ −∞ gy (y ). The text of the question should have been: “Given two independent random variables x and y with marginal density functions fx (x) and gy (y )”; by just calling them “density functions” without specifying “marginal” it committed the error of de-totalization, i.e., it treated elements of a totality, i.e., of an ensemble in which each depends on everything else, as if they could be defined independently of each other. Conditional density functions: fx|y=y (x; y ) = fx (x) (i.e., it does not depend on y ); and gy|x=x (y ; x) = gy (y ). You can see this by dividing the joint by the marginal. 6.7. Expected Values as Predictors ` Expected values and conditional expected values have optimal properties as predictors. ` Problem 130. 3 points What is the best predictor of a random variable y by a constant a, if the loss function is the “mean squared error” (MSE) E[(y − a)2 ]? ` ` `` T ` T ` ` ` `` `` `` `` ` ` ` ``` ` ` ` ` ` `` ` ` ` ` `` ` ` ` ` ` `` ` `` `` ` ` ` ` ` ` ` ` ` ```` ```` ` ` ````` `` ` `` ` ` ```` `` `` ` `` ` ` ` ` ` `` ` ` ` ``` ``` `` ` ` ` ` ` ` ` ` ` `` ` ` `` ` ` `` ``` ``` `` ` ` ``` ``` ` `` `` ` ` ` ` `` ` ` ` `` ` ` ` ` ` `` ` ` ` ` `` ` `` ` ` ` ` `` ` ` `` `` ` `` ` ` `` ` ` ` `` `` E ` E 80 6. VECTOR RANDOM VARIABLES Answer. Write E[y ] = µ; then (y − a )2 = ( y − µ) − ( a − µ) 2 = (y − µ)2 − 2(y − µ)(a − µ) + (a − µ)2 ; (6.7.1) therefore E[(y − a)2 ] = E[(y − µ)2 ] − 0 + (a − µ)2 This is minimized by a = µ. The expected value of y is therefore that constant which, as predictor of y , has smallest MSE. What if we want to predict y not by a constant but by a function of the random vector x, call it h(x)? Problem 131. 2 points Assume the vector x = [x1 , . . . xj ] and the scalar y are jointly distributed random variables, and assume conditional means exist. x is observed, but y is not observed. The joint distribution of x and y is known. Show that the conditional expectation E[y |x] is the minimum MSE predictor of y given x, i.e., show that for any other function of x, call it h(x), the following inequality holds: 2 2 E[ y − h(x) ] ≥ E[ y − E[y |x] ]. (6.7.2) For this proof and the proofs required in Problems 132 and 133, you may use (1) the theorem of iterated expectations E E[y |x] = E[y ], (2) the additivity E[g (y ) + h(y )|x] = E[g (y )|x]+ E[h(y )|x], and (3) the fact that E[g (x)h(y )|x] = g (x)E[h(y )|x]. Be very specific about which rules you are applying at every step. You must show that you understand what you are writing down. Answer. (6.7.3) E[ y − h(x) 2 =E y − E[y |x] − (h(x) − E[y |x]) 2 = E[(y − E[y |x])2 ] − 2 E[(y − E[y |x])(h(x) − E[y |x])] + E[(h(x) − E[y |x])2 ]. Here the cross product term E[(y − E[y |x])(h(x) − E[y |x])] is zero. In order to see this, first use the law of iterated expectations (6.7.4) E[(y − E[y |x])(h(x) − E[y |x])] = E E[(y − E[y |x])(h(x) − E[y |x])|x] and then look at the inner term, not yet doing the outer expectation: E[(y − E[y |x])(h(x) − E[y |x])|x] = (h(x) − E[y |x]) = E[(y − E[y |x])|x] = (h(x) − E[y |x])(E[y |x] − E[y |x]) == (h(x) − E[y |x]) · 0 = 0 Plugging this into (6.7.4) gives E[(y − E[y |x])(h(x) − E[y |x])] = E 0 = 0. This is one of the few clear cut results in probability theory where a best estimator/predictor exists. In this case, however, all parameters of the distribution are known, the only uncertainty comes from the fact that some random variables are unobserved. Problem 132. Assume the vector x = [x1 , . . . xj ] and the scalar y are jointly distributed random variables, and assume conditional means exist. Define ε = y − E[y |x]. 6.7. EXPECTED VALUES AS PREDICTORS 81 • a. 5 points Demonstrate the following identities: (6.7.5) E[ε|x] = 0 (6.7.6) E[ε] = 0 (6.7.7) E[xi ε|x] = 0 for all i, 1 ≤ i ≤ j (6.7.8) E[xi ε] = 0 for all i, 1 ≤ i ≤ j (6.7.9) cov[xi , ε] = 0 for all i, 1 ≤ i ≤ j . Interpretation of (6.7.9): ε is the error in the best prediction of y based on x. If this error were correlated with one of the components xi , then this correlation could be used to construct a better prediction of y . Answer. (6.7.5): E[ε|x] = E[y |x] − E E[y |x]|x = 0 since E[y |x] is a function of x and therefore equal to its own expectation conditionally on x. (This is not the law of iterated expectations but the law that the expected value of a constant is a constant.) (6.7.6) follows from (6.7.5) (i.e., (6.7.5) is stronger than (6.7.6)): if an expectation is zero conditionally on every possible outcome of x then it is zero altogether. In formulas, E[ε] = E E[ε|x] = E[0] = 0. It is also easy to show it in one swoop, without using (6.7.5): E[ε] = E[y − E[y |x]] = 0. Either way you need the law of iterated expectations for this. (6.7.7): E[xi ε|x] = xi E[ε|x] = 0. (6.7.8): E[xi ε] = E E[xi ε|x] = E[0] = 0; or in one swoop: E[xi ε] = E xi y − xi E[y |x] = E xi y − E[xi y |x] = E[xi y ] − E[xi y ] = 0. The following “proof” is not correct: E[xi ε] = E[xi ] E[ε] = E[xi ] · 0 = 0. xi and ε are generally not independent, therefore the multiplication rule E[xi ε] = E[xi ] E[ε] cannot be used. Of course, the following “proof” does not work either: E[xi ε] = xi E[ε] = xi · 0 = 0. xi is a random variable and E[xi ε] is a constant; therefore E[xi ε] = xi E[ε] cannot hold. (6.7.9): cov[xi , ε] = E[xi ε] − E[xi ] E[ε] = 0 − E[xi ] · 0 = 0. • b. 2 points This part can only be done after discussing the multivariate normal distribution:If x and y are jointly normal, show that x and ε are independent, and that the variance of ε does not depend on x. (This is why one can consider it an error term.) Answer. If x and y are jointly normal, then x and ε are jointly normal as well, and independence follows from the fact that their covariance is zero. The variance is constant because in the Normal case, the conditional variance is constant, i.e., E[ε2 ] = E E[ε2 |x] = constant (does not depend on x). Problem 133. 5 points Under the permanent income hypothesis, the assumption is made that consumers’ lifetime utility is highest if the same amount is consumed every year. The utility-maximizing level of consumption c for a given consumer depends on the actual state of the economy in each of the n years of the consumer’s life c = f (y 1 , . . . , y n ). Since c depends on future states of the economy, which are not known, it is impossible for the consumer to know this optimal c in advance; but it is assumed that the function f and the joint distribution of y 1 , . . . , y n are known to him. Therefore in period t, when he only knows the values of y 1 , . . . , y t , but not yet the future values, the consumer decides to consume the amount ct = E[c|y 1 , . . . , y t ], which is the best possible prediction of c given the information available to him. Show that in this situation, ct+1 − ct is uncorrelated with all y 1 , . . . , y t . This implication of the permanent income hypothesis can be tested empirically, see [Hal78]. Hint: you are allowed to use without proof the following extension of the theorem of iterated expectations: (6.7.10) E E[x|y , z ] y = E[x|y ]. Here is an explanation of (6.7.10): E[x|y ] is the best predictor of x based on information set y . E[x|y , z ] is the best predictor of x based on the extended information 82 6. VECTOR RANDOM VARIABLES set consisting of y and z . E E[x|y , z ] y is therefore my prediction, based on y only, how I will refine my prediction when z becomes available as well. Its equality with E[x|y ], i.e., (6.7.10) says therefore that I cannot predict how I will change my mind after better information becomes available. Answer. In (6.7.10) set x = c = f (y 1 , . . . , y t , y t+1 , . . . , y n ), y = [y 1 , . . . , y t ] , and z = y t+1 to get E E[c|y 1 , . . . , y t+1 ] y 1 , . . . , y t = E[c|y 1 , . . . , y t ]. (6.7.11) Writing ct for E[c|y 1 , . . . , y t ], this becomes E[ct+1 |y 1 , . . . , y t ] = ct , i.e., ct is not only the best predictor of c, but also that of ct+1 . The change in consumption ct+1 − ct is therefore the prediction error, which is uncorrelated with the conditioning variables, as shown in Problem 132. Problem 134. 3 points Show that for any two random variables x and y whose covariance exists, the following equation holds: cov[x, y ] = cov x, E[y |x] (6.7.12) Note: Since E[y |x] is the best predictor of y based on the observation of x, (6.7.12) can also be written as cov x, (y − E[y |x]) = 0, (6.7.13) i.e., x is uncorrelated with the prediction error of the best prediction of y given x. (Nothing to prove for this Note.) Answer. Apply (6.1.6) to the righthand side of (6.7.12): (6.7.14) cov x, E[y |x] = E xE[y |x] −E[x] E E[y |x] = E E[xy |x] −E[x] E[y ] = E[xy ]−E[x] E[y ] = cov[x, y ]. The tricky part here is to see that xE[y |x] = E[xy |x]. Problem 135. Assume x and y have a joint density function fx,y (x, y ) which is symmetric about the x-axis, i.e., fx,y (x, y ) = fx,y (x, −y ). Also assume that variances and covariances exist. Show that cov[x, y ] = 0. Hint: one way to do it is to look at E[y |x]. Answer. We know that cov[x, y ] = cov x, E[y |x] . Furthermore, from symmetry follows E[y |x] = 0. Therefore cov[x, y ] = cov[x, 0] = 0. Here is a detailed proof of E[y |x] = 0: E[y |x=x] = ∞ f (x,y ) , y xfy (x) dy . Now substitute z = −y , then also dz = −dy , and the boundaries of integration x are reversed: −∞ −∞ (6.7.15) E[y |x=x] = z ∞ fx,y (x, −z ) dz = f x ( x) −∞ z ∞ fx,y (x, z ) dz = − E[y |x=x]. f x ( x) One can also prove directly under this presupposition cov[x, y ] = cov[x, −y ] and therefore it must be zero. Problem 136. [Wit85, footnote on p. 241] Let p be the logarithm of the price level, m the logarithm of the money supply, and x a variable representing real influences on the price level (for instance productivity). We will work in a model of the economy in which p = m + γ x, where γ is a nonrandom parameter, and m and x are 2 2 independent normal with expected values µm , µx , and variances σm , σx . According to the rational expectations assumption, the economic agents know the probability distribution of the economy they live in, i.e., they know the expected values and variances of m and x and the value of γ . But they are unable to observe m and x, they 6.8. TRANSFORMATION OF VECTOR RANDOM VARIABLES 83 can only observe p. Then the best predictor of x using p is the conditional expectation E[x|p]. • a. Assume you are one of these agents and you observe p = p. How great would you predict x to be, i.e., what is the value of E[x|p = p]? Answer. It is, according to formula (7.3.18), E[x|p = p] = µx + E[p] = µm + γµx , cov[x, p] = cov[x, m] + γ cov[x, x] = 2 γσx , and var(p) = cov(x,p) (p − E[p]). Now var(p) 2 2 σm + γ 2 σx . Therefore 2 γσx E[x|p = p] = µx + 2 (p − µm − γµx ). 2 σm + γ 2 σx (6.7.16) • b. Define the prediction error ε = x − E[x|p]. Compute expected value and variance of ε. Answer. (6.7.17) ε = x − µx − 2 γσx (p − µm − γµx ). 2 2 σm + γ 2 σx This has zero expected value, and its variance is (6.7.18) var[ε] = var[x] + (6.7.19) 2 = σx + (6.7.20) = 2 γ σx 2 2 σm + γ 2 σx 2 var[p] − 2 2 γ σx cov[x, p] = 2 2 σm + γ 2 σ x 2 2 γ 2 (σ x ) 2 γ 2 ( σx )2 −2 2 2 σ2 2 γx σ m + γ 2 σx 2 σm + 2 σ2 σx m 2 2 σm + γ 2 σx = 1+ 2 σx . 2 σ 2 /σ 2 γxm 2 • c. In an attempt to fine tune the economy, the central bank increases σm . Does that increase or decrease var(ε)? Answer. From (6.7.20) follows that it increases the variance. 6.8. Transformation of Vector Random Variables In order to obtain the density or probability mass function of a one-to-one transformation of random variables, we have to follow the same 4 steps described in Section 3.6 for a scalar random variable. (1) Determine A, the range of the new variable, whose density we want to compute; (2) express the old variable, the one whose density/mass function is known, in terms of the new variable, the one whose x x density or mass function is needed. If that of is known, set = t(u, v ). Here y y q (u, v ) t is a vector-valued function, (i.e., it could be written t(u, v ) = , but we will r(u, v ) use one symbol t for this whole transformation), and you have to check that it is one-to-one on A, i.e., t(u, v ) = t(u1 , v1 ) implies u = u1 and v = v1 for all (u, v ) and u1 , v1 ) in A. (A function for which two different arguments (u, v ) and u1 , v1 ) give the same function value is called many-to-one.) If the joint probability distribution of x and y is described by a probability mass function, then the joint probability mass function of u and v can simply be obtained by substituting t into the joint probability mass function of x and y (and it is zero for any values which are not in A): (6.8.1) u u x pu,v (u, v ) = Pr = = Pr t(u, v ) = t(u, v ) = Pr = t(u, v ) = px,y t(u, v ) . v v y 84 6. VECTOR RANDOM VARIABLES The second equal sign is where the condition enters that t : R2 → R2 is one-to-one. If one works with the density function instead of a mass function, one must perform an additional step besides substituting t. Since t is one-to-one, it follows u ∈ dVu,v } = {t(u, v ) ∈ t(dV )x,y }. v { (6.8.2) Therefore (6.8.3) fu,v (u, v )|dVu,v | = Pr[ u ∈ dVu,v ] = Pr[t(u, v ) ∈ t(dV )x,y ] = fx,y (t(u, v ))|t(dV )x,y | = v = fx,y (t(u, v )) (6.8.4) |t(dV ) |t(dV )x,y | |dVu,v |. |dVu,v | | x,y The term |dVu,v | is the local magnification factor of the transformation t; analytically it is the absolute value |J | of the Jacobian determinant (6.8.5) J= ∂x ∂u ∂y ∂u ∂x ∂v ∂y ∂v = ∂q ∂u (u, v ) ∂r ∂u (u, v ) ∂q ∂v (u, v ) ∂r ∂v (u, v ) . Remember, u, v are the new and x, y the old variables. To compute J one has to express the old in terms of the new variables. If one expresses the new in terms of the old, one has to take the inverse of the corresponding determinant! The transformation rule for density functions can therefore be summarized as: (x, y ) = t(u, v ) ⇒ fu,v (u, v ) = fx,y t(u, v ) |J | where one-to-one J= ∂x ∂u ∂y ∂u ∂x ∂v ∂y ∂v Problem 137. Let x and y be two random variables with joint density function fx,y (x, y ). • a. 3 points Define u = x + y . Derive the joint density function of u and y . Answer. You have to express the “old” x and y as functions of the “new” u and y : x=u−y y=y or x y = −1 1 1 0 u y therefore J= ∂x ∂u ∂y ∂u ∂x ∂y ∂y ∂y = 1 0 −1 = 1. 1 Therefore fu,y (u, y ) = fx,y (u − y, y ). (6.8.6) • b. 1 point Derive from this the following formula computing the density function fu (u) of the sum u = x + y from the joint density function fx,y (x, y ) of x and y. y =∞ (6.8.7) fx,y (u − y, y )dy. fu (u) = y =−∞ Answer. Write down the joint density of u and y and then integrate y out, i.e., take its integral over y from −∞ to +∞: y =∞ (6.8.8) f u (u) = y =∞ y =−∞ i.e., one integrates over all x y fx,y (u − y, y )dy. fu,y (u, y )dy = with x + y = u. y =−∞ . 6.8. TRANSFORMATION OF VECTOR RANDOM VARIABLES 85 Problem 138. 6 points Let x and y be independent and uniformly distributed over the interval [0, 1]. Compute the density function of u = x + y and draw its graph. Hint: you may use formula (6.8.7) for the density of the sum of two jointly distributed random variables. An alternative approach would be to first compute the cumulative distribution function Pr[x + y ≤ u] for all u. Answer. Using equation (6.8.7): (6.8.9) fx,y (u − y, y ) dy = fx+y (u) = Tq d E q dq for 1 ≤ u ≤ 2 −∞ u for 0 ≤ u ≤ 1 ∞ otherwise. 2−u 0 To help evaluate this integral, here is the area in u, y -plane (u = x + y on the horizontal and y on the vertical axis) in which fx,y (u − v, v ) has the value 1: q Tq E q q This is the area between (0,0), (1,1), (2,1), and (1,0). One can also show it this way: fx,y (x, y ) = 1 iff 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1. Now take any fixed u. It must be between 0 and 2. First assume 0 ≤ u ≤ 1: then fx,y (u − y, y ) = 1 iff 0 ≤ u − y ≤ 1 and 0 ≤ y ≤ 1 iff 0 ≤ y ≤ u. Now assume 1 ≤ u ≤ 2: then fx,y (u − y, y ) = 1 iff u − 1 ≤ y ≤ 1. Problem 139. Assume [ x ] is uniformly distributed on a round disk around the y origin with radius 10. • a. 4 points Derive the joint density, the marginal density of x, and the conditional density of y given x=x. • b. 3 points Now let us go over to polar coordinates r and φ, which satisfy (6.8.10) x = r cos φ y = r sin φ , i.e., the vector transformation t is t( r r cos φ )= . φ r sin φ r Which region in φ -space is necessary to cover x -space? Compute the Jacobian y determinant of this transformation. Give an intuitive explanation in terms of local magnification factor of the formula you get. Finally compute the transformed density function. • c. 1 point Compute cov[x, y ]. • d. 2 points Compute the conditional variance var[y |x=x]. • e. 2 points Are x and y independent? Problem 140. [Ame85, pp. 296–7] Assume three transportation choices are available: bus, train, and car. If you pick at random a neoclassical individual ω and ask him or her which utility this person derives from using bus, train, and car, the answer will be three numbers u1 (ω ), u2 (ω ), u3 (ω ). Here u1 , u2 , and u3 are assumed to be independent random variables with the following cumulative distribution functions: (6.8.11) Pr[ui ≤ u] = Fi (u) = exp − exp(µi − u) , i = 1, 2, 3. I.e., the functional form is the same for all three transportation choices (exp indicates the exponential function); the Fi only differ by the parameters µi . These probability distributions are called Type I extreme value distributions, or log Weibull distributions. 86 6. VECTOR RANDOM VARIABLES Often these kinds of models are set up in such a way that these µi to depend on the income etc. of the individual, but we assume for this exercise that this distribution applies to the population as a whole. • a. 1 point Show that the Fi are indeed cumulative distribution functions, and derive the density functions fi (u). Individual ω likes cars best if and only if his utilities satisfy u3 (ω ) ≥ u1 (ω ) and u3 (ω ) ≥ u2 (ω ). Let I be a function of three arguments such that I (u1 , u2 , u3 ) is the indicator function of the event that one randomly chooses an individual ω who likes cars best, i.e., (6.8.12) I (u1 , u2 , u3 ) = 1 0 if u1 ≤ u3 and u2 ≤ u3 otherwise. Then Pr[car] = E[I (u1 , u2 , u3 )]. The following steps have the purpose to compute this probability: • b. 2 points For any fixed number u, define g (u) = E[I (u1 , u2 , u3 )|u3 = u]. Show that g (u) = exp − exp(µ1 − u) − exp(µ2 − u) . (6.8.13) • c. 2 points This here is merely the evaluation of an integral. Show that +∞ exp − exp(µ1 − u) − exp(µ2 − u) − exp(µ3 − u) exp(µ3 − u) du = −∞ exp µ3 . exp µ1 + exp µ2 + exp µ3 Hint: use substitution rule with y = − exp(µ1 − u) − exp(µ2 − u) − exp(µ3 − u). = • d. 1 point Use b and c to show that (6.8.14) Pr[car] = exp µ3 . exp µ1 + exp µ2 + exp µ3 CHAPTER 7 The Multivariate Normal Probability Distribution 7.1. More About the Univariate Case By definition, z is a standard normal variable, in symbols, z ∼ N (0, 1), if it has the density function z2 1 fz (z ) = √ e− 2 . 2π (7.1.1) To verify that this is a density function we have to check two conditions. (1) It is everywhere nonnegative. (2) Its integral from −∞ to ∞ is 1. In order to evaluate this integral, it is easier to work with the independent product of two standard normal x2 +y 2 variables x and y ; their joint density function is fx,y (x, y ) = 21 e− 2 . In order to π see that this joint density integrates to 1, go over to polar coordinates x = r cos φ, y = r sin φ, i.e., compute the joint distribution of r and φ from that of x and y : the absolute value of the Jacobian determinant is r, i.e., dx dy = r dr dφ, therefore y =∞ x=∞ y =−∞ x=−∞ (7.1.2) 1 − x2 +y2 2 e dx dy = 2π 2π ∞ φ=0 r =0 1 − r2 e 2 r dr dφ. 2π ∞ By substituting t = r2 /2, therefore dt = r dr, the inner integral becomes − 21 e−t 0 = π 1 2π ; therefore the whole integral is 1. Therefore the product of the integrals of the marginal densities is 1, and since each such marginal integral is positive and they are equal, each of the marginal integrals is 1 too. ∞ Problem 141. 6 points The Gamma function can be defined as Γ(r) = 0 xr−1 e−x dx. √ Show that Γ( 1 ) = π . (Hint: after substituting r = 1/2, apply the variable transfor2 mation x = z 2 /2 for nonnegative x and z only, and then reduce the resulting integral to the integral over the normal density function.) Answer. Then dx = z dz , normal density: ∞ (7.1.3) 0 dx √ x = dz √ 1 √ e−x dx = 2 x √ 2. Therefore one can reduce it to the integral over the ∞ e−z 0 2 /2 1 dz = √ 2 ∞ e−z −∞ 2 /2 √ √ 2π dz = √ = π. 2 A univariate normal variable with mean µ and variance σ 2 is a variable x whose standardized version z = x−µ ∼ N (0, 1). In this transformation from x to z , the σ dz 1 Jacobian determinant is dx = σ ; therefore the density function of x ∼ N (µ, σ 2 ) is (two notations, the second is perhaps more modern:) (7.1.4) fx (x) = √ 1 2πσ 2 e− (x−µ)2 2σ 2 = (2πσ 2 )−1/2 exp −(x − µ)2 /2σ 2 . 87 88 7. MULTIVARIATE NORMAL Problem 142. 3 points Given n independent observations of a Normally distributed variable y ∼ N (µ, 1). Show that the sample mean y is a sufficient statis¯ tic for µ. Here is a formulation of the factorization theorem for sufficient statistics, which you will need for this question: Given a family of probability densities fy (y1 , . . . , yn ; θ) defined on Rn , which depend on a parameter θ ∈ Θ. The statistic T : Rn → R, y1 , . . . , yn → T (y1 , . . . , yn ) is sufficient for parameter θ if and only if there exists a function of two variables g : R × Θ → R, t, θ → g (t; θ), and a function of n variables h : Rn → R, y1 , . . . , yn → h(y1 , . . . , yn ) so that fy (y1 , . . . , yn ; θ) = g T (y1 , . . . , yn ); θ · h(y1 , . . . , yn ). (7.1.5) Answer. The joint density function can be written (factorization indicated by ·): (7.1.6) (2π )−n/2 exp − 1 2 n (yi −µ)2 = (2π )−n/2 exp − i=1 1 2 n n y y (yi −y )2 ·exp − (¯−µ)2 = h(y1 , . . . , yn )·g (¯; µ). ¯ 2 i=1 7.2. Definition of Multivariate Normal The multivariate normal distribution is an important family of distributions with very nice properties. But one must be a little careful how to define it. One might naively think a multivariate Normal is a vector random variable each component of which is univariate Normal. But this is not the right definition. Normality of the components is a necessary but not sufficient condition for a multivariate normal x vector. If u = with both x and y multivariate normal, u is not necessarily y multivariate normal. Here is a recursive definition from which one gets all multivariate normal distributions: (1) The univariate standard normal z , considered as a vector with one component, is multivariate normal. x (2) If x and y are multivariate normal and they are independent, then u = y is multivariate normal. (3) If y is multivariate normal, and A a matrix of constants (which need not be square and is allowed to be singular), and b a vector of constants, then Ay + b is multivariate normal. In words: A vector consisting of linear combinations of the same set of multivariate normal variables is again multivariate normal. For simplicity we will go over now to the bivariate Normal distribution. 7.3. Special Case: Bivariate Normal The following two simple rules allow to obtain all bivariate Normal random variables: (1) If x and y are independent and each of them has a (univariate) normal distribution with mean 0 and the same variance σ 2 , then they are bivariate normal. (They would be bivariate normal even if their variances were different and their means not zero, but for the calculations below we will use only this special case, which together with principle (2) is sufficient to get all bivariate normal distributions.) x (2) If x = is bivariate normal and P is a 2 × 2 nonrandom matrix and µ y a nonrandom column vector with two elements, then P x + µ is bivariate normal as well. 7.3. BIVARIATE NORMAL 89 All other properties of bivariate Normal variables can be derived from this. First let us derive the density function of a bivariate Normal distribution. Write x x= . x and y are independent N (0, σ 2 ). Therefore by principle (1) above the y vector x is bivariate normal. Take any nonsingular 2 × 2 matrix P and a 2 vector u µ = u = P x + µ. We need nonsingularity because otherwise µ= , and define v ν the resulting variable would not have a bivariate density; its probability mass would be concentrated on one straight line in the two-dimensional plane. What is the joint density function of u? Since P is nonsingular, the transformation is on-to-one, therefore we can apply the transformation theorem for densities. Let us first write down the density function of x which we know: 1 1 exp − 2 (x2 + y 2 ) . 2πσ 2 2σ For the next step, remember that we have to express the old variable in terms of the new one: x = P −1 (u − µ). The Jacobian determinant is therefore J = x u−µ det(P −1 ). Also notice that, after the substitution = P −1 , the expoy v−ν fx,y (x, y ) = (7.3.1) 1 1 nent in the joint density function of x and y is − 2σ2 (x2 + y 2 ) = − 2σ2 x y x = y u−µ u−µ P −1 P −1 . Therefore the transformation theorem of density v−ν v−ν functions gives 1 − 2σ 2 (7.3.2) fu,v (u, v ) = 1 1 u−µ det(P −1 ) exp − 2 2 2πσ 2σ v − ν P −1 P −1 u−µ v−ν . This expression can be made nicer. Note that the covariance matrix of the u = σ 2 P P = σ 2 Ψ, say. Since P −1 P −1 P P = I , transformed variables is V [ v it follows P −1 P −1 = Ψ−1 and det(P −1 ) = 1/ det(Ψ), therefore (7.3.3) fu,v (u, v ) = 1 2πσ 2 1 det(Ψ) exp − 1 u−µ 2σ 2 v − ν Ψ−1 u−µ v−ν . This is the general formula for the density function of a bivariate normal with nonsingular covariance matrix σ 2 Ψ and mean vector µ. One can also use the following notation which is valid for the multivariate Normal variable with n dimensions, with mean vector µ and nonsingular covariance matrix σ 2 Ψ: (7.3.4) fx (x) = (2πσ 2 )−n/2 (det Ψ)−1/2 exp − 1 (x − µ) Ψ−1 (x − µ) . 2σ 2 Problem 143. 1 point Show that the matrix product of (P −1 ) P −1 and P P is the identity matrix. Problem 144. 3 points All vectors in this question are n × 1 column vectors. Let y = α + ε , where α is a vector of constants and ε is jointly normal with E [ε ] = o. Often, the covariance matrix V [ε ] is not given directly, but a n × n nonsingular matrix T is known which has the property that the covariance matrix of T ε is σ 2 times the n × n unit matrix, i.e., (7.3.5) 2 V [T ε ] = σ I n . 90 7. MULTIVARIATE NORMAL Show that in this case the density function of y is 1 (7.3.6) fy (y ) = (2πσ 2 )−n/2 |det(T )| exp − 2 T (y − α) T (y − α) . 2σ Hint: define z = T ε , write down the density function of z , and make a transformation between z and y . Answer. Since E [z ] = o and V [z ] = σ 2 I n , its density function is (2πσ 2 )−n/2 exp(−z z /2σ 2 ). Now express z , whose density we know, as a function of y , whose density function we want to know. z = T (y − α) or (7.3.7) z1 = t11 (y1 − α1 ) + t12 (y2 − α2 ) + · · · + t1n (yn − αn ) . . . (7.3.8) (7.3.9) zn = tn1 (y1 − α1 ) + tn2 (y1 − α2 ) + · · · + tnn (yn − αn ) therefore the Jacobian determinant is det(T ). This gives the result. 7.3.1. Most Natural Form of Bivariate Normal Density. Problem 145. In this exercise we will write the bivariate normal density in its most natural form. For this we set the multiplicative “nuisance parameter” σ 2 = 1, i.e., write the covariance matrix as Ψ instead of σ 2 Ψ. u • a. 1 point Write the covariance matrix Ψ = V [ in terms of the standard v deviations σu and σv and the correlation coefficient ρ. • b. 1 point Show that the inverse of a 2 × 2 matrix has the following form: ab cd (7.3.10) −1 = 1 d −b . ad − bc −c a • c. 2 points Show that (7.3.11) (7.3.12) q 2 = u − µ v − ν Ψ−1 = u−µ v−ν u−µv−ν 1 (u − µ)2 (v − ν )2 − 2ρ + . 2 2 2 1−ρ σu σu σv σv • d. 2 points Show the following quadratic decomposition: (7.3.13) q2 = (u − µ)2 1 σv + v − ν − ρ (u − µ) 2 2 σu (1 − ρ2 )σv σu 2 . • e. 1 point Show that (7.3.13) can also be written in the form 2 (u − µ)2 σ2 σ uv + 22 u v − ν − 2 (u − µ) . 2 2 σu σu σv − (σuv ) σu √ • f . 1 point Show that d = det Ψ can be split up, not additively but multiplicatively, as follows: d = σu · σv 1 − ρ2 . (7.3.14) q2 = • g. 1 point Using these decompositions of d and q 2 , show that the density function fu,v (u, v ) reads (7.3.15) 2 σv (v − ν ) − ρ σu (u − µ) 1 (u − µ)2 1 exp − · exp − . 2 2 2 2 2σ u 2(1 − ρ2 )σv 2πσu 2πσv 1 − ρ2 7.3. BIVARIATE NORMAL 91 σv 2 The second factor in (7.3.15) is the density of a N (ρ σu u, (1 − ρ2 )σv ) evaluated at v , and the first factor does not depend on v . Therefore if I integrate v out to get the marginal density of u, this simply gives me the first factor. The conditional density of v given u = u is the joint divided by the marginal, i.e., it is the second factor. In other words, by completing the square we wrote the joint density function in its natural form as the product of a marginal and a conditional density function: fu,v (u, v ) = fu (u) · fv|u (v ; u). From this decomposition one can draw the following conclusions: 2 • u ∼ N (0, σu ) is normal and, by symmetry, v is normal as well. Note that u (or v ) can be chosen to be any nonzero linear combination of x and y . Any nonzero linear transformation of independent standard normal variables is therefore univariate normal. • If ρ = 0 then the joint density function is the product of two independent univariate normal density functions. In other words, if the variables are normal, then they are independent whenever they are uncorrelated. For general distributions only the reverse is true. • The conditional density of v conditionally on u = u is the second term on the rhs of (7.3.15), i.e., it is normal too. • The conditional mean is σv (7.3.16) E[v |u = u] = ρ u, σu i.e., it is a linear function of u. If the (unconditional) means are not zero, then the conditional mean is σv (7.3.17) E[v |u = u] = µv + ρ (u − µu ). σu Since ρ = (7.3.18) cov[u,v ] σu σv , (7.3.17) can als be written as follows: E[v |u = u] = E[v ] + cov[u, v ] (u − E[u]) var[u] • The conditional variance is the same whatever value of u was chosen: its value is (7.3.19) 2 var[v |u = u] = σv (1 − ρ2 ), which can also be written as (7.3.20) var[v |u = u] = var[v ] − (cov[u, v ])2 . var[u] We did this in such detail because any bivariate normal with zero mean has this form. A multivariate normal distribution is determined by its means and variances and covariances (or correlations coefficients). If the means are not zero, then the densities merely differ from the above by an additive constant in the arguments, i.e., if one needs formulas for nonzero mean, one has to replace u and v in the above equations by u − µu and v − µv . du and dv remain the same, because the Jacobian of the translation u → u − µu , v → v − µv is 1. While the univariate normal was determined by mean and standard deviation, the bivariate normal is determined by the two means µu and µv , the two standard deviations σu and σv , and the correlation coefficient ρ. 92 7. MULTIVARIATE NORMAL 7.3.2. Level Lines of the Normal Density. Problem 146. 8 points Define the angle δ = arccos(ρ), i.e, ρ = cos δ . In terms of δ , the covariance matrix (??) has the form (7.3.21) Ψ= 2 σu σu σv cos δ σu σv cos δ 2 σv Show that for all φ, the vector (7.3.22) r σu cos φ r σv cos(φ + δ ) x= satisfies x Ψ−1 x = r2 . The opposite holds too, all vectors x satisfying x Ψ−1 x = r2 can be written in the form (7.3.22) for some φ, but I am not asking to prove this. This formula can be used to draw level lines of the bivariate Normal density and confidence ellipses, more details in (??). Problem 147. The ellipse in Figure 1 contains all the points x, y for which (7.3.23) x−1 y−1 0.5 −0.25 −0.25 1 −1 x−1 ≤6 y−1 • a. 3 points Compute the probability that a random variable 1 0.5 −0.25 , 1 −0.25 1 x ∼N y (7.3.24) falls into this ellipse. Hint: you should apply equation (7.4.9). Then you will have to look up the values of a χ2 distribution in a table, or use your statistics software to get it. • b. 1 point Compute the standard deviations of x and y , and the correlation coefficient corr(x, y ) • c.√ points The vertical tangents to the ellipse in Figure 1 are at the locations 2 x = 1 ± 3. What is the probability that [ x ] falls between these two vertical tangents? y √ • d. 1 point The horizontal tangents are at the locations y = 1 ± 6. What is the probability that [ x ] falls between the horizontal tangents? y • e. 1 point Now take an arbitrary linear combination u = ax + by . Write down its mean and its standard deviation. √ • f . 1 point Show that the set of realizations x, y for which u lies less than 6 standard deviation away from its mean is √ (7.3.25) |a(x − 1) + b(y − 1)| ≤ 6 a2 var[x] + 2ab cov[x, y ] + b2 var[y ]. The set of all these points forms a band limited by two parallel lines. What is the probability that [ x ] falls between these two lines? y • g. 1 point It is our purpose to show that this band is again tangent to the ellipse. This is easiest if we use matrix notation. Define (7.3.26) x= x y µ= 1 1 Ψ= 0.5 −0.25 −0.25 1 a= a b Equation (7.3.23) in matrix notation says: the ellipse contains all the points for which (7.3.27) (x − µ) Ψ−1 (x − µ) ≤ 6. 7.3. BIVARIATE NORMAL −2 −1 x=0 1 93 2 3 4 4 4 . ........................... . ..................................... ... .. .......... ...... ..... ..... ..... .... ..... .... .... .... .... ... .... ... .... ... .... . . .... .... .. ... . . ... ... .. .. ... ... . .. ... ... .. ... ... .. .. ... ... .. .. .. ... .. . .. .. .. . .. . .. . .. .. . . .. . .. . .. .. . . . .. .. . . .. . .. . .. . .. . . .. . .. .. . . . .. . .. . .. . .. . . .. . .. . . .. .. . . . .. .. . . . .. .. . . . .. .. . . . .. . .. . . . .. . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . . . .. . .. . . .. . .. . . . .. . .. . . .. .. . . .. . .. . . .. . .. . .. . .. . . .. . .. . . .. .. . . . .. .. . . .. . .. . .. . .. . . .. . .. . . .. .. . . .. .. . . .. .. . . .. .. . .. .. . .. ... ... . ... .. ... .. .. ... .. ... ... ... .. .. ... ... .. .. .... .... .. .. .... .... .. ... .... .... ... ... .... .... .. .... ..... ..... .... .... ...... ...... .... ..... .......... ........... .. .. ..... ........................ ....................... 3 2 1 y=0 −1 3 2 1 y=0 −1 −2 −2 −2 −1 x=0 1 2 3 Figure 1. Level Line for Normal Density Show that the band defined by inequality (7.3.25) contains all the points for which a (x − µ) a Ψa (7.3.28) 2 ≤ 6. • h. 2 points Inequality (7.3.28) can also be written as: (x − µ) a(a Ψa)−1 a (x − µ) ≤ 6 (7.3.29) or alternatively (7.3.30) x−1 y−1 a b a b Ψ−1 a b −1 Show that the matrix (7.3.31) Ω = Ψ−1 − a(a Ψa)−1 a x−1 y−1 a b ≤ 6. 4 94 7. MULTIVARIATE NORMAL satisfies Ω ΨΩ = Ω . Derive from this that Ω is nonnegative definite. Hint: you may use, without proof, that any symmetric matrix is nonnegative definite if and only if it can be written in the form RR . • i. 1 point As an aside: Show that Ω Ψa = o and derive from this that Ω is not positive definite but only nonnegative definite. • j. 1 point Show that the following inequality holds for all x − µ, (x − µ) Ψ−1 (x − µ) ≥ (x − µ) a(a Ψa)−1 a (x − µ). (7.3.32) In other words, if x lies in the ellipse then it also lies in each band. I.e., the ellipse is contained in the intersection of all the bands. • k. 1 point Show: If x − µ = Ψaα with some arbitrary scalar α, then (7.3.32) is an equality, and if α = ± 6/a Ψa, then both sides in (7.3.32) have the value 6. I.e., the boundary of the ellipse and the boundary lines of the band intersect. Since the ellipse is completely inside the band, this can only be the case if the boundary lines of the band are tangent to the ellipse. • l. 2 points The vertical lines in Figure 1 which are not tangent to the ellipse delimit a band which, if extended to infinity, has as much probability mass as the ellipse itself. Compute the x-coordinates of these two lines. 7.3.3. Miscellaneous Exercises. Problem 148. Figure 2 shows the level line for a bivariate Normal density which contains 95% of the probability mass. .................. ......................... . .. ... ...... ....... .... ....... .... −1 0 1 3 ..... .....2 .. .. .... .... 3 2 1 0 −1 .. .. .. .. .. .. .... .... .. .... .... .. .... .... . . .... . .... . . .. . . ... . ... . . .. . . ... . ... . . .. . . ... . ... . . . . . ... ... . . . . . . ... ... . . . .. ... ... . . . . ... ... . . . . ... .. . . . . . .. . .. . .. .. . . .. .. . . .. . .. . . .. .. . . .. .. . . .. . .. . . . .. . .. . . .. . .. . . .. .. . . . .. . .. . . .. . .. . . . .. . .. . .. . .. . . .. . .. . .. . . .. . .. . . .. .. . . . .. . .. . . .. .. . . .. .. .. . .. .. .. . .. .. .. . .. .. .. . .. .. .. . . . . .. . .. .. . .. .. .. . .. .. .. . .. .. .. . .. .. .. . .. .. .. . . .. .. . . . .. .. . . . .. . .. . . .. .. . . .. . .. . .. . . .. . .. . .. . .. . . .. .. . . . .. . .. . . .. . .. . . .. . .. . . . .. . . .. .. . . .. . .. . . .. .. . . .. .. . . .. .. . . . . . .. . .. . .. .. . . .. .. . . .. . .. . . ... . ... . .. . ... . ... . ... . ... . . ... . ... . . ... . ... . . ... ... . . .. . ... . . ... ... . . ... . ... . . ... . ... . . ... ... .. .. .... .... .. ... .... .. .. .... .... .. .... ... ... ..... .... .... ..... ..... ........ .......... .................. ... . . ..................... . ..... −1 0 1 2 3 2 1 0 −1 3 Figure 2. Level Line of Bivariate Normal Density, see Problem 148 7.3. BIVARIATE NORMAL 95 x . Ψ1 = y 0.62 −0.56 1.85 1.67 0.62 0.56 1.85 −1.67 , Ψ2 = , Ψ3 = , Ψ4 = , −0.56 1.04 1.67 3.12 0.56 1.04 1.67 3.12 3.12 −1.67 1.04 0.56 3.12 1.67 0.62 0.81 Ψ5 = , Ψ6 = , Ψ7 = , Ψ8 = , −1.67 1.85 0.56 0.62 1.67 1.85 0.81 1.04 3.12 1.67 0.56 0.62 Ψ9 = , Ψ10 = . Which is it? Remember that for a uni2.67 1.85 0.62 −1.04 variate Normal, 95% of the probability mass lie within ±2 standard deviations from the mean. If you are not sure, cross out as many of these covariance matrices as possible and write down why you think they should be crossed out. • a. 3 points One of the following matrices is the covariance matrix of Answer. Covariance matrix must be symmetric, therefore we can cross out 4 and 9. It must also be nonnegative definite (i.e., it must have nonnegative elements in the diagonal), therefore cross out 10, and a nonnegative determinant, therefore cross out 8. Covariance must be positive, so cross out 1 and 5. Variance in x-direction is smaller than in y-direction, therefore cross out 6 and 7. Remains 2 and 3. Of these it is number 3. By comparison with Figure 1 one can say that the vertical band between 0.4 and 2.6 and the horizontal band between 3 and -1 roughly have the same probability as the ellipse, namely 95%. Since a univariate Normal has 95% of its probability mass in an interval centered around the mean which is 4 standard deviations long, standard deviations must be approximately 0.8 in the horizontal and 1 in the vertical directions. Ψ1 is negatively correlated; Ψ2 has the right correlation but is scaled too big; Ψ3 this is it; Ψ4 not symmetric; Ψ5 negatively correlated, and x has larger variance than y ; Ψ6 x has larger variance than y ; Ψ7 too large, x has larger variance than y ; Ψ8 not positive definite; Ψ9 not symmetric; Ψ10 not positive definite. The next Problem constructs a counterexample which shows that a bivariate distribution, which is not bivariate Normal, can nevertheless have two marginal densities which are univariate Normal. Problem 149. Let x and y be two independent standard normal random vari2 2 ables, and let u and v be bivariate normal with mean zero, variances σu = σv = 1, and correlation coefficient ρ = 0. Let fx,y and fu,v be the corresponding density functions, i.e., fx,y (a, b) = 1 a2 + b2 exp(− ) fu,v (a, b) = 2π 2 2π 1 1 − ρ2 exp(−a2 + b2 − 2ρa b ). 2(1 − ρ2 ) Assume the random variables a and b are defined by the following experiment: You flip a fair coin; if it shows head, then you observe x and y and give a the value observed on x, and b the value observed of y . If the coin shows tails, then you observe u and v and give a the value of u, and b the value of v . • a. Prove that the joint density of a and b is 1 1 (7.3.33) fa,b (a, b) = fx,y (a, b) + fu,v (a, b). 2 2 Hint: first show the corresponding equation for the cumulative distribution functions. Answer. Following this hint: (7.3.34) (7.3.35) Fa,b (a, b) = Pr[a ≤ a and b ≤ b] = = Pr[a ≤ a and b ≤ b|head] Pr[head] + Pr[a ≤ a and b ≤ b|tail] Pr[tail] 1 1 + Fu,v (a, b) . 2 2 The density function is the function which, if integrated, gives the above cumulative distribution function. (7.3.36) = Fx,y (a, b) 96 7. MULTIVARIATE NORMAL • b. Show that the marginal distribution of a and b each is normal. Answer. You can either argue it out: each of the above marginal distributions is standard normal, but you can also say integrate b out; for this it is better to use form (7.3.15) for fu,v , i.e., write a2 1 exp − fu,v (a, b) = √ 2 2π (7.3.37) ·√ 1 exp − 1 − ρ2 2π (b − ρa)2 . 2(1 − ρ2 ) Then you can see that the marginal is standard normal. Therefore you get a mixture of two distributions each of which is standard normal, therefore it is not really a mixture any more. • c. Compute the density of b conditionally on a = 0. What are its mean and variance? Is it a normal density? √ Answer. Fb|a (b; a) = fa,b (a,b) . fa (a) We don’t need it for every a, only for a = 0. Since fa (0) = 1/ 2π , therefore (7.3.38) fb|a=0 (b) = √ 2πfa,b (0, b) = 1 11 −b2 1 −b2 exp +√ exp . √ 2 2π 2 2 2 π 1 − ρ2 2(1 − ρ2 ) It is not normal, it is a mixture of normals with different variances. This has mean zero and variance 1 (1 + (1 − ρ2 )) = 1 − 1 ρ2 . 2 2 • d. Are a and b jointly normal? Answer. Since the conditional distribution is not normal, they cannot be jointly normal. Problem 150. This is [HT83, 4.8-6 on p. 263] with variance σ 2 instead of 1: Let x and y be independent normal with mean 0 and variance σ 2 . Go over to polar coordinates r and φ, which satisfy x = r cos φ (7.3.39) y = r sin φ. • a. 1 point Compute the Jacobian determinant. Answer. Express the variables whose density you know in terms of those whose density you want to know. The Jacobian determinant is (7.3.40) J= ∂x ∂r ∂y ∂r ∂x ∂φ ∂y ∂φ = cos φ sin φ −r sin φ = (cos φ)2 + (sin φ)2 r = r. r cos φ • b. 2 points Find the joint probability density function of r and φ. Also indicate the area in (r, φ) space in which it is nonzero. Answer. fx,y (x, y ) = ∞ and 0 ≤ φ < 2π . 2 2 2 1 e−(x +y )/2σ ; 2πσ 2 therefore fr,φ (r, φ) = 2 2 1 re−r /2σ 2πσ 2 for 0 ≤ r < • c. 3 points Find the marginal distributions of r and φ. Hint: for one of the integrals it is convenient to make the substitution q = r2 /2σ 2 . 2 2 1 re−r /2σ for 0 σ2 2 2 ∞ re−r /2σ dr = 21 , set π 0 Answer. fr (r ) = 1 we need 2πσ2 ∞ −q 1 e dq . 2π 0 ≤ r < ∞, and fφ (φ) = q= r 2 /2σ 2 , then dq = 1 for 0 ≤ φ < 2π . For the latter 2π 1 2 r dr , and the integral becomes σ • d. 1 point Are r and φ independent? Answer. Yes, because joint density function is the product of the marginals. 7.4. MULTIVARIATE STANDARD NORMAL IN HIGHER DIMENSIONS 97 7.4. Multivariate Standard Normal in Higher Dimensions Here is an important fact about the multivariate normal, which one cannot see in x two dimensions: if the partitioned vector is jointly normal, and every component y of x is independent of every component of y , then the vectors x and y are already independent. Not surprised? You should be, see Problem 125. Let’s go back to the construction scheme at the beginning of this chapter. First we will introduce the multivariate standard normal, which one obtains by applying only operations (1) and (2), i.e., it is a vector composed of independent univariate standard normals, and give some properties of it. Then we will go over to the multivariate normal with arbitrary covariance matrix, which is simply an arbitrary linear transformation of the multivariate standard normal. We will always carry the “nuisance parameter” σ 2 along. Definition 7.4.1. The random vector z is said to have a multivariate standard normal distribution with variance σ 2 , written as z ∼ N (o, σ 2 I ), if each element z i is a standard normal with same variance σ 2 , and all elements are mutually independent of each other. (Note that this definition of the standard normal is a little broader than the usual one; the usual one requires that σ 2 = 1.) The density function of a multivariate standard normal z is therefore the product of the univariate densities, which gives fx (z ) = (2πσ 2 )−n/2 exp(−z z /2σ 2 ). The following property of the multivariate standard normal distributions is basic: Theorem 7.4.2. Let z be multivariate standard normal p-vector with variance σ 2 , and let P be a m × p matrix with P P = I . Then x = P z is a multivariate standard normal m-vector with the same variance σ 2 , and z z − x x ∼ σ 2 χ2−m p independent of x. Proof. P P = I means all rows are orthonormal. If P is not square, it must therefore have more columns than rows, and one can add more rows to get an P orthogonal square matrix, call it T = . Define y = T z , i.e., z = T y . Then Q z z = y T T y = y y , and the Jacobian of the transformation from y to z has absolute value one. Therefore the density function of y is (2πσ 2 )−n/2 exp(−y y /2σ 2 ), which means y is standard normal as well. In other words, every y i is univariate standard normal with same variance σ 2 and y i is independent of y j for i = j . Therefore also any subvector of y , such as x, is standard normal. Since z z −x x = y y −x x is the sum of the squares of those elements of y which are not in x, it follows that it is an independent σ 2 χ2−m . p Problem 151. Show that the moment generating function of a multivariate standard normal with variance σ 2 is mz (t) = E [exp(t z )] = exp(σ 2 t t/2). Answer. Proof: The moment generating function is defined as (7.4.1) mz (t) = E[exp(t z )] (7.4.2) = (2πσ 2 )n/2 ··· exp(− 1 z z ) exp(t z ) dz1 · · · dzn 2σ 2 (7.4.3) = (2πσ 2 )n/2 ··· exp(− 1 σ2 (z − σ 2 t ) (z − σ 2 t ) + t t) dz1 · · · dzn 2σ 2 2 (7.4.4) = exp( σ2 t t) 2 since first part of integrand is density function. 98 7. MULTIVARIATE NORMAL Theorem 7.4.3. Let z ∼ N (o, σ 2 I ), and P symmetric and of rank r. A necessary and sufficient condition for q = z P z to have a σ 2 χ2 distribution is P 2 = P . In this case, the χ2 has r degrees of freedom. Proof of sufficiency: If P 2 = P with rank r, then a matrix T exists with P = T T and T T = I . Define x = T z ; it is standard normal by theorem 7.4.2. r Therefore q = z T T z = i=1 x2 . Proof of necessity by construction of the moment generating function of q = z P z for arbitrary symmetric P with rank r. Since P is symmetric, there exists a T with T T = I r and P = T ΛT where Λ is a nonsingular diagonal matrix, write r it Λ = diag(λ1 , . . . , λr ). Therefore q = z T ΛT z = x Λx = i=1 λi x2 where i 2 x = T z ∼ N (o, σ I r ). Therefore the moment generating function r (7.4.5) λi x2 )] i E[exp(q t)] = E[exp(t i=1 (7.4.6) = E[exp(tλ1 x2 )] · · · E[exp(tλr x2 )] 1 r (7.4.7) = (1 − 2λ1 σ 2 t)−1/2 · · · (1 − 2λr σ 2 t)−1/2 . By assumption this is equal to (1 − 2σ 2 t)−k/2 with some integer k ≥ 1. Taking squares and inverses one obtains (7.4.8) (1 − 2λ1 σ 2 t) · · · (1 − 2λr σ 2 t) = (1 − 2σ 2 t)k . Since the λi = 0, one obtains λi = 1 by uniqueness of the polynomial roots. Furthermore, this also implies r = k . From Theorem 7.4.3 one can derive a characterization of all the quadratic forms of multivariate normal variables with arbitrary covariance matrices that are χ2 ’s. Assume y is a multivariate normal vector random variable with mean vector µ and covariance matrix σ 2 Ψ, and Ω is a symmetric nonnegative definite matrix. Then (y − µ) Ω (y − µ) ∼ σ 2 χ2 iff k (7.4.9) ΨΩ ΨΩ Ψ = ΨΩ Ψ, and k is the rank of ΨΩ. Here are the three best known special cases (with examples): • Ψ = I (the identity matrix) and Ω 2 = Ω , i.e., the case of theorem 7.4.3. This is the reason why the minimum value of the SSE has a σ 2 χ2 distribution, see (27.0.10). • Ψ nonsingular and Ω = Ψ−1 . The quadratic form in the exponent of the normal density function is therefore a χ2 ; one needs therefore the χ2 to compute the probability that the realization of a Normal is in a given equidensity-ellipse (Problem 147). • Ψ singular and Ω = Ψ− , its g-inverse. The multinomial distribution has a singular covariance matrix, and equation (??) gives a convenient g-inverse which enters the equation for Pearson’s goodness of fit test. Here are, without proof, two more useful theorems about the standard normal: Theorem 7.4.4. Let x a multivariate standard normal. Then x P x is independent of x Qx if and only if P Q = O . This is called Craig’s theorem, although Craig’s proof in [Cra43] is incorrect. Kshirsagar [Ksh19, p. 41] describes the correct proof; he and Seber [Seb77] give Lancaster’s book [Lan69] as basic reference. Seber [Seb77] gives a proof which is only valid if the two quadratic forms are χ2 . 7.4. MULTIVARIATE STANDARD NORMAL IN HIGHER DIMENSIONS 99 The next theorem is known as James’s theorem, it is a stronger version of Cochrane’s theorem. It is from Kshirsagar [Ksh19, p. 41]. Theorem 7.4.5. Let x be p-variate standard normal with variance σ 2 , and k x x= i=1 x P i x. Then for the quadratic forms x P i x to be independently distributed as σ 2 χ2 , any one of the following three equivalent conditions is necessary and sufficient: P2 = Pi i (7.4.10) (7.4.11) P iP j = O k (7.4.12) rank(P i ) = p i=1 for all i i=j CHAPTER 8 The Regression Fallacy Only for the sake of this exercise we will assume that “intelligence” is an innate property of individuals and can be represented by a real number z . If one picks at random a student entering the U of U, the intelligence of this student is a random variable which we assume to be normally distributed with mean µ and standard deviation σ . Also assume every student has to take two intelligence tests, the first at the beginning of his or her studies, the other half a year later. The outcomes of these tests are x and y . x and y measure the intelligence z (which is assumed to be the same in both tests) plus a random error ε and δ , i.e., (8.0.13) x=z+ε (8.0.14) y =z+δ Here z ∼ N (µ, τ 2 ), ε ∼ N (0, σ 2 ), and δ ∼ N (0, σ 2 ) (i.e., we assume that both errors have the same variance). The three variables ε, δ , and z are independent of each other. Therefore x and y are jointly normal. var[x] = τ 2 + σ 2 , var[y ] = τ 2 + σ 2 , 2 cov[x, y ] = cov[z + ε, z + δ ] = τ 2 + 0 + 0 + 0 = τ 2 . Therefore ρ = τ 2τ σ2 . The contour + lines of the joint density are ellipses with center (µ, µ) whose main axes are the lines y = x and y = −x in the x, y -plane. Now what is the conditional mean? Since var[x] = var[y ], (7.3.17) gives the line E[y |x=x] = µ + ρ(x − µ), i.e., it is a line which goes through the center of the ellipses but which is flatter than the line x = y representing the real underlying linear relationship if there are no errors. Geometrically one can get it as the line which intersects each ellipse exactly where the ellipse is vertical. Therefore, the parameters of the best prediction of y on the basis of x are not the parameters of the underlying relationship. Why not? Because not only y but also x is subject to errors. Assume you pick an individual by random, and it turns out that his or her first test result is very much higher than the average. Then it is more likely that this is an individual which was lucky in the first exam, and his or her true IQ is lower than the one measured, than that the individual is an Einstein who had a bad day. This is simply because z is normally distributed, i.e., among the students entering a given University, there are more individuals with lower IQ’s than Einsteins. In order to make a good prediction of the result of the second test one must make allowance for the fact that the individual’s IQ is most likely lower than his first score indicated, therefore one will predict the second score to be lower than the first score. The converse is true for individuals who scored lower than average, i.e., in your prediction you will do as if a “regression towards the mean” had taken place. The next important point to note here is: the “true regression line,” i.e., the prediction line, is uniquely determined by the joint distribution of x and y . However the line representing the underlying relationship can only be determined if one has information in addition to the joint density, i.e., in addition to the observations. E.g., assume the two tests have different standard deviations, which may be the case 101 102 8. THE REGRESSION FALLACY simply because the second test has more questions and is therefore more accurate. Then the underlying 45◦ line is no longer one of the main axes of the ellipse! To be more precise, the underlying line can only be identified if one knows the ratio of the variances, or if one knows one of the two variances. Without any knowledge of the variances, the only thing one can say about the underlying line is that it lies between the line predicting y on the basis of x and the line predicting x on the basis of y . The name “regression” stems from a confusion between the prediction line and the real underlying relationship. Francis Galton, the cousin of the famous Darwin, measured the height of fathers and sons, and concluded from his evidence that the heights of sons tended to be closer to the average height than the height of the fathers, a purported law of “regression towards the mean.” Problem 152 illustrates this: Problem 152. The evaluation of two intelligence tests, one at the beginning of the semester, one at the end, gives the following disturbing outcome: While the underlying intelligence during the first test was z ∼ N (100, 20), it changed between the first and second test due to the learning experience at the university. If w is the intelligence of each student at the second test, it is connected to his intelligence z at the first test by the formula w = 0.5z + 50, i.e., those students with intelligence below 100 gained, but those students with intelligence above 100 lost. (The errors of both intelligence tests are normally distributed with expected value zero, and the variance of the first intelligence test was 5, and that of the second test, which had more questions, was 4. As usual, the errors are independent of each other and of the actual intelligence.) 80 90 100 110 120 110 110 ......................... ........................... .. ............. ............ ... ... ....... ....... ...... .. . ...... ...... . . ..... ..... . . ..... . ..... . .. . . . ..... ..... . .. .. . . .... . .... . .. .. .... .. .... . . . .. .... .. .... .. .. .... .... .. .. ... ... ... ... ... ... . . ... ... ... ... . .. .. ... .. ... ... .. .. ... ... .. .. .... .... .. .. ... ... .. .... .. .... . ... ... . . .... .... . . .... .... . . ..... .... . . . ..... ..... . . ..... ..... .. .. .. .. .... ... .. .... ...... ...... ............ .......................... ........................................ . ..... 100 100 90 90 80 90 100 110 120 Figure 1. Ellipse containing 95% of the probability mass of test results x and y 8. THE REGRESSION FALLACY 103 • a. 3 points If x and y are the outcomes of the first and second intelligence test, compute E[x], E[y ], var[x], var[y ], and the correlation coefficient ρ = corr[x, y ]. Figure 1 shows an equi-density line of their joint distribution; 95% of the probability mass of the test results are inside this ellipse. Draw the line w = 0.5z + 50 into Figure 1. Answer. We know z ∼ N (100, 20); w = 0.5z + 50; x = z + ε; ε ∼ N (0, 4); y = w + δ ; δ ∼ N (0, 5); therefore E[x] = 100; E[y ] = 100; var[x] = 20 + 5 = 25; var[y ] = 5 + 4 = 9; cov[x, y ] = 10; corr[x, y ] = 10/15 = 2/3. In matrix notation x ∼N y (8.0.15) 100 25 , 100 10 10 9 The line y = 50 + 0.5x goes through the points (80, 90) and (120, 110). • b. 4 points Compute E[y |x=x] and E[x|y =y ]. The first is a linear function of x and the second a linear function of y . Draw the two lines representing these linear functions into Figure 1. Use (7.3.18) for this. Answer. 2 10 (x − 100) = 60 + x 25 5 10 100 10 E[x|y =y ] = 100 + (y − 100) = − + y. 9 9 9 E[y |x=x] = 100 + (8.0.16) (8.0.17) The line y = E[y |x=x] goes through the points (80, 92) and (120, 108) at the edge of Figure 1; it intersects the ellipse where it is vertical. The line x = E[x|y =y ] goes through the points (80, 82) and (120, 118), which are the corner points of Figure 1; it intersects the ellipse where it is horizontal. The two lines intersect in the center of the ellipse, i.e., at the point (100, 100). 6 • c. 2 points Another researcher says that w = 10 z + 40, z ∼ N (100, 100 ), 6 50 ε ∼ N (0, 6 ), δ ∼ N (0, 3). Is this compatible with the data? 6 100+40 = 100; 10 6 cov[x, y ] = 10 var[z ] = Answer. Yes, it is compatible: E[x] = E[z ]+E[ε] = 100; E[y ] = E[w]+E[δ ] = var[x] = 10. 100 6 + 50 6 = 25; var[y ] = 62 10 var[z ] + var[δ ] = 63 100 100 6 + 3 = 9; • d. 4 points A third researcher asserts that the IQ of the students really did not change. He says w = z , z ∼ N (100, 5), ε ∼ N (0, 20), δ ∼ N (0, 4). Is this compatible with the data? Is there unambiguous evidence in the data that the IQ declined? Answer. This is not compatible. This scenario gets everything right except the covariance: E[x] = E[z ] + E[ε] = 100; E[y ] = E[z ] + E[δ ] = 100; var[x] = 5 + 20 = 25; var[y ] = 5 + 4 = 9; cov[x, y ] = 5. A scenario in which both tests have same underlying intelligence cannot be found. Since the two conditional expectations are on the same side of the diagonal, the hypothesis that the intelligence did not change between the two tests is not consistent with the joint distribution of x and y . The diagonal goes through the points (82, 82) and (118, 118), i.e., it intersects the two horizontal boundaries of Figure 1. We just showed that the parameters of the true underlying relationship cannot be inferred from the data alone if there are errors in both variables. We also showed that this lack of identification is not complete, because one can specify an interval which in the plim contains the true parameter value. Chapter ?? has a much more detailed discussion of all this. There we will see that this lack of identification can be removed if more information is available, i.e., if one knows that the two error variances are equal, or if one knows that the regression has zero intercept, etc. Question 153 shows that in this latter case, the OLS estimate is not consistent, but other estimates exist that are consistent. 104 8. THE REGRESSION FALLACY Problem 153. [Fri57, chapter 3] According to Friedman’s permanent income hypothesis, drawing at random families in a given country and asking them about their income y and consumption c can be modeled as the independent observations of two random variables which satisfy y = yp + yt , (8.0.18) (8.0.19) c = cp + ct , (8.0.20) p c = β yp . Here y p and cp are the permanent and y t and ct the transitory components of income and consumption. These components are not observed separately, only their sums y and c are observed. We assume that the permanent income y p is random, with 2 E[y p ] = µ = 0 and var[y p ] = τy . The transitory components y t and ct are assumed 2 to be independent of each other and of y p , and E[y t ] = 0, var[y t ] = σy , E[ct ] = 0, t 2 and var[c ] = σc . Finally, it is assumed that all variables are normally distributed. • a. 2 points Given the above information, write down the vector of expected values E [ y ] and the covariance matrix V [ y ] in terms of the five unknown parameters c c 2 2 2 of the model µ, β , τy , σy , and σc . Answer. (8.0.21) E y c = µ βµ and V y c = 2 2 τy + σy 2 βτy 2 βτy 2 2. β 2 τy + σc • b. 3 points Assume that you know the true parameter values and you observe a family’s actual income y . Show that your best guess (minimum mean squared error) of this family’s permanent income y p is (8.0.22) yp ∗ = 2 2 σy τy µ+ 2 y. 2 2 2 τy + σ y τy + σ y Note: here we are guessing income, not yet consumption! Use (7.3.17) for this! Answer. This answer also does the math for part c. The best guess is the conditional mean cov[y p , y ] (22,000 − E[y ]) var[y ] 16,000,000 (22,000 − 12,000) = 20,000 = 12,000 + 20,000,000 E[y p |y = 22,000] = E[y p ] + or equivalently E[y p |y = 22,000] = µ + = 2 τy 2 σy 2 τy 2 + σy 2 2 τy + σy µ+ (22,000 − µ) 2 τy 2 2 τy + σy 22,000 = (0.2)(12,000) + (0.8)(22,000) = 20,000. • c. 3 points To make things more concrete, assume the parameters are (8.0.23) β = 0.7 (8.0.24) σy = 2,000 (8.0.25) σc = 1,000 (8.0.26) µ = 12,000 (8.0.27) τy = 4,000. 8. THE REGRESSION FALLACY 105 If a family’s income is y = 22,000, what is your best guess of this family’s permanent income y p ? Give an intuitive explanation why this best guess is smaller than 22,000. Answer. Since the observed income of 22,000 is above the average of 12,000, chances are greater that it is someone with a positive transitory income than someone with a negative one. • d. 2 points If a family’s income is y , show that your best guess about this family’s consumption is c∗ = β (8.0.28) 2 2 σy τy y. µ+ 2 2 2 2 τy + σ y τy + σ y Instead of an exact mathematical proof you may also reason out how it can be obtained from (8.0.22). Give the numbers for a family whose actual income is 22,000. Answer. This is 0.7 times the best guess about the family’s permanent income, since the transitory consumption is uncorrelated with everything else and therefore must be predicted by 0. This is an acceptable answer, but one can also derive it from scratch: (8.0.29) E[c|y = 22,000] = E[c] + cov[c, y ] (22,000 − E[y ]) var[y ] 2 βτy (22,000 − µ) = 8,400 + 0.7 16,000,000 (22,000 − 12,000) = 14,000 20,000,000 (8.0.30) = βµ + (8.0.31) or =β (8.0.32) = 0.7 (0.2)(12,000) + (0.8)(22,000) = (0.7)(20,000) = 14,000. 2 τy 2 σy 2 + σy µ+ 2 2 τy + σy 2 τy 2 2 τy + σy 22,000 The remainder of this Problem uses material that comes later in these Notes: • e. 4 points From now on we will assume that the true values of the parameters are not known, but two vectors y and c of independent observations are available. We will show that it is not correct in this situation to estimate β by regressing c on y with the intercept suppressed. This would give the estimator ci y i ˆ (8.0.33) β= y2 i Show that the plim of this estimator is E[cy ] ˆ (8.0.34) plim[β ] = E[y 2 ] ˆ Which theorems do you need for this proof ? Show that β is an inconsistent estimator of β , which yields too small values for β . ˆ Answer. First rewrite the formula for β in such a way that numerator and denominator each has a plim: by the weak law of large numbers the plim of the average is the expected value, therefore we have to divide both numerator and denominator by n. Then we can use the Slutsky theorem that the plim of the fraction is the fraction of the plims. ˆ β= 1 n 1 n ci y i y2 i ; ˆ plim[β ] = 2 2 µβµ + βτy µ2 + τy E[cy ] E[c] E[y ] + cov[c, y ] = =2 =β 2 . 2 2 2 2 E[y 2 ] (E[y ])2 + var[y ] µ + τy + σy µ + τy + σy • f . 4 points Give the formulas of the method of moments estimators of the five 2 2 2 paramaters of this model: µ, β , τy , σy , and σp . (For this you have to express these five parameters in terms of the five moments E[y ], E[c], var[y ], var[c], and cov[y , c], and then simply replace the population moments by the sample moments.) Are these consistent estimators? 106 8. THE REGRESSION FALLACY E[c] . This together with cov[y , c] = E[y ] 2 + σ 2 gives σ 2 = var[y ] − τ 2 = This together with var[y ] = τy y y y cov[y ,c] E[c] 2 2 2 equation var[c] = β 2 τy + σc one get σc = var[c] − . E[y ] Answer. From (8.0.21) follows E[c] = β E[y ], therefore β = cov[y ,c] cov[y ,c] E[y ] = . β E[c] cov[y ,c] E[y ] . And from the last E[c] 2 2 βτy gives τy = var[y ] − All these are consistent estimators, as long as E[y ] = 0 and β = 0. • g. 4 points Now assume you are not interested in estimating β itself, but in addition to the two n-vectors y and c you have an observation of y n+1 and you want to predict the corresponding cn+1 . One obvious way to do this would be to plug the method-of moments estimators of the unknown parameters into formula (8.0.28) for the best linear predictor. Show that this is equivalent to using the ordinary least ˆ ˆˆ ˆ squares predictor c∗ = α + β y n+1 where α and β are intercept and slope in the simple regression of c on y , i.e., c (y i − y )(ci − ¯) ¯ (y i − y )2 ¯ α = ¯ − βy ˆ c ˆ¯ ˆ β= (8.0.35) (8.0.36) Note that we are regressing c on y with an intercept, although the original model does not have an intercept. Answer. Here I am writing population moments where I should be writing sample moments. 2 2 First substitute the method of moments estimators in the denominator in (8.0.28): τy + σy = var[y ]. Therefore the first summand becomes E[c] 1 cov[y , c] E[y ] 1 cov[y , c] E[y ] cov[y , c] E[y ] 2 = var[y ]− E[y ] = E[c] 1− = E[c]− βσy µ var[y ] E[y ] E[c] var[y ] var[y ] E[c] var[y ] But since to show: cov[y ,c] var[y ] ˆ = β and α + β E[y ] = E[c] this expression is simply α. The second term is easier ˆˆ ˆ β 2 τy var[y ] y= cov[y , c] ˆ y = βy var[y ] • h. 2 points What is the “Iron Law of Econometrics,” and how does the above relate to it? Answer. The Iron Law says that all effects are underestimated because of errors in the independent variable. Friedman says Keynesians obtain their low marginal propensity to consume due to the “Iron Law of Econometrics”: they ignore that actual income is a measurement with error of the true underlying variable, permanent income. Problem 154. This question follows the original article [SW76] much more closely than [HVdP02] does. Sargent and Wallace first reproduce the usual argument why “activist” policy rules, in which the Fed “looks at many things” and “leans against the wind,” are superior to policy rules without feedback as promoted by the monetarists. They work with a very stylized model in which national income is represented by the following time series: (8.0.37) y t = α + λy t−1 + β mt + ut Here y t is GNP, measured as its deviation from “potential” GNP or as unemployment rate, and mt is the rate of growth of the money supply. The random disturbance ut is assumed independent of y t−1 , it has zero expected value, and its variance var[ut ] is constant over time, we will call it var[u] (no time subscript). • a. 4 points First assume that the Fed tries to maintain a constant money supply, i.e., mt = g0 + εt where g0 is a constant, and εt is a random disturbance since the Fed does not have full control over the money supply. The εt have zero 8. THE REGRESSION FALLACY 107 expected value; they are serially uncorrelated, and they are independent of the ut . This constant money supply rule does not necessarily make y t a stationary time series (i.e., a time series where mean, variance, and covariances do not depend on t), but if |λ| < 1 then y t converges towards a stationary time series, i.e., any initial deviations from the “steady state” die out over time. You are not required here to prove that the time series converges towards a stationary time series, but you are asked to compute E[y t ] in this stationary time series. • b. 8 points Now assume the policy makers want to steer the economy towards a desired steady state, call it y ∗ , which they think makes the best tradeoff between unemployment and inflation, by setting mt according to a rule with feedback: (8.0.38) mt = g0 + g1 y t−1 + εt Show that the following values of g0 and g1 (8.0.39) g0 = (y ∗ − α)/β g1 = −λ/β represent an optimal monetary policy, since they bring the expected value of the steady state E[y t ] to y ∗ and minimize the steady state variance var[y t ]. • c. 3 points This is the conventional reasoning which comes to the result that a policy rule with feedback, i.e., a policy rule in which g1 = 0, is better than a policy rule without feedback. Sargent and Wallace argue that there is a flaw in this reasoning. Which flaw? • d. 5 points A possible system of structural equations from which (8.0.37) can be derived are equations (8.0.40)–(8.0.42) below. Equation (8.0.40) indicates that unanticipated increases in the growth rate of the money supply increase output, while anticipated ones do not. This is a typical assumption of the rational expectations school (Lucas supply curve). (8.0.40) y t = ξ0 + ξ1 (mt − Et−1 mt ) + ξ2 y t−1 + ut The Fed uses the policy rule (8.0.41) mt = g0 + g1 y t−1 + εt and the agents know this policy rule, therefore (8.0.42) Et−1 mt = g0 + g1 y t−1 . Show that in this system, the parameters g0 and g1 have no influence on the time path of y . • e. 4 points On the other hand, the econometric estimations which the policy makers are running seem to show that these coefficients have an impact. During a certain period during which a constant policy rule g0 , g1 is followed, the econometricians regress y t on y t−1 and mt in order to estimate the coefficients in (8.0.37). Which values of α, λ, and β will such a regression yield? CHAPTER 9 A Simple Example of Estimation We will discuss here a simple estimation problem, which can be considered the prototype of all least squares estimation. Assume we have n independent observations y1 , . . . , yn of a Normally distributed random variable y ∼ N (µ, σ 2 ) with unknown location parameter µ and dispersion parameter σ 2 . Our goal is to estimate the location parameter and also estimate some measure of the precision of this estimator. 9.1. Sample Mean as Estimator of the Location Parameter The obvious (and in many cases also the best) estimate of the location parameter n 1 of a distribution is the sample mean y = n i=1 yi . Why is this a reasonable ¯ estimate? 1. The location parameter of the Normal distribution is its expected value, and by the weak law of large numbers, the probability limit for n → ∞ of the sample mean is the expected value. 2. The expected value µ is sometimes called the “population mean,” while y is ¯ the sample mean. This terminology indicates that there is a correspondence between population quantities and sample quantities, which is often used for estimation. This is the principle of estimating the unknown distribution of the population by the empirical distribution of the sample. Compare Problem 63. 3. This estimator is also unbiased. By definition, an estimator t of the parameter θ is unbiased if E[t] = θ. y is an unbiased estimator of µ, since E[¯] = µ. ¯ y 4. Given n observations y1 , . . . , yn , the sample mean is the number a = y which ¯ minimizes (y1 − a)2 + (y2 − a)2 + · · · + (yn − a)2 . One can say it is the number whose squared distance to the given sample numbers is smallest. This idea is generalized in the least squares principle of estimation. It follows from the following frequently used fact: 5. In the case of normality the sample mean is also the maximum likelihood estimate. Problem 155. 4 points Let y1 , . . . , yn be an arbitrary vector and α an arbitrary n number. As usual, y = n i=1 yi . Show that ¯1 n n (yi − α)2 = (9.1.1) i=1 (yi − y )2 + n(¯ − α)2 ¯ y i=1 109 110 9. A SIMPLE EXAMPLE OF ESTIMATION ... ... ................................. .................................. ........................................................... ..................................... ............................................................ ...................................... ......... ......... . .......... .......... ...... ......... . .......... .......... .......... .......... ...... .......... .......... .......... ........... .......... ............. ...................... ..................... .......... ................. ............. ...................... ..................... ............. .......... ................. .............. .... . ..... ... ..... . ............. .............. ...... ....... ......................................................... .......................................................... .......................................................... ........................................................... q µ2 µ1 µ3 µ4 Figure 1. Possible Density Functions for y Answer. n n (yi − α )2 = (9.1.2) i=1 (9.1.3) (yi − y ) + (¯ − α) ¯ y i=1 n n (yi − y )2 + 2 ¯ = i=1 n (9.1.4) 2 n i=1 i=1 n (yi − y )2 + 2(¯ − α) ¯ y = (¯ − α)2 y (yi − y )(¯ − α) + ¯y i=1 (yi − y ) + n(¯ − α)2 ¯ y i=1 Since the middle term is zero, (9.1.1) follows. Problem 156. 2 points Let y be a n-vector. (It may be a vector of observations of a random variable y , but it does not matter how the yi were obtained.) Prove that the scalar α which minimizes the sum (9.1.5) (y1 − α)2 + (y2 − α)2 + · · · + (yn − α)2 = (yi − α)2 is the arithmetic mean α = y . ¯ Answer. Use (9.1.1). Problem 157. Give an example of a distribution in which the sample mean is not a good estimate of the location parameter. Which other estimate (or estimates) would be preferable in that situation? 9.2. Intuition of the Maximum Likelihood Estimator In order to make intuitively clear what is involved in maximum likelihood estimation, look at the simplest case y = µ + ε, ε ∼ N (0, 1), where µ is an unknown parameter. In other words: we know that one of the functions shown in Figure 1 is the density function of y , but we do not know which: Assume we have only one observation y . What is then the MLE of µ? It is that µ for which the value of the likelihood function, evaluated at y , is greatest. I.e., you ˜ look at all possible density functions and pick the one which is highest at point y , and use the µ which belongs this density as your estimate. 2) Now assume two independent observations of y are given, y1 and y2 . The family of density functions is still the same. Which of these density functions do we choose now? The one for which the product of the ordinates over y1 and y2 gives the highest value. For this the peak of the density function must be exactly in the middle between the two observations. 3) Assume again that we made two independent observations y1 and y2 of y , but this time not only the expected value but also the variance of y is unknown, call it σ 2 . This gives a larger family of density functions to choose from: they do not only differ by location, but some are low and fat and others tall and skinny. 9.2. INTUITION OF THE MAXIMUM LIKELIHOOD ESTIMATOR 111 ...... ... ... ..... ... ... .................................... ............................................................ .............................................. ..................................... ............................................................ ............................................... ... ... ... ... . . ....... .............. .......... ....... ............. .......... .......... ...... ................................. ..................... ........... ......... ......... ... ......... ......... ........... ... ..................................... .............. ......... ......... ........... ... ..................................... ................... ...... ................. ................ ......... .... .......... ........... .......................................................... ............................................ ........................................................... ............................................. q µ1 q µ2 µ3 µ4 Figure 2. Two observations, σ 2 = 1 Figure 3. Two observations, σ 2 unknown .. . .. . .. .. .. .. . .. .. . .. .. .. .. .. .. .. .. . .. .. . .. .. .. .. .. .. . .. .. .. .. .. .. .. .. ... . .... .. .... .... ... . .... .. .... . . .. . . .. ... . .. . ... . ... .. .... .... . ... . . ... . ... . ... ... .... . .... .... .... .... .. .... .... .... .. .... .... .... .... .. .... .... . .... .. .... . .. .... .. .... .. . ... .. . ... .. . ... . .. . ... . .. . ... . .. . ... . . .. . ... . .. . .... .. . . .... . ...... . ... ...... ...... .. ...... ..... ...... ...... ... ...... ...... ...... ... ...... ...... ...... ...... ...... ...... ...... . ...... ...... .. .... .. .... .. .... ... ........ .. ......... . .... ..... .. ...... . . .. . . . ..... .. ...... . .. . . ... . . . . . . .. . . .. . . . .. . . .. . . .... .. . . .. . . ... ............................... .................................. . . .. . . .. .. ......... . ....... . . .. . . ....... ............ . . .. . . ... .... ............ .. ............ . .. . . ........... ... ........ . .. ........... ... ......... ........... ... .......... .. ........... .... ........... . . . ............................................................................................... ..................................................................................................................................................................................................................................................................................................................................... ..................................................................................................................................................................................................................................................................................................................................... ........ ...................................................................................... .................................................................................... .......... .................................................................................... ......... ... . . . .. . Figure 4. Only those centered over the two observations need to be considered Figure 5. Many Observations For which density function is the product of the ordinates over y1 and y2 the largest again? Before even knowing our estimate of σ 2 we can already tell what µ is: ˜ it must again be (y1 + y2 )/2. Then among those density functions which are centered over (y1 + y2 )/2, there is one which is highest over y1 and y2 . Figure 4 shows the densities for standard deviations 0.01, 0.05, 0.1, 0.5, 1, and 5. All curves, except the last one, are truncated at the point where the resolution of TEX can no longer distinguish between their level and zero. For the last curve this point would only be reached at the coordinates ±25. 4) If we have many observations, then the density pattern of the observations, as indicated by the histogram below, approximates the actual density function of y itself. That likelihood function must be chosen which has a high value where the points are dense, and which has a low value where the points are not so dense. 9.2.1. Precision of the Estimator. How good is y as estimate of µ? To an¯ swer this question we need some criterion how to measure “goodness.” Assume your business depends on the precision of the estimate µ of µ. It incurs a penalty (extra ˆ cost) amounting to (ˆ − µ)2 . You don’t know what this error will be beforehand, µ but the expected value of this “loss function” may be an indication how good the estimate is. Generally, the expected value of a loss function is called the “risk,” and for the quadratic loss function E[(ˆ − µ)2 ] it has the name “mean squared error of µ µ as an estimate of µ,” write it MSE[ˆ; µ]. What is the mean squared error of y ? ˆ µ ¯ 2 Since E[¯] = µ, it is E[(¯ − E[¯])2 ] = var[¯] = σ . y y y y n 112 9. A SIMPLE EXAMPLE OF ESTIMATION Note that the MSE of y as an estimate of µ does not depend on µ. This is ¯ convenient, since usually the MSE depends on unknown parameters, and therefore one usually does not know how good the estimator is. But it has more important y y y advantages. For any estimator y of µ follows MSE[˜; µ] = var[˜] + (E[˜] − µ)2 . If ˜ y is linear (perhaps with a constant term), then var[˜] is a constant which does y ˜ not depend on µ, therefore the MSE is a constant if y is unbiased and a quadratic ˜ function of µ (parabola) if y is biased. Since a parabola is an unbounded function, ˜ a biased linear estimator has therefore the disadvantage that for certain values of µ its MSE may be very high. Some estimators are very good when µ is in one area, and very bad when µ is in another area. Since our unbiased estimator y has bounded ¯ MSE, it will not let us down, wherever nature has hidden the µ. On the other hand, the MSE does depend on the unknown σ 2 . So we have to estimate σ 2 . 9.3. Variance Estimation and Degrees of Freedom It is not so clear what the best estimator of σ 2 is. At least two possibilities are in common use: s2 = m 1 n s2 = u (9.3.1) 1 n−1 (y i − y )2 ¯ or (9.3.2) (y i − y )2 . ¯ Let us compute the expected value of our two estimators. Equation (9.1.1) with α = E[y ] allows us to simplify the sum of squared errors so that it becomes easy to take expected values: n (9.3.3) n (y i − y )2 ] = ¯ E[ i=1 y E[(y i − µ)2 ] − n E[(¯ − µ)2 ] i=1 n (9.3.4) σ2 − n = i=1 σ2 = (n − 1)σ 2 . n because E[(y i − µ)2 ] = var[y i ] = σ 2 and E[(¯ − µ)2 ] = var[¯] = y y use as estimator of σ 2 the quantity (9.3.5) s2 = u 1 n−1 σ2 n. Therefore, if we n (y i − y )2 ¯ i=1 then this is an unbiased estimate. Problem 158. 4 points Show that (9.3.6) s2 u 1 = n−1 n (y i − y )2 ¯ i=1 is an unbiased estimator of the variance. List the assumptions which have to be made about y i so that this proof goes through. Do you need Normality of the individual observations y i to prove this? 9.3. VARIANCE ESTIMATION AND DEGREES OF FREEDOM 113 Answer. Use equation (9.1.1) with α = E[y ]: n (9.3.7) n ( y i − y )2 ] = ¯ E[ i=1 E[(y i − µ)2 ] − n E[(¯ − µ)2 ] y i=1 n (9.3.8) σ2 − n = σ2 = (n − 1)σ 2 . n i=1 You do not need Normality for this. For testing, confidence intervals, etc., one also needs to know the probability distribution of s2 . For this look up once more Section 4.9 about the Chi-Square u distribution. There we introduced the terminology that a random variable q is distributed as a σ 2 χ2 iff q /σ 2 is a χ2 . In our model with n independent normal variables ¯ y i with same mean and variance, the variable (y i − y )2 is a σ 2 χ2 −1 . Problem 159 n gives a proof of this in the simplest case n = 2, and Problem 160 looks at the case 2 n = 3. But it is valid for higher n too. Therefore s2 is a nσ 1 χ2 −1 . This is reu n − markable: the distribution of s2 does not depend on µ. Now use (4.9.5) to get the u 2σ 4 variance of s2 : it is n−1 . u Problem 159. Let y 1 and y 2 be two independent Normally distributed variables with mean µ and variance σ 2 , and let y be their arithmetic mean. ¯ • a. 2 points Show that 2 (9.3.9) (y i − y )2 ∼ σ 2 χ2 ¯ 1 SSE = i−1 Hint: Find a Normally distributed random variable z with expected value 0 and variance 1 such that SSE = σ 2 z 2 . Answer. (9.3.10) (9.3.11) y1 − y ¯ (9.3.12) ¯ y2 − y (9.3.13) (y 1 − y )2 + ( y 2 − y )2 ¯ ¯ (9.3.14) y1 + y2 2 y − y2 =1 2 y − y2 =− 1 2 (y 1 − y 2 )2 (y − y 2 )2 = +1 4 4 2 2 y1 − y2 =σ , √ 2σ 2 y= ¯ = (y 1 − y 2 )2 2 √ and since z = (y 1 − y 2 )/ 2σ 2 ∼ N (0, 1), its square is a χ2 . 1 • b. 4 points Write down the covariance matrix of the vector y1 − y ¯ y2 − y ¯ (9.3.15) and show that it is singular. Answer. (9.3.11) and (9.3.12) give (9.3.16) and V [D y ] = D V [y ]D 1 y1 − y ¯ 2 = y2 − y ¯ −1 2 −1 2 1 2 y1 y2 = Dy = σ 2 D because V [y ] = σ 2 I and D = idempotent. D is singular because its determinant is zero. 1 2 −1 2 1 −2 1 2 is symmetric and 114 9. A SIMPLE EXAMPLE OF ESTIMATION • c. 1 point The joint distribution of y 1 and y 2 is bivariate normal, why did we then get a χ2 with one, instead of two, degrees of freedom? Answer. Because y 1 − y and y 2 − y are not independent; one is exactly the negative of the ¯ ¯ other; therefore summing their squares is really only the square of one univariate normal. Problem 160. Assume y 1 , y 2 , and y 3 are independent N (µ, σ 2 ). Define three new variables z 1 , z 2 , and z 3 as follows: z 1 is that multiple of y which has variance ¯ σ 2 . z 2 is that linear combination of z 1 and y 2 which has zero covariance with z 1 and has variance σ 2 . z 3 is that linear combination of z 1 , z 2 , and y 3 which has zero covariance with both z 1 and z 2 and has again variance σ 2 . These properties define z 1 , z 2 , and z 3 uniquely up factors ±1, i.e., if z 1 satisfies the above conditions, then −z 1 does too, and these are the only two solutions. • a. 2 points Write z 1 and z 2 (not yet z 3 ) as linear combinations of y 1 , y 2 , and y3 . • b. 1 point To make the computation of z 3 less tedious, first show the following: if z 3 has zero covariance with z 1 and z 2 , it also has zero covariance with y 2 . • c. 1 point Therefore z 3 is a linear combination of y 1 and y 3 only. Compute its coefficients. • d. 1 point How does the joint distribution of z 1 , z 2 , and z 3 differ from that of y 1 , y 2 , and y 3 ? Since they are jointly normal, you merely have to look at the expected values, variances, and covariances. • e. 2 points Show that z 2 + z 2 + z 2 = y 2 + y 2 + y 2 . Is this a surprise? 1 2 3 1 2 3 • f . 1 point Show further that s2 = 1 u 2 simple trick!) Conclude from this that s2 ∼ u 3 12 2 ¯2 i=1 (y i − y ) = 2 (z 2 + z 3 ). 2 σ 2 ¯ 2 χ2 , independent of y . (There is a For a matrix-interpretation of what is happening, see equation (7.4.9) together with Problem 161. 1 Problem 161. 3 points Verify that the matrix D = I − n ιι is symmetric and idempotent, and that the sample covariance of two vectors of observations x and y can be written in matrix notation as 1 1 (9.3.17) sample covariance(x, y ) = (xi − x)(yi − y ) = x D y ¯ ¯ n n In general, one can always find n − 1 normal variables with variance σ 2 , independent of each other and of y , whose sum of squares is equal to (y i − y )2 . Simply ¯ ¯ √ start with y n and generate n − 1 linear combinations of the y i which are pairwise ¯ uncorrelated and have √ variances σ 2 . You are simply building an orthonormal coordinate system with y n as its first vector; there are many different ways to do ¯ this. Next let us show that y and s2 are statistically independent. This is an ad¯ u vantage. Assume, hypothetically, y and s2 were negatively correlated. Then, if the ¯ u observed value of y is too high, chances are that the one of s2 is too low, and a look ¯ u at s2 will not reveal how far off the mark y may be. To prove independence, we will ¯ u first show that y and y i − y are uncorrelated: ¯ ¯ (9.3.18) (9.3.19) y ¯ y y cov[¯, y i − y ] = cov[¯, y i ] − var[¯] 1 σ2 = cov[ (y 1 + · · · + y i + · · · + y n ), y i ] − =0 n n 9.3. VARIANCE ESTIMATION AND DEGREES OF FREEDOM 115 ¯ By normality, y is therefore independent of y i − y for all i. Since all variables in¯ volved are jointly normal, it follows from this that y is independent of the vector ¯ y 1 − y · · · y n − y ; therefore it is also independent of any function of this vec¯ ¯ tor, such as s2 . u The above calculations explain why the parameter of the χ2 distribution has the colorful name “degrees of freedom.” This term is sometimes used in a very broad sense, referring to estimation in general, and sometimes in a narrower sense, in conjunction with the linear model. Here is first an interpretation of the general use of the term. A “statistic” is defined to be a function of the observations and of other known parameters of the problem, but not of the unknown parameters. Estimators are statistics. If one has n observations, then one can find at most n mathematically independent statistics; any other statistic is then a function of these n. If therefore a model has k independent unknown parameters, then one must have at least k observations to be able to estimate all parameters of the model. The number n − k , i.e., the number of observations not “used up” for estimation, is called the number of “degrees of freedom.” There are at least three reasons why one does not want to make the model such that it uses up too many degrees of freedom. (1) the estimators become too inaccurate if one does; (2) if there are no degrees of freedom left, it is no longer possible to make any “diagnostic” tests whether the model really fits the data, because it always gives a perfect fit whatever the given set of data; (3) if there are no degrees of freedom left, then one can usually also no longer make estimates of the precision of the estimates. Specifically in our linear estimation problem, the number of degrees of freedom is n − 1, since one observation has been used up for estimating the mean. If one runs a regression, the number of degrees of freedom is n − k , where k is the number of regression coefficients. In the linear model, the number of degrees of freedom becomes immediately relevant for the estimation of σ 2 . If k observations are used up for estimating the slope parameters, then the other n − k observations can be combined into a n − k -variate Normal whose expected value does not depend on the slope parameter at all but is zero, which allows one to estimate the variance. If we assume that the original observations are normally distributed, i.e., y i ∼ 2 NID(µ, σ 2 ), then we know that s2 ∼ nσ 1 χ2 −1 . Therefore E[s2 ] = σ 2 and var[s2 ] = u u u n − 2σ 4 /(n − 1). This estimate of σ 2 therefore not only gives us an estimate of the precision of y , but it has an estimate of its own precision built in. ¯ (y −y )2 ¯ i Interestingly, the MSE of the alternative estimator s2 = is smaller m n 2 2 2 than that of su , although sm is a biased estimator and su an unbiased estimator of σ 2 . For every estimator t, MSE[t; θ] = var[t] + (E[t − θ])2 , i.e., it is variance plus 2σ 4 squared bias. The MSE of s2 is therefore equal to its variance, which is n−1 . The u 4 2 4 ( /n alternative s2 = n−1 s2 has bias − σ and variance 2σ nn−1) . Its MSE is (2−1n )σ . 2 m u n n 2 Comparing that with the formula for the MSE of su one sees that the numerator is smaller and the denominator is bigger, therefore s2 has smaller MSE. m Problem 162. 4 points Assume y i ∼ NID(µ, σ 2 ). Show that the so-called Theil Schweitzer estimator [TS61] (9.3.20) s2 = t 1 n+1 (y i − y )2 ¯ has even smaller MSE than s2 and s2 as an estimator of σ 2 . u m 116 9. A SIMPLE EXAMPLE OF ESTIMATION ........ ..................... ..... .. . ........................ .. .. ........ ... ...... ......... .. . ... .... ......... ... ... .. .......... .. ... ........... .. ... .. ............ ............. ... . .. . ................ ..... . ....... ............... . ......................... . ........................... .. .. ..... ................. ................. 0 1 2 3 4 5 6 Figure 6. Densities of Unbiased and Theil Schweitzer Estimators Answer. s2 = t n−1 2 s; n+1 u 2 2 therefore its bias is − nσ and its variance is +1 4 2 MSE is nσ . That this is smaller than the MSE of s2 means m +1 (2n − 1)(n + 1) = 2n2 + n − 1 > 2n2 for n > 1. 2n−1 n2 ≥ 2 , n+1 2(n−1)σ 4 , (n+1)2 and the which follows from Problem 163. 3 points Computer assignment: Given 20 independent observations of a random variable y ∼ N (µ, σ 2 ). Assume you know that σ 2 = 2. Plot the density function of s2 . Hint: In R, the command dchisq(x,df=25) returns the u density of a Chi-square distribution with 25 degrees of freedom evaluated at x. But the number 25 was only taken as an example, this is not the number of degrees of freedom you need here. You also do not need the density of a Chi-Square but that of a certain multiple of a Chi-square. (Use the transformation theorem for density functions!) 2 Answer. s2 ∼ 19 χ2 . To express the density of the variable whose density is known by that u 19 whose density one wants to know, say 19 s2 ∼ χ2 . Therefore 19 2u f s 2 ( x) = (9.3.21) u 19 19 f 2 ( x) . 2 χ19 2 • a. 2 points In the same plot, plot the density function of the Theil-Schweitzer estimate s2 defined in equation (9.3.20). This gives a plot as in Figure 6. Can one see t from the comparison of these density functions that the Theil-Schweitzer estimator has a better MSE? Answer. Start with plotting the Theil-Schweitzer plot, because it is higher, and therefore it will give the right dimensions of the plot. You can run this by giving the command ecmetscript(theilsch). The two areas between the densities have equal size, but the area where the Theil-Schweitzer density is higher is overall closer to the true value than the area where the unbiased density is higher. Problem 164. 4 points The following problem illustrates the general fact that if one starts with an unbiased estimator and “shrinks” it a little, one will end up with a better MSE. Assume E[y ] = µ, var(y ) = σ 2 , and you make n independent observations y i . The best linear unbiased estimator of µ on the basis of these observations is the sample mean y . Show that, whenever α satisfies ¯ nµ2 − σ 2 <α<1 nµ2 + σ 2 (9.3.22) then MSE[αy ; µ] < MSE[¯; µ]. Unfortunately, this condition depends on µ and σ 2 ¯ y and can therefore not be used to improve the estimate. Answer. Here is the mathematical relationship: (9.3.23) MSE[αy ; µ] = E (αy − µ)2 = E (αy − αµ + αµ − µ)2 < MSE[¯; µ] = var[¯] ¯ ¯ ¯ y y (9.3.24) α2 σ 2 /n + (1 − α)2 µ2 < σ 2 /n Now simplify it: (9.3.25) (1 − α)2 µ2 < (1 − α2 )σ 2 /n = (1 − α)(1 + α)σ 2 /n 9.3. VARIANCE ESTIMATION AND DEGREES OF FREEDOM 117 This cannot be true for α ≥ 1, because for α = 1 one has equality, and for α > 1, the righthand side is negative. Therefore we are allowed to assume α < 1, and can divide by 1 − α without disturbing the inequality: (9.3.26) (1 − α)µ2 < (1 + α)σ 2 /n (9.3.27) µ2 − σ 2 /n < α(µ2 + σ 2 /n) The answer is therefore (9.3.28) nµ2 − σ 2 < α < 1. nµ2 + σ 2 This the range. Note that nµ2 − σ 2 < 0 may be negative. The best value is in the middle of this range, see Problem 165. Problem 165. [KS79, example 17.14 on p. 22] The mathematics in the following problem is easier than it looks. If you can’t prove a., assume it and derive b. from it, etc. • a. 2 points Let t be an estimator of the nonrandom scalar parameter θ. E[t − θ] is called the bias of t, and E (t − θ)2 is called the mean squared error of t as an estimator of θ, written MSE[t; θ]. Show that the MSE is the variance plus the squared bias, i.e., that (9.3.29) 2 MSE[t; θ] = var[t] + E[t − θ] . Answer. The most elegant proof, which also indicates what to do when θ is random, is: (9.3.30) MSE[t; θ] = E (t − θ)2 = var[t − θ] + (E[t − θ])2 = var[t] + (E[t − θ])2 . • b. 2 points For the rest of this problem assume that t is an unbiased estimator of θ with var[t] > 0. We will investigate whether one can get a better MSE if one estimates θ by a constant multiple at instead of t. Show that (9.3.31) MSE[at; θ] = a2 var[t] + (a − 1)2 θ2 . Answer. var[at] = a2 var[t] and the bias of at is E[at − θ] = (a − 1)θ. Now apply (9.3.30). • c. 1 point Show that, whenever a > 1, then MSE[at; θ] > MSE[t; θ]. If one wants to decrease the MSE, one should therefore not choose a > 1. Answer. MSE[at; θ] − MSE[t; θ] = (a2 − 1) var[t]+(a − 1)2 θ2 > 0 since a > 1 and var[t] > 0. • d. 2 points Show that (9.3.32) d MSE[at; θ] da > 0. a=1 From this follows that the MSE of at is smaller than the MSE of t, as long as a < 1 and close enough to 1. Answer. The derivative of (9.3.31) is d MSE[at; θ] = 2a var[t] + 2(a − 1)θ2 da Plug a = 1 into this to get 2 var[t] > 0. (9.3.33) • e. 2 points By solving the first order condition show that the factor a which gives smallest MSE is (9.3.34) a= θ2 . var[t] + θ2 Answer. Rewrite (9.3.33) as 2a(var[t] + θ2 ) − 2θ2 and set it zero. 118 9. A SIMPLE EXAMPLE OF ESTIMATION • f . 1 point Assume t has an exponential distribution with parameter λ > 0, i.e., ft (t) = λ exp(−λt), (9.3.35) t≥0 and ft (t) = 0 otherwise. Check that ft (t) is indeed a density function. ∞ Answer. Since λ > 0, ft (t) > 0 for all t ≥ 0. To evaluate λ exp(−λt) dt, substitute 0 s = −λt, therefore ds = −λdt, and the upper integration limit changes from +∞ to −∞, therefore −∞ the integral is − exp(s) ds = 1. 0 • g. 4 points Using this density function (and no other knowledge about the exponential distribution) prove that t is an unbiased estimator of 1/λ, with var[t] = 1/λ2 . Answer. To evaluate ∞ 0 λt exp(−λt) dt, use partial integration uv dt = uv − u v dt with u = t, u = 1, v = − exp(−λt), v = λ exp(−λt). Therefore the integral is −t exp(−λt) ∞ 0 exp(−λt) dt = 1/λ, since we just saw that To evaluate ∞ 0 ∞ 0 0 + λ exp(−λt) dt = 1. λt2 exp(−λt) dt, use partial integration with u = t2 , u = 2t, v = − exp(−λt), v = λ exp(−λt). Therefore the integral is −t2 exp(−λt) 2/λ2 . ∞ Therefore var[t] = E[t2 ] − (E[t])2 = 2/λ2 − 1/λ2 ∞ 0 = +2 ∞ 0 1/λ2 . t exp(−λt) dt = 2 λ ∞ 0 λt exp(−λt) dt = • h. 2 points Which multiple of t has the lowest MSE as an estimator of 1/λ? Answer. It is t/2. Just plug θ = 1/λ into (9.3.34). (9.3.36) a= 1/λ2 1/λ2 1 = =. var[t] + 1/λ2 1/λ2 + 1/λ2 2 • i. 2 points Assume t1 , . . . , tn are independently distributed, and each of them has the exponential distribution with the same parameter λ. Which multiple of the n 1 sample mean ¯ = n i=1 ti has best MSE as estimator of 1/λ? t Answer. ¯ has expected value 1/λ and variance 1/nλ2 . Therefore t (9.3.37) a= i.e., for the best estimator ˜ = t 1/λ2 n 1/λ2 = = , 2 var[t] + 1/λ 1/nλ2 + 1/λ2 n+1 1 n+1 ti divide the sum by n + 1 instead of n. 1 • j. 3 points Assume q ∼ σ 2 χ2 (in other words, σ2 q ∼ χ2 , a Chi-square distrim m bution with m degrees of freedom). Using the fact that E[χ2 ] = m and var[χ2 ] = 2m, m m compute that multiple of q that has minimum MSE as estimator of σ 2 . Answer. This is a trick question since q itself is not an unbiased estimator of σ 2 . E[q ] = mσ 2 , therefore q /m is the unbiased estimator. Since var[q /m] = 2σ 4 /m, it follows from (9.3.34) that qm a = m/(m + 2), therefore the minimum MSE multiple of q is m m+2 = mq . I.e., divide q by m + 2 +2 instead of m. • k. 3 points Assume you have n independent observations of a Normally distributed random variable y with unknown mean µ and standard deviation σ 2 . The 1 best unbiased estimator of σ 2 is n−1 (y i − y )2 , and the maximum likelihood extima¯ 1 2 tor is n (y i − y ) . What are the implications of the above for the question whether ¯ one should use the first or the second or still some other multiple of (y i − y )2 ? ¯ Answer. Taking that multiple of the sum of squared errors which makes the estimator unbiased is not necessarily a good choice. In terms of MSE, the best multiple of (y i − y )2 is ¯ 1 (y i − y )2 . ¯ n+1 9.3. VARIANCE ESTIMATION AND DEGREES OF FREEDOM 119 • l. 3 points We are still in the model defined in k. Which multiple of the sample mean y has smallest MSE as estimator of µ? How does this example differ from the ¯ ones given above? Can this formula have practical significance? 2 µ Answer. Here the optimal a = µ2 +(σ2 /n) . Unlike in the earlier examples, this a depends on the unknown parameters. One can “operationalize” it by estimating the parameters from the data, but the noise introduced by this estimation can easily make the estimator worse than the simple y . ¯ Indeed, y is admissible, i.e., it cannot be uniformly improved upon. On the other hand, the Stein ¯ rule, which can be considered an operationalization of a very similar formula (the only difference being that one estimates the mean vector of a vector with at least 3 elements), by estimating µ2 1 and µ2 + n σ 2 from the data, shows that such an operationalization is sometimes successful. We will discuss here one more property of y and s2 : They together form sufficient ¯ u statistics for µ and σ 2 . I.e., any estimator of µ and σ 2 which is not a function of y ¯ and s2 is less efficient than it could be. Since the factorization theorem for sufficient u statistics holds even if the parameter θ and its estimate t are vectors, we have to write the joint density of the observation vector y as a product of two functions, one depending on the parameters and the sufficient statistics, and the other depending on the value taken by y , but not on the parameters. Indeed, it will turn out that this second function can just be taken to be h(y ) = 1, since the density function can be rearranged as n (9.3.38) fy (y1 , . . . , yn ; µ, σ 2 ) = (2πσ 2 )−n/2 exp − (yi − µ)2 /2σ 2 = i=1 n (9.3.39) = (2πσ 2 )−n/2 exp − (yi − y )2 − n(¯ − µ)2 /2σ 2 = ¯ y i=1 (9.3.40) = (2πσ 2 )−n/2 exp − (n − 1)s2 − n(¯ + µ)2 y u . 2 2σ CHAPTER 10 Estimation Principles and Classification of Estimators 10.1. Asymptotic or Large-Sample Properties of Estimators We will discuss asymptotic properties first, because the idea of estimation is to get more certainty by increasing the sample size. Strictly speaking, asymptotic properties do not refer to individual estimators but to sequences of estimators, one for each sample size n. And strictly speaking, if one alters the first 10 estimators or the first million estimators and leaves the others unchanged, one still gets a sequence with the same asymptotic properties. The results that follow should therefore be used with caution. The asymptotic properties may say very little about the concrete estimator at hand. The most basic asymptotic property is (weak) consistency. An estimator tn (where n is the sample size) of the parameter θ is consistent iff (10.1.1) plim tn = θ. n→∞ Roughly, a consistent estimation procedure is one which gives the correct parameter values if the sample is large enough. There are only very few exceptional situations in which an estimator is acceptable which is not consistent, i.e., which does not converge in the plim to the true parameter value. Problem 166. Can you think of a situation where an estimator which is not consistent is acceptable? Answer. If additional data no longer give information, like when estimating the initial state of a timeseries, or in prediction. And if there is no identification but the value can be confined to an interval. This is also inconsistency. The following is an important property of consistent estimators: Slutsky theorem: If t is a consistent estimator for θ, and the function g is continuous at the true value of θ, then g (t) is consistent for g (θ). For the proof of the Slutsky theorem remember the definition of a continuous function. g is continuous at θ iff for all ε > 0 there exists a δ > 0 with the property that for all θ1 with |θ1 − θ| < δ follows |g (θ1 ) − g (θ)| < ε. To prove consistency of g (t) we have to show that for all ε > 0, Pr[|g (t) − g (θ)| ≥ ε] → 0. Choose for the given ε a δ as above, then |g (t) − g (θ)| ≥ ε implies |t − θ| ≥ δ , because all those values of t for with |t − θ| < δ lead to a g (t) with |g (t) − g (θ)| < ε. This logical implication means that (10.1.2) Pr[|g (t) − g (θ)| ≥ ε] ≤ Pr[|t − θ| ≥ δ ]. Since the probability on the righthand side converges to zero, the one on the lefthand side converges too. Different consistent estimators can have quite different speeds of convergence. Are there estimators which have optimal asymptotic properties among all consistent 121 122 10. ESTIMATION PRINCIPLES estimators? Yes, if one limits oneself to a fairly reasonable subclass of consistent estimators. Here are the details: Most consistent estimators we will encounter are asymptotically normal, i.e., the “shape” of their distribution function converges towards the normal distribution, as we had it for the sample mean in the central limit theorem. In order to be able to use this asymptotic distribution for significance tests and confidence intervals, however, one needs more than asymptotic normality (and many textbooks are not aware of this): one needs the convergence to normality to be uniform in compact intervals [Rao73, p. 346–351]. Such estimators are called consistent uniformly asymptotically normal estimators (CUAN estimators) If one limits oneself to CUAN estimators it can be shown that there are asymptotically “best” CUAN estimators. Since the distribution is asymptotically normal, there is no problem to define what it means to be asymptotically best: those estimators are asymptotically best whose asymptotic MSE = asymptotic variance is smallest. CUAN estimators whose MSE is asymptotically no larger than that of any other CUAN estimator, are called asymptotically efficient. Rao has shown that for CUAN estimators the lower bound for this asymptotic variance is the asymptotic limit of the Cramer Rao lower bound (CRLB). (More about the CRLB below). Maximum likelihood estimators are therefore usually efficient CUAN estimators. In this sense one can think of maximum likelihood estimators to be something like asymptotically best consistent estimators, compare a statement to this effect in [Ame94, p. 144]. And one can think of asymptotically efficient CUAN estimators as estimators who are in large samples as good as maximum likelihood estimators. All these are large sample properties. Among the asymptotically efficient estimators there are still wide differences regarding the small sample properties. Asymptotic efficiency should therefore again be considered a minimum requirement: there must be very good reasons not to be working with an asymptotically efficient estimator. Problem 167. Can you think of situations in which an estimator is acceptable which is not asymptotically efficient? Answer. If robustness matters then the median may be preferable to the mean, although it is less efficient. 10.2. Small Sample Properties In order to judge how good an estimator is for small samples, one has two dilemmas: (1) there are many different criteria for an estimator to be “good”; (2) even if one has decided on one criterion, a given estimator may be good for some values of the unknown parameters and not so good for others. If x and y are two estimators of the parameter θ, then each of the following conditions can be interpreted to mean that x is better than y : (10.2.1) (10.2.2) Pr[|x − θ| ≤ |y − θ|] = 1 E[g (x − θ)] ≤ E[g (y − θ)] for every continuous function g which is and nonincreasing for x < 0 and nondecreasing for x > 0 (10.2.3) E[g (|x − θ|)] ≤ E[g (|y − θ|)] 10.3. COMPARISON UNBIASEDNESS CONSISTENCY 123 for every continuous and nondecreasing function g (10.2.4) Pr[{|x − θ| > ε}] ≤ Pr[{|y − θ| > ε}] 2 for every ε 2 (10.2.5) E[(x − θ) ] ≤ E[(y − θ) ] (10.2.6) Pr[|x − θ| < |y − θ|] ≥ Pr[|x − θ| > |y − θ|] This list is from [Ame94, pp. 118–122]. But we will simply use the MSE. Therefore we are left with dilemma (2). There is no single estimator that has uniformly the smallest MSE in the sense that its MSE is better than the MSE of any other estimator whatever the value of the parameter value. To see this, simply think of the following estimator t of θ: t = 10; i.e., whatever the outcome of the experiments, t always takes the value 10. This estimator has zero MSE when θ happens to be 10, but is a bad estimator when θ is far away from 10. If an estimator existed which had uniformly best MSE, then it had to be better than all the constant estimators, i.e., have zero MSE whatever the value of the parameter, and this is only possible if the parameter itself is observed. Although the MSE criterion cannot be used to pick one best estimator, it can be used to rule out estimators which are unnecessarily bad in the sense that other estimators exist which are never worse but sometimes better in terms of MSE whatever the true parameter values. Estimators which are dominated in this sense are called inadmissible. But how can one choose between two admissible estimators? [Ame94, p. 124] gives two reasonable strategies. One is to integrate the MSE out over a distribution of the likely values of the parameter. This is in the spirit of the Bayesians, although Bayesians would still do it differently. The other strategy is to choose a minimax strategy. Amemiya seems to consider this an alright strategy, but it is really too defensive. Here is a third strategy, which is often used but less well founded theoretically: Since there are no estimators which have minimum MSE among all estimators, one often looks for estimators which have minimum MSE among all estimators with a certain property. And the “certain property” which is most often used is unbiasedness. The MSE of an unbiased estimator is its variance; and an estimator which has minimum variance in the class of all unbiased estimators is called “efficient.” The class of unbiased estimators has a high-sounding name, and the results related with Cramer-Rao and Least Squares seem to confirm that it is an important class of estimators. However I will argue in these class notes that unbiasedness itself is not a desirable property. 10.3. Comparison Unbiasedness Consistency Let us compare consistency with unbiasedness. If the estimator is unbiased, then its expected value for any sample size, whether large or small, is equal to the true parameter value. By the law of large numbers this can be translated into a statement about large samples: The mean of many independent replications of the estimate, even if each replication only uses a small number of observations, gives the true parameter value. Unbiasedness says therefore something about the small sample properties of the estimator, while consistency does not. The following thought experiment may clarify the difference between unbiasedness and consistency. Imagine you are conducting an experiment which gives you every ten seconds an independent measurement, i.e., a measurement whose value is not influenced by the outcome of previous measurements. Imagine further that the experimental setup is connected to a computer which estimates certain parameters of that experiment, re-calculating its estimate every time twenty new observation have 124 10. ESTIMATION PRINCIPLES become available, and which displays the current values of the estimate on a screen. And assume that the estimation procedure used by the computer is consistent, but biased for any finite number of observations. Consistency means: after a sufficiently long time, the digits of the parameter estimate displayed by the computer will be correct. That the estimator is biased, means: if the computer were to use every batch of 20 observations to form a new estimate of the parameter, without utilizing prior observations, and then would use the average of all these independent estimates as its updated estimate, it would end up displaying a wrong parameter value on the screen. A biased extimator gives, even in the limit, an incorrect result as long as one’s updating procedure is the simple taking the averages of all previous estimates. If an estimator is biased but consistent, then a better updating method is available, which will end up in the correct parameter value. A biased estimator therefore is not necessarily one which gives incorrect information about the parameter value; but it is one which one cannot update by simply taking averages. But there is no reason to limit oneself to such a crude method of updating. Obviously the question whether the estimate is biased is of little relevance, as long as it is consistent. The moral of the story is: If one looks for desirable estimators, by no means should one restrict one’s search to unbiased estimators! The high-sounding name “unbiased” for the technical property E[t] = θ has created a lot of confusion. Besides having no advantages, the category of unbiasedness even has some inconvenient properties: In some cases, in which consistent estimators exist, there are no unbiased estimators. And if an estimator t is an unbiased estimate for the parameter θ, then the estimator g (t) is usually no longer an unbiased estimator for g (θ). It depends on the way a certain quantity is measured whether the estimator is unbiased or not. However consistency carries over. Unbiasedness is not the only possible criterion which ensures that the values of the estimator are centered over the value it estimates. Here is another plausible definition: ˆ Definition 10.3.1. An estimator θ of the scalar θ is called median unbiased for all θ ∈ Θ iff 1 ˆ ˆ (10.3.1) Pr[θ < θ] = Pr[θ > θ] = 2 This concept is always applicable, even for estimators whose expected value does not exist. Problem 168. 6 points (Not eligible for in-class exams) The purpose of the following problem is to show how restrictive the requirement of unbiasedness is. Sometimes no unbiased estimators exist, and sometimes, as in the example here, unbiasedness leads to absurd estimators. Assume the random variable x has the geometric distribution with parameter p, where 0 ≤ p ≤ 1. In other words, it can only assume the integer values 1, 2, 3, . . ., with probabilities (10.3.2) Pr[x = r] = (1 − p)r−1 p. Show that the unique unbiased estimator of p on the basis of one observation of x is the random variable f (x) defined by f (x) = 1 if x = 1 and 0 otherwise. Hint: Use the mathematical fact that a function φ(q ) that can be expressed as a power series ∞ φ(q ) = j =0 aj q j , and which takes the values φ(q ) = 1 for all q in some interval of nonzero length, is the power series with a0 = 1 and aj = 0 for j = 0. (You will need the hint at the end of your answer, don’t try to start with the hint!) 10.3. COMPARISON UNBIASEDNESS CONSISTENCY 125 ∞ Answer. Unbiasedness means that E[f (x)] = f (r )(1 − p)r−1 p = p for all p in the unit r =1 ∞ interval, therefore f (r )(1 − p)r−1 = 1. This is a power series in q = 1 − p, which must be r =1 identically equal to 1 for all values of q between 0 and 1. An application of the hint shows that the constant term in this power series, corresponding to the value r − 1 = 0, must be = 1, and all other f (r) = 0. Here older formulation: An application of the hint with q = 1 − p, j = r − 1, and aj = f (j + 1) gives f (1) = 1 and all other f (r ) = 0. This estimator is absurd since it lies on the boundary of the range of possible values for q . Problem 169. As in Question 61, you make two independent trials of a Bernoulli experiment with success probability θ, and you observe t, the number of successes. • a. Give an unbiased estimator of θ based on t (i.e., which is a function of t). • b. Give an unbiased estimator of θ2 . • c. Show that there is no unbiased estimator of θ3 . Hint: Since t can only take the three values 0, 1, and 2, any estimator u which is a function of t is determined by the values it takes when t is 0, 1, or 2, call them u0 , u1 , and u2 . Express E[u] as a function of u0 , u1 , and u2 . Answer. E[u] = u0 (1 − θ)2 + 2u1 θ(1 − θ) + u2 θ2 = u0 + (2u1 − 2u0 )θ + (u0 − 2u1 + u2 )θ2 . This is always a second degree polynomial in θ, therefore whatever is not a second degree polynomial in θ cannot be the expected value of any function of t. For E[u] = θ we need u0 = 0, 2u1 − 2u0 = 2u1 = 1, therefore u1 = 0.5, and u0 − 2u1 + u2 = −1 + u2 = 0, i.e. u2 = 1. This is, in other words, u = t/2. For E[u] = θ2 we need u0 = 0, 2u1 − 2u0 = 2u1 = 0, therefore u1 = 0, and u0 − 2u1 + u2 = u2 = 1, This is, in other words, u = t(t − 1)/2. From this equation one also sees that θ3 and higher powers, or things like 1/θ, cannot be the expected values of any estimators. • d. Compute the moment generating function of t. Answer. (10.3.3) E[eλt ] = e0 · (1 − θ)2 + eλ · 2θ(1 − θ) + e2λ · θ2 = 1 − θ + θeλ 2 Problem 170. This is [KS79, Question 17.11 on p. 34], originally [Fis, p. 700]. • a. 1 point Assume t and u are two unbiased estimators of the same unknown scalar nonrandom parameter θ. t and u have finite variances and satisfy var[u − t] = 0. Show that a linear combination of t and u, i.e., an estimator of θ which can be written in the form αt + β u, is unbiased if and only if α = 1 − β . In other words, any unbiased estimator which is a linear combination of t and u can be written in the form t + β (u − t). (10.3.4) • b. 2 points By solving the first order condition show that the unbiased linear combination of t and u which has lowest MSE is cov[t, u − t] ˆ (10.3.5) θ =t− (u − t) var[u − t] Hint: your arithmetic will be simplest if you start with (10.3.4). • c. 1 point If ρ2 is the squared correlation coefficient between t and u − t, i.e., (10.3.6) ρ2 = (cov[t, u − t])2 var[t] var[u − t] ˆ show that var[θ] = var[t](1 − ρ2 ). • d. 1 point Show that cov[t, u − t] = 0 implies var[u − t] = 0. 126 10. ESTIMATION PRINCIPLES • e. 2 points Use (10.3.5) to show that if t is the minimum MSE unbiased estimator of θ, and u another unbiased estimator of θ, then cov[t, u − t] = 0. (10.3.7) • f . 1 point Use (10.3.5) to show also the opposite: if t is an unbiased estimator of θ with the property that cov[t, u − t] = 0 for every other unbiased estimator u of θ, then t has minimum MSE among all unbiased estimators of θ. There are estimators which are consistent but their bias does not converge to zero: ˆ θn = (10.3.8) ˆ Then Pr( θn − θ ≥ ε) ≤ θ + 1 = 0. 1 n, θ n with probability 1 − 1 with probability n 1 n ˆ i.e., the estimator is consistent, but E[θ] = θ n−1 + 1 → n Problem 171. 4 points Is it possible to have a consistent estimator whose bias becomes unbounded as the sample size increases? Either prove that it is not possible or give an example. Answer. Yes, this can be achieved by making the rare outliers even wilder than in (10.3.8), say ˆ θn = (10.3.9) ˆ Here Pr( θn − θ ≥ ε) ≤ 1 , n θ n2 with probability 1 − 1 with probability n 1 n ˆ i.e., the estimator is consistent, but E[θ] = θ n−1 + n → θ + n. n And of course there are estimators which are unbiased but not consistent: simply take the first observation x1 as an estimator if E[x] and ignore all the other observations. 10.4. The Cramer-Rao Lower Bound Take a scalar random variable y with density function fy . The entropy of y , if it exists, is H[y ] = − E[log(fy (y ))]. This is the continuous equivalent of (3.11.2). The entropy is the measure of the amount of randomness in this variable. If there is little information and much noise in this variable, the entropy is high. Now let y → g (y ) be the density function of a different random variable x. In +∞ other words, g is some function which satisfies g (y ) ≥ 0 for all y , and −∞ g (y ) dy = 1. Equation (3.11.10) with v = g (y ) and w = fy (y ) gives (10.4.1) fy (y ) − fy (y ) log fy (y ) ≤ g (y ) − fy (y ) log g (y ). This holds for every value y , and integrating over y gives 1 − E[log fy (y )] ≤ 1 − E[log g (y )] or (10.4.2) E[log fy (y )] ≥ E[log g (y )]. This is an important extremal value property which distinguishes the density function fy (y ) of y from all other density functions: That density function g which maximizes E[log g (y )] is g = fy , the true density function of y . This optimality property lies at the basis of the Cramer-Rao inequality, and it is also the reason why maximum likelihood estimation is so good. The difference between the left and right hand side in (10.4.2) is called the Kullback-Leibler discrepancy between the random variables y and x (where x is a random variable whose density is g ). 10.4. THE CRAMER-RAO LOWER BOUND 127 The Cramer Rao inequality gives a lower bound for the MSE of an unbiased estimator of the parameter of a probability distribution (which has to satisfy certain regularity conditions). This allows one to determine whether a given unbiased estimator has a MSE as low as any other unbiased estimator (i.e., whether it is “efficient.”) Problem 172. Assume the density function of y depends on a parameter θ, write it fy (y ; θ), and θ◦ is the true value of θ. In this problem we will compare the expected value of y and of functions of y with what would be their expected value if the true parameter value were not θ◦ but would take some other value θ. If the random variable t is a function of y , we write Eθ [t] for what would be the expected value of t if the true value of the parameter were θ instead of θ◦ . Occasionally, we will use the subscript ◦ as in E◦ to indicate that we are dealing here with the usual case in which the expected value is taken with respect to the true parameter value θ◦ . Instead of E◦ one usually simply writes E, since it is usually self-understood that one has to plug the right parameter values into the density function if one takes expected values. The subscript ◦ is necessary here only because in the present problem, we sometimes take expected values with respect to the “wrong” parameter values. The same notational convention also applies to variances, covariances, and the MSE. Throughout this problem we assume that the following regularity conditions hold: (a) the range of y is independent of θ, and (b) the derivative of the density function with respect to θ is a continuous differentiable function of θ. These regularity conditions ensure that one can differentiate under the integral sign, i.e., for all function t(y ) follows ∞ (10.4.3) ∂ ∂ fy (y ; θ)t(y ) dy = ∂θ ∂θ −∞ ∞ 2 (10.4.4) −∞ ∞ fy (y ; θ)t(y ) dy = −∞ ∂ ∂2 fy (y ; θ)t(y ) dy = (∂θ)2 (∂θ)2 ∂ Eθ [t(y )] ∂θ ∞ fy (y ; θ)t(y ) dy = −∞ ∂2 Eθ [t(y )]. (∂θ)2 • a. 1 point The score is defined as the random variable ∂ log fy (y ; θ). ∂θ In other words, we do three things to the density function: take its logarithm, then take the derivative of this logarithm with respect to the parameter, and then plug the random variable into it. This gives us a random variable which also depends on the nonrandom parameter θ. Show that the score can also be written as (10.4.5) (10.4.6) q (y ; θ) = q (y ; θ) = 1 ∂fy (y ; θ) fy (y ; θ) ∂θ Answer. This is the chain rule for differentiation: for any differentiable function g (θ), ∂ ∂θ log g (θ) = 1 ∂g (θ ) . g (θ ) ∂θ • b. 1 point If the density function is member of an exponential dispersion family (??), show that the score function has the form (10.4.7) q (y ; θ) = y − ∂b(θ) ∂θ a(ψ ) Answer. This is a simple substitution: if (10.4.8) fy (y ; θ, ψ ) = exp y θ − b(θ) + c(y, ψ ) , a (ψ ) 128 10. ESTIMATION PRINCIPLES then ∂b(θ ) y − ∂θ ∂ log fy (y ; θ, ψ ) = ∂θ a(ψ ) (10.4.9) • c. 3 points If fy (y ; θ◦ ) is the true density function of y , then we know from (10.4.2) that E◦ [log fy (y ; θ◦ )] ≥ E◦ [log f (y ; θ)] for all θ. This explains why the score is so important: it is the derivative of that function whose expected value is maximized if the true parameter is plugged into the density function. The first-order conditions in this situation read: the expected value of this derivative must be zero for the true parameter value. This is the next thing you are asked to show: If θ◦ is the true parameter value, show that E◦ [q (y ; θ◦ )] = 0. Answer. First write for general θ ∞ (10.4.10) ∞ q (y ; θ)fy (y ; θ◦ ) dy = E◦ [q (y ; θ)] = −∞ −∞ 1 ∂fy (y ; θ) fy (y ; θ◦ ) dy. f y (y ; θ ) ∂θ For θ = θ◦ this simplifies: ∞ (10.4.11) E◦ [q (y ; θ◦ )] = −∞ Here I am writing ∂fy (y ;θ ) ∂θ ∂fy (y ; θ) ∂θ dy = θ =θ ◦ ∂ ∂θ ∞ fy (y ; θ) dy ∞ = θ =θ ◦ ∂ 1 = 0. ∂θ ∂fy (y ;θ ◦ ) , in order to emphasize ∂θ ◦ into that derivative. one plugs θ instead of the simpler notation θ =θ ◦ that one first has to take a derivative with respect to θ and then • d. Show that, in the case of the exponential dispersion family, (10.4.12) E◦ [y ] = ∂b(θ) ∂θ θ =θ ◦ Answer. Follows from the fact that the score function of the exponential family (10.4.7) has zero expected value. • e. 5 points If we differentiate the score, we obtain the Hessian (10.4.13) h(θ) = ∂2 log fy (y ; θ). (∂θ)2 From now on we will write the score function as q (θ) instead of q (y ; θ); i.e., we will no longer make it explicit that q is a function of y but write it as a random variable which depends on the parameter θ. We also suppress the dependence of h on y ; our notation h(θ) is short for h(y ; θ). Since there is only one parameter in the density function, score and Hessian are scalars; but in the general case, the score is a vector and the Hessian a matrix. Show that, for the true parameter value θ◦ , the negative of the expected value of the Hessian equals the variance of the score, i.e., the expected value of the square of the score: E◦ [h(θ◦ )] = − E◦ [q 2 (θ◦ )]. (10.4.14) Answer. Start with the definition of the score ∂ 1 ∂ q (y ; θ ) = log fy (y ; θ) = f y (y ; θ ), ∂θ fy (y ; θ) ∂θ (10.4.15) and differentiate the rightmost expression one more time: (10.4.16) (10.4.17) h (y ; θ ) = ∂ 1 q (y ; θ ) = − 2 (∂θ) f y (y ; θ ) = − q 2 (y ; θ ) + ∂ f y (y ; θ ) ∂θ 1 ∂2 f y (y ; θ ) fy (y ; θ) ∂θ2 2 + 1 ∂2 f y (y ; θ ) fy (y ; θ) ∂θ2 10.4. THE CRAMER-RAO LOWER BOUND 129 Taking expectations we get +∞ (10.4.18) E◦ [h(y ; θ)] = − E◦ [q 2 (y ; θ)] + −∞ 1 f y (y ; θ ) ∂2 fy (y ; θ) fy (y ; θ◦ ) dy ∂θ2 Again, for θ = θ◦ , we can simplify the integrand and differentiate under the integral sign: +∞ (10.4.19) −∞ ∂2 ∂2 fy (y ; θ) dy = ∂θ2 ∂θ2 +∞ fy (y ; θ) dy = −∞ ∂2 1 = 0. ∂θ2 • f . Derive from (10.4.14) that, for the exponential dispersion family (??), (10.4.20) var◦ [y ] = ∂ 2 b(θ) a(φ) ∂θ2 Answer. Differentiation of (10.4.7) gives h(θ) = − equal to its own expected value. (10.4.14) says therefore (10.4.21) ∂ 2 b(θ) ∂θ2 θ =θ ◦ θ =θ ◦ 2 ∂ b(θ ) 1 . ∂θ 2 a(φ) 1 = E◦ [q 2 (θ◦ )] = a(φ) 1 a(φ) 2 This is constant and therefore var◦ [y ] from which (10.4.20) follows. Problem 173. • a. Use the results from question 172 to derive the following strange and interesting result: for any random variable t which is a function of y , i.e., t = t(y ), ∂ follows cov◦ [q (θ◦ ), t] = ∂θ Eθ [t] θ=θ◦ . Answer. The following equation holds for all θ: ∞ (10.4.22) E◦ [q (θ)t] = −∞ 1 ∂fy (y ; θ) t(y )fy (y ; θ◦ ) dy f y (y ; θ ) ∂θ If the θ in q (θ) is the right parameter value θ◦ one can simplify: ∞ (10.4.23) E◦ [q (θ◦ )t] = −∞ (10.4.24) (10.4.25) = = ∂ ∂θ ∂fy (y ; θ) ∂θ t(y ) dy θ =θ ◦ ∞ fy (y ; θ)t(y ) dy −∞ ∂ Eθ [t] ∂θ θ =θ ◦ θ =θ ◦ This is at the same time the covariance: cov◦ [q (θ◦ ), t] = E◦ [q (θ◦ )t] − E◦ [q (θ◦ )] E◦ [t] = E◦ [q (θ◦ )t], since E◦ [q (θ◦ )] = 0. Explanation, nothing to prove here: Now if t is an unbiased estimator of θ, ∂ whatever the value of θ, then it follows cov◦ [q (θ◦ ), t] = ∂θ θ = 1. From this fol◦ lows by Cauchy-Schwartz var◦ [t] var◦ [q (θ )] ≥ 1, or var◦ [t] ≥ 1/ var◦ [q (θ◦ )]. Since E◦ [q (θ◦ )] = 0, we know var◦ [q (θ◦ )] = E◦ [q 2 (θ◦ )], and since t is unbiased, we know var◦ [t] = MSE◦ [t; θ◦ ]. Therefore the Cauchy-Schwartz inequality reads (10.4.26) MSE◦ [t; θ◦ ] ≥ 1/ E◦ [q 2 (θ◦ )]. This is the Cramer-Rao inequality. The inverse of the variance of q (θ◦ ), 1/ var◦ [q (θ◦ )] = 1/ E◦ [q 2 (θ◦ )], is called the Fisher information, written I (θ◦ ). It is a lower bound for the MSE of any unbiased estimator of θ. Because of (10.4.14), the Cramer Rao inequality can also be written in the form (10.4.27) MSE[t; θ◦ ] ≥ −1/ E◦ [h(θ◦ )]. 130 10. ESTIMATION PRINCIPLES (10.4.26) and (10.4.27) are usually written in the following form: Assume y has density function fy (y ; θ) which depends on the unknown parameter θ, and and let t(y ) be any unbiased estimator of θ. Then (10.4.28) 1 var[t] ≥ E[ 2 ∂ ∂θ = log fy (y ; θ) ] ∂2 E[ ∂θ2 −1 . log fy (y ; θ)] (Sometimes the first and sometimes the second expression is easier to evaluate.) If one has a whole vector of observations then the Cramer-Rao inequality involves the joint density function: (10.4.29) 1 var[t] ≥ E[ ∂ ∂θ 2 = log fy (y ; θ) ] ∂2 E[ ∂θ2 −1 . log fy (y ; θ)] This inequality also holds if y is discrete and one uses its probability mass function instead of the density function. In small samples, this lower bound is not always attainable; in some cases there is no unbiased estimator with a variance as low as the Cramer Rao lower bound. Problem 174. 4 points Assume n independent observations of a variable y ∼ N (µ, σ 2 ) are available, where σ 2 is known. Show that the sample mean y attains the ¯ Cramer-Rao lower bound for µ. Answer. The density function of each y i is fyi (y ) = (2πσ 2 )−1/2 exp − (10.4.30) (y − µ )2 2σ 2 therefore the log likelihood function of the whole vector is n (10.4.31) log fyi (y i ) = − ( y ; µ) = n n 1 log(2π ) − log σ 2 − 2 2 2σ 2 i=1 ( y i − µ )2 i=1 n ∂ 1 ( y ; µ) = 2 ∂µ σ (10.4.32) n (y i − µ) i=1 In order to apply (10.4.29) you can either square this and take the expected value (10.4.33) E[ ∂ (y ; µ) ∂µ 2 = 1 σ4 E[(y i − µ)2 ] = n/σ 2 alternatively one may take one more derivative from (10.4.32) to get (10.4.34) n ∂2 ( y ; µ) = − 2 ∂µ2 σ This is constant, therefore equal to its expected value. Therefore the Cramer-Rao Lower Bound says that var[¯] ≥ σ 2 /n. This holds with equality. y Problem 175. Assume y i ∼ NID(0, σ 2 ) (i.e., normally independently distributed) 1 with unknown σ 2 . The obvious estimate of σ 2 is s2 = n y2 . i 2 • a. 2 points Show that s2 is an unbiased estimator of σ 2 , is distributed ∼ σ χ2 , nn and has variance 2σ 4 /n. You are allowed to use the fact that a χ2 has variance 2n, n which is equation (4.9.5). 10.4. THE CRAMER-RAO LOWER BOUND 131 Answer. 2 E[yi ] = var[yi ] + (E[yi ])2 = σ 2 + 0 = σ 2 yi zi = ∼ NID(0, 1) σ yi = σzi (10.4.35) (10.4.36) (10.4.37) 2 2 yi = σ 2 zi (10.4.38) n n 1 n (10.4.40) (10.4.41) 2 zi ∼ σ 2 χ2 n 2 yi = σ 2 (10.4.39) var 1 n i=1 n 2 yi = i=1 n 2 yi = σ2 n i=1 n 2 zi ∼ σ2 2 χ nn i=1 σ4 n2 var[χ2 ] = n σ4 2σ 4 2n = 2 n n i=1 • b. 4 points Show that this variance is at the same time the Cramer Rao lower bound. Answer. (10.4.42) 1 1 y2 log 2π − log σ 2 − 2 2 2σ 2 1 y2 y2 − σ2 ∂ log fy (y ; σ 2 ) = − 2 + = ∂σ 2 2σ 2σ 4 2σ 4 (y, σ 2 ) = log fy (y ; σ 2 ) = − (10.4.43) Since y2 − σ2 has zero mean, it follows 2σ 4 (10.4.44) E[ ∂ log fy (y ; σ 2 ) ∂σ 2 2 = var[y 2 ] 1 = . 4σ 8 2σ 4 Alternatively, one can differentiate one more time: ∂ 2 log fy y2 1 (y ; σ 2 ) = − 6 + (∂σ 2 )2 σ 2σ 4 (10.4.45) (10.4.46) E[ 1 1 σ2 ∂ 2 log fy (y ; σ 2 )] = − 6 + = (∂σ 2 )2 σ 2σ 4 2σ 4 (10.4.47) This makes the Cramer Rao lower bound 2σ 4 /n. Problem 176. 4 points Assume x1 , . . . , xn is a random sample of independent observations of a Poisson distribution with parameter λ, i.e., each of the xi has probability mass function (10.4.48) pxi (x) = Pr[xi = x] = λx −λ e x! x = 0, 1, 2, . . . . A Poisson variable with parameter λ has expected value λ and variance λ. (You are not required to prove this here.) Is there an unbiased estimator of λ with lower variance than the sample mean x? ¯ Here is a formulation of the Cramer Rao Inequality for probability mass functions, as you need it for Question 176. Assume y 1 , . . . , y n are n independent observations of a random variable y whose probability mass function depends on the unknown parameter θ and satisfies certain regularity conditions. Write the univariate probability mass function of each of the y i as py (y ; θ) and let t be any unbiased 132 10. ESTIMATION PRINCIPLES estimator of θ. Then (10.4.49) 1 var[t] ≥ n E[ ∂ ∂θ 2 = ln py (y ; θ) ] ∂2 n E[ ∂θ2 −1 . ln py (y ; θ)] Answer. The Cramer Rao lower bound says no. log px (x; λ) = x log λ − log x! − λ (10.4.50) (10.4.51) (10.4.52) x x−λ ∂ log px (x; λ) = − 1 = ∂λ λ λ 2 var[x] 1 ∂ log px (x − λ)2 (x; λ) ] = E[= =. E[ ∂λ λ2 λ2 λ Or alternatively, after (10.4.51) do (10.4.53) (10.4.54) ∂ 2 log px x (x; λ) = − 2 ∂λ2 λ E[x] 1 ∂ 2 log px (x; λ) ] = 2 = . − E[ ∂λ2 λ λ Therefore the Cramer Rao lower bound is λ , n which is the variance of the sample mean. If the density function depends on more than one unknown parameter, i.e., if it has the form fy (y ; θ1 , . . . , θk ), the Cramer Rao Inequality involves the following steps: (1) define (y ; θ1 , · · · , θk ) = log fy (y ; θ1 , . . . , θk ), (2) form the following matrix which is called the information matrix : (10.4.55) 2 2 ∂2 ∂ ∂ −n E[ ∂ θ2 ] · · · −n E[ ∂ θ∂ ∂θk ] n E[ ∂ θ1 ] · · · n E[ ∂ θ1 ∂∂θk ] 1 1 . . . . .. .. = , . . . . I= . . . . . . 2 ∂2 ∂2 ∂ ∂ ∂ −n E[ ∂ θ2 ] −n E[ ∂ θk ∂θ1 ] · · · n E[ ∂ θk ∂ θ1 ] · · · n E[ ∂ θk ] k t1 . −1 and (3) form the matrix inverse I . If the vector random variable t = . . tn θ1 . is an unbiased estimator of the parameter vector θ = . , then the inverse of . θn the information matrix I −1 is a lower bound for the covariance matrix V [t] in the following sense: the difference matrix V [t] − I −1 is always nonnegative definite. From this follows in particular: if iii is the ith diagonal element of I −1 , then var[ti ] ≥ iii . 10.5. Best Linear Unbiased Without Distribution Assumptions If the xi are Normal with unknown expected value and variance, their sample mean has lowest MSE among all unbiased estimators of µ. If one does not assume Normality, then the sample mean has lowest MSE in the class of all linear unbiased estimators of µ. This is true not only for the sample mean but also for all least squares estimates. This result needs remarkably weak assumptions: nothing is assumed about the distribution of the xi other than the existence of mean and variance. Problem 177 shows that in some situations one can even dispense with the independence of the observations. Problem 177. 5 points [Lar82, example 5.4.1 on p 266] Let y 1 and y 2 be two random variables with same mean µ and variance σ 2 , but we do not assume that they 10.5. BEST LINEAR UNBIASED WITHOUT DISTRIBUTION ASSUMPTIONS 133 are uncorrelated; their correlation coefficient is ρ, which can take any value |ρ| ≤ 1. Show that y = (y 1 + y 2 )/2 has lowest mean squared error among all linear unbiased ¯ estimators of µ, and compute its MSE. (An estimator µ of µ is linear iff it can be ˜ written in the form µ = α1 y 1 + α2 y 2 with some constant numbers α1 and α2 .) ˜ Answer. (10.5.1) (10.5.2) y = α1 y 1 + α2 y 2 ˜ var y = α2 var[y 1 ] + α2 var[y 2 ] + 2α1 α2 cov[y 1 , y 2 ] ˜ 1 2 = σ 2 (α2 + α2 + 2α1 α2 ρ). 1 2 (10.5.3) Here we used (6.1.14). Unbiasedness means α2 = 1 − α1 , therefore we call α1 = α and α2 = 1 − α: (10.5.4) y var[˜]/σ 2 = α2 + (1 − α)2 + 2α(1 − α)ρ Now sort by the powers of α: (10.5.5) = 2α2 (1 − ρ) − 2α(1 − ρ) + 1 (10.5.6) = 2(α2 − α)(1 − ρ) + 1. This takes its minimum value where the derivative α1 = α2 − 1/2 into (10.5.3) to get σ2 2 ∂ (α 2 ∂α − α) = 2α − 1 = 0. For the MSE plug 1+ρ . Problem 178. You have two unbiased measurements with errors of the same quantity µ (which may or may not be random). The first measurement y 1 has mean squared error E[(y 1 − µ)2 ] = σ 2 , the other measurement y 2 has E[(y 1 − µ)2 ] = τ 2 . The measurement errors y 1 − µ and y 2 − µ have zero expected values (i.e., the measurements are unbiased) and are independent of each other. • a. 2 points Show that the linear unbiased estimators of µ based on these two measurements are simply the weighted averages of these measurements, i.e., they can be written in the form µ = αy 1 + (1 − α)y 2 , and that the MSE of such an estimator ˜ is α2 σ 2 + (1 − α)2 τ 2 . Note: we are using the word “estimator” here even if µ is random. An estimator or predictor µ is unbiased if E[˜ − µ] = 0. Since we allow µ ˜ µ to be random, the proof in the class notes has to be modified. Answer. The estimator µ is linear (more precisely: affine) if it can written in the form ˜ µ = α1 y 1 + α2 y 2 + γ ˜ (10.5.7) The measurements themselves are unbiased, i.e., E[y i − µ] = 0, therefore (10.5.8) E[˜ − µ] = (α1 + α2 − 1) E[µ] + γ = 0 µ for all possible values of E[µ]; therefore γ = 0 and α2 = 1 − α1 . To simplify notation, we will call from now on α1 = α, α2 = 1 − α. Due to unbiasedness, the MSE is the variance of the estimation error (10.5.9) var[˜ − µ] = α2 σ 2 + (1 − α)2 τ 2 µ • b. 4 points Define ω 2 by 1 1 σ2 τ 2 1 = 2+ 2 which can be solved to give ω2 = 2 . ω2 σ τ σ + τ2 Show that the Best (i.e., minimum MSE) linear unbiased estimator (BLUE) of µ based on these two measurements is ω2 ω2 (10.5.11) y = 2 y1 + 2 y2 ˆ σ τ i.e., it is the weighted average of y 1 and y 2 where the weights are proportional to the inverses of the variances. (10.5.10) 134 10. ESTIMATION PRINCIPLES Answer. The variance (10.5.9) takes its minimum value where its derivative with respect of α is zero, i.e., where (10.5.13) ∂ α2 σ 2 + (1 − α)2 τ 2 = 2ασ 2 − 2(1 − α)τ 2 = 0 ∂α ασ 2 = τ 2 − ατ 2 (10.5.14) α= (10.5.12) In terms of ω one can write ω2 τ2 (10.5.15) =2 α= 2 2 σ +τ σ σ2 τ2 + τ2 and 1−α= σ2 σ2 ω2 = 2. 2 +τ τ • c. 2 points Show: the MSE of the BLUE ω 2 satisfies the following equation: 1 1 1 (10.5.16) = 2+ 2 ω2 σ τ Answer. We already have introduced the notation ω 2 for the quantity defined by (10.5.16); therefore all we have to show is that the MSE or, equivalently, the variance of the estimation error is equal to this ω 2 : (10.5.17) var[˜ − µ] = µ ω2 σ2 22 σ+ ω2 τ2 22 τ = ω4 1 1 +2 σ2 τ = ω4 1 = ω2 ω2 Examples of other classes of estimators for which a best estimator exists are: if one requires the estimator to be translation invariant, then the least squares estimators are best in the class of all translation invariant estimators. But there is no best linear estimator in the linear model. (Theil) 10.6. Maximum Likelihood Estimation This is an excellent and very widely applicable estimation principle. Its main drawback is its computational complexity, but with modern computing power it becomes more and more manageable. Another drawback is that it requires a full specification of the distribution. Problem 179. 2 points What are the two greatest disadvantages of Maximum Likelihood Estimation? Answer. Its high information requirements (the functional form of the density function must be known), and computational complexity. In our discussion of entropy in Section 3.11 we derived an extremal value property which distinguishes the actual density function fy (y ) of a given random variable y from all other possible density functions of y , i.e., from all other functions g ≥ 0 +∞ with −∞ g (y ) dy = 1. The true density function of y is the one which maximizes E[log g (y )]. We showed that this principle can be used to design a payoff scheme by which it is in the best interest of a forecaster to tell the truth. Now we will see that this principle can also be used to design a good estimator. Say you have n independent observations of y . You know the density of y belongs to a given family F of density functions, but you don’t know which member of F it is. Then form the arithmetic mean of log f (yi ) for all f ∈ F . It converges towards E[log f (y )]. For the true density function, this expected value is higher than for all the other density functions. If one does not know which the true density function is, then it is a good strategy to select that density function f for which the sample mean of the log f (yi ) is largest. This is the maximum likelihood estimator. 10.6. MAXIMUM LIKELIHOOD ESTIMATION 135 Let us interject here a short note about the definitional difference between density function and likelihood function. If we know µ = µ0 , we can write down the density function as (y −µ0 )2 1 (10.6.1) fy (y ; µ0 ) = √ e− 2 . 2π It is a function of y , the possible values assumed by y , and the letter µ0 symbolizes a constant, the true parameter value. The same function considered as a function of the variable µ, representing all possible values assumable by the true mean, with y being fixed at the actually observed value, becomes the likelihood function. In the same way one can also turn probability mass functions px (x) into likelihood functions. Now let us compute some examples of the MLE. You make n independent observations y 1 , . . . , y n from a N (µ, σ 2 ) distribution. Write the likelihood function as n (10.6.2) L(µ, σ 2 ; y 1 , . . . , y n ) = fy (y i ) = √ i=1 1 2πσ 2 n 1 e− 2σ2 (y i −µ)2 . Its logarithm is more convenient to maximize: n n 1 = ln L(µ, σ 2 ; y 1 , . . . , y n ) = − ln 2π − ln σ 2 − 2 (y i − µ)2 . (10.6.3) 2 2 2σ To compute the maximum we need the partial derivatives: 1 ∂ =2 (y i − µ) (10.6.4) ∂µ σ n 1 ∂ (10.6.5) (y i − µ)2 . =− 2 + 4 ∂ σ2 2σ 2σ The maximum likelihood estimators are those values µ and σ 2 which set these two ˆ ˆ partials zero. I.e., at the same time at which we set the partials zero we must put the hats on µ and σ 2 . As long as σ 2 = 0 (which is the case with probability one), ˆ 1 the first equation determines µ: ˆ y i − nµ = 0, i.e., µ = n ˆ ˆ y i = y . (This would ¯ 2 be the MLE of µ even if σ were known). Now plug this µ into the second equation ˆ 1 1 ¯ ˆ ¯ to get n = 2ˆ 2 (y i − y )2 , or σ 2 = n (y i − y )2 . 2 σ Here is another example: t1 , . . . , tn are independent and follow an exponential distribution, i.e., (10.6.6) (10.6.7) (10.6.8) ft (t; λ) = λe−λt (t > 0) n −λ(t1 +···+tn ) L(t1 , . . . , tn ; λ) = λ e (t1 , . . . , tn λ) = n ln λ − λ(t1 + · · · + tn ) ∂ n = − (t1 + · · · + tn ). ∂λ λ n ˆ instead of λ to get λ = ˆ ¯ Set this zero, and write λ t1 +···+tn = 1/t. Usually the MLE is asymptotically unbiased and asymptotically normal. Therefore it is important to have an estimate of its asymptotic variance. Here we can use the fact that asymptotically the Cramer Rao Lower Bound is not merely a lower bound for this variance but is equal to its variance. (From this follows that the maximum likelihood estimator is asymptotically efficient.) The Cramer Rao lower bound itself depends on unknown parameters. In order to get a consistent estimate of the Cramer Rao lower bound, do the following: (1) Replace the unknown parameters in the second derivative of the log likelihood function by their maximum likelihood estimates. (2) Instead of taking expected values over the observed values xi you may (10.6.9) 136 10. ESTIMATION PRINCIPLES simply insert the sample values of the xi into these maximum likelihood estimates, and (3) then invert this estimate of the information matrix. ˆ MLE obeys an important functional invariance principle: if θ is the MLE of θ, ˆ) is the MLE of g (θ). E.g., µ = 1 is the expected value of the exponential then g (θ λ variable, and its MLE is x. ¯ Problem 180. x1 , . . . , xm is a sample from a N (µx , σ 2 ), and y 1 , . . . , y n from a N (µy , σ 2 ) with different mean but same σ 2 . All observations are independent of each other. • a. 2 points Show that the MLE of µx , based on the combined sample, is x. (By ¯ symmetry it follows that the MLE of µy is y .) ¯ Answer. (10.6.10) (µx , µ y , σ 2 ) = − m m m 1 ln 2π − ln σ 2 − 2 2 2σ 2 ( xi − µ x ) 2 i=1 n n n 1 − ln 2π − ln σ 2 − 2 2 2σ 2 ( y j − µ y )2 j =1 (10.6.11) 1 ∂ =− 2 ∂ µx 2σ −2(xi − µx ) =0 for µx = x ¯ • b. 2 points Derive the MLE of σ 2 , based on the combined samples. Answer. (10.6.12) ∂ m+n 1 =− + ∂ σ2 2σ 2 2σ 4 m n ( xi − µx ) 2 + i=1 (10.6.13) σ2 = ˆ 1 m+n m j =1 n ( xi − x) 2 + ¯ i=1 ( y j − µ y )2 (y i − y ) 2 . ¯ j =1 10.7. Method of Moments Estimators Method of moments estimators use the sample moments as estimates of the population moments. I.e., the estimate of µ is x, the estimate of the variance σ 2 is ¯ 1 (xi − x)2 , etc. If the parameters are a given function of the population moments, ¯ n use the same function of the sample moments (using the lowest moments which do the job). The advantage of method of moments estimators is their computational simplicity. Many of the estimators discussed above are method of moments estimators. However if the moments do not exist, then method of moments estimators are inconsistent, and in general method of moments estimators are not as good as maximum likelihood estimators. 10.8. M-Estimators The class of M -estimators maximizes something other than a likelihood function: it includes nonlinear least squares, generalized method of moments, minimum distance and minimum chi-squared estimators. The purpose is to get a “robust” estimator which is good for a wide variety of likelihood functions. Many of these are asymptotically efficient; but their small-sample properties may vary greatly. 10.9. SUFFICIENT STATISTICS AND ESTIMATION 137 10.9. Sufficient Statistics and Estimation Weak Sufficiency Principle: If x has a p.d.f. fx (x; θ) and if a sufficient statistic s(x) exists for θ, then identical conclusions should be drawn from data x1 and x2 which have same value s(x1 ) = s(x2 ). Why? Sufficiency means: after knowing s(x), the rest of the data x can be regarded generated by a random mechanism not dependent on θ, and are therefore uninformative about θ. This principle can be used to improve on given estimators. Without proof we will state here Rao Blackwell Theorem: Let t(x) be an estimator of θ and s(x) a sufficient statistic for θ. Then one can get an estimator t∗ (x) of θ which has no worse a MSE than t(x) by taking expectations conditionally on the sufficient statistic, i.e., t∗ (x) = E[t(x)|s(x)]. To recapitulate: t∗ (x) is obtained by the following two steps: (1) Compute the conditional expectation t∗∗ (s) = E[t(x)|s(x) = s], and (2) plug s(x) into t∗∗ , i.e., t∗ (x) = t∗∗ (s(x)). A statistic s is said to be complete, if the only real-valued function g defined on the range of s, which satisfies E[g (s)] = 0 whatever the value of θ, is the function which is identically zero. If a statistic s is complete and sufficient, then every function g (s) is the minimum MSE unbiased estimator of its expected value E[g (s)]. If a complete and sufficient statistic exists, this gives a systematic approach to minimum MSE unbiased estimators (Lehmann Scheff´ Theorem ): if t is an unbiased e estimator of θ and s is complete and sufficient, then t∗ (x) = E[t(x)|s(x)] has lowest MSE in the class of all unbiased estimators of θ. Problem 181 steps you through the proof. Problem 181. [BD77, Problem 4.2.6 on p. 144] If a statistic s is complete and sufficient, then every function g (s) is the minimum MSE unbiased estimator of E[g (s)] ( Lehmann-Scheff´ theorem). This gives a systematic approach to finding e minimum MSE unbiased estimators. Here are the definitions: s is sufficient for θ if for any event E and any value s, the conditional probability Pr[E |s ≤ s] does not involve θ. s is complete for θ if the only function g (s) of s, which has zero expected value whatever the value of θ, is the function which is identically zero, i.e., g (s) = 0 for all s. • a. 3 points Given an unknown parameter θ, and a complete sufficient statistic s, how can one find that function of s whose expected value is θ? There is an easy trick: start with any statistic p with E[p] = θ, and use the conditional expectation E[p|s]. Argue why this conditional expectation does not depend on the unknown parameter θ, is an unbiased estimator of θ, and why this leads to the same estimate regardless which p one starts with. Answer. You need sufficiency for the first part of the problem, the law of iterated expectations for the second, and completeness for the third. Set E = {p ≤ p} in the definition of sufficiency given at the beginning of the Problem to see that the cdf of p conditionally on s being in any interval does not involve θ, therefore also E[p|s] does not involve θ. Unbiasedness follows from the theorem of iterated expectations E E[p|s] = E[p] = θ. The independence on the choice of p can be shown as follows: Since the conditional expectation conditionally on s is a function of s, we can use the notation E[p|s] = g1 (s) and E[q |s] = g2 (s). From E[p] = E[q ] follows by the law of iterated expectations E[g1 (s) − g2 (s)] = 0, therefore by completeness g1 (s) − g2 (s) ≡ 0. 138 10. ESTIMATION PRINCIPLES • b. 2 points Assume y i ∼ NID(µ, 1) (i = 1, . . . , n), i.e., they are independent and normally distributed with mean µ and variance 1. Without proof you are allowed to use the fact that in this case, the sample mean y is a complete sufficient statistic ¯ for µ. What is the minimum MSE unbiased estimate of µ, and what is that of µ2 ? Answer. We have to find functions of y with the desired parameters as expected values. ¯ Clearly, y is that of µ, and y 2 − 1/n is that of µ2 . ¯ ¯ • c. 1 point For a given j , let π be the probability that the j th observation is nonnegative, i.e., π = Pr[y j ≥ 0]. Show that π = Φ(µ) where Φ is the cumulative distribution function of the standard normal. The purpose of the remainder of this Problem is to find a minimum MSE unbiased estimator of π . Answer. (10.9.1) π = Pr[y i ≥ 0] = Pr[y i − µ ≥ −µ] = Pr[y i − µ ≤ µ] = Φ(µ) because y i − µ ∼ N (0, 1). We needed symmetry of the distribution to flip the sign. • d. 1 point As a first step we have to find an unbiased estimator of π . It does not have to be a good one, any ubiased estimator will do. And such an estimator is indeed implicit in the definition of π . Let q be the “indicator function” for nonnegative values, satisfying q (y ) = 1 if y ≥ 0 and 0 otherwise. We will be working with the random variable which one obtains by inserting the j th observation y j into q , i.e., with q = q (y j ). Show that q is an unbiased estimator of π . Answer. q (y j ) has a discrete distribution and Pr[q (y j ) = 1] = Pr[y j ≥ 0] = π by (10.9.1) and therefore Pr[q (y j ) = 0] = 1 − π The expected value is E[q (y j )] = (1 − π ) · 0 + π · 1 = π . • e. 2 points Given q we can apply the Lehmann-Scheff´ theorem: E[q (y j )|y ] is e ¯ the best unbiased estimator of π . We will compute E[q (y j )|y ] in four steps which build ¯ on each other. First step: since for every indicator function follows E[q (y j )|y ] = ¯ Pr[y j ≥ 0|y ], we need for every given value y , the conditional distribution of y j ¯ ¯ conditionally on y = y . (Not just the conditional mean but the whole conditional ¯ ¯ distribution.) In order to construct this, we first have to specify exactly the joint distribution of y j and y : ¯ Answer. They are jointly normal: (10.9.2) yj y ¯ ∼N µ 1 , µ 1/n 1/n 1/n • f . 2 points Second step: From this joint distribution derive the conditional distribution of y j conditionally on y = y . (Not just the conditional mean but the ¯ ¯ whole conditional distribution.) For this you will need formula (7.3.18) and (7.3.20). Answer. Here are these two formulas: if u and v are jointly normal, then the conditional distribution of v conditionally on u = u is Normal with mean (10.9.3) E[v |u = u] = E[v ] + cov[u, v ] (u − E[u]) var[u] and variance (10.9.4) var[v |u = u] = var[v ] − (cov[u, v ])2 . var[u] 10.9. SUFFICIENT STATISTICS AND ESTIMATION 139 Plugging u = y and v = y j into (7.3.18) and (7.3.20) gives: the conditional distribution of y j ¯ conditionally on y = y has mean ¯ ¯ (10.9.5) E[y j |y = y ] = E[y j ] + ¯ ¯ =µ+ (10.9.6) y cov[¯, y j ] var[¯] y (¯ − E[¯]) y y 1/n (¯ − µ) = y y ¯ 1/n and variance (10.9.7) var[y j |y = y ] = var[y j ] − ¯ ¯ y (cov[¯, y j ])2 var[¯] y 1 (1/n)2 =1− . =1− 1/n n (10.9.8) y Therefore the conditional distribution of y j conditional on y is N (¯, (n − 1)/n). How can this ¯ be motivated? if we know the actual arithmetic mean of the variables, then our best estimate is that each variable is equal to this arithmetic mean. And this additional knowledge cuts down the variance by 1/n. • g. 2 points The variance decomposition (6.6.6) gives a decomposition of var[y j ]: give it here: Answer. (10.9.9) var[y j ] = var E[y j |y ] + E var[y j |y ] ¯ ¯ (10.9.10) = var[¯] + E y n−1 n = n−1 1 + n n • h. Compare the conditional with the unconditional distribution. Answer. Conditional distribution does not depend on unknown parameters, and it has smaller variance! • i. 2 points Third step: Compute the probability, conditionally on y = y , that ¯ ¯ y j ≥ 0. Answer. If x ∼ N (¯, (n − 1)/n) (I call it x here instead of y j since we use it not with its y familiar unconditional distribution N (µ, 1) but with a conditional distribution), then Pr[x ≥ 0] = Pr[x − y ≥ −y ] = Pr[x − y ≤ y ] = Pr (x − y ) n/(n − 1) ≤ y n/(n − 1) = Φ(¯ n/(n − 1)) ¯ ¯ ¯ ¯ ¯ ¯ y ¯ ¯ because (x − y ) n/(n − 1) ∼ N (0, 1) conditionally on y . Again we needed symmetry of the distribution to flip the sign. • j. 1 point Finally, put all the pieces together and write down E[q (y j )|y ], the ¯ ¯ e conditional expectation of q (y j ) conditionally on y , which by the Lehmann-Scheff´ theorem is the minimum MSE unbiased estimator of π . The formula you should come up with is (10.9.11) π = Φ(¯ ˆ y n/(n − 1)), where Φ is the standard normal cumulative distribution function. Answer. The conditional expectation of q (y j ) conditionally on y = y is, by part d, simply ¯ ¯ the probability that y j ≥ 0 under this conditional distribution. In part i this was computed as Φ(¯ y n/(n − 1)). Therefore all we have to do is replace y by y to get the minimum MSE unbiased ¯ ¯ estimator of π as Φ(¯ y n/(n − 1)). Remark: this particular example did not give any brand new estimators, but it can rather be considered a proof that certain obvious estimators are unbiased and efficient. But often this same procedure gives new estimators which one would not have been able to guess. Already when the variance is unknown, the above example becomes quite a bit more complicated, see [Rao73, p. 322, example 2]. When the variables 140 10. ESTIMATION PRINCIPLES have an exponential distribution then this example (probability of early failure) is discussed in [BD77, example 4.2.4 on pp. 124/5]. 10.10. The Likelihood Principle Consider two experiments whose likelihood functions depend on the same parameter vector θ . Suppose that for particular realizations of the data y 1 and y 2 , the respective likelihood functions are proportional to each other, i.e., 1 (θ ; y 1 ) = α 2 (θ ; y 2 ) where α does not depend on θ although it may depend on y 1 and y 2 . Then the likelihood principle states that identical conclusions should be drawn from these two experiments about θ . The likelihood principle is equivalent to the combination of two simpler principles: the weak sufficiency principle, and the following principle, which seems very plausible: Weak Conditonality Principle: Given two possible experiments A and B . A mixed experiment is one in which one throws a coin and performs A if the coin shows head and B if it shows tails. The weak conditionality principle states: suppose it is known that the coin shows tails. Then the evidence of the mixed experiment is equivalent to the evidence gained had one not thrown the coin but performed B without the possible alternative of A. This principle says therefore that an experiment which one did not do but which one could have performed does not alter the information gained from the experiment actually performed. As an application of the likelihood principle look at the following situation: Problem 182. 3 points You have a Bernoulli experiment with unknown parameter θ, 0 ≤ θ ≤ 1. Person A was originally planning to perform this experiment 12 times, which she does. She obtains 9 successes and 3 failures. Person B was originally planning to perform the experiment until he has reached 9 successes, and it took him 12 trials to do this. Should both experimenters draw identical conclusions from these two experiments or not? Answer. The probability mass function in the first is by (3.7.1) second it is by (4.1.13) matter! 11 8 θ9 (1 − θ)3 . 12 9 θ9 (1 − θ)3 , and in the They are proportional, the stopping rule therefore does not 10.11. Bayesian Inference Real-life estimation usually implies the choice between competing estimation methods all of which have their advantages and disadvantages. Bayesian inference removes some of this arbitrariness. Bayesians claim that “any inferential or decision process that does not follow from some likelihood function and some set of priors has objectively verifiable deficiencies” [Cor69, p. 617]. The “prior information” used by Bayesians is a formalization of the notion that the information about the parameter values never comes from the experiment alone. The Bayesian approach to estimation forces the researcher to cast his or her prior knowledge (and also the loss function for estimation errors) in a mathematical form, because in this way, unambiguous mathematical prescriptions can be derived as to how the information of an experiment should be evaluated. To the objection that these are large information requirements which are often not satisfied, one might answer that it is less important whether these assumptions are actually the right ones. The formulation of prior density merely ensures that the researcher proceeds from a coherent set of beliefs. 10.11. BAYESIAN INFERENCE 141 The mathematics which the Bayesians do is based on a “final” instead of an “initial” criterion of precision. In other words, not an estimation procedure is evaluated which will be good in hypothetical repetitions of the experiment in the average, but one which is good for the given set of data and the given set of priors. Data which could have been observed but were not observed are not taken into consideration. Both Bayesians and non-Bayesians define the probabilistic properties of an experiment by the density function (likelihood function) of the observations, which may depend on one or several unknown parameters. The non-Bayesian considers these parameters fixed but unknown, while the Bayesian considers the parameters random, i.e., he symbolizes his prior information about the parameters by a prior probability distribution. An excellent example in which this prior probability distribution is discrete is given in [Ame94, pp. 168–172]. In the more usual case that the prior distribution has a density function, a Bayesian is working with the joint density function of the parameter values and the data. Like all joint density function, it can be written as the product of a marginal and conditional density. The marginal density of the parameter value represents the beliefs the experimenter holds about the parameters before the experiment (prior density), and the likelihood function of the experiment is the conditional density of the data given the parameters. After the experiment has been conducted, the experimenter’s belief about the parameter values is represented by their conditional density given the data, called the posterior density. Let y denote the observations, θ the unknown parameters, and f (y , θ ) their joint density. Then (10.11.1) (10.11.2) f (y , θ ) = f (θ )f (y |θ ) = f (y )f (θ |y ). Therefore (10.11.3) f (θ |y ) = f (θ )f (y |θ ) . f (y ) In this formula, the value of f (y ) is irrelevant. It only depends on y but not on θ , but y is fixed, i.e., it is a constant. If one knows the posterior density function of θ up to a constant, one knows it altogether, since the constant is determined by the requirement that the area under the density function is 1. Therefore (10.11.3) is usually written as (∝ means “proportional to”) (10.11.4) f (θ |y ) ∝ f (θ )f (y |θ ); here the lefthand side contains the posterior density function of the parameter, the righthand side the prior density function and the likelihood function representing the probability distribution of the experimental data. The Bayesian procedure does not yield a point estimate or an interval estimate, but a whole probability distribution for the unknown parameters (which represents our information about these parameters) containing the “prior” information “updated” by the information yielded by the sample outcome. Of course, such probability distributions can be summarized by various measures of location (mean, median), which can then be considered Bayesian point estimates. Such summary measures for a whole probability distribution are rather arbitrary. But if a loss function is given, then this process of distilling point estimates from the posterior distribution can once more be systematized. For a concrete decision it tells us that parameter value which minimizes the expected loss function under the 142 10. ESTIMATION PRINCIPLES posterior density function, the so-called “Bayes risk.” This can be considered the Bayesian analog of a point estimate. For instance, if the loss function is quadratic, then the posterior mean is the parameter value which minimizes expected loss. There is a difference between Bayes risk and the notion of risk we applied previously. The frequentist minimizes expected loss in a large number of repetitions of the trial. This risk is dependent on the unknown parameters, and therefore usually no estimators exist which give minimum risk in all situations. The Bayesian conditions on the data (final criterion!) and minimizes the expected loss where the expectation is taken over the posterior density of the parameter vector. The irreducibility of absence to presences: the absence of knowledge (or also the absence of regularity itself) cannot be represented by a probability distribution. Proof: if I give a certain random variable a neutral prior, then functions of this random variable have non-neutral priors. This argument is made in [Roy97, p. 174]. Many good Bayesians drift away from the subjective point of view and talk about a stratified world: their center of attention is no longer the world out there versus our knowledge of it, but the empirical world versus the underlying systematic forces that shape it. Bayesians say that frequentists use subjective elements too; their outcomes depend on what the experimenter planned to do, even if he never did it. This again comes from [Roy97, p. ??]. Nature does not know about the experimenter’s plans, and any evidence should be evaluated in a way independent of this. CHAPTER 11 Interval Estimation Look at our simplest example of an estimator, the sample mean of an independent sample from a normally distributed variable. Since the population mean of a normal variable is at the same time its median, the sample mean will in 50 percent of the cases be larger than the population mean, and in 50 percent of the cases it will be smaller. This is a statement about the procedure how the sample mean was obtained, not about any given observed value of the sample mean. Say in one particular sample the observed sample mean was 3.5. This number 3.5 is either larger or smaller than the true mean, there is no probability involved. But if one were to compute sample means of many different independent samples, then these means would in 50% of the cases lie above and in 50% of the cases below the population mean. This is why one can, from knowing how this one given number was obtained, derive the “confidence” of 50% that the actual mean lies above 3.5, and the same with below. The sample mean can therefore be considered a one-sided confidence bound, although one usually wants higher confidence levels than 50%. (I am 95% confident that φ is greater or equal than a certain value computed from the sample.) The concept of “confidence” is nothing but the usual concept of probability if one uses an initial criterion of precision. The following thought experiment illustrates what is involved. Assume you bought a widget and want to know whether it is defective or not. The obvious way (which would correspond to a “final” criterion of precision) would be to open it up and look if it is defective or not. Now assume we cannot do it: there is no way telling by just looking at it whether it will work. Then another strategy would be to go by an “initial” criterion of precision: we visit the widget factory and look how they make them, how much quality control there is and such. And if we find out that 95% of all widgets coming out of the same factory have no defects, then we have the “confidence” of 95% that our particular widget is not defective either. The matter becomes only slightly more mystified if one talks about intervals. Again, one should not forget that confidence intervals are random intervals. Besides confidence intervals and one-sided confidence bounds one can, if one regards several parameters simultaneously, also construct confidence rectangles, ellipsoids and more complicated shapes. Therefore we will define in all generality: Let y be a random vector whose distribution depends on some vector of unknown parameters φ ∈ Ω. A confidence region is a prescription which assigns to every possible value y of y a subset R(y ) ⊂ Ω of parameter space, so that the probability that this subset covers the true value of φ is at least a given confidence level 1 − α, i.e., (11.0.5) Pr R(y ) φ0 |φ = φ0 ≥ 1 − α for all φ0 ∈ Ω. The important thing to remember about this definition is that these regions R(y ) are random regions; every time one performs the experiment one obtains a different region. 143 144 11. INTERVAL ESTIMATION Now let us go to the specific case of constructing an interval estimate for the parameter µ when we have n independent observations from a normally distributed population ∼ N (µ, σ 2 ) in which neither µ nor σ 2 are known. The vector of observations is therefore distributed as y ∼ N (ιµ, σ 2 I ), where ιµ is the vector every component of which is µ. I will give you now what I consider to be the cleanest argument deriving the so-called t-interval. It generalizes directly to the F -test in linear regression. It is not the same derivation which you will usually find, and I will bring the usual derivation below for comparison. Recall the observation made earlier, based on (9.1.1), that the sample mean y is that number y = a which minimizes the sum of squared deviations ¯ ¯ (yi − a)2 . (In other words, y is the “least squares estimate” in this situation.) This ¯ least squares principle also naturally leads to interval estimates for µ: we will say that a lies in the interval for µ if and only if (yi − a)2 ≤c (yi − y )2 ¯ (11.0.6) for some number c ≥ 1. Of course, the value of c depends on the confidence level, but the beauty of this criterion here is that the value of c can be determined by the confidence level alone without knowledge of the true values of µ or σ 2 . To show this, note first that (11.0.6) is equivalent to (11.0.7) (yi − a)2 − (yi − y )2 ¯ ≤c−1 (yi − y )2 ¯ and then apply the identity (yi − a)2 = (yi − y )2 + n(¯ − a)2 to the numerator ¯ y to get the following equivalent formulation of (11.0.6): n(¯ − a)2 y ≤c−1 (yi − y )2 ¯ (11.0.8) The confidence level of this interval is the probability that the true µ lies in an interval randomly generated using this principle. In other words, it is (11.0.9) Pr n(¯ − µ)2 y ≤c−1 (y i − y )2 ¯ Although for every known a, the probability that a lies in the confidence interval depends on the unknown µ and σ 2 , we will show now that the probability that the unknown µ lies in the confidence interval does not depend on any unknown parameters. First look at the distribution of the numerator: Since y ∼ N (µ, σ 2 /n), it follows ¯ (¯ − µ)2 ∼ (σ 2 /n)χ2 . We also know the distribution of the denominator. Earlier we y 1 have shown that the variable (y i − y )2 is a σ 2 χ2 −1 . It is not enough to know the ¯ n distribution of numerator and denominator separately, we also need their joint distribution. For this go back to our earlier discussion of variance estimation again; there ¯ ¯ we also showed that y is independent of the vector y 1 − y · · · y n − y ; there¯ fore any function of y is also independent of any function of this vector, from which ¯ follows that numerator and denominator in our fraction are independent. Therefore this fraction is distributed as an σ 2 χ2 over an independent σ 2 χ2 −1 , and since the 1 n σ 2 ’s cancel out, this is the same as a χ2 over an independent χ2 −1 . In other words, 1 n this distribution does not depend on any unknown parameters! The definition of a F -distribution with k and m degrees of freedom is the distribution of a ratio of a χ2 /k divided by a χ2 /m; therefore if we divide the sum of m k 11. INTERVAL ESTIMATION 145 squares in the numerator by n − 1 we get a F distribution with 1 and n − 1 d.f.: (¯ − µ)2 y ∼ F 1,n−1 (y i − y )2 ¯ (11.0.10) 11 n n−1 If one does not take the square in the numerator, i.e., works with y − µ instead of ¯ y (¯ − µ)2 , and takes square root in the denominator, one obtains a t-distribution: y−µ ¯ (11.0.11) ∼ tn−1 1 1 (y i − y )2 ¯ n n−1 The left hand side of this last formula has a suggestive form. It can be written as (¯ − µ)/sy , where sy is an estimate of the standard deviation of y (it is the square y ¯ ¯ ¯ root of the unbiased estimate of the variance of y ). In other words, this t-statistic ¯ can be considered an estimate of the number of standard deviations the observed value of y is away from µ. ¯ Now we will give, as promised, the usual derivation of the t-confidence intervals, which is based on this interpretation. This usual derivation involves the following two steps: (1) First assume that σ 2 is known. Then it is obvious what to do; for every observation y of y construct the following interval: R(y ) = {u ∈ R : |u − y | ≤ N(α/2) σy }. ¯ ¯ (11.0.12) What do these symbols mean? The interval R (as in region) has y as an argument, i.e.. it is denoted R(y ), because it depends on the observed value y . R is the set of real numbers. N(α/2) is the upper α/2-quantile of the Normal distribution, i.e., it is that number c for which a standard Normal random variable z satisfies Pr[z ≥ c] = α/2. Since by the symmetry of the Normal distribution, Pr[z ≤ −c] = α/2 as well, one obtains for a two-sided test: Pr[|z | ≥ N(α/2) ] = α. (11.0.13) From this follows the coverage probability: (11.0.14) (11.0.15) Pr[R(y ) µ] = Pr[|µ − y | ≤ N(α/2) σy ] ¯ ¯ = Pr[|(µ − y )/σy | ≤ N(α/2) ] = Pr[|−z | ≤ N(α/2) ] = 1 − α ¯¯ since z = (¯ − µ)/σy is a standard Normal. I.e., R(y ) is a confidence interval for µ y ¯ with confidence level 1 − α. (2) Second part: what if σ 2 is not known? Here a seemingly ad-hoc way out would be to replace σ 2 by its unbiased estimate s2 . Of course, then the Normal distribution no longer applies. However if one replaces the normal critical values by those of the tn−1 distribution, one still gets, by miraculous coincidence, a confidence level which is independent of any unknown parameters. Problem 183. If y i ∼ NID(µ, σ 2 ) (normally independently distributed) with µ and σ 2 unknown, then the confidence interval for µ has the form (11.0.16) R(y ) = {u ∈ R : |u − y | ≤ t(n−1;α/2) sy }. ¯ ¯ Here t(n−q;α/2) is the upper α/2-quantile of the t distribution with n − 1 degrees of freedom, i.e., it is that number c for which a random variable t which has a t distribution with n − 1 degrees of freedom satisfies Pr[t ≥ c] = α/2. And sy is ¯ obtained as follows: write down the standard deviation of y and replace σ by s. One ¯ s can also say sy = σy σ where σy is an abbreviated notation for std. dev[y ] = var[y ]. ¯ ¯ ¯ • a. 1 point Write down the formula for sy . ¯ 146 11. INTERVAL ESTIMATION Table 1. Percentiles of Student’s t Distribution. Table entry x satisfies Pr[tn ≤ x] = p. n 1 2 3 4 5 .750 1.000 0.817 0.765 0.741 0.727 p= .950 .975 .990 .995 6.314 12.706 31.821 63.657 2.920 4.303 6.965 9.925 2.354 3.182 4.541 5.841 2.132 2.776 3.747 4.604 2.015 2.571 3.365 4.032 .900 3.078 1.886 1.638 1.533 1.476 2 Answer. Start with σy = var[¯] = y ¯ σ2 , n √ therefore σy = σ/ n, and ¯ √ sy = s/ n = ¯ (11.0.17) (y i − y )2 ¯ n(n − 1) • b. 2 points Compute the coverage probability of the interval (11.0.16). Answer. The coverage probability is (11.0.18) Pr[R(y ) µ] = Pr[ µ − y ≤ t(n−1;α/2) sy ] ¯ ¯ (11.0.19) (11.0.20) (11.0.21) (11.0.22) µ−y ¯ ≤ t(n−1;α/2) ] sy ¯ (µ − y )/σy ¯ ¯ = Pr[ ≤ t(n−1;α/2) ] sy /σy ¯ ¯ (y − µ)/σy ¯ ¯ = Pr[ ≤ t(n−1;α/2) ] s/σ = 1 − α, = Pr[ because the expression in the numerator is a standard normal, and the expression in the denominator is the square root of an independent χ2 −1 divided by n − 1. The random variable between the n absolute signs has therefore a t-distribution, and (11.0.22) follows from (30.4.8). • c. 2 points Four independent observations are available of a normal √ random √ variable with unknown mean µ and variance σ 2 : the values are −2, − 2, + 2, and +2. (These are not the kind of numbers you are usually reading off a measurement instrument, but they make the calculation easy). Give a 95% confidence interval for µ. Table 1 gives the percentiles of the t-distribution. Answer. In our situation (11.0.23) ¯ x−µ √ ∼ t3 s/ n According to table 1, for b = 3.182 follows (11.0.24) Pr[t3 ≤ b] = 0.975 therefore (11.0.25) Pr[t3 > b] = 0.025 and by symmetry of the t-distribution (11.0.26) Pr[t3 < −b] = 0.025 Now subtract (11.0.26) from (11.0.24) to get (11.0.27) Pr[−b ≤ t3 ≤ b] = 0.95 11. INTERVAL ESTIMATION 147 or Pr[|t3 | ≤ b] = 0.95 (11.0.28) or, plugging in the formula for t3 , ¯ x−µ ≤ b = .95 √ s/ n √ Pr[|x − µ| ≤ bs/ n] = .95 ¯ (11.0.30) √ √ (11.0.31) ¯ Pr[−bs/ n ≤ µ − x ≤ bs/ n] = .95 √ √ Pr[¯ − bs/ n ≤ µ ≤ x + bs/ n] = .95 (11.0.32) x ¯ √ √ the confidence interval is therefore x − bs/ n, x + bs/ n . In our sample, x = 0, s2 = 12 = 4, ¯ ¯ ¯ 3 2 /n = 1, therefore also s/√n = 1. So the sample value of the confidence interval n = 4, therefore s is [−3.182, +3.182]. (11.0.29) Pr Problem 184. Using R, construct 20 samples of 12 observation each from a N (0, 1) distribution, construct the 95% confidence t-intervals for the mean based on these 20 samples, plot these intervals, and count how many intervals contain the true mean. Here are the commands: stdnorms<-matrix(rnorm(240),nrow=12,ncol=20 gives a 12 × 20 matrix containing 240 independent random normals. You get the vector containing the midpoints of the confidence intervals by the assignment midpts <apply(stdnorms,2,mean). About apply see [BCW96, p. 130]. The vector containing the half width of each confidence interval can be obtained by another use of apply: halfwidth <- (qt(0.975,11)/sqrt(12)) * sqrt(apply(stdnorms,2,var)); To print the values on the screen you may simply issue the command cbind(midpts-halfwidth,midpts+halfwidth). But it is much better to plot them. Since such a plot does not have one of the usual formats, we have to put it together with some low-level commands. See [BCW96, page 325]. At the very minimum we need the following: frame() starts a new plot. par(usr = c(1,20, range(c(midpts-halfwidth,midpts+halfwidth)) sets a coordinate system which accommodates all intervals. The 20 confidence intervals are constructed by segments(1:20, midpts-halfwidth, 1:20, midpts+halfwidth). Finally, abline(0,0) adds a horizontal line, so that you can see how many intervals contain the true mean. The ecmet package has a function confint.segments which draws such plots automatically. Choose how many observations in each experiment (the argument n), and how many confidence intervals (the argument rep), and the confidence level level (the default is here 95%), and then issue, e.g. the command confint.segments(n=50,rep=100,level=.9). Here is the transcript of the function: confint.segments <- function(n, rep, level = 95/100) { stdnormals <- matrix(rnorm(n * rep), nrow = n, ncol = rep) midpts <- apply(stdnormals, 2, mean) halfwidth <- qt(p=(1 + level)/2, df= n - 1) * sqrt(1/n)* sqrt(apply(stdnormals, 2, var)) frame() x <- c(1:rep, 1:rep) y <- c(midpts + halfwidth, midpts - halfwidth) par(usr = c(1, rep, range(y))) segments(1:rep, midpts - halfwidth, 1:rep, midpts + halfwidth) abline(0, 0) invisible(cbind(x,y)) } 148 11. INTERVAL ESTIMATION This function draws the plot as a “side effect,” but it also returns a matrix with the coordinates of the endpoints of the plots (without printing them on the screen). This matrix can be used as input for the identify function. If you do for instance iddata<-confint.segments(12,20) and then identify(iddata,labels=iddata[,2], then the following happens: if you move the mouse cursor on the graph near one of the endpoints of one of the intervals, and click the left button, then it will print on the graph the coordinate of the bounday of this interval. Clicking any other button of the mouse gets you out of the identify function. CHAPTER 12 Hypothesis Testing Imagine you are a business person considering a major investment in order to launch a new product. The sales prospects of this product are not known with certainty. You have to rely on the outcome of n marketing surveys that measure the demand for the product once it is offered. If µ is the actual (unknown) rate of return on the investment, each of these surveys here will be modeled as a random variable, which has a Normal distribution with this mean µ and known variance 1. Let y1 , y2 , . . . , yn be the observed survey results. How would you decide whether to build the plant? The intuitively reasonable thing to do is to go ahead with the investment if the sample mean of the observations is greater than a given value c, and not to do it otherwise. This is indeed an optimal decision rule, and we will discuss in what respect it is, and how c should be picked. Your decision can be the wrong decision in two different ways: either you decide to go ahead with the investment although there will be no demand for the product, or you fail to invest although there would have been demand. There is no decision rule which eliminates both errors at once; the first error would be minimized by the rule never to produce, and the second by the rule always to produce. In order to determine the right tradeoff between these errors, it is important to be aware of their asymmetry. The error to go ahead with production although there is no demand has potentially disastrous consequences (loss of a lot of money), while the other error may cause you to miss a profit opportunity, but there is no actual loss involved, and presumably you can find other opportunities to invest your money. To express this asymmetry, the error with the potentially disastrous consequences is called “error of type one,” and the other “error of type two.” The distinction between type one and type two errors can also be made in other cases. Locking up an innocent person is an error of type one, while letting a criminal go unpunished is an error of type two; publishing a paper with false results is an error of type one, while foregoing an opportunity to publish is an error of type two (at least this is what it ought to be). Such an asymmetric situation calls for an asymmetric decision rule. One needs strict safeguards against committing an error of type one, and if there are several decision rules which are equally safe with respect to errors of type one, then one will select among those that decision rule which minimizes the error of type two. Let us look here at decision rules of the form: make the investment if y > c. ¯ An error of type one occurs if the decision rule advises you to make the investment while there is no demand for the product. This will be the case if y > c but µ ≤ 0. ¯ The probability of this error depends on the unknown parameter µ, but it is at most α = Pr[¯ > c | µ = 0]. This maximum value of the type one error probability is called y the significance level, and you, as the director of the firm, will have to decide on α depending on how tolerable it is to lose money on this venture, which presumably depends on the chances to lose money on alternative investments. It is a serious 149 150 12. HYPOTHESIS TESTING ........................ .......................... .................. ................. ......... ......... ....... ....... ...... ...... ...... ...... .. ... ...... ...... ....... ....... ........ ........ .......... .......... ....................................... ....................................... -3 -2 -1 0 1 2 3 Figure 1. Eventually this Figure will show the Power function of a one-sided normal test, i.e., the probability of error of type one as a function of µ; right now this is simply the cdf of a Standard Normal shortcoming of the classical theory of hypothesis testing that it does not provide good guidelines how α should be chosen, and how it should change with sample size. Instead, there is the tradition to choose α to be either 5% or 1% or 0.1%. Given α, a table of the cumulative standard normal distribution function allows you to find y that c for which Pr[¯ > c | µ = 0] = α. Problem 185. 2 points Assume each y i ∼ N (µ, 1), n = 400 and α = 0.05, and y different y i are independent. Compute the value c which satisfies Pr[¯ > c | µ = 0] = α. You shoule either look it up in a table and include a xerox copy of the table with the entry circled and the complete bibliographic reference written on the xerox copy, or do it on a computer, writing exactly which commands you used. In R, the function qnorm does what you need, find out about it by typing help(qnorm). Answer. In the case n = 400, y has variance 1/400 and therefore standard deviation 1/20 = ¯ 0.05. Therefore 20¯ is a standard normal: from Pr[¯ > c | µ = 0] = 0.05 follows Pr[20¯ > 20c | µ = y y y 0] = 0.05. Therefore 20c = 1.645 can be looked up in a table, perhaps use [JHG+ 88, p. 986], the row for ∞ d.f. Let us do this in R. The p-“quantile” of the distribution of the random variable y is defined as that value q for which Pr[y ≤ q ] = p. If y is normally distributed, this quantile is computed by the R-function qnorm(p, mean=0, sd=1, lower.tail=TRUE). In the present case we need either qnorm(p=1-0.05, mean=0, sd=0.05) or qnorm(p=0.05, mean=0, sd=0.05, lower.tail=FALSE) which gives the value 0.08224268. Choosing a decision which makes a loss unlikely is not enough; your decision must also give you a chance of success. E.g., the decision rule to build the plant if −0.06 ≤ y ≤ −0.05 and not to build it otherwise is completely perverse, although ¯ the significance level of this decision rule is approximately 4% (if n = 100). In other words, the significance level is not enough information for evaluating the performance of the test. You also need the “power function,” which gives you the probability with which the test advises you to make the “critical” decision, as a function of the true parameter values. (Here the “critical” decision is that decision which might potentially lead to an error of type one.) By the definition of the significance level, the power function does not exceed the significance level for those parameter values for which going ahead would lead to a type 1 error. But only those tests are “powerful” whose power function is high for those parameter values for which it would be correct to go ahead. In our case, the power function must be below 0.05 when µ ≤ 0, and we want it as high as possible when µ > 0. Figure 1 shows the power function for the decision rule to go ahead whenever y ≥ c, where c is chosen in such a way that ¯ the significance level is 5%, for n = 100. The hypothesis whose rejection, although it is true, constitutes an error of type one, is called the null hypothesis, and its alternative the alternative hypothesis. (In the examples the null hypotheses were: the return on the investment is zero or negative, the defendant is innocent, or the results about which one wants to publish a research paper are wrong.) The null hypothesis is therefore the hypothesis that nothing is 12.1. DUALITY BETWEEN SIGNIFICANCE TESTS AND CONFIDENCE REGIONS 151 the case. The test tests whether this hypothesis should be rejected, will safeguard against the hypothesis one wants to reject but one is afraid to reject erroneously. If you reject the null hypothesis, you don’t want to regret it. Mathematically, every test can be identified with its null hypothesis, which is a region in parameter space (often consisting of one point only), and its “critical region,” which is the event that the test comes out in favor of the “critical decision,” i.e., rejects the null hypothesis. The critical region is usually an event of the form that the value of a certain random variable, the “test statistic,” is within a given range, usually that it is too high. The power function of the test is the probability of the critical region as a function of the unknown parameters, and the significance level is the maximum (or, if this maximum depends on unknown parameters, any upper bound) of the power function over the null hypothesis. Problem 186. Mr. Jones is on trial for counterfeiting Picasso paintings, and you are an expert witness who has developed fool-proof statistical significance tests for identifying the painter of a given painting. • a. 2 points There are two ways you can set up your test. a: You can either say: The null hypothesis is that the painting was done by Picasso, and the alternative hypothesis that it was done by Mr. Jones. b: Alternatively, you might say: The null hypothesis is that the painting was done by Mr. Jones, and the alternative hypothesis that it was done by Picasso. Does it matter which way you do the test, and if so, which way is the correct one. Give a reason to your answer, i.e., say what would be the consequences of testing in the incorrect way. Answer. The determination of what the null and what the alternative hypothesis is depends on what is considered to be the catastrophic error which is to be guarded against. On a trial, Mr. Jones is considered innocent until proven guilty. Mr. Jones should not be convicted unless he can be proven guilty beyond “reasonable doubt.” Therefore the test must be set up in such a way that the hypothesis that the painting is by Picasso will only be rejected if the chance that it is actually by Picasso is very small. The error of type one is that the painting is considered counterfeited although it is really by Picasso. Since the error of type one is always the error to reject the null hypothesis although it is true, solution a. is the correct one. You are not proving, you are testing. • b. 2 points After the trial a customer calls you who is in the process of acquiring a very expensive alleged Picasso painting, and who wants to be sure that this painting is not one of Jones’s falsifications. Would you now set up your test in the same way as in the trial or in the opposite way? Answer. It is worse to spend money on a counterfeit painting than to forego purchasing a true Picasso. Therefore the null hypothesis would be that the painting was done by Mr. Jones, i.e., it is the opposite way. 12.1. Duality between Significance Tests and Confidence Regions There is a duality between confidence regions with confidence level 1 − α and certain significance tests. Let us look at a family of significance tests, which all have a significance level ≤ α, and which define for every possible value of the parameter φ0 ∈ Ω a critical region C (φ0 ) for rejecting the simple null hypothesis that the true parameter is equal to φ0 . The condition that all significance levels are ≤ α means mathematically (12.1.1) Pr C (φ0 )|φ = φ0 ≤ α for all φ0 ∈ Ω. 152 12. HYPOTHESIS TESTING Mathematically, confidence regions and such families of tests are one and the same thing: if one has a confidence region R(y ), one can define a test of the null hypothesis φ = φ0 as follows: for an observed outcome y reject the null hypothesis if and only if φ0 is not contained in R(y ). On the other hand, given a family of tests, one can build a confidence region by the prescription: R(y ) is the set of all those parameter values which would not be rejected by a test based on observation y . Problem 187. Show that with these definitions, equations (11.0.5) and (12.1.1) are equivalent. Answer. Since φ0 ∈ R(y ) iff y ∈ C (φ0 ) (the complement of the critical region rejecting that the parameter value is φ0 ), it follows Pr[R(y ) ∈ φ0 |φ = φ0 ] = 1 − Pr[C (φ0 )|φ = φ0 ] ≥ 1 − α. This duality is discussed in [BD77, pp. 177–182]. 12.2. The Neyman Pearson Lemma and Likelihood Ratio Tests Look one more time at the example with the fertilizer. Why are we considering only regions of the form y ≥ µ0 , why not one of the form µ1 ≤ y ≤ µ2 , or maybe not ¯ ¯ use the mean but decide to build if y 1 ≥ µ3 ? Here the µ1 , µ2 , and µ3 can be chosen such that the probability of committing an error of type one is still α. It seems intuitively clear that these alternative decision rules are not reasonable. The Neyman Pearson lemma proves this intuition right. It says that the critical regions of the form y ≥ µ0 are uniformly most powerful, in the sense that every ¯ other critical region with same probability of type one error has equal or higher probability of committing error of type two, regardless of the true value of µ. Here are formulation and proof of the Neyman Pearson lemma, first for the case that both null hypothesis and alternative hypothesis are simple: H0 : θ = θ0 , HA : θ = θ1 . In other words, we want to determine on the basis of the observations of the random variables y 1 , . . . , y n whether the true θ was θ0 or θ1 , and a determination θ = θ1 when in fact θ = θ0 is an error of type one. The critical region C is the set of all outcomes that lead us to conclude that the parameter has value θ1 . The Neyman Pearson lemma says that a uniformly most powerful test exists in this situation. It is a so-called likelihood-ratio test, which has the following critical region: (12.2.1) C = {y1 , . . . , yn : L(y1 , . . . , yn ; θ1 ) ≥ kL(y1 , . . . , yn ; θ0 )}. C consists of those outcomes for which θ1 is at least k times as likely as θ0 (where k is chosen such that Pr[C |θ0 ] = α). To prove that this decision rule is uniformly most powerful, assume D is the critical region of a different test with same significance level α, i.e., if the null hypothesis is correct, then C and D reject (and therefore commit an error of type one) with equally low probabilities α. In formulas, Pr[C |θ0 ] = Pr[D|θ0 ] = α. Look at figure 2 with C = U ∪ V and D = V ∪ W . Since C and D have the same significance level, it follows (12.2.2) Pr[U |θ0 ] = Pr[W |θ0 ]. Also (12.2.3) Pr[U |θ1 ] ≥ k Pr[U |θ0 ], 12.2. THE NEYMAN PEARSON LEMMA AND LIKELIHOOD RATIO TESTS 153 Figure 2. Venn Diagram for Proof of Neyman Pearson Lemma ec660.1005 since U ⊂ C and C were chosen such that the likelihood (density) function of the alternative hypothesis is high relatively to that of the null hypothesis. Since W lies outside C , the same argument gives Pr[W |θ1 ] ≤ k Pr[W |θ0 ]. (12.2.4) Linking those two inequalities and the equality gives (12.2.5) Pr[W |θ1 ] ≤ k Pr[W |θ0 ] = k Pr[U |θ0 ] ≤ Pr[U |θ1 ], hence Pr[D|θ1 ] ≤ Pr[C |θ1 ]. In other words, if θ1 is the correct parameter value, then C will discover this and reject at least as often as D. Therefore C is at least as powerful as D, or the type two error probability of C is at least as small as that of D. Back to our fertilizer example. To make both null and alternative hypotheses simple, assume that either µ = 0 (fertilizer is ineffective) or µ = t for some fixed t > 0. Then the likelihood ratio critical region has the form (12.2.6) C = {y1 , . . . , yn : 1 √ 2π n 1 e− 2 ((y1 −t) 2 +···+(yn −t)2 ) 1 ≥k √ 2π n 1 2 2 e− 2 (y1 +···+yn ) } (12.2.7) 12 1 2 = {y1 , . . . , yn : − ((y1 − t)2 + · · · + (yn − t)2 ) ≥ ln k − (y1 + · · · + yn )} 2 2 (12.2.8) t2 n = {y1 , . . . , yn : t(y1 + · · · + yn ) − ≥ ln k } 2 (12.2.9) t ln k +} = {y1 , . . . , yn : y ≥ ¯ nt 2 i.e., C has the form y ≥ some constant. The dependence of this constant on k is not ¯ relevant, since this constant is usually chosen such that the maximum probability of error of type one is equal to the given significance level. Problem 188. 8 points You have four independent observations y1 , . . . , y4 from an N (µ, 1), and you are testing the null hypothesis µ = 0 against the alternative hypothesis µ = 1. For your test you are using the likelihood ratio test with critical region (12.2.10) C = {y1 , . . . , y4 : L(y1 , . . . , y4 ; µ = 1) ≥ 3.633 · L(y1 , . . . , y4 ; µ = 0)}. Compute the significance level of this test. (According to the Neyman-Pearson lemma, this is the uniformly most powerful test for this significance level.) Hints: 154 12. HYPOTHESIS TESTING In order to show this you need to know that ln 3.633 = 1.29, everything else can be done without a calculator. Along the way you may want to show that C can also be written in the form C = {y1 , . . . , y4 : y1 + · · · + y4 ≥ 3.290}. Answer. Here is the equation which determines when y1 , . . . , y4 lie in C : (12.2.11) (12.2.12) (12.2.13) 1 12 2 (y1 − 1)2 + · · · + (y4 − 1)2 ≥ 3.633 · (2π )−2 exp − y + · · · + y4 2 21 1 12 2 − (y1 − 1)2 + · · · + (y4 − 1)2 ≥ ln(3.633) − y + · · · + y4 2 21 y1 + · · · + y4 − 2 ≥ 1.290 (2π )−2 exp − Since Pr[y 1 + · · · + y 4 ≥ 3.290] = Pr[z = (y 1 + · · · + y 4 )/2 ≥ 1.645] and z is a standard normal, one obtains the significance level of 5% from the standard normal table or the t-table. Note that due to the properties of the Normal distribution, this critical region, for a given significance level, does not depend at all on the value of t. Therefore this test is uniformly most powerful against the composite hypothesis µ > 0. One can als write the null hypothesis as the composite hypothesis µ ≤ 0, because the highest probability of type one error will still be attained when µ = 0. This completes the proof that the test given in the original fertilizer example is uniformly most powerful. Most other distributions discussed here are equally well behaved, therefore uniformly most powerful one-sided tests exist not only for the mean of a normal with known variance, but also the variance of a normal with known mean, or the parameters of a Bernoulli and Poisson distribution. However the given one-sided hypothesis is the only situation in which a uniformly most powerful test exists. In other situations, the generalized likelihood ratio test has good properties even though it is no longer uniformly most powerful. Many known tests (e.g., the F test) are generalized likelihood ratio tests. Assume you want to test the composite null hypothesis H0 : θ ∈ ω , where ω is a subset of the parameter space, against the alternative HA : θ ∈ Ω, where Ω ⊃ ω is a more comprehensive subset of the parameter space. ω and Ω are defined by functions with continuous first-order derivatives. The generalized likelihood ratio critical region has the form (12.2.14) C = {x1 , . . . , xn : supθ∈Ω L(x1 , . . . , xn ; θ) ≥ k} supθ∈ω L(x1 , . . . , xn ; θ) where k is chosen such that the probability of the critical region when the null hypothesis is true has as its maximum the desired significance level. It can be shown that twice the log of this quotient is asymptotically distributed as a χ2−s , where q q is the dimension of Ω and s the dimension of ω . (Sometimes the likelihood ratio is defined as the inverse of this ratio, but whenever possible we will define our test statistics so that the null hypothjesis is rejected if the value of the test statistic is too large.) In order to perform a likelihood ratio test, the following steps are necessary: First construct the MLE’s for θ ∈ Ω and θ ∈ ω , then take twice the difference of the attained levels of the log likelihoodfunctions, and compare with the χ2 tables. 12.3. The Wald, Likelihood Ratio, and Lagrange Multiplier Tests ˜ Let us start with the generalized Wald test. Assume θ is an asymptotically normal estimator of θ , whose asymptotic distribution is N (θ , Ψ). Assume furtherˆ more that Ψ is a consistent estimate of Ψ. Then the following statistic is called the 12.3. WALD, LIKELIHOOD RATIO, LAGRANGE MULTIPLIER TESTS 155 generalized Wald statistic. It can be used for an asymtotic test of the hypothesis h(θ ) = o, where h is a q -vector-valued differentiable function: −1 ∂h ˆ ∂h ˜ Ψ h(θ ) ˜ ˜ ∂θ θ ∂θ θ Under the null hypothesis, this test statistic is asymptotically distributed as a χ2 . To q ˜ , h(θ ) h(θ ) + ∂ h (θ − θ ). Taking ˜ ˜ understand this, note that for all θ close to θ (12.3.1) ˜ G.W. = h(θ ) ∂θ covariances ˜ θ ∂h ˆ ∂h Ψ ˜ ˜ ∂θ θ ∂θ θ ˜ ˜ is an estimate of the covariance matrix of h(θ ). I.e., one takes h(θ ) twice and “divides” it by its covariance matrix. Now let us make more stringent assumptions. Assume the density fx (x; θ ) of x depends on the parameter vector θ . We are assuming that the conditions are ˆ satisfied which ensure asymptotic normality of the maximum likelihood estimator θ ¯ and also of θ , the constrained maximum likelihood estimator subject to the constraint h(θ ) = o. There are three famous tests to test this hypothesis, which asymptotically are all distributed like χ2 . The likehihood-ratio test is q (12.3.2) (12.3.3) LRT = −2 log maxh(θ)=o fy (y ; θ ) ¯ ˆ = 2(log fy (y , θ ) − log fy (y , θ )) maxθ fy (y ; θ ) It rejects if imposing the constraint reduces the attained level of the likelihood function too much. The Wald test has the form −1 ∂ h −1 ∂h ∂ 2 log f (y ; θ ) ˆ ˆ (12.3.4) Wald = −h(θ ) h(θ ) ˆ ˆ ˆ ∂θ θ θ ∂θ θ ∂ θ∂ θ 2 −1 log ( To understand this formula, note that − E ∂ ∂ θ∂fθ y;θ) is the Cramer Rao lower bound, and since all maximum likelihood estimators asymptotically attain the ˆ CRLB, it is the asymptotic covariance matrix of θ . If one does not take the expected ˆ into these partial derivatives of the log likelihood function, one value but plugs θ gets a consistent estimate of the asymtotic covariance matrix. Therefore the Wald test is a special case of the generalized Wald test. Finally the score test has the form −1 ∂ log f (y ; θ ) ∂ log f (y ; θ ) ∂ 2 log f (y ; θ ) ¯ ¯ ¯ ∂θ θ θ θ ∂θ ∂ θ∂ θ This test tests whether the score, i.e., the gradient of the unconstrained log likelihood function, evaluated at the constrained maximum likelihood estimator, is too far away from zero. To understand this formula, remember that we showed in the proof of the Cramer-Rao lower bound that the negative of the expected value of the Hessian 2 log ( − E ∂ ∂ θ∂fθ y;θ) is the covariance matrix of the score, i.e., here we take the score twice and divide it by its estimated covariance matrix. (12.3.5) Score = − CHAPTER 13 General Principles of Econometric Modelling [Gre97, 6.1 on p. 220] says: “An econometric study begins with a set of propositions about some aspect of the economy. The theory specifies a set of precise, deterministic relationships among variables. Familiar examples are demand equations, production functions, and macroeconomic models. The empirical investigation provides estimates of unknown parameters in the model, such as elasticities or the marginal propensity to consume, and usually attempts to measure the validity of the theory against the behavior of the observable data.” [Hen95, p. 6] distinguishes between two extremes: “‘Theory-driven’ approaches, in which the model is derived from a priori theory and calibrated from data evidence. They suffer from theory dependence in that their credibility depends on the credibility of the theory from which they arose—when that theory is discarded, so is the associated evidence.” The other extreme is “‘Data-driven’ approaches, where models are developed to closely describe the data . . . These suffer from sample dependence in that accidental and transient data features are embodied as tightly in the model as permanent aspects, so that extension of the data set often reveal predictive failure.” Hendry proposes the following useful distinction of 4 levels of knowledge: A Consider the situation where we know the complete structure of the process which gernerates economic data and the values of all its parameters. This is the equivalent of a probability theory course (example: rolling a perfect die), but involves economic theory and econometric concepts. B consider a known economic structure with unknown values of the parameters. Equivalent to an estimation and inference course in statistics (example: independent rolls of an imperfect die and estimating the probabilities of the different faces) but focusing on econometrically relevant aspects. C is “the empirically relevant situation where neither the form of the datagenerating process nor its parameter values are known. (Here one does not know whether the rolls of the die are independent, or whether the probabilities of the different faces remain constant.) Model discovery, evaluation, data mining, modelsearch procedures, and associated methodological issues. D Forecasting the future when the data outcomes are unknown. (Model of money demand under financial innovation). The example of Keynes’s consumption function in [Gre97, pp. 221/22] sounds at the beginning as if it was close to B , but in the further discussion Greene goes more and more over to C . It is remarkable here that economic theory usually does not yield functional forms. Greene then says: the most common functional form is the linear one c = α + β x with α > 0 and 0 < β < 1. He does not mention the aggregation problem hidden in this. Then he says: “But the linear function is only approximate; in fact, it is unlikely that consumption and income can be connected by any simple relationship. The deterministic relationship is clearly inadequate.” Here Greene uses a random relationship to model a relationship which is quantitatively “fuzzy.” This is an interesting and relevant application of randomness. 157 158 13. GENERAL PRINCIPLES OF ECONOMETRIC MODELLING A sentence later Green backtracks from this insight and says: “We are not so ambitious as to attempt to capture every influence in the relationship, but only those that are substantial enough to model directly.” The “fuzziness” is not due to a lack of ambition of the researcher, but the world is inherently quantiatively fuzzy. It is not that we don’t know the law, but there is no law; not everything that happens in an economy is driven by economic laws. Greene’s own example, in Figure 6.2, that during the war years consumption was below the trend line, shows this. Greene’s next example is the relationship between income and education. This illustrates multiple instead of simple regression: one must also include age, and then also the square of age, even if one is not interested in the effect which age has, but in order to “control” for this effect, so that the effects of education and age will not be confounded. Problem 189. Why should a regression of income on education include not only age but also the square of age? Answer. Because the effect of age becomes smaller with increases in age. Critical Realist approaches are [Ron02] and [Mor02]. CHAPTER 14 Mean-Variance Analysis in the Linear Model In the present chapter, the only distributional assumptions are that means and variances exist. (From this follows that also the covariances exist). 14.1. Three Versions of the Linear Model As background reading please read [CD97, Chapter 1]. Following [JHG+ 88, Chapter 5], we will start with three different linear statistical models. Model 1 is the simplest estimation problem already familiar from chapter 9, with n independent observations from the same distribution, call them y 1 , . . . , y n . The only thing known about the distribution is that mean and variance exist, call them µ and σ 2 . In order to write this as a special case of the “linear model,” define εi = y i −µ, and define the vectors y = y 1 y 2 · · · y n , ε = ε1 ε2 · · · εn , and ι = 1 1 ··· 1 . Then one can write the model in the form ε ∼ (o, σ 2 I ) y = ιµ + ε (14.1.1) The notation ε ∼ (o, σ 2 I ) is shorthand for E [ε ] = o (the null vector) and V [ε ] = σ 2 I (σ 2 times the identity matrix, which has 1’s in the diagonal and 0’s elsewhere). µ is the deterministic part of all the y i , and εi is the random part. Model 2 is “simple regression” in which the deterministic part µ is not constant but is a function of the nonrandom variable x. The assumption here is that this function is differentiable and can, in the range of the variation of the data, be approximated by a linear function [Tin51, pp. 19–20]. I.e., each element of y is a constant α plus a constant multiple of the corresponding element of the nonrandom vector x plus a random error term: y t = α + xt β + εt , t = 1, . . . , n. This can be written as x1 1 1 x1 y1 ε1 ε1 . . . . . . α + . . (14.1.2) . . = . α + . β + . = . . . . . . . . β xn 1 1 xn yn εn εn or ε ∼ (o, σ 2 I ) y = Xβ + ε (14.1.3) Problem 190. 1 point Compute the matrix product 40 1 2 5 2 1 031 38 Answer. 1 0 2 3 5 1 4 2 3 0 1 8 = 1·4+2·2+5·3 0·4+3·2+1·3 159 1·0+2·1+5·8 23 = 0·0+3·1+1·8 9 42 11 160 14. MEAN-VARIANCE ANALYSIS IN THE LINEAR MODEL If the systematic part of y depends on more than one variable, then one needs multiple regression, model 3. Mathematically, multiple regression has the same form (14.1.3), but this time X is arbitrary (except for the restriction that all its columns are linearly independent). Model 3 has Models 1 and 2 as special cases. Multiple regression is also used to “correct for” disturbing influences. Let me explain. A functional relationship, which makes the systematic part of y dependent on some other variable x will usually only hold if other relevant influences are kept constant. If those other influences vary, then they may affect the form of this functional relation. For instance, the marginal propensity to consume may be affected by the interest rate, or the unemployment rate. This is why some econometricians (Hendry) advocate that one should start with an “encompassing” model with many explanatory variables and then narrow the specification down by hypothesis tests. Milton Friedman, by contrast, is very suspicious about multiple regressions, and argues in [FS91, pp. 48/9] against the encompassing approach. Friedman does not give a theoretical argument but argues by an example from Chemistry. Perhaps one can say that the variations in the other influences may have more serious implications than just modifying the form of the functional relation: they may destroy this functional relation altogether, i.e., prevent any systematic or predictable behavior. observed unobserved random y ε nonrandom X β, σ2 14.2. Ordinary Least Squares ˆ In the model y = Xβ + ε , where ε ∼ (o, σ 2 I ), the OLS-estimate β is defined to ˆ which minimizes be that value β = β (14.2.1) SSE = (y − Xβ ) (y − Xβ ) = y y − 2y X β + β X X β . Problem 156 shows that in model 1, this principle yields the arithmetic mean. Problem 191. 2 points Prove that, if one predicts a random variable y by a constant a, the constant which gives the best MSE is a = E[y ], and the best MSE one can get is var[y ]. Answer. E[(y − a)2 ] = E[y 2 ] − 2a E[y ] + a2 . Differentiate with respect to a and set zero to get a = E[y ]. One can also differentiate first and then take expected value: E[2(y − a)] = 0. We will solve this minimization problem using the first-order conditions in vector notation. As a preparation, you should read the beginning of Appendix C about matrix differentiation and the connection between matrix differentiation and the Jacobian matrix of a vector function. All you need at this point is the two equations (C.1.6) and (C.1.7). The chain rule (C.1.23) is enlightening but not strictly necessary for the present derivation. The matrix differentiation rules (C.1.6) and (C.1.7) allow us to differentiate (14.2.1) to get (14.2.2) ∂ SSE /∂ β = −2y X + 2β X X . Transpose it (because it is notationally simpler to have a relationship between column ˆ vectors), set it zero while at the same time replacing β by β , and divide by 2, to get the “normal equation” ˆ (14.2.3) X y = X X β. 14.2. ORDINARY LEAST SQUARES 161 Due to our assumption that all columns of X are linearly independent, X X has an inverse and one can premultiply both sides of (14.2.3) by (X X )−1 : ˆ β = (X X )−1 X y . (14.2.4) If the columns of X are not linearly independent, then (14.2.3) has more than one solution, and the normal equation is also in this case a necessary and sufficient ˆ condition for β to minimize the SSE (proof in Problem 194). Problem 192. 4 points Using the matrix differentiation rules (14.2.5) (14.2.6) ∂ w x/∂ x = w ∂ x M x/∂ x = 2x M ˆ for symmetric M , compute the least-squares estimate β which minimizes (14.2.7) SSE = (y − Xβ ) (y − Xβ ) You are allowed to assume that X X has an inverse. Answer. First you have to multiply out (14.2.8) (y − Xβ ) (y − Xβ ) = y y − 2y X β + β X X β . The matrix differentiation rules (14.2.5) and (14.2.6) allow us to differentiate (14.2.8) to get (14.2.9) ∂ SSE /∂ β = −2y X + 2β X X . Transpose it (because it is notationally simpler to have a relationship between column vectors), set ˆ it zero while at the same time replacing β by β , and divide by 2, to get the “normal equation” ˆ (14.2.10) X y = X X β. Since X X has an inverse, one can premultiply both sides of (14.2.10) by (X X )−1 : (14.2.11) ˆ β = (X X )−1 X y . Problem 193. 2 points Show the following: if the columns of X are linearly independent, then X X has an inverse. (X itself is not necessarily square.) In your proof you may use the following criteria: the columns of X are linearly independent (this is also called: X has full column rank) if and only if Xa = o implies a = o. And a square matrix has an inverse if and only if its columns are linearly independent. Answer. We have to show that any a which satisfies X X a = o is itself the null vector. From X X a = o follows a X X a = 0 which can also be written X a 2 = 0. Therefore Xa = o, and since the columns of X are linearly independent, this implies a = o. Problem 194. 3 points In this Problem we do not assume that X has full column rank, it may be arbitrary. • a. The normal equation (14.2.3) has always at least one solution. Hint: you are allowed to use, without proof, equation (A.3.3) in the mathematical appendix. ˆ Answer. With this hint it is easy: β = (X X )− X y is a solution. ˆ • b. If β satisfies the normal equation and β is an arbitrary vector, then ˆ ˆ ˆ (14.2.12) (y − Xβ ) (y − Xβ ) = (y − X β ) (y − X β ) + (β − β ) X X (β − β ). Answer. This is true even if X has deficient rank, and it will be shown here in this general ˆ ˆ ˆ ˆ (y − X β ) − X (β − β ) ; case. To prove (14.2.12), write (14.2.1) as SSE = (y − X β ) − X (β − β ) ˆ satisfies (14.2.3), the cross product terms disappear. since β • c. Conclude from this that the normal equation is a necessary and sufficient ˆ condition characterizing the values β minimizing the sum of squared errors (14.2.12). 162 14. MEAN-VARIANCE ANALYSIS IN THE LINEAR MODEL Answer. (14.2.12) shows that the normal equations are sufficient. For necessity of the normal ˆ equations let β be an arbitrary solution of the normal equation, we have seen that there is always ˆ at least one. Given β , it follows from (14.2.12) that for any solution β ∗ of the minimization, ˆ ˆ X X (β ∗ − β ) = o. Use (14.2.3) to replace (X X )β by X y to get X X β ∗ = X y . ˆˆ It is customary to use the notation X β = y for the so-called fitted values, which ˆ are the estimates of the vector of means η = Xβ . Geometrically, y is the orthogonal projection of y on the space spanned by the columns of X . See Theorem A.6.1 about projection matrices. The vector of differences between the actual and the fitted values is called the ˆ ˆ vector of “residuals” ε = y − y . The residuals are “predictors” of the actual (but unobserved) values of the disturbance vector ε . An estimator of a random magnitude is usually called a “predictor,” but in the linear model estimation and prediction are treated on the same footing, therefore it is not necessary to distinguish between the two. You should understand the difference between disturbances and residuals, and between the two decompositions ˆˆ y = Xβ + ε = X β + ε (14.2.13) Problem 195. 2 points Assume that X has full column rank. Show that ε = M y ˆ where M = I − X (X X )−1 X . Show that M is symmetric and idempotent. ˆ Answer. By definition, ε = y − X β = y − X (X X )−1 X y = I − X (X X )−1 X y . Idemˆ potent, i.e. M M = M : (14.2.14) M M = I − X (X X )−1 X I − X (X X )−1 X = I − X (X X )−1 X − X (X X )−1 X + X (X X )−1 X X (X X )−1 Problem 196. Assume X has full column rank. Define M = I −X (X X )−1 X . • a. 1 point Show that the space M projects on is the space orthogonal to all columns in X , i.e., M q = q if and only if X q = o. Answer. X q = o clearly implies M q = q . Conversely, M q = q implies X (X X )−1 X q = o. Premultiply this by X to get X q = o. • b. 1 point Show that a vector q lies in the range space of X , i.e., the space spanned by the columns of X , if and only if M q = o. In other words, {q : q = Xa for some a} = {q : M q = o}. Answer. First assume M q = o. This means q = X (X X )−1 X q = Xa with a = (X X )−1 X q . Conversely, if q = Xa then M q = M Xa = Oa = o. Problem 197. In 2-dimensional space, write down the projection matrix on the diagonal line y = x (call it E ), and compute E z for the three vectors a = [ 2 ], 1 b = [ 2 ], and c = [ 3 ]. Draw these vectors and their projections. 2 2 Assume we have a dependent variable y and two regressors x1 and x2 , each with 15 observations. Then one can visualize the data either as 15 points in 3-dimensional space (a 3-dimensional scatter plot), or 3 points in 15-dimensional space. In the first case, each point corresponds to an observation, in the second case, each point corresponds to a variable. In this latter case the points are usually represented as vectors. You only have 3 vectors, but each of these vectors is a vector in 15dimensional space. But you do not have to draw a 15-dimensional space to draw ˆ these vectors; these 3 vectors span a 3-dimensional subspace, and y is the projection of the vector y on the space spanned by the two regressors not only in the original 14.2. ORDINARY LEAST SQUARES 163 15-dimensional space, but already in this 3-dimensional subspace. In other words, [DM93, Figure 1.3] is valid in all dimensions! In the 15-dimensional space, each dimension represents one observation. In the 3-dimensional subspace, this is no longer true. Problem 198. “Simple regression” is regression with an intercept and one explanatory variable only, i.e., (14.2.15) y t = α + βxt + εt Here X = ι x and β = α ˆ for β = α β : ˆˆ β . Evaluate (14.2.4) to get the following formulas x2 y t − xt xt y t t n x2 − ( xt )2 t n xt y t − xt y t ˆ β= n x2 − ( xt )2 t (14.2.16) α= ˆ (14.2.17) Answer. (14.2.18) X X= (14.2.19) ι x ι X X −1 = (14.2.20) x= 1 x2 − ( t n X y= ιι xι ιx = xx x2 t xt − xt ) 2 ιy = xy n xt xt x2 t − xt n yt xi y t Therefore (X X )−1 X y gives equations (14.2.16) and (14.2.17). Problem 199. Show that n n (xt − x)(y t − y ) = ¯ ¯ (14.2.21) t=1 xt y t − nxy ¯¯ t=1 (Note, as explained in [DM93, pp. 27/8] or [Gre97, Section 5.4.1], that the left hand side is computationally much more stable than the right.) Answer. Simply multiply out. Problem 200. Show that (14.2.17) and (14.2.16) can also be written as follows: (14.2.22) (14.2.23) Answer. Using (xt − x)(y t − y ) ¯ ¯ (xt − x)2 ¯ α = y − βx ˆ ¯ ˆ¯ ˆ β= xi = nx and ¯ y i = ny in (14.2.17), it can be written as ¯ ˆ β= (14.2.24) xt y t − nxy ¯¯ x2 − nx2 ¯ t Now apply Problem 199 to the numerator of (14.2.24), and Problem 199 with y = x to the denominator, to get (14.2.22). To prove equation (14.2.23) for α, let us work backwards and plug (14.2.24) into the righthand ˆ side of (14.2.23): (14.2.25) y − xβ = ¯ ¯ˆ y ¯ x2 − y nx2 − x ¯¯ ¯ t x2 t − xt y t + nxxy ¯ ¯¯ nx2 ¯ 164 14. MEAN-VARIANCE ANALYSIS IN THE LINEAR MODEL The second and the fourth term in the numerator cancel out, and what remains can be shown to be equal to (14.2.16). Problem 201. 3 points Show that in the simple regression model, the fitted regression line can be written in the form y t = y + β (xt − x). ˆ ¯ ¯ˆ (14.2.26) From this follows in particular that the fitted regression line always goes through the point x, y . ¯¯ ˆ β xt . Answer. Follows immediately if one plugs (14.2.23) into the defining equation y t = α + ˆ ˆ Formulas (14.2.22) and (14.2.23) are interesting because they express the regression coefficients in terms of the sample means and covariances. Problem 202 derives the properties of the population equivalents of these formulas: Problem 202. Given two random variables x and y with finite variances, and var[x] > 0. You know the expected values, variances and covariance of x and y , and you observe x, but y is unobserved. This question explores the properties of the Best Linear Unbiased Predictor (BLUP) of y in this situation. • a. 4 points Give a direct proof of the following, which is a special case of theorem 20.1.1: If you want to predict y by an affine expression of the form a + bx, you will get the lowest mean squared error MSE with b = cov[x, y ]/ var[x] and a = E[y ] − b E[x]. Answer. The MSE is variance plus squared bias (see e.g. problem 165), therefore (14.2.27) MSE[a + bx; y ] = var[a + bx − y ] + (E[a + bx − y ])2 = var[bx − y ] + (a − E[y ] + b E[x])2 . Therefore we choose a so that the second term is zero, and then you only have to minimize the first term with respect to b. Since var[bx − y ] = b2 var[x] − 2b cov[x, y ] + var[y ] (14.2.28) the first order condition is 2b var[x] − 2 cov[x, y ] = 0 (14.2.29) ∂ ∂a • b. 2 points For the first-order conditions you needed the partial derivatives ∂ E[(y − a − bx)2 ] and ∂b E[(y − a − bx)2 ]. It is also possible, and probably shorter, to interchange taking expected value and partial derivative, i.e., to compute E 2 ∂ ∂b (y a − bx) and E alternative fashion. Answer. E 2 − a − bx) ∂ (y − a − bx)2 ∂a the formula for a. Now E ∂ (y ∂b − and set those zero. Do the above proof in this = −2 E[y − a − bx] = −2(E[y ] − a − b E[x]). Setting this zero gives − a − bx)2 = −2 E[x(y − a − bx)] = −2(E[xy ] − a E[x] − b E[x2 ]). Setting this zero gives E[xy ] − a E[x] − b E[x2 ] = 0. Plug in formula for a and solve for b: (14.2.30) ∂ ∂a (y b= E[xy ] − E[x] E[y ] cov[x, y ] = . E[x2 ] − (E[x])2 var[x] • c. 2 points Compute the MSE of this predictor. 14.2. ORDINARY LEAST SQUARES 165 Answer. If one plugs the optimal a into (14.2.27), this just annulls the last term of (14.2.27) so that the MSE is given by (14.2.28). If one plugs the optimal b = cov[x, y ]/ var[x] into (14.2.28), one gets (14.2.31) (14.2.32) MSE = cov[x, y ] var[x] = var[y ] − 2 var[x] − 2 (cov[x, y ]) cov[x, y ] + var[x] var[x] (cov[x, y ])2 . var[x] • d. 2 points Show that the prediction error is uncorrelated with the observed x. Answer. (14.2.33) cov[x, y − a − bx] = cov[x, y ] − a cov[x, x] = 0 • e. 4 points If var[x] = 0, the quotient cov[x, y ]/ var[x] can no longer be formed, but if you replace the inverse by the g-inverse, so that the above formula becomes (14.2.34) b = cov[x, y ](var[x])− then it always gives the minimum MSE predictor, whether or not var[x] = 0, and regardless of which g-inverse you use (in case there are more than one). To prove this, you need to answer the following four questions: (a) what is the BLUP if var[x] = 0? (b) what is the g-inverse of a nonzero scalar? (c) what is the g-inverse of the scalar number 0? (d) if var[x] = 0, what do we know about cov[x, y ]? Answer. (a) If var[x] = 0 then x = µ almost surely, therefore the observation of x does not give us any new information. The BLUP of y is ν in this case, i.e., the above formula holds with b = 0. (b) The g-inverse of a nonzero scalar is simply its inverse. (c) Every scalar is a g-inverse of the scalar 0. (d) if var[x] = 0, then cov[x, y ] = 0. Therefore pick a g-inverse 0, an arbitrary number will do, call it c. Then formula (14.2.34) says b = 0 · c = 0. Problem 203. 3 points Carefully state the specifications of the random variables involved in the linear regression model. How does the model in Problem 202 differ from the linear regression model? What do they have in common? Answer. In the regression model, you have several observations, in the other model only one. In the regression model, the xi are nonrandom, only the y i are random, in the other model both x and y are random. In the regression model, the expected value of the y i are not fully known, in the other model the expected values of both x and y are fully known. Both models have in common that the second moments are known only up to an unknown factor. Both models have in common that only first and second moments need to be known, and that they restrict themselves to linear estimators, and that the criterion function is the MSE (the regression model minimaxes it, but the other model minimizes it since there is no unknown parameter whose value one has to minimax over. But this I cannot say right now, for this we need the Gauss-Markov theorem. Also the Gauss-Markov is valid in both cases!) Problem 204. 2 points We are in the multiple regression model y = Xβ + ε with intercept, i.e., X is such that there is a vector a with ι = Xa. Define the 1 ¯ row vector x = n ι X , i.e., it has as its j th component the sample mean of the ˆ j th independent variable. Using the normal equations X y = X X β , show that ¯ˆ y = x β (i.e., the regression plane goes through the center of gravity of all data ¯ points). Answer. Premultiply the normal equation by a 1/n to get the result. ˆ to get ι y − ι X β = 0. Premultiply by 166 14. MEAN-VARIANCE ANALYSIS IN THE LINEAR MODEL Problem 205. The fitted values y and the residuals ε are “orthogonal” in two ˆ ˆ different ways. • a. 2 points Show that the inner product y ε = 0. Why should you expect this ˆˆ from the geometric intuition of Least Squares? Answer. Use ε = M y and y = (I − M )y : y ε = y (I − M )M y = 0 because M (I − M ) = O . ˆ ˆ ˆˆ This is a consequence of the more general result given in problem ??. • b. 2 points Sometimes two random variables are called “orthogonal” if their covariance is zero. Show that y and ε are orthogonal also in this sense, i.e., show ˆ ˆ that for every i and j , cov[y i , εj ] = 0. In matrix notation this can also be written ˆˆ ˆˆ C [y , ε] = O . Answer. C [y , ε] = C [(I − M )y , M y ] = (I − M ) V [y ]M = (I − M )(σ 2 I )M = σ 2 (I − M )M = ˆˆ O . This is a consequence of the more general result given in question 246. 14.3. The Coefficient of Determination Among the criteria which are often used to judge whether the model is appro¯ priate, we will look at the “coefficient of determination” R2 , the “adjusted” R2 , and later also at Mallow’s Cp statistic. Mallow’s Cp comes later because it is not a final but an initial criterion, i.e., it does not measure the fit of the model to the given data, but it estimates its MSE. Let us first look at R2 . A value of R2 always is based (explicitly or implicitly) on a comparison of two models, usually nested in the sense that the model with fewer parameters can be viewed as a specialization of the model with more parameters. The value of R2 is then 1 minus the ratio of the smaller to the larger sum of squared residuals. Thus, there is no such thing as the R2 from a single fitted model—one must always think about what model (perhaps an implicit “null” model) is held out as a standard of comparison. Once that is determined, the calculation is straightforward, based on the sums of squared residuals from the two models. This is particularly appropriate for nls(), which minimizes a sum of squares. The treatment which follows here is a little more complete than most. Some textbooks, such as [DM93], never even give the leftmost term in formula (14.3.6) according to which R2 is the sample correlation coefficient. Other textbooks, such that [JHG+ 88] and [Gre97], do give this formula, but it remains a surprise: there is no explanation why the same quantity R2 can be expressed mathematically in two quite different ways, each of which has a different interpretation. The present treatment explains this. ˆ If the regression has a constant term, then the OLS estimate β has a third optimality property (in addition to minimizing the SSE and being the BLUE): no other linear combination of the explanatory variables has a higher squared sample ˆ correlation with y than y = X β . ˆ In the proof of this optimality property we will use the symmetric and idempotent 1 ¯ projection matrix D = I − n ιι . Applied to any vector z , D gives Dz = z − ιz , which is z with the mean taken out. Taking out the mean is therefore a projection, on the space orthogonal to ι. See Problem 161. Problem 206. In the reggeom visualization, see Problem 293, in which x1 is the vector of ones, which are the vectors Dx2 and D y ? Answer. Dx2 is og , the dark blue line starting at the origin, and D y is cy , the red line starting on x1 and going up to the peak. 14.3. THE COEFFICIENT OF DETERMINATION 167 As an additional mathematical tool we will need the Cauchy-Schwartz inequality for the vector product: (u v )2 ≤ (u u)(v v ) (14.3.1) Problem 207. If Q is any nonnegative definite matrix, show that also (u Qv )2 ≤ (u Qu)(v Qv ). (14.3.2) Answer. This follows from the fact that any nnd matrix Q can be written in the form Q = R R. In order to prove that y has the highest squared sample correlation, take any ˆ ˜ vector c and look at y = Xc. We will show that the sample correlation of y with ˜ y cannot be higher than that of y with y . For this let us first compute the sample ˆ ˜ covariance. By (9.3.17), n times the sample covariance between y and y is (14.3.3) ˜ n times sample covariance(˜ , y ) = y D y = c X D (y + ε ). y ˆˆ ˆ ˆ ˆ ˆ By Problem 208, Dε = ε , hence X Dε = X ε = o (this last equality is ˜ equivalent to the Normal Equation (14.2.3)), therefore (14.3.3) becomes y D y = ˜ y D y . Together with (14.3.2) this gives ˆ (14.3.4) n times sample covariance(˜ , y ) y 2 ˜ˆ = (˜ D y )2 ≤ (˜ D y )(y D y ) y ˆ y ˆ In order to get from n2 times the squared sample covariance to the squared sample correlation coefficient we have to divide it by n2 times the sample variances ˜ of y and of y : (14.3.5) ¯ y Dy ˆ ˆ (yj − y )2 ˆ ˆ (yj − y )2 ˆ ¯ (˜ D y )2 y 2 ≤ = = . sample correlation(˜ , y ) = y y Dy (yj − y )2 ¯ (yj − y )2 ¯ ˜ )(y D y ) (˜ D y y For the rightmost equal sign in (14.3.5) we need Problem 209. ˜ If y = y , inequality (14.3.4) becomes an equality, and therefore also (14.3.5) ˆ becomes an equality throughout. This completes the proof that y has the highest ˆ possible squared sample correlation with y , and gives at the same time two different formulas for the same entity (14.3.6) R2 = 2 ¯ (yj − y )(yj − y ) ˆ ˆ ¯ = ¯ (yj − y )2 (yj − y )2 ˆ ˆ ¯ (yj − y )2 ˆ ¯ . (yj − y )2 ¯ ˆˆ Problem 208. 1 point Show that, if X contains a constant term, then Dε = ε . ˆ ε = o, which is equivalent to the normal You are allowed to use the fact that X equation (14.2.3). ˆ Answer. Since X has a constant term, a vector a exists such that Xa = ι, therefore ι ε = ˆ ˆ ˆˆ a X ε = a o = 0. From ι ε = 0 follows Dε = ε . ¯¯ Problem 209. 1 point Show that, if X has a constant term, then y = y ˆ ˆ Answer. Follows from 0 = ι ε = ι y − ι y . In the visualization, this is equivalent with the ˆ fact that both ocb and ocy are right angles. Problem 210. Instead of (14.3.6) one often sees the formula 2 (14.3.7) (yj − y )(yj − y ) ˆ ¯ ¯ = 2 (yj − y ) ˆ ¯ (yj − y )2 ¯ (yj − y )2 ˆ ¯ . (yj − y )2 ¯ Prove that they are equivalent. Which equation is better? 168 14. MEAN-VARIANCE ANALYSIS IN THE LINEAR MODEL The denominator in the righthand side expression of (14.3.6), (yj − y )2 , is ¯ usually called “SST ,” the total (corrected) sum of squares. The numerator (yj − ˆ y )2 is usually called “SSR,” the sum of squares “explained” by the regression. In ¯ order to understand SSR better, we will show next the famous “Analysis of Variance” identity SST = SSR + SSE . Problem 211. In the reggeom visualization, again with x1 representing the vector of ones, show that SST = SSR + SSE , and show that R2 = cos2 α where α is the angle between two lines in this visualization. Which lines? Answer. ε is the by , the green line going up to the peak, and SSE is the squared length of ˆ ¯ ¯ it. SST is the squared length of y − ιy . Sincer ιy is the projection of y on x1 , i.e., it is oc, the part of x1 that is red, one sees that SST is the squared length of cy . SSR is the squared length of cb. The analysis of variance identity follows because cby is a right angle. R2 = cos2 α where α is the angle between bcy in this same triangle. Since the regression has a constant term, the decomposition y = (y − y ) + (y − ιy ) + ιy ˆ ˆ ¯ ¯ (14.3.8) is an orthogonal decomposition (all three vectors on the righthand side are orthogonal to each other), therefore in particular (y − y ) (y − ιy ) = 0. ˆ ˆ ¯ (14.3.9) Geometrically this follows from the fact that y − y is orthogonal to the column space ˆ of X , while y − ιy lies in that column space. ˆ ¯ Problem 212. Show the decomposition 14.3.8 in the reggeom-visualization. Answer. From y take the green line down to b, then the light blue line to c, then the red line to the origin. This orthogonality can also be explained in terms of sequential projections: instead of projecting y on x1 directly I can first project it on the plane spanned by x1 and x2 , and then project this projection on x1 . From (14.3.9) follows (now the same identity written in three different notations): (14.3.10) (y − ιy ) (y − ιy ) = (y − y ) (y − y ) + (y − ιy ) (y − ιy ) ¯ ¯ ˆ ˆ ˆ ¯ ˆ ¯ (yt − y )2 = ¯ (14.3.11) t (yt − yt )2 + ˆ t (14.3.12) (ˆt − y )2 y ¯ t SST = SSE + SSR Problem 213. 5 points Show that the “analysis of variance” identity SST = SSE + SSR holds in a regression with intercept, i.e., prove one of the two following equations: (14.3.13) (y − ιy ) (y − ιy ) = (y − y ) (y − y ) + (y − ιy ) (y − ιy ) ¯ ¯ ˆ ˆ ˆ ¯ ˆ ¯ (yt − y )2 = ¯ (14.3.14) t (yt − yt )2 + ˆ t (ˆt − y )2 y ¯ t Answer. Start with (14.3.15) SST = (yt − y )2 = ¯ (yt − yt + yt − y ) 2 ˆ ˆ ¯ ˆ1 ˆ and then show that the cross product term (yt −yt )(ˆt −y ) = ˆy¯ εt (ˆt −y ) = ε (X β −ι n ι y ) = 0 ˆy ¯ ˆ ˆ ε X = o and in particular, since a constant term is included, ε ι = 0. since 14.3. THE COEFFICIENT OF DETERMINATION 169 From the so-called “analysis of variance” identity (14.3.12), together with (14.3.6), one obtains the following three alternative expressions for the maximum possible correlation, which is called R2 and which is routinely used as a measure of the “fit” of the regression: (14.3.16) 2 ¯ (yj − y )(yj − y ) ˆ ˆ ¯ SSR SST − SSE = = ¯ SST SST ¯ (yj − y )2 (yj − y )2 ˆ ˆ 2 R= The first of these three expressions is the squared sample correlation coefficient between y and y , hence the notation R2 . The usual interpretation of the middle ˆ expression is the following: SST can be decomposed into a part SSR which is “explained” by the regression, and a part SSE which remains “unexplained,” and R2 measures that fraction of SST which can be “explained” by the regression. [Gre97, pp. 250–253] and also [JHG+ 88, pp. 211/212] try to make this notion plausible. Instead of using the vague notions “explained” and “unexplained,” I prefer the following reading, which is based on the third expression for R2 in (14.3.16): ιy is the ¯ vector of fitted values if one regresses y on a constant term only, and SST is the SSE in this “restricted” regression. R2 measures therefore the proportionate reduction in the SSE if one adds the nonconstant regressors to the regression. From this latter formula one can also see that R2 = cos2 α where α is the angle between y − ιy and ¯ y − ιy . ˆ ¯ Problem 214. Given two data series x and y . Show that the regression of y on x has the same R2 as the regression of x on y . (Both regressions are assumed to include a constant term.) Easy, but you have to think! Answer. The symmetry comes from the fact that, in this particular case, R2 is the squared sample correlation coefficient between x and y . Proof: y is an affine transformation of x, and ˆ correlation coefficients are invariant under affine transformations (compare Problem 216). Problem 215. This Problem derives some relationships which are valid in simple regression yt = α + βxt + εt but their generalization to multiple regression is not obvious. • a. 2 points Show that ˆ2 R2 = β (14.3.17) (xt − x)2 ¯ (yt − y )2 ¯ Hint: show first that yt − y = β (xt − x). ˆ ¯ˆ ¯ ˆ Answer. From yt = α + β xt and y = α + β x follows yt − y = β (xt − x). Therefore ˆ ˆˆ ¯ ˆ ˆ¯ ˆ ¯ ¯ (14.3.18) R2 = (yt − y )2 ˆ ¯ (yt − y )2 ¯ 2 ˆ =β ( xt − x) 2 ¯ (yt − y )2 ¯ • b. 2 points Furthermore show that R2 is the sample correlation coefficient between y and x, i.e., 2 (14.3.19) R2 = (xt − x)(yt − y ) ¯ ¯ (xt − x)2 ¯ Hint: you are allowed to use (14.2.22). (yt − y )2 ¯ . 170 14. MEAN-VARIANCE ANALYSIS IN THE LINEAR MODEL Answer. 2 (14.3.20) 2 ˆ R =β 2 (xt − x)2 ¯ (yt − y ) 2 ¯ ( xt − x) 2 ¯ (xt − x)(yt − y ) ¯ ¯ = 2 ( xt − x) 2 ¯ (yt − y )2 ¯ which simplifies to (14.3.19). ˆˆ • c. 1 point Finally show that R2 = βxy βyx , i.e., it is the product of the two slope coefficients one gets if one regresses y on x and x on y . If the regression does not have a constant term, but a vector a exists with ι = Xa, then the above mathematics remains valid. If a does not exist, then the identity SST = SSR + SSE no longer holds, and (14.3.16) is no longer valid. The fraction SST −SSE can assume negative values. Also the sample correlation SST coefficient between y and y loses its motivation, since there will usually be other ˆ linear combinations of the columns of X that have higher sample correlation with y than the fitted values y . ˆ Equation (14.3.16) is still puzzling at this point: why do two quite different simple concepts, the sample correlation and the proportionate reduction of the SSE , give the same numerical result? To explain this, we will take a short digression about correlation coefficients, in which it will be shown that correlation coefficients always denote proportionate reductions in the MSE. Since the SSE is (up to a constant factor) the sample equivalent of the MSE of the prediction of y by y , this shows ˆ that (14.3.16) is simply the sample equivalent of a general fact about correlation coefficients. But first let us take a brief look at the Adjusted R2 . 14.4. The Adjusted R-Square The coefficient of determination R2 is often used as a criterion for the selection of regressors. There are several drawbacks to this. [KA69, Chapter 8] shows that the distribution function of R2 depends on both the unknown error variance and the values taken by the explanatory variables; therefore the R2 belonging to different regressions cannot be compared. A further drawback is that inclusion of more regressors always increases the ¯ R2 . The adjusted R2 is designed to remedy this. Starting from the formula R2 = 1 − SSE /SST , the “adjustment” consists in dividing both SSE and SST by their degrees of freedom: (14.4.1) SSE /(n − k ) n−1 ¯ R2 = 1 − = 1 − (1 − R2 ) . SST /(n − 1) n−k For given SST , i.e., when one looks at alternative regressions with the same depen¯ dent variable, R2 is therefore a declining function of s2 , the unbiased estimator of 2 ¯ σ . Choosing the regression with the highest R2 amounts therefore to selecting that 2 regression which yields the lowest value for s . ¯ R2 has the following interesting property: (which we note here only for reference, because we have not yet discussed the F -test:) Assume one adds i more regressors: ¯ then R2 increases only if the F statistic for these additional regressors has a value greater than one. One can also say: s2 decreases only if F > 1. To see this, write 14.4. THE ADJUSTED R-SQUARE 171 this F statistic as (14.4.2) (14.4.3) F= (SSE k − SSE k+i )/i n − k − i S SE k = −1 SSE k+i /(n − k − i) i SSE k+i (n − k )s2 n−k−i k = −1 i (n − k − i)s2 +i k (14.4.4) = (n − k )s2 n−k k − +1 is2 +i i k (14.4.5) = (n − k ) s2 k −1 +1 i s2 +i k From this the statement follows. ¯ Minimizing the adjusted R2 is equivalent to minimizing the unbiased variance 2 estimator s ; it still does not penalize the loss of degrees of freedom heavily enough, i.e., it still admits too many variables into the model. Alternatives minimize Amemiya’s prediction criterion or Akaike’s information criterion, which minimize functions of the estimated variances and n and k . Akaike’s information criterion minimizes an estimate of the Kullback-Leibler discrepancy, which was discussed on p. 126. CHAPTER 15 Digression about Correlation Coefficients 15.1. A Unified Definition of Correlation Coefficients Correlation coefficients measure linear association. The usual definition of the simple correlation coefficient between two variables ρxy (sometimes we also use the notation corr[x, y ]) is their standardized covariance (15.1.1) ρxy = cov[x, y ] . var[x] var[y ] Because of Cauchy-Schwartz, its value lies between −1 and 1. Problem 216. Given the constant scalars a = 0 and c = 0 and b and d arbitrary. Show that corr[x, y ] = ± corr[ax + b, cy + d], with the + sign being valid if a and c have the same sign, and the − sign otherwise. Answer. Start with cov[ax + b, cy + d] = ac cov[x, y ] and go from there. Besides the simple correlation coefficient ρxy between two scalar variables y and x, one can also define the squared multiple correlation coefficient ρ2 (x) between one y scalar variable y and a whole vector of variables x, and the partial correlation coefficient ρ12.x between two scalar variables y 1 and y 2 , with a vector of other variables x “partialled out.” The multiple correlation coefficient measures the strength of a linear association between y and all components of x together, and the partial correlation coefficient measures the strength of that part of the linear association between y 1 and y 2 which cannot be attributed to their joint association with x. One can also define partial multiple correlation coefficients. If one wants to measure the linear association between two vectors, then one number is no longer enough, but one needs several numbers, the “canonical correlations.” The multiple or partial correlation coefficients are usually defined as simple correlation coefficients involving the best linear predictor or its residual. But all these correlation coefficients share the property that they indicate a proportionate reduction in the MSE. See e.g. [Rao73, pp. 268–70]. Problem 217 makes this point for the simple correlation coefficient: Problem 217. 4 points Show that the proportionate reduction in the MSE of the best predictor of y , if one goes from predictors of the form y ∗ = a to predictors of the form y ∗ = a + bx, is equal to the squared correlation coefficient between y and x. You are allowed to use the results of Problems 191 and 202. To set notation, call the minimum MSE in the first prediction (Problem 191) MSE[constant term; y ], and the minimum MSE in the second prediction (Problem 202) MSE[constant term and x; y ]. Show that (15.1.2) MSE[constant term; y ] − MSE[constant term and x; y ] (cov[y , x])2 2 = = ρy x . MSE[constant term; y ] var[y ] var[x] 173 174 15. DIGRESSION ABOUT CORRELATION COEFFICIENTS Answer. The minimum MSE with only a constant is var[y ] and (14.2.32) says that MSE[constant term and x; y ] = var[y ] − (cov[x, y ])2 / var[x]. Therefore the difference in MSE’s is (cov[x, y ])2 / var[x], and if one divides by var[y ] to get the relative difference, one gets exactly the squared correlation coefficient. Multiple Correlation Coefficients. Now assume x is a vector while y remains a scalar. Their joint mean vector and dispersion matrix are (15.1.3) Ω x µ ∼ , σ 2 xx ω xy y ν ω xy . ωy y By theorem ??, the best linear predictor of y based on x has the formula (15.1.4) y ∗ = ν + ω xy Ω −x (x − µ) x y ∗ has the following additional extremal value property: no linear combination b x has a higher squared correlation with y than y ∗ . This maximal value of the squared correlation is called the squared multiple correlation coefficient (15.1.5) ρ2 ( x ) = y ω xy Ω −xω xy x ωy y The multiple correlation coefficient itself is the positive square root, i.e., it is always nonnegative, while some other correlation coefficients may take on negative values. The squared multiple correlation coefficient can also defined in terms of proportionate reduction in MSE. It is equal to the proportionate reduction in the MSE of the best predictor of y if one goes from predictors of the form y ∗ = a to predictors of the form y ∗ = a + b x, i.e., MSE[constant term; y ] − MSE[constant term and x; y ] 2 (15.1.6) ρy ( x ) = MSE[constant term; y ] There are therefore two natural definitions of the multiple correlation coefficient. These two definitions correspond to the two formulas for R2 in (14.3.6). Partial Correlation Coefficients. Now assume y = y 1 y 2 is a vector with two elements and write Ω xx ω y 1 ω y 2 x µ y 1 ∼ ν1 , σ 2 ω y1 ω11 ω12 . (15.1.7) y2 ν2 ω y2 ω21 ω22 Let y ∗ be the best linear predictor of y based on x. The partial correlation coefficient ρ12.x is defined to be the simple correlation between the residuals corr[(y 1 − y ∗ ), (y 2 − 1 y ∗ )]. This measures the correlation between y 1 and y 2 which is “local,” i.e., which 2 does not follow from their association with x. Assume for instance that both y 1 and y 2 are highly correlated with x. Then they will also have a high correlation with each other. Subtracting y ∗ from y i eliminates this dependency on x, therefore any i remaining correlation is “local.” Compare [Krz88, p. 475]. The partial correlation coefficient can be defined as the relative reduction in the MSE if one adds y 2 to x as a predictor of y 1 : (15.1.8) MSE[constant term and x; y 2 ] − MSE[constant term, x, and y 1 ; y 2 ] . ρ2 . x = 12 MSE[constant term and x; y 2 ] Problem 218. Using the definitions in terms of MSE’s, show that the following relationship holds between the squares of multiple and partial correlation coefficients: (15.1.9) 1 − ρ2 x,1) = (1 − ρ2 .x )(1 − ρ2 x) ) 21 2( 2( 15.1. A UNIFIED DEFINITION OF CORRELATION COEFFICIENTS 175 Answer. In terms of the MSE, (15.1.9) reads (15.1.10) MSE[constant term, x, and y 1 ; y 2 ] MSE[constant term, x, and y 1 ; y 2 ] MSE[constant term and x; y 2 ] = . MSE[constant term; y 2 ] MSE[constant term and x; y 2 ] MSE[constant term; y 2 ] From (15.1.9) follows the following weighted average formula: ρ2 x,1) = ρ2 x) + (1 − ρ2 x) )ρ2 .x 21 2( 2( 2( (15.1.11) An alternative proof of (15.1.11) is given in [Gra76, pp. 116/17]. Mixed cases: One can also form multiple correlations coefficients with some of the variables partialled out. The dot notation used here is due to Yule, [Yul07]. The notation, definition, and formula for the squared correlation coefficient is (15.1.12) ρ2 (x).z = y (15.1.13) = MSE[constant term and z ; y ] − MSE[constant term, z , and x; y ] MSE[constant term and z ; y ] ω xy.z Ω −x.z ω xy.z x ω y y .z CHAPTER 16 Specific Datasets 16.1. Cobb Douglas Aggregate Production Function Problem 219. 2 points The Cobb-Douglas production function postulates the following relationship between annual output q t and the inputs of labor t and capital kt : (16.1.1) γ q t = µ β kt exp(εt ). t q t , t , and kt are observed, and µ, β , γ , and the εt are to be estimated. By the variable transformation xt = log q t , yt = log t , zt = log kt , and α = log µ, one obtains the linear regression (16.1.2) xt = α + βyt + γzt + εt Sometimes the following alternative variable transformation is made: ut = log(q t / t ), vt = log(kt / t ), and the regression (16.1.3) ut = α + γvt + εt is estimated. How are the regressions (16.1.2) and (16.1.3) related to each other? Answer. Write (16.1.3) as (16.1.4) xt − yt = α + γ (zt − yt ) + εt and collect terms to get (16.1.5) xt = α + (1 − γ )yt + γzt + εt From this follows that running the regression (16.1.3) is equivalent to running the regression (16.1.2) with the constraint β + γ = 1 imposed. The assumption here is that output is the only random variable. The regression model is based on the assumption that the dependent variables have more noise in them than the independent variables. One can justify this by the argument that any noise in the independent variables will be transferred to the dependent variable, and also that variables which affect other variables have more steadiness in them than variables which depend on others. This justification often has merit, but in the specific case, there is much more measurement error in the labor and capital inputs than in the outputs. Therefore the assumption that only the output has an error term is clearly wrong, and problem 221 below will look for possible alternatives. Problem 220. Table 1 shows the data used by Cobb and Douglas in their original article [CD28] introducing the production function which would bear their name. output is “Day’s index of the physical volume of production (1899 = 100)” described in [DP20], capital is the capital stock in manufacturing in millions of 1880 dollars [CD28, p. 145], labor is the “probable average number of wage earners employed in manufacturing” [CD28, p. 148], and wage is an index of the real wage (1899–1908 = 100). 177 178 16. SPECIFIC DATASETS year 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 output 100 101 112 122 124 122 143 152 151 126 155 159 capital 4449 4746 5061 5444 5806 6132 6626 7234 7832 8229 8820 9240 labor 4713 4968 5184 5554 5784 5468 5906 6251 6483 5714 6615 6807 wage 99 98 101 102 100 99 103 101 99 94 102 104 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 year 1910 output 153 177 184 169 189 225 227 223 218 231 179 240 capital 9624 10067 10520 10873 11840 13242 14915 16265 17234 18118 18542 19192 labor 6855 7167 7277 7026 7269 8601 9218 9446 9096 9110 6947 7602 wage 97 99 100 99 99 104 103 107 111 114 115 119 Table 1. Cobb Douglas Original Data • a. A text file with the data is available on the web at www.econ.utah.edu/ ehrbar/data/cobbdoug.txt, and a SDML file (XML for statistical data which can be read by R, Matlab, and perhaps also SPSS) is available at www.econ.utah.edu/ehrbar/ data/cobbdoug.sdml. Load these data into your favorite statistics package. Answer. In R, you can simply issue the command cobbdoug <- read.table("http://www. econ.utah.edu/ehrbar/data/cobbdoug.txt", header=TRUE). If you run R on unix, you can also do the following: download cobbdoug.sdml from the www, and then first issue the command library(StatDataML) and then readSDML("cobbdoug.sdml"). When I tried this last, the XML package necessary for StatDataML was not available on windows, but chances are it will be when you read this. In SAS, you must issue the commands data cobbdoug; infile ’cobbdoug.txt’; input year output capital labor; run; But for this to work you must delete the first line in the file cobbdoug.txt which contains the variable names. (Is it possible to tell SAS to skip the first line?) And you may have to tell SAS the full pathname of the text file with the data. If you want a permanent instead of a temporary dataset, give it a two-part name, such as ecmet.cobbdoug. Here are the instructions for SPSS: 1) Begin SPSS with a blank spreadsheet. 2) Open up a file with the following commands and run: SET BLANKS=SYSMIS UNDEFINED=WARN. DATA LIST FILE=’A:\Cbbunst.dat’ FIXED RECORDS=1 TABLE /1 year 1-4 output 5-9 capital 10-16 labor 17-22 wage 23-27 . EXECUTE. This files assume the data file to be on the same directory, and again the first line in the data file with the variable names must be deleted. Once the data are entered into SPSS the procedures (regression, etc.) are best run from the point and click environment. • b. The next step is to look at the data. On [CD28, p. 150], Cobb and Douglas plot capital, labor, and output on a logarithmic scale against time, all 3 series normalized such that they start in 1899 at the same level =100. Reproduce this graph using a modern statistics package. 16.1. COBB DOUGLAS AGGREGATE PRODUCTION FUNCTION 179 • c. Run both regressions (16.1.2) and (16.1.3) on Cobb and Douglas’s original dataset. Compute 95% confidence intervals for the coefficients of capital and labor in the unconstrained and the cconstrained models. Answer. SAS does not allow you to transform the data on the fly, it insists that you first go through a data step creating the transformed data, before you can run a regression on them. Therefore the next set of commands creates a temporary dataset cdtmp. The data step data cdtmp includes all the data from cobbdoug into cdtemp and then creates some transformed data as well. Then one can run the regressions. Here are the commands; they are in the file cbbrgrss.sas in your data disk: data cdtmp; set cobbdoug; logcap = log(capital); loglab = log(labor); logout = log(output); logcl = logcap-loglab; logol = logout-loglab; run; proc reg data = cdtmp; model logout = logcap loglab; run; proc reg data = cdtmp; model logol = logcl; run; Careful! In R, the command lm(log(output)-log(labor) ~ log(capital)-log(labor), data=cobbdoug) does not give the right results. It does not complain but the result is wrong nevertheless. The right way to write this command is lm(I(log(output)-log(labor)) ~ I(log(capital)-log(labor)), data=cobbdoug). • d. The regression results are graphically represented in Figure 1. The big ellipse is a joint 95% confidence region for β and γ . This ellipse is a level line of the SSE . The vertical and horizontal bands represent univariate 95% confidence regions for β and γ separately. The diagonal line is the set of all β and γ with β + γ = 1, representing the constraint of constant returns to scale. The small ellipse is that level line of the SSE which is tangent to the constraint. The point of tangency represents the constrained estimator. Reproduce this graph (or as much of this graph as you can) using your statistics package. Remark: In order to make the hand computations easier, Cobb and Douglass reduced the data for capital and labor to index numbers (1899=100) which were rounded to integers, before running the regressions, and Figure 1 was constructed using these rounded data. Since you are using the nonstandardized data, you may get slightly different results. Answer. lines(ellipse.lm(cbbfit, which=c(2, 3))) Problem 221. In this problem we will treat the Cobb-Douglas data as a dataset with errors in all three variables. See chapter ?? and problem ?? about that. • a. Run the three elementary regressions for the whole period, then choose at least two subperiods and run it for those. Plot all regression coefficients as points in a plane, using different colors for the different subperiods (you have to normalize them in a special way that they all fit on the same plot). Answer. Here are the results in R: > outputlm<-lm(log(output)~log(capital)+log(labor),data=cobbdoug) > capitallm<-lm(log(capital)~log(labor)+log(output),data=cobbdoug) > laborlm<-lm(log(labor)~log(output)+log(capital),data=cobbdoug) 180 16. SPECIFIC DATASETS 1.0 0.9 d d 0.8 0.7 0.6 0.5 d d d d d d 0.4 d d . ........................ ........................... .. ...... ...... ... ...... ... ...... ...... ... .... ...... ...... .... ..... .... ..... .... ..... ..... .... .... ..... ..... .... .... ..... ..... .... .. . ......... .... .......... .... ... ..... ..... ..... ..... ... ..... ..... .... .... .... ..... ........ ..... ...... .... ..... .... ..... . ..... .... .... ..... ..... .... .... ..... ..... .... .... ...... ...... ... ... ...... ...... ... ... ...... ...... ...... .. ........ . ......... .. ... ................. ................. . 0.3 dq dq d 0.2 0.1 d d d 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Figure 1. Coefficients of capital (vertical) and labor (horizontal), dependent variable output, unconstrained and constrained, 1899–1922 > coefficients(outputlm) (Intercept) log(capital) log(labor) -0.1773097 0.2330535 0.8072782 > coefficients(capitallm) (Intercept) log(labor) log(output) -2.72052726 -0.08695944 1.67579357 > coefficients(laborlm) (Intercept) log(output) log(capital) 1.27424214 0.73812541 -0.01105754 #Here is the information for the confidence ellipse: > summary(outputlm,correlation=T) Call: lm(formula = log(output) ~ log(capital) + log(labor), data = cobbdoug) Residuals: Min 1Q Median -0.075282 -0.035234 -0.006439 3Q 0.038782 Max 0.142114 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.17731 0.43429 -0.408 0.68721 log(capital) 0.23305 0.06353 3.668 0.00143 ** log(labor) 0.80728 0.14508 5.565 1.6e-05 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ Residual standard error: 0.05814 on 21 degrees of freedom Multiple R-Squared: 0.9574,Adjusted R-squared: 0.9534 F-statistic: 236.1 on 2 and 21 degrees of freedom,p-value: 3.997e-15 Correlation of Coefficients: (Intercept) log(capital) log(capital) 0.7243 log(labor) -0.9451 -0.9096 1 16.1. COBB DOUGLAS AGGREGATE PRODUCTION FUNCTION 181 #Quantile of the F-distribution: > qf(p=0.95, df1=2, df2=21) [1] 3.4668 • b. The elementary regressions will give you three fitted equations of the form ˆ ˆ (16.1.6) output = α1 + β12 labor + β13 capital + residual1 ˆ (16.1.7) (16.1.8) ˆ ˆ labor = α2 + β21 output + β23 capital + residual2 ˆ ˆ ˆ capital = α3 + β31 output + β32 labor + residual3 . ˆ In order to compare the slope parameters in these regressions, first rearrange them in the form ˆ ˆ (16.1.9) −output + β12 labor + β13 capital + α1 + residual1 = 0 ˆ (16.1.10) ˆ ˆ β21 output − labor + β23 capital + α2 + residual2 = 0 ˆ (16.1.11) ˆ ˆ β31 output + β32 labor − capital + α3 + residual3 = 0 ˆ This gives the following table of coefficients: output labor capital intercept −1 0.8072782 0.2330535 −0.1773097 0.73812541 −1 −0.01105754 1.27424214 1.67579357 −0.08695944 −1 −2.72052726 Now divide the second and third rows by the negative of their first coefficient, so that the coefficient of output becomes −1: out labor capital intercept −1 0.8072782 0.2330535 −0.1773097 −1 1/0.73812541 0.01105754/0.73812541 −1.27424214/0.73812541 −1 0.08695944/1.67579357 1/1.67579357 2.72052726/1.67579357 After performing the divisions the following numbers are obtained: output labor capital intercept −1 0.8072782 0.2330535 −0.1773097 −1 1.3547833 0.014980570 −1.726322 −1 0.05189149 0.59673221 1.6234262 These results can also be re-written in the form given by Table 2. Intercept Slope of output Slope of output wrt labor wrt capital Regression of output on labor and capital Regression of labor on output and capital Regression of capital on output and labor Table 2. Comparison of coefficients in elementary regressions Fill in the values for the whole period and also for several sample subperiods. Make a scatter plot of the contents of this table, i.e., represent each regression result as a point in a plane, using different colors for different sample periods. 182 16. SPECIFIC DATASETS T d d d d d d q d d capital d d d d d d dqoutput no error, crs d c Cobb Douglas’s original result output dq d d d d d qlabor E Figure 2. Coefficients of capital (vertical) and labor (horizontal), dependent variable output, 1899–1922 1.0 0.9 d d 0.8 d d 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 q d ..... capital all errors ..... ..... ..... ..... ..... ..... ..... ..... ...... ..... ..... ..... ..... ..... ...... ..... ..... ..... ..... ..... ...... ..... ..... ..... ..... ...... ..... ..... ..... ..... ..... ...... output no error, crs ..... ..... ..... ..... ..... ...... ..... .....output all errors ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ...... ..... ..... ..... ..... ..... ...... ..... ..... ..... ..... ..... ..... ..... labor ..... ..... ..... .... .. d d d d dq d dq d d d d q all errors 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Figure 3. Coefficient of capital (vertical) and labor (horizontal) in the elementary regressions, dependent variable output, 1899–1922 Problem 222. Given a univariate problem with three variables all of which have zero mean, and a linear constraint that the coefficients of all variables sum to 0. (This is the model apparently appropriate to the Cobb-Douglas data, with the assumption of constant returns to scale, after taking out the means.) Call the observed variables x, y , and z , with underlying systematic variables x∗ , y ∗ , and z ∗ , and errors u, v , and w. • a. Write this model in the form (??). 16.1. COBB DOUGLAS AGGREGATE PRODUCTION FUNCTION 183 Answer. x∗ y∗ −1 β 1−β z∗ (16.1.12) x y x∗ = β y ∗ + (1 − β )z ∗ x = x∗ + u =0 or z = x∗ y∗ z∗ + u v w y = y∗ + v z = z ∗ + w. • b. The moment matrix of the systematic variables can be written fully in terms 2 2 of σy∗ , σz∗ , σy∗ z∗ , and the unknown parameter β . Write out the moment matrix and therefore the Frisch decomposition. Answer. The moment matrix is the middle matrix in the following Frisch decomposition: 2 σx σxy σxz (16.1.13) (16.1.14) 2 2 β 2 σy∗ + 2β (1 − β )σy∗ z∗ + (1 − β )2 σz∗ 2 + (1 − β )σ ∗ ∗ βσy∗ = yz 2 βσy∗ z∗ + (1 − β )σz∗ σxy 2 σy σy z σxz σy z 2 σz = 2 βσy∗ + (1 − β )σy∗ z∗ 2 σy ∗ 2 σy ∗ 2 2 βσy∗ z∗ + (1 − β )σz∗ σu σy ∗ z ∗ +0 2 0 σz ∗ • c. Show that the unknown parameters are not yet identified. However, if one 2 2 2 makes the additional assumption that one of the three error variances σu , σv , or σw is zero, then the equations are identified. Since the quantity of output presumably 2 has less error than the other two variables, assume σu = 0. Under this assumption, show that σ 2 − σxz (16.1.15) β= x σxy − σxz and this can be estimated by replacing the variances and covariances by their sample counterparts. In a similar way, derive estimates of all other parameters of the model. Answer. Solving (16.1.14) one gets from the y z element of the covariance matrix (16.1.16) σy ∗ z ∗ = σy z and from the xz element (16.1.17) 2 σz ∗ = σxz − βσyz 1−β Similarly, one gets from the xy element: (16.1.18) 2 σy ∗ = σxy − (1 − β )σyz β Now plug (16.1.16), (16.1.17), and (16.1.18) into the equation for the xx element: (16.1.19) (16.1.20) 2 2 σx = β (σxy − (1 − β )σyz ) + 2β (1 − β )σyz + (1 − β )(σxz − βσyz ) + σu 2 = βσxy + (1 − β )σxz + σu 2 Since we are assuming σu = 0 this last equation can be solved for β : (16.1.21) β= 2 σx − σxz σxy − σxz If we replace the variances and covariances by the sample variances and covariances, this gives an estimate of β . • d. Evaluate these formulas numerically. In order to get the sample means and the sample covariance matrix of the data, you may issue the SAS commands 0 2 σv 0 0 0. 2 σw 184 16. SPECIFIC DATASETS proc corr cov nocorr data=cdtmp; var logout loglab logcap; run; These commands are in the file cbbcovma.sas on the disk. Answer. Mean vector and covariance matrix are (16.1.22) LOGOUT LOGLAB LOGCAP 5.07734 0.0724870714 4.96272 , 0.0522115563 5.35648 0.1169330807 ∼ 0.0522115563 0.0404318579 0.0839798588 0.1169330807 0.0839798588 0.2108441826 Therefore equation (16.1.15) gives 0.0724870714 − 0.1169330807 ˆ β= = 0.686726861149148 0.0522115563 − 0.1169330807 ˆ ˆ In Figure 3, the point (β, 1 − β ) is exactly the intersection of the long dotted line with the constraint. (16.1.23) • e. The fact that all 3 points lie almost on the same line indicates that there may be 2 linear relations: log labor is a certain coefficient times log output, and log capital is a different coefficient times log output. I.e., y ∗ = δ1 + γ1 x∗ and z ∗ = δ2 + γ2 x∗ . In other words, there is no substitution. What would be the two coefficients γ1 and γ2 if this were the case? Answer. Now the Frisch decomposition is (16.1.24) 2 σx σxy σxz σxy 2 σy σy z σxz σy z 2 σz 2 = σx∗ 1 γ1 γ2 γ1 2 γ1 γ1 γ2 γ2 γ1 γ2 2 γ2 + 2 σu 0 0 0 2 σv 0 0 0. 2 σw Solving this gives (obtain γ1 by dividing the 32-element by the 31-element, γ2 by dividing the 2 32-element by the 12-element, σx∗ by dividing the 21-element by γ1 , etc. (16.1.25) 0.0839798588 σy z = = 0.7181873452513939 γ1 = σxy 0.1169330807 σy z 0.0839798588 γ2 = = = 1.608453467992104 σxz 0.0522115563 σyx σxz 0.0522115563 · 0.1169330807 2 = 0.0726990758 σx∗ = = σy z 0.0839798588 σyx σxz = 0.0724870714 − 0.0726990758 = −0.000212 σy z σxy σyz 2 2 σv = σy − σxz σxz σzy 2 2 σw = σz − σxy 2 2 σu = σx − This model is just barely rejected by the data since it leads to a slightly negative variance for U . • f . The assumption that there are two linear relations is represented as the light-blue line in Figure 3. What is the equation of this line? Answer. If y = γ1 x and z = γ2 x then the equation x = β1 y + β2 z holds whenever β1 γ1 + β2 γ2 = 1. This is a straight line in the β1 , β2 -plane, going through the points and (0, 1/γ2 ) = 0.0522115563 (0, 0.0839798588 = 0.6217152189353289) and (1/γ1 , 0) = ( 0.1169330807 = 1.3923943475361023, 0). 0.0839798588 This line is in the figure, and it is just a tiny bit on the wrong side of the dotted line connecting the two estimates. 16.2. Houthakker’s Data For this example we will use Berndt’s textbook [Ber91], which discusses some of the classic studies in the econometric literature. One example described there is the estimation of a demand function for electricity [Hou51], which is the first multiple regression with several variables run on a computer. In this exercise you are asked to do all steps in exercise 1 and 3 in chapter 7 of Berndt, and use the additional facilities of R to perform other steps of data analysis which Berndt did not ask for, such as, for instance, explore the best subset of regressors using leaps and the best nonlinear transformation using avas, do some 16.2. HOUTHAKKER’S DATA 185 diagnostics, search for outliers or influential observations, and check the normality of residuals by a probability plot. Problem 223. 4 points The electricity demand date from [Hou51] are available on the web at www.econ.utah.edu/ehrbar/data/ukelec.txt. Import these data into your favorite statistics package. For R you need the command ukelec <read.table("http://www.econ.utah.edu/ehrbar/data/ukelec.txt"). Make a scatterplot matrix of these data using e.g. pairs(ukelec) and describe what you see. Answer. inc and cap are negatively correlated. cap is capacity of rented equipment and not equipment owned. Apparently customers with higher income buy their equipment instead of renting it. gas6 and gas8 are very highly correlated. mc4, mc6, and mc8 are less hightly correlated, the corrlation between mc6 and mc8 is higher than that between mc4 and mc6. It seem electicity prices have been coming down. kwh, inc, and exp are strongly positively correlated. the stripes in all the plots which have mc4, mc6, or mc8 in them come from the fact that the marginal cost of electricity is a round number. electricity prices and kwh are negatively correlated. There is no obvious positive correlation between kwh and cap or expen and cap. Prices of electricity and gas are somewhat positively correlated, but not much. When looking at the correlations of inc with the other variables, there are several outliers which could have a strong “leverage” effect. in 1934, those with high income had lower electricity prices than those with low income. This effect dissipated by 1938. No strong negative correlations anywhere. cust negatively correlated with inc, because rich people live in smaller cities? If you simply type ukelec in R, it will print the data on the screen. The variables have the following meanings: cust Average number of consumers with two-part tariffs for electricity in 1937– 38, in thousands. Two-part tariff means: they pay a fixed monthly sum plus a certain “running charge” times the number of kilowatt hours they use. inc Average income of two-part consumers, in pounds per year. (Note that one pound had 240 pence at that time.) mc4 The running charge (marginal cost) on domestic two-part tariffs in 1933–34, in pence per KWH. (The marginal costs are the costs that depend on the number of kilowatt hours only, it is the cost of one additional kilowatt hour. mc6 The running charge (marginal cost) on domestic two-part tariffs in 1935–36, in pence per KWH mc8 The running charge (marginal cost) on domestic two-part tariffs in 1937–38, in pence per KWH gas6 The marginal price of gas in 1935–36, in pence per therm gas8 The marginal price of gas in 1937–38, in pence per therm kwh Consumption on domestic two-part tariffs per consumer in 1937–38, in kilowatt hours cap The average holdings (capacity) of heavy electric equipment bought on hire purchase (leased) by domestic two-part consumers in 1937–38, in kilowatts expen The average total expenditure on electricity by two-part consumers in 1937–38, in pounds The function summary(ukelec) displays summary statistics about every variable. 186 16. SPECIFIC DATASETS Since every data frame in R is a list, it is possible to access the variables in ukelec by typing ukelec$mc4 etc. Try this; if you type this and then a return, you will get a listing of mc4. In order to have all variables available as separate objects and save typing ukelec$ all the time, one has to “mount” the data frame by the command attach(ukelec). After this, the individual data series can simply be printed on the screen by typing the name of the variable, for instance mc4, and then the return key. Problem 224. 2 points Make boxplots of mc4, mc6, and mc6 in the same graph next to each other, and the same with gas6 and gas8. Problem 225. 2 points How would you answer the question whether marginal prices of gas vary more or less than those of electricity (say in the year 1936)? Answer. Marginal gas prices vary a little more than electricity prices, although electricity was the newer technology, and although gas prices are much more stable over time than the electricity prices. Compare sqrt(var(mc6))/mean(mc6) with sqrt(var(gas6))/mean(gas6). You get 0.176 versus 0.203. Another way would be to compute max(mc6)/min(mc6) and compare with max(gas6)/min(gas6): you get 2.27 versus 2.62. In any case this is a lot of variation. Problem 226. 2 points Make a plot of the (empirical) density function of mc6 and gas6 and interpret the results. Problem 227. 2 points Is electricity a big share of total income? Which command is better: mean(expen/inc) or mean(expen)/mean(inc)? What other options are there? Actually, there is a command which is clearly better than at least one of the above, can you figure out what it is? Answer. The proportion is small, less than 1 percent. The two above commands give 0.89% and 0.84%. The command sum(cust*expen) / sum(cust*inc) is better than mean(expen) / mean(inc), because each component in expen and inc is the mean over many households, the number of households given by cust. mean(expen) is therefore an average over averages over different population sizes, not a good idea. sum(cust*expen) is total expenditure in all households involved, and sum(cust*inc) is total income in all households involved. sum(cust*expen) / sum(cust*inc) gives the value 0.92%. Another option is median(expen/inc) which gives 0.91%. A good way to answer this question is to plot it: plot(expen,inc). You get the line where expenditure is 1 percent of income by abline(0,0.01). For higher incomes expenditure for electricity levels off and becomes a lower share of income. Problem 228. Have your computer compute the sample correlation matrix of the data. The R-command is cor(ukelec) • a. 4 points Are there surprises if one looks at the correlation matrix? Answer. Electricity consumption kwh is slightly negatively correlated with gas prices and with the capacity. If one takes the correlation matrix of the logarithmic data, one gets the expected positive signs. marginal prices of gas and electricity are positively correlated in the order of 0.3 to 0.45. higher correlation between mc6 and mc8 than between mc4 and mc6. Correlation between expen and cap is negative and low in both matrices, while one should expect positive correlation. But in the logarithmic matrix, mc6 has negative correlation with expen, i.e., elasticity of electricity demand is less than 1. In the logarithmic data, cust has higher correlations than in the non-logarithmic data, and it is also more nearly normally distributed. inc has negative correlation with mc4 but positive correlation with mc6 and mc8. (If one looks at the scatterplot matrix this seems just random variations in an essentially zero correlation). mc6 and expen are positively correlated, and so are mc8 and expen. This is due to the one outlier with high expen and high income and also high electricity prices. The marginal prices of electricity are not strongly correlated with expen, and in 1934, they are negatively correlated with income. From the scatter plot of kwh versus cap it seems there are two datapoints whose removal might turn the sign around. To find out which they are do plot(kwh,cap) and then use the identify 16.2. HOUTHAKKER’S DATA 187 function: identify(kwh,cap,labels=row.names(ukelec)). The two outlying datapoints are Halifax and Wallase. Wallase has the highest income of all towns, namely, 1422, while Halifax’s income of 352 is close to the minimum, which is 279. High income customers do not lease their equipment but buy it. • b. 3 points The correlation matrix says that kwh is negatively related with cap, but the correlation of the logarithm gives the expected positive sign. Can you explain this behavior? Answer. If one plots the date using plot(cap,kwh) one sees that the negative correlation comes from the two outliers. In a logarithmic scale, these two are no longer so strong outliers. Problem 229. Berndt on p. 338 defines the intramarginal expenditure f <expen-mc8*kwh/240. What is this, and what do you find out looking at it? After this preliminary look at the data, let us run the regressions. Problem 230. 6 points Write up the main results from the regressions which in R are run by the commands houth.olsfit <- lm(formula = kwh ~ inc+I(1/mc6)+gas6+cap) houth.glsfit <- lm(kwh ~ inc+I(1/mc6)+gas6+cap, weight=cust) houth.olsloglogfit <- lm(log(kwh) ~ log(inc)+log(mc6)+log(gas6)+log(cap)) Instead of 1/mc6 you had to type I(1/mc6) because the slash has a special meaning in formulas, creating a nested design, therefore it had to be “protected” by applying the function I() to it. If you then type houth.olsfit, a short summary of the regression results will be displayed on the screen. There is also the command summary(houth.olsfit), which gives you a more detailed summary. If you type plot(houth.olsfit) you will get a series of graphics relevant for this regression. Answer. All the expected signs. Gas prices do not play a great role in determining electricity consumption, despite the “cookers” Berndt talks about on p. 337. Especially the logarithmic regression makes gas prices highly insignificant! The weighted estimation has a higher R2 . Problem 231. 2 points The output of the OLS regression gives as standard error of inc the value of 0.18, while in the GLS regression it is 0.20. For the other variables, the standard error as given in the GLS regression is lower than that in the OLS regression. Does this mean that one should use for inc the OLS estimate and for the other variables the GLS estimates? Problem 232. 5 points Show, using the leaps procedure om R or some other selection of regressors, that the variables Houthakker used in his GLS-regression are the “best” among the following: inc, mc4, mc6, mc8, gas6, gas8, cap using either the Cp statistic or the adjusted R2 . (At this stage, do not transform the variables but just enter them into the regression untransformed, but do use the weights, which are theoretically well justified). To download the leaps package, use install.packages("leaps", lib="C:/Documents and Settings/420lab.420LAB/My Documents") and to call it up, use library(leaps, lib.loc="C:/Documents and Settings/420lab.420LAB/My Documents"). If the library ecmet is available, the command ecmet.script(houthsel) runs the following script: 188 16. SPECIFIC DATASETS library(leaps) data(ukelec) attach(ukelec) houth.glsleaps<-leaps(x=cbind(inc,mc4,mc6,mc8,gas6,gas8,cap), y=kwh, wt=cust, method="Cp", nbest=5, strictly.compatible=F) ecmet.prompt("Plot Mallow’s Cp against number of regressors:") plot(houth.glsleaps$size, houth.glsleaps$Cp) ecmet.prompt("Throw out all regressions with a Cp > 50 (big gap)") plot(houth.glsleaps$size[houth.glsleaps$Cp<50], houth.glsleaps$Cp[houth.glsleaps$Cp<50]) ecmet.prompt("Cp should be roughly equal the number of regressors") abline(0,1) cat("Does this mean the best regression is overfitted?") ecmet.prompt("Click at the points to identify them, left click to quit") ## First construct the labels lngth <- dim(houth.glsleaps$which)[1] included <- as.list(1:lngth) for (ii in 1:lngth) included[[ii]] <- paste( colnames(houth.glsleaps$which)[houth.glsleaps$which[ii,]], collapse=",") identify(x=houth.glsleaps$size, y=houth.glsleaps$Cp, labels=included) ecmet.prompt("Now use regsubsets instead of leaps") houth.glsrss<- regsubsets.default(x=cbind(inc,mc4,mc6,mc8,gas6,gas8,cap), y=kwh, weights=cust, method="exhaustive") print(summary.regsubsets(houth.glsrss)) plot.regsubsets(houth.glsrss, scale="Cp") ecmet.prompt("Now order the variables") houth.glsrsord<- regsubsets.default(x=cbind(inc,mc6,cap,gas6,gas8,mc8,mc4), y=kwh, weights=cust, method="exhaustive") print(summary.regsubsets(houth.glsrsord)) plot.regsubsets(houth.glsrsord, scale="Cp") Problem 233. Use avas to determine the “best” nonlinear transformations of the explanatory and the response variable. Since the weights are theoretically well justified, one should do it for the weighted regression. Which functions do you think one should use for the different regressors? Problem 234. 3 points Then, as a check whether the transformation interferred with data selection, redo leaps, but now with the transformed variables. Show that the GLS-regression Houthakker actually ran is the “best” regression among the following variables: inc, 1/mc4, 1/mc6, 1/mc8, gas6, gas8, cap using either the Cp statistic or the adjusted R2 . Problem 235. Diagnostics, the identification of outliers or influential observations is something which we can do easily with R, although Berndt did not ask for it. The command houth.glsinf<-lm.influence(houth.glsfit) gives you the building blocks for many of the regression disgnostics statistics. Its output is a list if three ˆ objects: A matrix whose rows are all the the least squares estimates β (i) when the ith observation is dropped, a vector with all the s(i), and a vector with all the hii . A more extensive function is influence.measures(houth.glsfit), it has Cook’s distance and others. 16.3. LONG TERM DATA ABOUT US ECONOMY 189 In order to look at the residuals, use the command plot(resid(houth.glsfit), type="h") or plot(rstandard(houth.glsfit), type="h") or plot(rstudent(houth.glsfit), type="h"). To add the axis do abline(0,0). If you wanted to check the residuals for normality, you would use qqnorm(rstandard(houth.glsfit)). Problem 236. Which commands do you need to plot the predictive residuals? Problem 237. 4 points Although there is good theoretical justification for using cust as weights, one might wonder if the data bear this out. How can you check this? Answer. Do plot(cust, rstandard(houth.olsfit)) and plot(cust, rstandard(houth.glsfit)). In the first plot, smaller numbers of customers have larger residuals, in the second plot this is mitigated. Also the OLS plot has two terrible outliers, which are brought more into range with GLS. Problem 238. The variable cap does not measure the capacity of all electrical equipment owned by the households, but only those appliances which were leased from the Electric Utility company. A plot shows that people with higher income do not lease as much but presumably purchase their appliances outright. Does this mean the variable should not be in the regression? 16.3. Long Term Data about US Economy The dataset uslt is described in [DL91]. Home page of the authors is www.cepremap.cnrs.fr/~evy/. l uslt has the variables kn, kg (net and gross capital stock in current $), kn2, kg2 (the same in 1982$), hours (hours worked), wage (hourly wage in current dollars), gnp, gnp2, nnp, inv2 (investment in 1982 dollars), r (profit rate (nnp − wage × hours)/kn), u (capacity utilization), kne, kge, kne2, kge2, inve2 (capital stock and investment data for equipment), kns, kgs, kns2, kgs2, invs2 (the same for structures). Capital stock data were estimated separately for structures and equipment and then added up, i.e., kn2 = kne2 + kns2 etc. Capital stock since 1925 has been constructed from annual investment data, and prior to 1925 the authors of the series apparently went the other direction: they took someone’s published capital stock estimates and constructed investment from it. In the 1800s, only a few observations were available, which were then interpolated. The capacity utilization ratio is equal to the ratio of gnp2 to its trend, i.e., it may be negative. Here are some possible commands for your R-session: data(uslt) makes the data available; uslt.clean<-na.omit(uslt) removes missing values; this dataset starts in 1869 (instead of 1805). attach(uslt.clean) makes the variables in this dataset available. Now you can plot various series, for instance plot((nnp-hours*wage)/nnp, type="l") plots the profit share, or plot(gnp/gnp2, kg/kg2, type="l") gives you a scatter plot of the price level for capital goods versus that for gnp. The command plot(r, kn2/hours, type="b") gives both points and dots; type = "o" will have the dots overlaid the line. After the plot you may issue the command identify(r, kn2/hours, label=1869:1989) and then click with the left mouse button on the plot those data points for which you want to have the years printed. If you want more than one timeseries on the same plot, you may do matplot(1869:1989, cbind(kn2,kns2), type="l"). If you want the y-axis logarithmic, say matplot(1869:1989, cbind(gnp/gnp2,kns/kns2,kne/kne2), type="l", log="y"). Problem 239. Computer assignment: Make a number of such plots on the screen, and import the most interesting ones into your wordprocessor. Each class participant should write a short paper which shows the three most insteresting plots, together with a written explanation why these plots seem interesting. 190 16. SPECIFIC DATASETS To use pairs or xgobi, you should carefully select the variables you want to include, and then you need the following preparations: usltsplom <- cbind(gnp2=gnp2, kn2=kn2, inv2=inv2, hours=hours, year=1869:1989) dimnames(usltsplom)[[1]] <- paste(1869:1989) The dimnames function adds the row labels to the matrix, so that you can see which year it is. pairs(usltsplom) or library(xgobi) and then xgobi(usltsplom) You can also run regressions with commands of the following sort: lm.fit <lm(formula = gnp2 ~ hours + kne2 + kns2). You can also fit a “generalized additive model” with the formula gam.fit <- gam(formula = gnp2 ~ s(hours) + s(kne2) + s(kns2)). This is related to the avas command we talked about in class. It is discussed in [CH93]. 16.4. Dougherty Data We have a new dataset, in both SAS and Splus, namely the data described in [Dou92]. There are more data than in the tables at the end of the book; prelcosm for instance is the relative price of cosmetics, it is 100*pcosm/ptpe, but apparently truncated at 5 digits. 16.5. Wage Data The two datasets used in [Ber91, pp. 191–209] are available in R as the data frames cps78 and cps85. In R on unix, the data can be downloaded by cps78 <- readSDML("http://www.econ.utah.edu/ehrbar/data/cps78.sdml"), and the corresponding for cps85. The original data provided by Berndt contain many dummy variables. The data frames in R have the same data coded as “factor” variables instead of dummies. These “factor” variables automatically generate dummies when included in the model statement. cps78 consists of 550 randomly selected employed workers from the May 1978 current population survey, and cps85 consists of 534 randomly selected employed workers from the May 1985 current population survey. These are surveys of 50,000 households conducted monthly by the U.S. Department of Commerce. They serve as the basis for the national employment and unemployment statistics. Data are collected on a number of individual characteristics as well as employment status. The present extracts were performed by Leslie Sundt of the University of Arizona. ed = years of education ex = years of labor market experience (= age − ed − 6, or 0 if this is a negative number). lnwage = natural logarithm of average hourly earnings age = age in years ndep = number of dependent children under 18 in household (only in cps78). region has levels North, South race has levels Other, Nonwhite, Hispanic. Nonwhite is mainly the Blacks, and Other is mainly the Non-Hispanic Whites. gender has levels Male, Female marr has levels Single, Married union has levels Nonunion, Union industry has levels Other, Manuf, and Constr occupation has levels Other, Manag, Sales, Cler, Serv, and Prof Here is a log of my commands for exercises 1 and 2 in [Ber91, pp. 194–197]. > cps78 <- readSDML("http://www.econ.utah.edu/ehrbar/data/cps78.sdml") 16.5. WAGE DATA 191 > attach(cps78) > ###Exercise 1a (2 points) in chapter V of Berndt, p. 194 > #Here is the arithmetic mean of hourly wages: > mean(exp(lnwage)) [1] 6.062766 > #Here is the geometric mean of hourly wages: > #(Berndt’s instructions are apparently mis-formulated): > exp(mean(lnwage)) [1] 5.370935 > #Geometric mean is lower than arithmetic, due to Jensen’s inequality > #if the year has 2000 hours, this gives an annual wage of > 2000*exp(mean(lnwage)) [1] 10741.87 > #What are arithmetic mean and standard deviation of years of schooling > #and years of potential experience? > mean(ed) [1] 12.53636 > sqrt(var(ed)) [1] 2.772087 > mean(ex) [1] 18.71818 > sqrt(var(ex)) [1] 13.34653 > #experience has much higher standard deviation than education, not surprising. > ##Exercise 1b (1 point) can be answered with the two commands > table(race) Hisp Nonwh Other 36 57 457 > table(race, gender) gender race Female Male Hisp 12 24 Nonwh 28 29 Other 167 290 > #Berndt also asked for the sample means of certain dummy variables; > #This has no interest in its own right but was an intermediate > #step in order to compute the numbers of cases as above. > ##Exercise 1c (2 points) can be answered using tapply > tapply(ed,gender,mean) Female Male 12.76329 12.39942 > #now the standard deviation: > sqrt(tapply(ed,gender,var)) Female Male 2.220165 3.052312 > #Women do not have less education than men; it is about equal, > #but their standard deviation is smaller > #Now the geometric mean of the wage rate: > exp(tapply(lnwage,gender,mean)) Female Male 4.316358 6.128320 192 16. SPECIFIC DATASETS > #Now do the same with race > ##Exercise 1d (4 points) > detach() > ##This used to be my old command: > cps85 <- read.table("~/dpkg/ecmet/usr/share/ecmet/usr/lib/R/library/ecmet/data/cps85.txt", h > #But this should work for everyone (perhaps only on linux): > cps85 <- readSDML("http://www.econ.utah.edu/ehrbar/data/cps85.sdml") > attach(cps85) > mean(exp(lnwage)) [1] 9.023947 > sqrt(var(lnwage)) [1] 0.5277335 > exp(mean(lnwage)) [1] 7.83955 > 2000*exp(mean(lnwage)) [1] 15679.1 > 2000*exp(mean(lnwage))/1.649 [1] 9508.248 > #real wage has fallen > tapply(exp(lnwage), gender, mean) Female Male 7.878743 9.994794 > tapply(exp(lnwage), gender, mean)/1.649 Female Male 4.777891 6.061125 > #Compare that with 4.791237 6.830132 in 1979: > #Male real wages dropped much more than female wages > ##Exercise 1e (3 points) > #using cps85 > w <- mean(lnwage); w [1] 2.059181 > s <- sqrt(var(lnwage)); s [1] 0.5277335 > lnwagef <- factor(cut(lnwage, breaks = w+s*c(-4,-2,-1,0,1,2,4))) > table(lnwagef) lnwagef (-0.0518,1] (1,1.53] (1.53,2.06] (2.06,2.59] (2.59,3.11] (3.11,4.17] 3 93 174 180 72 12 > ks.test(lnwage, "pnorm") One-sample Kolmogorov-Smirnov test data: lnwage D = 0.8754, p-value = < 2.2e-16 alternative hypothesis: two.sided > ks.test(lnwage, "pnorm", mean=w, sd =s) One-sample Kolmogorov-Smirnov test data: lnwage 16.5. WAGE DATA 193 D = 0.0426, p-value = 0.2879 alternative hypothesis: two.sided > > > > > #Normal distribution not rejected #If we do the same thing with wage <- exp(lnwage) ks.test(wage, "pnorm", mean=mean(wage), sd =sqrt(var(wage))) One-sample Kolmogorov-Smirnov test data: wage D = 0.1235, p-value = 1.668e-07 alternative hypothesis: two.sided > #This is highly significant, therefore normality rejected > > > > > > > > #An alternative, simpler way to answer question 1e is by using qqnorm qqnorm(lnwage) qqnorm(exp(lnwage)) #Note that the SAS proc univariate rejects that wage is normally distributed #but does not reject that lnwage is normally distributed. ###Exercise 2a (3 points), p. 196 summary(lm(lnwage ~ ed, data = cps78)) Call: lm(formula = lnwage ~ ed, data = cps78) Residuals: Min 1Q Median -2.123168 -0.331368 -0.007296 3Q 0.319713 Max 1.594445 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.030445 0.092704 11.115 < 2e-16 *** ed 0.051894 0.007221 7.187 2.18e-12 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ Residual standard error: 0.469 on 548 degrees of freedom Multiple R-Squared: 0.08613,Adjusted R-squared: 0.08447 F-statistic: 51.65 on 1 and 548 degrees of freedom,p-value: 2.181e-12 > #One year of education increases wages by 5 percent, but low R^2. > #Mincer (5.18) had 7 percent for 1959 > #Now we need a 95 percent confidence interval for this coefficient > 0.051894 + 0.007221*qt(0.975, 548) [1] 0.06607823 > 0.051894 - 0.007221*qt(0.975, 548) [1] 0.03770977 1 194 16. SPECIFIC DATASETS > ##Exercise 2b (3 points): Include union participation > summary(lm(lnwage ~ union + ed, data=cps78)) Call: lm(formula = lnwage ~ union + ed, data = cps78) Residuals: Min 1Q -2.331754 -0.294114 Median 0.001475 3Q 0.263843 Max 1.678532 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.859166 0.091630 9.376 < 2e-16 *** unionUnion 0.305129 0.041800 7.300 1.02e-12 *** ed 0.058122 0.006952 8.361 4.44e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ Residual standard error: 0.4481 on 547 degrees of freedom Multiple R-Squared: 0.1673,Adjusted R-squared: 0.1642 F-statistic: 54.93 on 2 and 547 degrees of freedom,p-value: 0.1 ‘’ 0 > exp(0.058) [1] 1.059715 > exp(0.305129) [1] 1.3568 > # Union members have 36 percent higher wages > # The test whether union and nonunion members have the same intercept > # is the same as the test whether the union dummy is 0. > # t-value = 7.300 which is highly significant, > # i.e., they are different. > #The union variable is labeled unionUnion, because > #it is labeled 1 for Union and 0 for Nonun. Check with the command > contrasts(cps78$union) Union Nonun 0 Union 1 > #One sees it also if one runs > model.matrix(lnwage ~ union + ed, data=cps78) (Intercept) union ed 1 1 0 12 2 1 1 12 3 1 16 4 1 1 12 5 1 0 12 > #etc, rest of output flushed > #and compares this with > cps78$union[1:5] [1] Nonun Union Union Union Nonun Levels: Nonun Union 1 16.5. WAGE DATA > > > > > > > > > > 195 #Consequently, the intercept for nonunion is 0.8592 #and the intercept for union is 0.8592+0.3051=1.1643. #Can I have a different set of dummies constructed from this factor? #We will first do ##Exercise 2e (2 points) contrasts(union)<-matrix(c(1,0),nrow=2,ncol=1) #This generates a new contrast matrix #which covers up that in cps78 #Note that I do not say "data=cps78" in the next command: summary(lm(lnwage ~ union + ed)) Call: lm(formula = lnwage ~ union + ed) Residuals: Min 1Q -2.331754 -0.294114 Median 0.001475 3Q 0.263843 Max 1.678532 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.164295 0.090453 12.872 < 2e-16 union1 -0.305129 0.041800 -7.300 1.02e-12 ed 0.058122 0.006952 8.361 4.44e-16 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ *** *** *** 0.05 ‘.’ Residual standard error: 0.4481 on 547 degrees of freedom Multiple R-Squared: 0.1673,Adjusted R-squared: 0.1642 F-statistic: 54.93 on 2 and 547 degrees of freedom,p-value: > > > > > > > > 0.1 ‘’ 1 0 #Here the coefficients are different, #but it is consistent with the above result. ##Ecercise 2c (2 points): If I want to have two contrasts from this one dummy, I have to do contrasts(union,2)<-matrix(c(1,0,0,1),nrow=2,ncol=2) #The additional argument 2 #specifies different number of contrasts than it expects #Now I have to supress the intercept in the regression summary(lm(lnwage ~ union + ed - 1)) Call: lm(formula = lnwage ~ union + ed - 1) Residuals: Min 1Q -2.331754 -0.294114 Median 0.001475 3Q 0.263843 Max 1.678532 Coefficients: Estimate Std. Error t value Pr(>|t|) union1 0.859166 0.091630 9.376 < 2e-16 *** union2 1.164295 0.090453 12.872 < 2e-16 *** ed 0.058122 0.006952 8.361 4.44e-16 *** 196 --Signif. codes: 16. SPECIFIC DATASETS 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ Residual standard error: 0.4481 on 547 degrees of freedom Multiple R-Squared: 0.9349,Adjusted R-squared: 0.9345 F-statistic: 2617 on 3 and 547 degrees of freedom,p-value: > > > > 0.1 ‘’ 1 0 #actually it was unnecessary to construct the contrast matrix. #If we regress with a categorical variable without #an intercept, R will automatically use dummies for all levels: lm(lnwage ~ union + ed - 1, data=cps85) Call: lm(formula = lnwage ~ union + ed - 1, data = cps85) Coefficients: unionNonunion 0.9926 unionUnion 1.2909 ed 0.0778 > ##Exercise 2d (1 point) Why is it not possible to include two dummies plus > # an intercept? Because the two dummies add to 1, > # you have perfect collinearity > > > > > > > > > > > > > > > > > > > > ###Exercise 3a (2 points): summary(lm(lnwage ~ ed + ex + I(ex^2), data=cps78)) #All coefficients are highly significant, but the R^2 is only 0.2402 #Returns to experience are positive and decline with increase in experience ##Exercise 3b (2 points): summary(lm(lnwage ~ gender + ed + ex + I(ex^2), data=cps78)) contrasts(cps78$gender) #We see here that gender is coded 0 for female and 1 for male; #by default, the levels in a factor variable occur in alphabetical order. #Intercept in our regression = 0.1909203 (this is for female), #genderMale has coefficient = 0.3351771, #i.e., the intercept for women is 0.5260974 #Gender is highly significant ##Exercise 3c (2 points): summary(lm(lnwage ~ gender + marr + ed + ex + I(ex^2), data=cps78)) #Coefficient of marr in this is insignificant ##Exercise 3d (1 point) asks to construct a variable which we do #not need when we use factor variables ##Exercise 3e (3 points): For interaction term do summary(lm(lnwage ~ gender * marr + ed + ex + I(ex^2), data=cps78)) Call: lm(formula = lnwage ~ gender * marr + ed + ex + I(ex^2), data = cps78) Residuals: Min 1Q Median 3Q Max 16.5. WAGE DATA -2.45524 -0.24566 0.01969 0.23102 197 1.42437 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1893919 0.1042613 1.817 0.06984 . genderMale 0.3908782 0.0467018 8.370 4.44e-16 *** marrSingle 0.0507811 0.0557198 0.911 0.36251 ed 0.0738640 0.0066154 11.165 < 2e-16 *** ex 0.0265297 0.0049741 5.334 1.42e-07 *** I(ex^2) -0.0003161 0.0001057 -2.990 0.00291 ** genderMale:marrSingle -0.1586452 0.0750830 -2.113 0.03506 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 Residual standard error: 0.3959 on 543 degrees of freedom Multiple R-Squared: 0.3547,Adjusted R-squared: 0.3476 F-statistic: 49.75 on 6 and 543 degrees of freedom,p-value: ‘’ 1 0 > #Being married raises the wage for men by 13% but lowers it for women by 3% > ###Exercise 4a (5 points): > summary(lm(lnwage ~ union + gender + race + ed + ex + I(ex^2), data=cps78)) Call: lm(formula = lnwage ~ union + gender + race + ed + ex + I(ex^2), data = cps78) Residuals: Min 1Q -2.41914 -0.23674 Median 0.01682 3Q 0.21821 Max 1.31584 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1549723 0.1068589 1.450 0.14757 unionUnion 0.2071429 0.0368503 5.621 3.04e-08 *** genderMale 0.3060477 0.0344415 8.886 < 2e-16 *** raceNonwh -0.1301175 0.0830156 -1.567 0.11761 raceOther 0.0271477 0.0688277 0.394 0.69342 ed 0.0746097 0.0066521 11.216 < 2e-16 *** ex 0.0261914 0.0047174 5.552 4.43e-08 *** I(ex^2) -0.0003082 0.0001015 -3.035 0.00252 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ Residual standard error: 0.3845 on 542 degrees of freedom Multiple R-Squared: 0.3924,Adjusted R-squared: 0.3846 F-statistic: 50.01 on 7 and 542 degrees of freedom,p-value: 0.1 ‘’ 0 > exp(-0.1301175) [1] 0.8779923 > #Being Hispanic lowers wages by 2.7%, byut being black lowers them 1 198 16. SPECIFIC DATASETS > #by 12.2 % > > > > #At what level of ex is lnwage maximized? #exeffect = 0.0261914 * ex -0.0003082 * ex^2 #derivative = 0.0261914 - 2 * 0.0003082 * ex #derivative = 0 for ex=0.0261914/(2*0.0003082) > 0.0261914/(2*0.0003082) [1] 42.49091 > > > > > > > # age - ed - 6 = 42.49091 # age = ed + 48.49091 # for 8, 12, and 16 years of schooling the max earnings # are at ages 56.5, 60.5, and 64.5 years ##Exercise 4b (4 points) is a graph, not done here ##Exercise 4c (5 points) summary(lm(lnwage ~ gender + union + race + ed + ex + I(ex^2) + I(ed*ex), data=cps78)) Call: lm(formula = lnwage ~ gender + union + race + ed + ex + I(ex^2) + I(ed * ex), data = cps78) Residuals: Min 1Q -2.41207 -0.23922 Median 0.01463 3Q 0.21645 Max 1.32051 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.0396495 0.1789073 0.222 0.824693 genderMale 0.3042639 0.0345241 8.813 < 2e-16 *** unionUnion 0.2074045 0.0368638 5.626 2.96e-08 *** raceNonwh -0.1323898 0.0830908 -1.593 0.111673 raceOther 0.0319829 0.0691124 0.463 0.643718 ed 0.0824154 0.0117716 7.001 7.55e-12 *** ex 0.0328854 0.0095716 3.436 0.000636 *** I(ex^2) -0.0003574 0.0001186 -3.013 0.002704 ** I(ed * ex) -0.0003813 0.0004744 -0.804 0.421835 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ Residual standard error: 0.3846 on 541 degrees of freedom Multiple R-Squared: 0.3932,Adjusted R-squared: 0.3842 F-statistic: 43.81 on 8 and 541 degrees of freedom,p-value: > #Maximum earnings ages must be computed as before > > > > > 0.1 ‘’ 1 0 ##Exercise 4d (4 points) not done here ##Exercise 4e (6 points) not done here ###Exercise 5a (3 points): #Naive approach to estimate impact of unionization on wages: summary(lm(lnwage ~ gender + union + race + ed + ex + I(ex^2), data=cps78)) 16.5. WAGE DATA 199 Call: lm(formula = lnwage ~ gender + union + race + ed + ex + I(ex^2), data = cps78) Residuals: Min 1Q -2.41914 -0.23674 Median 0.01682 3Q 0.21821 Max 1.31584 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1549723 0.1068589 1.450 0.14757 genderMale 0.3060477 0.0344415 8.886 < 2e-16 *** unionUnion 0.2071429 0.0368503 5.621 3.04e-08 *** raceNonwh -0.1301175 0.0830156 -1.567 0.11761 raceOther 0.0271477 0.0688277 0.394 0.69342 ed 0.0746097 0.0066521 11.216 < 2e-16 *** ex 0.0261914 0.0047174 5.552 4.43e-08 *** I(ex^2) -0.0003082 0.0001015 -3.035 0.00252 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ Residual standard error: 0.3845 on 542 degrees of freedom Multiple R-Squared: 0.3924,Adjusted R-squared: 0.3846 F-statistic: 50.01 on 7 and 542 degrees of freedom,p-value: > # What is wrong with the above? It assumes that unions > # only affect the intercept, everything else is the same > ##Exercise 5b (2 points) > tapply(lnwage, union, mean) Nonun Union 1.600901 1.863137 > tapply(ed, union, mean) Nonun Union 12.76178 12.02381 > table(gender, union) union gender Nonun Union Female 159 48 Male 223 120 > table(race, union) union race Nonun Union Hisp 29 7 Nonwh 35 22 Other 318 139 > 7/(7+29) [1] 0.1944444 > 22/(22+35) [1] 0.3859649 0.1 0 ‘’ 1 200 16. SPECIFIC DATASETS > 139/(318+139) [1] 0.3041575 > #19% of Hispanic, 39% of Nonwhite, and 30% of other (white) workers > #in the sample are in unions > ##Exercise 5c (3 points) > summary(lm(lnwage ~ gender + race + ed + ex + I(ex^2), data=cps78, subset=union == "Union")) Call: lm(formula = lnwage ~ gender + race + ed + ex + I(ex^2), data = cps78, subset = union == "Union") Residuals: Min 1Q -2.3307 -0.1853 Median 0.0160 3Q 0.2199 Max 1.1992 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.9261456 0.2321964 3.989 0.000101 *** genderMale 0.2239370 0.0684894 3.270 0.001317 ** raceNonwh -0.3066717 0.1742287 -1.760 0.080278 . raceOther -0.0741660 0.1562131 -0.475 0.635591 ed 0.0399500 0.0138311 2.888 0.004405 ** ex 0.0313820 0.0098938 3.172 0.001814 ** I(ex^2) -0.0004526 0.0002022 -2.239 0.026535 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 Residual standard error: 0.3928 on 161 degrees of freedom Multiple R-Squared: 0.2019,Adjusted R-squared: 0.1721 F-statistic: 6.787 on 6 and 161 degrees of freedom,p-value: 1.975e-06 > summary(lm(lnwage ~ gender + race + ed + ex + I(ex^2), data=cps78, subset=union == "Nonun")) Call: lm(formula = lnwage ~ gender + race + ed + ex + I(ex^2), data = cps78, subset = union == "Nonun") Residuals: Min 1Q -1.39107 -0.23775 Median 0.01040 3Q 0.23337 Max 1.29073 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.0095668 0.1193399 -0.080 0.9361 genderMale 0.3257661 0.0397961 8.186 4.22e-15 *** raceNonwh -0.0652018 0.0960570 -0.679 0.4977 raceOther 0.0444133 0.0761628 0.583 0.5602 ed 0.0852212 0.0075554 11.279 < 2e-16 *** ex 0.0253813 0.0053710 4.726 3.25e-06 *** I(ex^2) -0.0002841 0.0001187 -2.392 0.0172 * --- 16.5. WAGE DATA Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 201 ‘*’ 0.05 ‘.’ Residual standard error: 0.3778 on 375 degrees of freedom Multiple R-Squared: 0.4229,Adjusted R-squared: 0.4137 F-statistic: 45.8 on 6 and 375 degrees of freedom,p-value: 0.1 ‘’ 1 ‘’ 1 0 > #Are union-nonunion differences larger for females than males? > #For this look at the intercepts for males and females in > #the two regressions. Say for white males and females: > 0.9261456-0.0741660+0.2239370 [1] 1.075917 > 0.9261456-0.0741660 [1] 0.8519796 > -0.0095668+0.0444133+0.3257661 [1] 0.3606126 > -0.0095668+0.0444133 [1] 0.0348465 > 1.075917-0.3606126 [1] 0.7153044 > 0.8519796-0.0348465 [1] 0.8171331 > > > > > > > > #White Males White Females #Union 1.075917 0.8519796 #Nonunion 0.3606126 0.0348465 #Difference 0.7153044 0.8171331 #Difference is greater for women ###Exercise 6a (5 points) summary(lm(lnwage ~ gender + union + race + ed + ex + I(ex^2))) Call: lm(formula = lnwage ~ gender + union + race + ed + ex + I(ex^2)) Residuals: Min 1Q -2.41914 -0.23674 Median 0.01682 3Q 0.21821 Max 1.31584 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1549723 0.1068589 1.450 0.14757 genderMale 0.3060477 0.0344415 8.886 < 2e-16 *** unionUnion 0.2071429 0.0368503 5.621 3.04e-08 *** raceNonwh -0.1301175 0.0830156 -1.567 0.11761 raceOther 0.0271477 0.0688277 0.394 0.69342 ed 0.0746097 0.0066521 11.216 < 2e-16 *** ex 0.0261914 0.0047174 5.552 4.43e-08 *** I(ex^2) -0.0003082 0.0001015 -3.035 0.00252 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 202 16. SPECIFIC DATASETS Residual standard error: 0.3845 on 542 degrees of freedom Multiple R-Squared: 0.3924,Adjusted R-squared: 0.3846 F-statistic: 50.01 on 7 and 542 degrees of freedom,p-value: 0 > #To test whether Nonwh and Hisp have same intercept > #one might generate a contrast matrix which collapses those > #two and then run it and make an F-test > #or make a contrast matrix which has this difference as one of > #the dummies and use the t-test for that dummy > ##Exercise 6b (2 points) > table(race) race Hisp Nonwh Other 36 57 457 > tapply(lnwage, race, mean) Hisp Nonwh Other 1.529647 1.513404 1.713829 > tapply(lnwage, race, ed) Error in get(x, envir, mode, inherits) : variable "ed" was not found > tapply(ed, race, mean) Hisp Nonwh Other 10.30556 11.71930 12.81400 > table(gender, race) race gender Hisp Nonwh Other Female 12 28 167 Male 24 29 290 > #Blacks, almost as many women than men, hispanic twice as many men, > #Whites in between > > > > > > > > > > > > > > #Additional stuff: #There are two outliers in cps78 with wages of less than $1 per hour, #Both service workers, perhaps waitresses who did not report her tips? #What are the commands for extracting certain observations #by certain criteria and just print them? The split command. #Interesting to do loess(lnwage ~ ed + ex, data=cps78) #loess is appropriate here because there are strong interation terms #How can one do loess after taking out the effects of gender for instance? #Try the following, but I did not try it out yet: gam(lnwage ~ lo(ed,ex) + gender, data=cps78) #I should put more plotting commands in! CHAPTER 17 The Mean Squared Error as an Initial Criterion of Precision The question how “close” two random variables are to each other is a central concern in statistics. The goal of statistics is to find observed random variables which are “close” to the unobserved parameters or random outcomes of interest. These observed random variables are usually called “estimators” if the unobserved magnitude is nonrandom, and “predictors” if it is random. For scalar random variables we will ˆ use the mean squared error as a criterion for closeness. Its definition is MSE[φ; φ] ˆ as an estimator or predictor, whatever the case (read it: mean squared error of φ may be, of φ): ˆ ˆ MSE[φ; φ] = E[(φ − φ)2 ] (17.0.1) ˆ For our purposes, therefore, the estimator (or predictor) φ of the unknown parameter ˜ ˆ (or unobserved random variable) φ is no worse than the alternative φ if MSE[φ; φ] ≤ ˜ ; φ]. This is a criterion which can be applied before any observations are MSE[φ collected and actual estimations are made; it is an “initial” criterion regarding the expected average performance in a series of future trials (even though, in economics, usually only one trial is made). 17.1. Comparison of Two Vector Estimators ˆ ˜ If one wants to compare two vector estimators, say φ and φ, it is often impossible ˆ to say which of two estimators is better. It may be the case that φ1 is better than ˆ ˜ ˜ φ1 (in terms of MSE or some other criterion), but φ2 is worse than φ2 . And even if ˆ than by φ , certain linear combinations ˜ every component φi is estimated better by φi i ˜ ˆ t φ of the components of φ may be estimated better by t φ than by t φ. ˆ Problem 240. 2 points Construct an example of two vector estimators φ and ˜ ˆ ˜ of the same random vector φ = φ1 φ2 , so that MSE[φi ; φi ] < MSE[φi ; φi ] for φ ˆ ˆ ˜ ˜ i = 1, 2 but MSE[φ1 + φ2 ; φ1 + φ2 ] > MSE[φ1 + φ2 ; φ1 + φ2 ]. Hint: it is easiest to use an example in which all random variables are constants. Another hint: the geometric ˆ ˜ analog would be to find two vectors in a plane φ and φ. In each component (i.e., ˆ is closer to the origin than φ. But in the projection on ˜ projection on the axes), φ ˜ is closer to the origin than φ. ˆ the diagonal, φ ˆ φ= Answer. In the simplest counterexample, all variables involved are constants: φ = 1 , and φ = −2 . ˜ 1 0 0 , 2 ˆ One can only then say unambiguously that the vector φ is a no worse estimator ˜ if its MSE is smaller or equal for every linear combination. Theorem 17.1.1 than φ ˆ will show that this is the case if and only if the MSE -matrix of φ is smaller, by a ˜. If this is so, then theorem 17.1.1 says nonnegative definite matrix, than that of φ 203 204 17. THE MEAN SQUARED ERROR AS AN INITIAL CRITERION OF PRECISION that not only the MSE of all linear transformations, but also all other nonnegative definite quadratic loss functions involving these vectors (such as the trace of the MSE -matrix, which is an often-used criterion) are minimized. In order to formulate and prove this, we first need a formal definition of the MSE -matrix. We write MSE ˆ for the matrix and MSE for the scalar mean squared error. The MSE -matrix of φ as an estimator of φ is defined as ˆ ˆ ˆ MSE [φ; φ] = E [(φ − φ)(φ − φ) ] . (17.1.1) ˆ Problem 241. 2 points Let θ be a vector of possibly random parameters, and θ an estimator of θ . Show that (17.1.2) ˆ ˆ ˆ ˆ MSE [θ ; θ ] = V [θ − θ ] + (E [θ − θ ])(E [θ − θ ]) . Don’t assume the scalar result but make a proof that is good for vectors and scalars. Answer. For any random vector x follows E [xx ] = E (x − E [x] + E [x])(x − E [x] + E [x]) = E (x − E [x])(x − E [x]) − E (x − E [x]) E [x] − E E [x](x − E [x]) + E E [x] E [x] = V [x] − O − O + E [x] E [x] . ˆ Setting x = θ − θ the statement follows. ˆ If θ is nonrandom, formula (17.1.2) simplifies slightly, since in this case V [θ − θ ] = ˆ]. In this case, the MSE matrix is the covariance matrix plus the squared bias V [θ ˆ matrix. If θ is nonrandom and in addition θ is unbiased, then the MSE -matrix coincides with the covariance matrix. ˆ ˜ Theorem 17.1.1. Assume φ and φ are two estimators of the parameter φ (which is allowed to be random itself ). Then conditions (17.1.3), (17.1.4), and (17.1.5) are equivalent: (17.1.3) For every constant vector t, (17.1.4) ˜ ˆ MSE [φ; φ] − MSE [φ; φ] (17.1.5) For every nnd Θ, ˆ ˜ MSE[t φ; t φ] ≤ MSE[t φ; t φ] is a nonnegative definite matrix ˆ ˆ ˜ ˜ E (φ − φ) Θ(φ − φ) ≤ E (φ − φ) Θ(φ − φ) . ˜ ˆ Proof. Call MSE [φ; φ] = σ 2 Ξ and MSE [φ; φ] = σ 2Ω . To show that (17.1.3) ˆ ˜ implies (17.1.4), simply note that MSE[t φ; t φ] = σ 2 t Ω t and likewise MSE[t φ; t φ] = 2 Ω)t ≥ 0 for all t, which is the σ t Ξt. Therefore (17.1.3) is equivalent to t (Ξ − defining property making Ξ − Ω nonnegative definite. Here is the proof that (17.1.4) implies (17.1.5): ˆ ˆ ˆ ˆ E[(φ − φ) Θ(φ − φ)] = E[tr (φ − φ) Θ(φ − φ) ] = ˆ ˆ = E[tr Θ(φ − φ)(φ − φ) ˆ ˆ = tr Θ E [(φ − φ)(φ − φ) ] = σ 2 tr ΘΩ and in the same way ˜ ˜ E[(φ − φ) Θ(φ − φ)] = σ 2 tr ΘΞ . The difference in the expected quadratic forms is therefore σ 2 tr Θ(Ξ − Ω ) . By assumption, Ξ − Ω is nonnegative definite. Therefore, by theorem A.5.6 in the Mathematical Appendix, or by Problem 242 below, this trace is nonnegative. To complete the proof, (17.1.5) has (17.1.3) as a special case if one sets Θ = tt . 17.1. COMPARISON OF TWO VECTOR ESTIMATORS 205 Problem 242. Show that if Θ and Σ are symmetric and nonnegative definite, then tr(ΘΣ ) ≥ 0. You are allowed to use that tr(AB ) = tr(BA), that the trace of a nonnegative definite matrix is ≥ 0, and Problem 118 (which is trivial). Answer. Write Θ = RR ; then tr(ΘΣ ) = tr(RR Σ ) = tr(R Σ R) ≥ 0. Problem 243. Consider two very simple-minded estimators of the unknown nonrandom parameter vector φ = φ1 . Neither of these estimators depends on any φ2 ˆ observations, they are constants. The first estimator is φ = [ 11 ], and the second is 11 12 ]. ˜ φ=[ 8 • a. 2 points Compute the MSE -matrices of these two estimators if the true value of the parameter vector is φ = [ 10 ]. For which estimator is the trace of the 10 MSE matrix smaller? ˆ Answer. φ has smaller trace of the MSE -matrix. 1 ˆ φ−φ= 1 ˆ ˆ ˆ MSE [φ; φ] = E [(φ − φ)(φ − φ) ] = E[ 1 1 ˜ φ−φ= 4 −4 1 ] = E[ 1 1 1 1= 1 1 1 1 2 −2 ˜ MSE [φ; φ] = 1 −4 4 Note that both MSE -matrices are singular, i.e., both estimators allow an error-free look at certain linear combinations of the parameter vector. ˆ • b. 1 point Give two vectors g = [ g1 ] and h = h1 satisfying MSE[g φ; g φ] < g2 h2 ˜ ˆ ˜ MSE[g φ; g φ] and MSE[h φ; h φ] > MSE[h φ; h φ] (g and h are not unique; there are many possibilities). 1 ˆ ˜ Answer. With g = −1 and h = 1 for instance we get g φ − g φ = 0, g φ − 1 ˆ; h φ = 2, h φ; h φ = 0, therefore MSE[g φ; g φ] = 0, MSE[g φ; g φ] = 16, ˜ ˆ ˜ g φ = 4, h φ ˆ ˜ MSE[h φ; h φ] = 4, MSE[h φ; h φ] = 0. An alternative way to compute this is e.g. ˜ MSE [h φ; h φ] = 1 −1 4 −4 −4 4 1 = 16 −1 ˆ ˜ ˜ • c. 1 point Show that neither MSE [φ; φ] − MSE [φ; φ] nor MSE [φ; φ] − ˆ MSE [φ; φ] is a nonnegative definite matrix. Hint: you are allowed to use the mathematical fact that if a matrix is nonnegative definite, then its determinant is nonnegative. Answer. (17.1.6) ˜ ˆ MSE [φ; φ] − MSE [φ; φ] = 3 −5 −5 3 Its determinant is negative, and the determinant of its negative is also negative. CHAPTER 18 Sampling Properties of the Least Squares Estimator ˆ The estimator β was derived from a geometric argument, and everything which we showed so far are what [DM93, p. 3] calls its numerical as opposed to its statistical ˆ properties. But β has also nice statistical or sampling properties. We are assuming right now the specification given in (14.1.3), in which X is an arbitrary matrix of full column rank, and we are not assuming that the errors must be Normally distributed. The assumption that X is nonrandom means that repeated samples are taken with the same X -matrix. This is often true for experimental data, but not in econometrics. The sampling properties which we are really interested in are those where also the X matrix is random; we will derive those later. For this later derivation, the properties with fixed X -matrix, which we are going to discuss presently, will be needed as an intermediate step. The assumption of fixed X is therefore a preliminary technical assumption, to be dropped later. ˆ In order to know how good the estimator β is, one needs the statistical properties ˆ − β . This sampling error has the following formula: of its “sampling error” β ˆ β − β = (X X )−1 X y − (X X )−1 X X β = (18.0.7) = (X X )−1 X (y − Xβ ) = (X X )−1 X ε ˆ From (18.0.7) follows immediately that β is unbiased, since E [(X X )−1 X ε ] = o. Unbiasedness does not make an estimator better, but many good estimators are unbiased, and it simplifies the math. We will use the MSE -matrix as a criterion for how good an estimator of a vector of unobserved parameters is. Chapter 17 gave some reasons why this is a sensible criterion (compare [DM93, Chapter 5.5]). 18.1. The Gauss Markov Theorem ˆ Returning to the least squares estimator β , one obtains, using (18.0.7), that ˆ ˆ ˆ MSE [β ; β ] = E [(β − β )(β − β ) ] = (X X )−1 X E [εε ]X (X X )−1 = (18.1.1) = σ 2 (X X )−1 . This is a very simple formula. Its most interesting aspect is that this MSE matrix does not depend on the value of the true β . In particular this means that it is bounded with respect to β , which is important for someone who wants to be assured of a certain accuracy even in the worst possible situation. Problem 244. 2 points Compute the MSE -matrix MSE [ε; ε ] = E [(ε − ε )(ε − ˆ ˆ ˆ ε ) ] of the residuals as predictors of the disturbances. Answer. Write ε − ε = M ε − ε = (M − I )ε = −X (X X )−1 X ε ; therefore MSE [ε; ε ] = ˆ ˆ ˆ E [X (X X )−1 X εε X (X X )−1 X = σ 2 X (X X )−1 X . Alternatively, start with ε − ε = y − 207 208 18. SAMPLING PROPERTIES OF THE LEAST SQUARES ESTIMATOR ˆ ˆ y −ε = Xβ −y = X (β −β ). This allows to use MSE [ε; ε ] = X MSE [β ; β ]X ˆ ˆ ˆ = σ 2 X (X X )−1 X . Problem 245. 2 points Let v be a random vector that is a linear transformation of y , i.e., v = T y for some constant matrix T . Furthermore v satisfies E [v ] = o. Show that from this follows v = T ε. (In other words, no other transformation of y ˆ with zero expected value is more “comprehensive” than ε . However there are many other transformation of y with zero expected value which are as “comprehensive” as ε ). Answer. E [v ] = T Xβ must be o whatever the value of β . Therefore T X = O , from which follows T M = T . Since ε = M y , this gives immediately v = T ε. (This is the statistical implication ˆ ˆ of the mathematical fact that M is a deficiency matrix of X .) ˆˆ ˆ Problem 246. 2 points Show that β and ε are uncorrelated, i.e., cov[β i , εj ] = ˆ ˆ, ε] as that matrix whose (i, j ) 0 for all i, j . Defining the covariance matrix C [β ˆ ˆˆ ˆˆ element is cov[β i , εj ], this can also be written as C [β , ε] = O . Hint: The covariance matrix satisfies the rules C [Ay , B z ] = A C [y , z ]B and C [y , y ] = V [y ]. (Other rules for the covariance matrix, which will not be needed here, are C [z , y ] = (C [y , z ]) , C [x + y , z ] = C [x, z ] + C [y , z ], C [x, y + z ] = C [x, y ] + C [x, z ], and C [y , c] = O if c is a vector of constants.) Answer. A = (X X )−1 X X (X X )−1 X ) = O . ˆˆ and B = I −X (X X )−1 X , therefore C [β , ε] = σ 2 (X X )−1 X (I − Problem 247. 4 points Let y = Xβ + ε be a regression model with intercept, in ˆ which the first column of X is the vector ι, and let β the least squares estimator of ˆ β . Show that the covariance matrix between y and β , which is defined as the matrix ¯ (here consisting of one row only) that contains all the covariances (18.1.2) yˆ yˆ C [¯, β ] ≡ cov[¯, β 1 ] cov[¯, β 2 ] · · · yˆ cov[¯, β k ] yˆ 2 has the following form: C [¯, β ] = σ 1 0 · · · 0 where n is the number of obyˆ n servations. Hint: That the regression has an intercept term as first column of the X -matrix means that Xe(1) = ι, where e(1) is the unit vector having 1 in the first place and zeros elsewhere, and ι is the vector which has ones everywhere. ˆ Answer. Write both y and β in terms of y , i.e., y = ¯ ¯ 1 ι n ˆ y and β = (X X )−1 X y . Therefore (18.1.3) σ 2 (1) 1 σ2 σ 2 (1) −1 ι X (X X )−1 = e e yˆ = X X (X X )−1 = . C [¯, β ] = ι V [y ]X (X X ) n n n n ˆ Theorem 18.1.1. Gauss-Markov Theorem: β is the BLUE (Best Linear Unbiased Estimator) of β in the following vector sense: for every nonrandom coefficient ˆ vector t, t β is the scalar BLUE of t β , i.e., every other linear unbiased estimator ˜ = a y of φ = t β has a bigger MSE than t β . ˆ φ ˜ Proof. Write the alternative linear estimator φ = a y in the form ˜ (18.1.4) φ = t (X X )−1 X + c y then the sampling error is ˜ φ − φ = t (X X )−1 X + c (18.1.5) −1 = t (X X ) X +c (X β + ε ) − t β ε + c X β. 18.2. DIGRESSION ABOUT MINIMAX ESTIMATORS 209 By assumption, the alternative estimator is unbiased, i.e., the expected value of this sampling error is zero regardless of the value of β . This is only possible if c X = o . But then it follows ˜ ˜ MSE[φ; φ] = E[(φ − φ)2 ] = E[ t (X X )−1 X + c = σ 2 t (X X )−1 X + c X (X X )−1 t + c ] = εε X (X X )−1 t + c = σ 2 t (X X )−1 t + σ 2 c c, Here we needed again c X = o . Clearly, this is minimized if c = o, in which case ˜ ˆ φ = t β. ˜ ˆ Problem 248. 4 points Show: If β is a linear unbiased estimator of β and β is ˜; β ]−MSE [β ; β ] ˆ the OLS estimator, then the difference of the MSE -matrices MSE [β is nonnegative definite. ˜ Answer. (Compare [DM93, p. 159].) Any other linear estimator β of β can be written ˜ = (X X )−1 X + C y . Its expected value is E [β ] = (X X )−1 X X β + CXβ . For ˜ as β ˜ β to be unbiased, regardless of the value of β , C must satisfy CX = O . But then it follows ˜ ˜ MSE [β ; β ] = V [β ] = σ 2 (X X )−1 X + C X (X X )−1 + C = σ 2 (X X )−1 + σ 2 CC , i.e., ˆ it exceeds the MSE -matrix of β by a nonnegative definite matrix. 18.2. Digression about Minimax Estimators Theorem 18.1.1 is a somewhat puzzling property of the least squares estimator, since there is no reason in the world to restrict one’s search for good estimators to unbiased estimators. An alternative and more enlightening characterization of ˆ β does not use the concept of unbiasedness but that of a minimax estimator with respect to the MSE. For this I am proposing the following definition: ˆ Definition 18.2.1. φ is the linear minimax estimator of the scalar parameter φ ˜ with respect to the MSE if and only if for every other linear estimator φ there exists a value of the parameter vector β 0 such that for all β 1 ˜ ˆ (18.2.1) MSE[φ; φ|β = β ] ≥ MSE[φ; φ|β = β ] 0 1 ˜ In other words, the worst that can happen if one uses any other φ is worse than ˆ . Using this concept one can prove the the worst that can happen if one uses φ following: ˆ Theorem 18.2.2. β is a linear minimax estimator of the parameter vector β ˆ in the following sense: for every nonrandom coefficient vector t, t β is the linear minimax estimator of the scalar φ = t β with respect to the MSE. I.e., for every ˜ ˜ other linear estimator φ = a y of φ one can find a value β = β 0 for which φ has a ˆ. larger MSE than the largest possible MSE of t β Proof: as in the proof of Theorem 18.1.1, write the alternative linear estimator ˜ ˜ φ in the form φ = t (X X )−1 X + c y , so that the sampling error is given by (18.1.5). Then it follows (18.2.2) ˜ ˜ MSE[φ; φ] = E[(φ−φ)2 ] = E[ t (X X )−1 X +c ε +c X β ε X (X X )−1 t+c +β X c ] (18.2.3) = σ 2 t (X X )−1 X + c X (X X )−1 t + c + c X ββ X c ˜ Now there are two cases: if c X = o , then MSE[φ; φ] = σ 2 t (X X )−1 t + σ 2 c c. This does not depend on β and if c = o then this MSE is larger than that for c = o. ˜ If c X = o , then MSE[φ; φ] is unbounded, i.e., for any finite number ω one one 210 18. SAMPLING PROPERTIES OF THE LEAST SQUARES ESTIMATOR ˜ ˆ can always find a β 0 for which MSE[φ; φ] > ω . Since MSE[φ; φ] is bounded, a β 0 can be found that satisfies (18.2.1). If we characterize the BLUE as a minimax estimator, we are using a consistent and unified principle. It is based on the concept of the MSE alone, not on a mixture between the concepts of unbiasedness and the MSE. This explains why the mathematical theory of the least squares estimator is so rich. On the other hand, a minimax strategy is not a good estimation strategy. Nature is not the adversary of the researcher; it does not maliciously choose β in such a way that the researcher will be misled. This explains why the least squares principle, despite the beauty of its mathematical theory, does not give terribly good estimators (in fact, they are inadmissible, see the Section about the Stein rule below). ˆ β is therefore simultaneously the solution to two very different minimization problems. We will refer to it as the OLS estimate if we refer to its property of minimizing the sum of squared errors, and as the BLUE estimator if we think of it as the best linear unbiased estimator. Note that even if σ 2 were known, one could not get a better linear unbiased estimator of β . 18.3. Miscellaneous Properties of the BLUE Problem 249. • a. 1 point Instead of (14.2.22) one sometimes sees the formula (xt − x)y t ¯ . (xt − x)2 ¯ ˆ β= (18.3.1) for the slope parameter in the simple regression. Show that these formulas are mathematically equivalent. y ¯ Answer. Equivalence of (18.3.1) and (14.2.22) follows from (xt − x) = 0 and therefore also ¯ (xt − x) = 0. Alternative proof, using matrix notation and the matrix D defined in Problem ¯ 161: (14.2.22) is idempotent. xD xD Dy Dx x Dy . x D Dx and (18.3.1) is They are equal because D is symmetric and • b. 1 point Show that σ2 (xi − x)2 ¯ ˆ var[β ] = (18.3.2) Answer. Write (18.3.1) as (18.3.3) ˆ β= 1 ( xt − x) 2 ¯ (xt − x)y t ¯ ⇒ 1 ˆ var[β ] = ( xt − x) 2 ¯ 2 ( xt − x) 2 σ 2 ¯ ˆ¯ • c. 2 points Show that cov[β , y ] = 0. Answer. This is a special case of problem 247, but it can be easily shown here separately: ˆ¯ cov[β , y ] = cov ( xs − x) y s 1 ¯ , ( xt − x) 2 n ¯ t s yj = j = n n 1 cov ( xt − x) 2 ¯ t t 1 ( xt − x) 2 ¯ ( xs − x) y s , ¯ s (xs − x)σ 2 = 0. ¯ s yj = j 18.3. MISCELLANEOUS PROPERTIES OF THE BLUE 211 • d. 2 points Using (14.2.23) show that x2 ¯ (xi − x)2 ¯ 1 + n ˆ var[α] = σ 2 (18.3.4) Problem 250. You have two data vectors xi and y i (i = 1, . . . , n), and the true model is y i = βxi + εi (18.3.5) where xi and εi satisfy the basic assumptions of the linear regression model. The least squares estimator for this model is xi y i x2 i ˜ β = (x x)−1 x y = (18.3.6) ˜ • a. 1 point Is β an unbiased estimator of β ? (Proof is required.) ˜ Answer. First derive a nice expression for β − β : xi y i ˜ β−β = x2 i x2 i x2 i xi εi = since x2 i y i = βxi + εi xi εi ˜ E[β − β ] = E = x2 β i xi ( y i − xi β ) = = − x2 i E[xi εi ] x2 i xi E[εi ] x2 i =0 since E εi = 0. ˜ • b. 2 points Derive the variance of β . (Show your work.) Answer. ˜ ˜ var β = E[β − β ]2 = = = = = 2 xi εi =E x2 i ( 1 E[ x2 )2 i ( 1 x2 )2 i ( ( 1 x2 )2 i 1 σ2 x2 )2 i σ2 . x2 i E xi εi ]2 (xi εi )2 + 2 E (xi εi )(xj εj ) i<j E[xi εi ]2 x2 i since the εi ’s are uncorrelated, i.e., cov[εi , εj ] = 0 for i = j since all εi have equal variance σ 2 212 18. SAMPLING PROPERTIES OF THE LEAST SQUARES ESTIMATOR Problem 251. We still assume (18.3.5) is the true model. Consider an alternative estimator: ¯ (xi − x)(y i − y ) ¯ ˆ (18.3.7) β= (xi − x)2 ¯ i.e., the estimator which would be the best linear unbiased estimator if the true model were (14.2.15). ˆ • a. 2 points Is β still an unbiased estimator of β if (18.3.5) is the true model? (A short but rigorous argument may save you a lot of algebra here). ˆ Answer. One can argue it: β is unbiased for model (14.2.15) whatever the value of α or β , therefore also when α = 0, i.e., when the model is (18.3.5). But here is the pedestrian way: ˆ β= = (xi − x)(y i − y ) ¯ ¯ ( xi − x) 2 ¯ (xi − x)y i ¯ = ( xi − x) 2 ¯ (xi − x)(βxi + εi ) ¯ =β =β+ ( xi − x) xi ¯ ( xi − x) 2 ¯ y i = βxi + εi (xi − x)εi ¯ + ( xi − x) 2 ¯ (xi − x)εi ¯ ( xi − x) xi = ¯ since ( xi − x) 2 ¯ ˆ Eβ = Eβ + E =β+ since ( xi − x) 2 ¯ (xi − x)¯ = 0 ¯y since ( xi − x) 2 ¯ (xi − x)εi ¯ ( xi − x) 2 ¯ (xi − x) E εi ¯ ˆ since E εi = 0 for all i, i.e., β is unbiased. =β ( xi − x) 2 ¯ ˆ • b. 2 points Derive the variance of β if (18.3.5) is the true model. ˆ Answer. One can again argue it: since the formula for var β does not depend on what the true value of α is, it is the same formula. (18.3.8) ˆ var β = var (18.3.9) = var (18.3.10) = (18.3.11) = β+ (xi − x)εi ¯ ( xi − x) 2 ¯ (xi − x)εi ¯ ( xi − x) 2 ¯ (xi − x)2 var εi ¯ ( (xi − x)2 )2 ¯ since cov[εi εj ] = 0 σ2 . ( xi − x) 2 ¯ ˆ • c. 1 point Still assuming (18.3.5) is the true model, would you prefer β or the ˜ β from Problem 250 as an estimator of β ? ˜ ˆ Answer. Since β and β are both unbiased estimators, if (18.3.5) is the true model, the pre˜ ˆ ferred estimator is the one with the smaller variance. As I will show, var β ≤ var β and, therefore, ˜ ˆ β is preferred to β . To show (18.3.12) ˆ var β = σ2 ≥ ( xi − x) 2 ¯ σ2 ˜ = var β x2 i one must show (18.3.13) ( xi − x) 2 ≤ ¯ x2 i 18.3. MISCELLANEOUS PROPERTIES OF THE BLUE 213 ˆ ˜ which is a simple consequence of (9.1.1). Thus var β ≥ var β ; the variances are equal only if x = 0, ¯ ˜ ˆ i.e., if β = β . Problem 252. Suppose the true model is (14.2.15) and the basic assumptions are satisfied. xi y i ˜ • a. 2 points In this situation, β = is generally a biased estimator of β . x2 i Show that its bias is nx ¯ x2 i ˜ E[β − β ] = α (18.3.14) Answer. In situations like this it is always worth while to get a nice simple expression for the sampling error: xi y i (18.3.15) ˜ β−β = (18.3.16) = (18.3.17) =α (18.3.18) =α x2 i xi (α + βxi + εi ) x2 i ˜ E[β − β ] = E α (18.3.19) −β (18.3.20) =α (18.3.21) =α xi x2 i xi x2 i xi xi xi x2 i + x2 i since y i = α + βxi + εi xi εi x2 i −β xi εi + x2 i x2 i x2 i +β −β x2 i xi εi +E + x2 i xi E εi x2 i +0=α nx ¯ x2 i This is = 0 unless x = 0 or α = 0. ¯ ˜ • b. 2 points Compute var[β ]. Is it greater or smaller than σ2 (xi − x)2 ¯ (18.3.22) which is the variance of the OLS estimator in this model? Answer. (18.3.23) (18.3.24) xi y i ˜ var β = var = x2 i 1 x2 i (18.3.25) = 1 x2 i (18.3.26) = = var[ xi y i ] 2 x2 var[y i ] i 2 x2 i σ2 x2 i (18.3.27) 2 since all y i are uncorrelated and have equal variance σ 2 σ2 . x2 i This variance is smaller or equal because x2 ≥ i ( xi − x) 2 . ¯ 214 18. SAMPLING PROPERTIES OF THE LEAST SQUARES ESTIMATOR ˜ • c. 5 points Show that the MSE of β is smaller than that of the OLS estimator if and only if the unknown true parameters α and σ 2 satisfy the equation α2 (18.3.28) σ2 1 n + x2 ¯ (xi −x)2 ¯ <1 Answer. This implies some tedious algebra. Here it is important to set it up right. αnx ¯ x2 i 2 αnx ¯ x2 i σ2 + x2 i ˜ MSE[β ; β ] = 2 ≤ σ2 ( xi − x) 2 ¯ ≤ σ2 − ( xi − x) 2 ¯ = α2 n x2 i = α2 1 n ( xi − x) 2 + x2 ¯ ¯ α2 σ2 1 n + x2 ¯ (xi −x)2 ¯ ≤ σ2 σ2 = x2 i x2 − i (xi − x)2 ¯ ( xi − x) 2 ¯ x2 i σ 2 nx2 ¯ ( xi − x) 2 ¯ x2 i σ2 ( xi − x) 2 ¯ ≤1 Now look at this lefthand side; it is amazing and surprising that it is exactly the population equivalent of the F -test for testing α = 0 in the regression with intercept. It can be estimated by replacing α2 with α2 and σ 2 with s2 (in the regression with intercept). Let’s look at this statistic. ˆ If α = 0 it has a F -distribution with 1 and n − 2 degrees of freedom. If α = 0 it has what is called a noncentral distribution, and the only thing we needed to know so far was that it was likely to assume larger values than with α = 0. This is why a small value of that statistic supported the hypothesis that α = 0. But in the present case we are not testing whether α = 0 but whether the constrained MSE is better than the unconstrained. This is the case of the above inequality holds, the limiting case being that it is an equality. If it is an equality, then the above statistic has a F distribution with noncentrality parameter 1/2. (Here all we need to know that: if z ∼ N (µ, 1) then z 2 ∼ χ2 with noncentrality parameter µ2 /2. A noncentral F has a noncentral χ2 in numerator and 1 a central one in denominator.) The testing principle is therefore: compare the observed value with the upper α point of a F distribution with noncentrality parameter 1/2. This gives higher critical values than testing for α = 0; i.e., one may reject that α = 0 but not reject that the MSE of the contrained estimator is larger. This is as it should be. Compare [Gre97, 8.5.1 pp. 405–408] on this. From the Gauss-Markov theorem follows that for every nonrandom matrix R, ˆ ˆ the BLUE of φ = Rβ is φ = Rβ . Furthermore, the best linear unbiased predictor ˆ ˆ (BLUP) of ε = y − Xβ is the vector of residuals ε = y − X β . ˜ Problem 253. Let ε = Ay be a linear predictor of the disturbance vector ε in the model y = Xβ + ε with ε ∼ (o, σ 2 I ). ˜ • a. 2 points Show that ε is unbiased, i.e., E[˜ − ε ] = o, regardless of the value ε of β , if and only if A satisfies AX = O . Answer. E [Ay − ε ] = E [AXβ + Aε − ε ] = AXβ + o − o. This is = o for all β if and only if AX = O ˜ • b. 2 points Which unbiased linear predictor ε = Ay of ε minimizes the MSE matrix E [(˜ − ε )(˜ − ε ) ]? Hint: Write A = I − X (X X )−1 X + C . What is the ε ε minimum value of this MSE -matrix? Answer. Since AX = O , the prediction error Ay − ε = AXβ + Aε − ε = (A − I )ε ; therefore one minimizes σ 2 (A − I )(A − I ) s. t. AX = O . Using the hint, C must also satisfy CX = O , and (A − I )(A − I ) = (C − X (X X )−1 X )(C − X (X X )−1 X ) = X (X X )−1 X + CC , therefore one must set C = O . Minimum value is σ 2 X (X X )−1 X . 18.3. MISCELLANEOUS PROPERTIES OF THE BLUE 215 ˆ • c. How does this best predictor relate to the OLS estimator β ? ˆ Answer. It is equal to the residual vector ε = y − X β . ˆ ˆ Problem 254. This is a vector generalization of problem 170. Let β the BLUE ˜ an arbitrary linear unbiased estimator of β . of β and β ˆ ˜ˆ • a. 2 points Show that C [β − β , β ] = O . ˜ ˜ ˜ Answer. Say β = B y ; unbiasedness means BX = I . Therefore −1 ˆ ˜ˆ C [ β − β , β ] = C [ (X X ) X = (X X ) −1 X ˜ − B y , (X X )−1 X y ] ˜ − B V [y ]X (X X )−1 = σ 2 (X X )−1 X ˜ − B X ( X X ) −1 = σ 2 (X X )−1 − (X X )−1 = O . ˜ ˆ ˜ˆ • b. 2 points Show that MSE [β ; β ] = MSE [β ; β ] + V [β − β ] ˜ ˆ ˜ ˆ Answer. Due to unbiasedness, MSE = V , and the decomposition β = β + (β − β ) is an ˜ ˜ ˆ˜ˆ ˆ ˆ˜ ˆ uncorrelated sum. Here is more detail: MSE [β ; β ] = V [β ] = V [β + β − β ] = V [β ] + C [β , β − β ] + ˜ ˆˆ ˜ˆ C [β − β , β ] + V [β − β ] but the two C -terms are the null matrices. Problem 255. 3 points Given a simple regression y t = α + βxt + εt , where the εt are independent and identically distributed with mean µ and variance σ 2 . Is it possible to consistently estimate all four parameters α, β , σ 2 , and µ? If yes, explain how you would estimate them, and if no, what is the best you can do? Answer. Call ˜t = εt − µ, then the equation reads y t = α + µ + βxt + ˜t , with well behaved ε ε disturbances. Therefore one can estimate α + µ, β , and σ 2 . This is also the best one can do; if α + µ are equal, the y t have the same joint distribution. Problem 256. 3 points The model is y = Xβ + ε but all rows of the X -matrix are exactly equal. What can you do? Can you estimate β ? If not, are there any linear combinations of the components of β which you can estimate? Can you estimate σ 2 ? Answer. If all rows are equal, then each column is a multiple of ι. Therefore, if there are more than one column, none of the individual components of β can be estimated. But you can estimate x β (if x is one of the row vectors of X ) and you can estimate σ 2 . Problem 257. This is [JHG+ 88, 5.3.32]: Consider the log-linear statistical model (18.3.29) y t = αxβ exp εt = zt exp εt t with “well-behaved” disturbances εt . Here zt = αxβ is the systematic portion of y t , t which depends on xt . (This functional form is often used in models of demand and production.) • a. 1 point Can this be estimated with the regression formalism? Answer. Yes, simply take logs: (18.3.30) log y t = log α + β log xt + εt • b. 1 point Show that the elasticity of the functional relationship between xt and zt (18.3.31) η= ∂zt /zt ∂xt /xt 216 18. SAMPLING PROPERTIES OF THE LEAST SQUARES ESTIMATOR does not depend on t, i.e., it is the same for all observations. Many authors talk about the elasticity of y t with respect to xt , but one should really only talk about the elasticity of zt with respect to xt , where zt is the systematic part of yt which can be estimated by yt . ˆ Answer. The systematic functional relationship is log zt = log α + β log xt ; therefore ∂ log zt 1 = ∂zt zt (18.3.32) which can be rewritten as ∂zt = ∂ log zt ; zt (18.3.33) The same can be done with xt ; therefore ∂zt /zt ∂ log zt = =β ∂xt /xt ∂ log xt (18.3.34) What we just did was a tricky way to take a derivative. A less tricky way is: ∂zt = αβxβ −1 = βzt /xt t ∂xt (18.3.35) Therefore ∂zt xt =β ∂xt zt (18.3.36) Problem 258. • a. 2 points What is the elasticity in the simple regression y t = α + βxt + εt ? Answer. (18.3.37) ηt = ∂ z t /z t ∂ z t xt βxt βxt = = = ∂xt /xt ∂xt z t zt α + βxt This depends on the observation, and if one wants one number, a good way is to evaluate it at x. ¯ • b. Show that an estimate of this elasticity evaluated at x is h = ¯ ˆ¯ βx y. ¯ Answer. This comes from the fact that the fitted regression line goes through the point x, y . ¯¯ If one uses the other definition of elasticity, which Greene uses on p. 227 but no longer on p. 280, and which I think does not make much sense, one gets the same formula: (18.3.38) ηt = ∂ y t xt βxt ∂ y t /y t = = ∂xt /xt ∂xt y t yt This is different than (18.3.37), but if one evaluates it at the sample mean, both formulas give the same result ˆ¯ βx . y ¯ • c. Show by the delta method that the estimator (18.3.39) h= ˆ¯ βx y ¯ of the elasticity in the simple regression model has the estimated asymptotic variance (18.3.40) s2 −h y ¯ x(1−h) ¯ y ¯ 1x ¯ x x2 ¯¯ −1 −h y ¯ x(1−h) ¯ y ¯ 18.3. MISCELLANEOUS PROPERTIES OF THE BLUE 217 • d. Compare [Gre97, example 6.20 on p. 280]. Assume (18.3.41) 1 1x ¯ 1q (X X ) = →Q= qr x x2 ¯¯ n where we assume for the sake of the argument that q is known. The true elasticity of the underlying functional relationship, evaluated at lim x, is ¯ qβ (18.3.42) η= α + qβ Then ˆ qβ (18.3.43) h= ˆ α + qβ ˆ is a consistent estimate for η . A generalization of the log-linear model is the translog model, which is a secondorder approximation to an unknown functional form, and which allows to model second-order effects such as elasticities of substitution etc. Used to model production, cost, and utility functions. Start with any function v = f (u1 , . . . , un ) and make a second-order Taylor development around u = o: (18.3.44) v = f (o) + ui ∂f ∂ui u=o + 1 2 ui uj i,j ∂2f ∂ui ∂uj u=o Now say v = log(y ) and ui = log(xi ), and the values of f and its derivatives at o are the coefficients to be estimated: 1 βi log xi + (18.3.45) log(y ) = α + γij log xi log xj + ε 2 i,j Note that by Young’s theorem it must be true that γkl = γlk . The semi-log model is often used to model growth rates: (18.3.46) log y t = xt β + εt Here usually one of the columns of X is the time subscript t itself; [Gre97, p. 227] writes it as (18.3.47) log y t = xt β + tδ + εt where δ is the autonomous growth rate. The logistic functional form is appropriate for adoption rates 0 ≤ y t ≤ 1: the rate of adoption is slow at first, then rapid as the innovation gains popularity, then slow again as the market becomes saturated: exp(xt β + tδ + εt ) 1 + exp(xt β + tδ + εt ) This can be linearized by the logit transformation: yt = xt β + tδ + εt (18.3.49) logit(y t ) = log 1 − yt (18.3.48) yt = Problem 259. 3 points Given a simple regression y t = αt + βxt which deviates from an ordinary regression in two ways: (1) There is no disturbance term. (2) The “constant term” αt is random, i.e., in each time period t, the value of αt is obtained by an independent drawing from a population with unknown mean µ and unknown variance σ 2 . Is it possible to estimate all three parameters β , σ 2 , and µ, and to “predict” each αt ? (Here I am using the term “prediction” for the estimation of a random parameter.) If yes, explain how you would estimate it, and if not, what is the best you can do? 218 18. SAMPLING PROPERTIES OF THE LEAST SQUARES ESTIMATOR Answer. Call εt = αt − µ, then the equation reads y t = µ + βxt + εt , with well behaved disturbances. Therefore one can estimate all the unknown parameters, and predict αt by µ + εt . ˆ 18.4. Estimation of the Variance The formulas in this section use g-inverses (compare (A.3.1)) and are valid even if not all columns of X are linearly independent. q is the rank if X . The proofs are not any more complicated than in the case that X has full rank, if one keeps in mind identity (A.3.3) and some other simple properties of g-inverses which are tacitly used at various places. Those readers who are only interested in the full-rank case should simply substitute (X X )−1 for (X X )− and k for q (k is the number of columns of X ). SSE , the attained minimum value of the Least Squares objective function, is a random variable too and we will now compute its expected value. It turns out that E[SSE ] = σ 2 (n − q ) (18.4.1) ˆ Proof. SSE = ε ε, where ε = y − X β = y − X (X X )− X y = M y , ˆˆ ˆ − with M = I − X (X X ) X . From M X = O follows ε = M (Xβ + ε ) = ˆ M ε . Since M is idempotent and symmetric, it follows ε ε = ε M ε , therefore ˆˆ E[ε ε] = E[tr ε M ε ] = E[tr M εε ] = σ 2 tr M = σ 2 tr(I − X (X X )− X ) = ˆˆ σ 2 (n − tr(X X )− X X ) = σ 2 (n − q ). Problem 260. • a. 2 points Show that (18.4.2) SSE = ε M ε where M = I − X (X X )− X ˆ Answer. SSE = ε ε, where ε = y − X β = y − X (X X )− X y = M y where M = ˆˆ ˆ I − X (X X )− X . From M X = O follows ε = M (Xβ + ε ) = M ε . Since M is idempotent and ˆ symmetric, it follows ε ε = ε M ε . ˆˆ • b. 1 point Is SSE observed? Is ε observed? Is M observed? • c. 3 points Under the usual assumption that X has full column rank, show that E[SSE ] = σ 2 (n − k ) (18.4.3) Answer. E[ε ε] = E[tr ε M ε ] = E[tr M εε ] = σ 2 tr M = σ 2 tr(I − X (X X )− X ) = ˆˆ σ 2 (n − tr(X X )− X X ) = σ 2 (n − k). Problem 261. As an alternative proof of (18.4.3) show that SSE = y M y and use theorem ??. From (18.4.3) follows that SSE /(n − q ) is an unbiased estimate of σ 2 . Although it is commonly suggested that s2 = SSE /(n − q ) is an optimal estimator of σ 2 , this is a fallacy. The question which estimator of σ 2 is best depends on the kurtosis of the distribution of the error terms. For instance, if the kurtosis is zero, which is the case when the error terms are normal, then a different scalar multiple of the SSE , namely, the Theil-Schweitzer estimator from [TS61] (18.4.4) σT S = ˆ2 1 1 y My = n−q+2 n−q+2 2 n ε2 , ˆi i=1 is biased but has lower MSE than s . Compare problem 163. The only thing one can say about s2 is that it is a fairly good estimator which one can use when one does not know the kurtosis (but even in this case it is not the best one can do). 18.5. MALLOW’S CP-STATISTIC 219 18.5. Mallow’s Cp-Statistic as Estimator of the Mean Squared Error Problem 262. We will compute here the MSE -matrix of y as an estimator of ˆ E [y ] in a regression which does not use the correct X -matrix. For this we assume that y = η +ε with ε ∼ (o, σ 2 I ). η = E [y ] is an arbitrary vector of constants, and we do not assume that η = Xβ for some β , i.e., we do not assume that X contains all the necessary explanatory variables. Regression of y on X gives the OLS estimator ˆ β = (X X )− X y . ˆ • a. 2 points Show that the MSE matrix of y = X β as estimator of η is ˆ ˆ (18.5.1) MSE [X β ; η ] = σ 2 X (X X )− X + M ηη M where M = I − X (X X )− X . • b. 1 point Formula (18.5.1) for the MSE matrix depends on the unknown σ 2 and η and is therefore useless for estimation. If one cannot get an estimate of the whole MSE matrix, an often-used second best choice is its trace. Show that ˆ (18.5.2) tr MSE [X β ; η ] = σ 2 q + η M η . where q is the rank of X . • c. 3 points If an unbiased estimator of the true σ 2 is available (call it s2 ), then an unbiased estimator of the righthand side of (18.5.2) can be constructed using this s2 and the SSE of the regression SSE = y M y . Show that (18.5.3) E[SSE − (n − 2q )s2 ] = σ 2 q + η M η . Hint: use equation (??). If one does not have an unbiased estimator s2 of σ 2 , one usually gets such an estimator by regressing y on an X matrix which is so large that one can assume that it contains the true regressors. The statistic SSE + 2q − n s2 ˆ is called Mallow’s Cp statistic. It is a consistent estimator of tr MSE [X β ; η ]/σ 2 . If X contains all necessary variables, i.e., η = Xβ for some β , then (18.5.2) becomes ˆ tr MSE [X β ; η ] = σ 2 q , i.e., in this case Cp should be close to q . Therefore the selection rule for regressions should be here to pick that regression for which the Cp -value is closest to q . (This is an explanation; nothing to prove here.) (18.5.4) Cp = If one therefore has several regressions and tries to decide which is the right one, it is recommended to plot Cp versus q for all regressions, and choose one for which this value is small and lies close to the diagonal. An example of this is given in problem 232. CHAPTER 19 Nonspherical Positive Definite Covariance Matrix The so-called “Generalized Least Squares” model specifies y = Xβ + ε with ε ∼ (o, σ 2 Ψ) where σ 2 is an unknown positive scalar, and Ψ is a known positive definite matrix. This is simply the OLS model in disguise. To see this, we need a few more facts about positive definite matrices. Ψ is nonnegative definite if and only if a Q exists with Ψ = QQ . If Ψ is positive definite, this Q can be chosen square and nonsingular. Then P = Q−1 satisfies P P Ψ = P P QQ = I , i.e., P P = Ψ−1 , and also P ΨP = P QQ P = I . Premultiplying the GLS model by P gives therefore a model whose disturbances have a spherical covariance matrix: P y = P Xβ + P ε (19.0.5) P ε ∼ (o, σ 2 I ) The OLS estimate of β in this transformed model is (19.0.6) ˆ β = (X P P X )−1 X P P y = (X Ψ−1 X )−1 X Ψ−1 y . ˆ This β is the BLUE of β in model (19.0.5), and since estimators which are linear ˆ in P y are also linear in y and vice versa, β is also the BLUE in the original GLS model. Problem 263. 2 points Show that ˆ β − β = (X Ψ−1 X )−1 X Ψ−1ε (19.0.7) ˆ ˆ and derive from this that β is unbiased and that MSE [β ; β ] = σ 2 (X Ψ−1 X )−1 . Answer. Proof of (19.0.7) is very similar to proof of (18.0.7). The objective function of the associated least squares problem is (19.0.8) ˆ β=β (y − Xβ ) Ψ−1 (y − Xβ ). minimizes The normal equations are (19.0.9) ˆ X Ψ−1 X β = X Ψ−1 y ˆ If X has full rank, then X Ψ−1 X is nonsingular, and the unique β minimizing (19.0.8) is (19.0.10) ˆ β = (X Ψ−1 X )−1 X Ψ−1 y Problem 264. [Seb77, p. 386, 5] Show that if Ψ is positive definite and X has full rank, then also X Ψ−1 X is positive definite. You are allowed to use, without proof, that the inverse of a positive definite matrix is also positive definite. Answer. From X Ψ−1 Xa = o follows a X Ψ−1 Xa = 0, and since Ψ−1 is positive definite, it follows Xa = o, and since X has full column rank, this implies a = o. 221 222 19. NONSPHERICAL COVARIANCE MATRIX ˆ The least squares objective function of the transformed model, which β = β minimizes, can be written (19.0.11) (P y − P Xβ ) (P y − P Xβ ) = (y − Xβ ) Ψ−1 (y − Xβ ), and whether one writes it in one form or the other, 1/(n − k ) times the minimum value of that GLS objective function is still an unbiased estimate of σ 2 . Problem 265. Show that the minimum value of the GLS objective function can be written in the form y M y where M = Ψ−1 − Ψ−1 X (X Ψ−1 X )−1 X Ψ−1 . Does M X = O still hold? Does M 2 = M or a similar simple identity still hold? Show that M is nonnegative definite. Show that E[y M y ] = (n − k )σ 2 . ˆ ˆ ˆ Answer. In (y − X β ) Ψ−1 (y − X β ) plug in β = (X Ψ−1 X )−1 X Ψ−1 y and multiply out to get y M y . Yes, M X = O holds. M is no longer idempotent, but it satisfies M ΨM = M . One way to show that it is nnd would be to use the first part of the question: for all z , z M z = ˆ ˆ (z − X β ) (z − X β ), and another way would be to use the second part of the question: M nnd because M ΨM = M . To show expected value, show first that y M y = εM ε, and then use those tricks with the trace again. The simplest example of Generalized Least Squares is that where Ψ is diagonal (heteroskedastic data). In this case, the GLS objective function (y − Xβ ) Ψ−1 (y − Xβ ) is simply a weighted least squares, with the weights being the inverses of the diagonal elements of Ψ. This vector of inverse diagonal elements can be specified with the optional weights argument in R, see the help-file for lm. Heteroskedastic data arise for instance when each data point is an average over a different number of individuals. If one runs OLS on the original instead of the transformed model, one gets an ˆ estimator, we will calle it here β OLS , which is still unbiased. The estimator is usually also consistent, but no longer BLUE. This not only makes it less efficient than the GLS, but one also gets the wrong results if one relies on the standard computer printouts for significance tests etc. The estimate of σ 2 generated by this regression is now usually biased. How biased it is depends on the X -matrix, but most often it seems biased upwards. The estimated standard errors in the regression printouts not only use the wrong s, but they also insert this wrong s into the wrong formula ˆ σ 2 (X X )−1 instead of σ 2 (X Ψ−1 X )−1 for V [β ]. Problem 266. In the generalized least squares model y = Xβ + ε with ε ∼ (o, σ 2 Ψ), the BLUE is (19.0.12) ˆ β = (X Ψ−1 X )−1 X Ψ−1 y . ˆ We will write β OLS for the ordinary least squares estimator (19.0.13) ˆ β OLS = (X X )−1 X y which has different properties now since we do not assume ε ∼ (o, σ 2 I ) but ε ∼ (o, σ 2 Ψ). ˆ • a. 1 point Is β OLS unbiased? ˆ • b. 2 points Show that, still under the assumption ε ∼ (o, σ 2 Ψ), V [β OLS ] − ˆ ˆ ˆ V [β ] = V [β OLS − β ]. (Write down the formulas for the left hand side and the right hand side and then show by matrix algebra that they are equal.) (This is what one should expect after Problem 170.) Since due to unbiasedness the covariance matrices ˆ ˆ are the MSE -matrices, this shows that MSE [β OLS ; β ] − MSE [β ; β ] is nonnegative definite. 19. NONSPHERICAL COVARIANCE MATRIX 223 Answer. Verify equality of the following two expressions for the differences in MSE matrices: 2 −1 −1 −1 −1 ˆ ˆ = V [β OLS ] − V [β ] = σ (X X ) X ΨX (X X ) − (X Ψ X ) = σ 2 (X X )−1 X − (X Ψ−1 X )−1 X Ψ−1 Ψ X (X X )−1 − Ψ−1 X (X Ψ−1 X )−1 Examples of GLS models are discussed in chapters ?? and ??. CHAPTER 20 Best Linear Prediction Best Linear Prediction is the second basic building block for the linear model, in addition to the OLS model. Instead of estimating a nonrandom parameter β about which no prior information is available, in the present situation one predicts a random variable z whose mean and covariance matrix are known. Most models to be discussed below are somewhere between these two extremes. Christensen’s [Chr87] is one of the few textbooks which treat best linear prediction on the basis of known first and second moments in parallel with the regression model. The two models have indeed so much in common that they should be treated together. 20.1. Minimum Mean Squared Error, Unbiasedness Not Required Assume the expected values of the random vectors y and z are known, and their joint covariance matrix is known up to an unknown scalar factor σ 2 > 0. We will write this as y Ω yz µ Ω (20.1.1) ∼ , σ 2 yy , σ 2 > 0. z Ω zy Ω zz ν y is observed but z is not, and the goal is to predict z on the basis of the observation of y . There is a unique predictor of the form z ∗ = B ∗ y + b∗ (i.e., it is linear with a constant term, the technical term for this is “affine”) with the following two properties: it is unbiased, and the prediction error is uncorrelated with y , i.e., (20.1.2) ∗ C [z − z , y ] = O . The formulas for B ∗ and b∗ are easily derived. Unbiasedness means ν = B ∗ µ + b∗ , the predictor has therefore the form (20.1.3) z ∗ = ν + B ∗ (y − µ). Since (20.1.4) z ∗ − z = B ∗ (y − µ) − (z − ν ) = B ∗ −I y−µ , z−ν the zero correlation condition (20.1.2) translates into (20.1.5) B ∗Ω y y = Ω z y , which, due to equation (A.5.13) holds for B ∗ = Ω zy Ω −y . Therefore the predictor y (20.1.6) z ∗ = ν + Ω zy Ω −y (y − µ) y satisfies the two requirements. Unbiasedness and condition (20.1.2) are sometimes interpreted to mean that z ∗ is an optimal predictor. Unbiasedness is often naively (but erroneously) considered to be a necessary condition for good estimators. And if the prediction error were correlated with the observed variable, the argument goes, then it would be possible to 225 226 20. BEST LINEAR PREDICTION improve the prediction. Theorem 20.1.1 shows that despite the flaws in the argument, the result which it purports to show is indeed valid: z ∗ has the minimum MSE of all affine predictors, whether biased or not, of z on the basis of y . Theorem 20.1.1. In situation (20.1.1), the predictor (20.1.6) has, among all predictors of z which are affine functions of y , the smallest MSE matrix. Its MSE matrix is Ω (20.1.7) MSE [z ∗ ; z ] = E [(z ∗ − z )(z ∗ − z ) ] = σ 2 (Ω zz − Ω zy Ω −y Ω yz ) = σ 2Ω zz.y . y ˜ ˜ ˜ ˜ Proof. Look at any predictor of the form z = B y + b. Its bias is d = E [˜ − z ] = z ˜ ˜ Bµ + b − ν , and by (17.1.2) one can write ˜˜ z z z E [(˜ − z )(˜ − z ) ] = V [(˜ − z )] + dd (20.1.8) (20.1.9) ˜ =V B −I y z (20.1.10) ˜ = σ2 B −I Ω yy Ω zy ˜˜ + dd Ω yz Ω zz ˜ B ˜˜ + dd . −I This MSE -matrix is minimized if and only if d∗ = o and B ∗ satisfies (20.1.5). To see ˜ ˜ this, take any solution B ∗ of (20.1.5), and write B = B ∗ + D . Since, due to theorem − − ∗ A.5.11, Ω zy = Ω zy Ω yy Ω yy , it follows Ω zy B = Ω zy Ω yy Ω yy B ∗ = Ω zy Ω −y Ω yz . y Therefore Ω yz Ω zz ˜ B∗ + D −I ˜ MSE [˜ ; z ] = σ 2 B ∗ + D z −I Ω yy Ω zy (20.1.11) ˜ = σ2 B ∗ + D −I ˜ Ω yy D ˜ −Ω zz.y + Ω zy D (20.1.12) ˜˜ ˜ ˜ Ω = σ 2 (Ω zz.y + DΩ yy D ) + dd . ˜˜ + dd ˜˜ + dd The MSE matrix is therefore minimized (with minimum value σ 2Ω zz.y ) if and only ˜ ˜ ˜ if d = o and DΩ yy = O which means that B , along with B ∗ , satisfies (20.1.5). Problem 267. Show that the solution of this minimum MSE problem is unique in the following sense: if B ∗ and B ∗ are two different solutions of (20.1.5) and y 1 2 is any feasible observed value y , plugged into equations (20.1.3) they will lead to the same predicted value z ∗ . Answer. Comes from the fact that every feasible observed value of y can be written in the form y = µ + Ω yy q for some q , therefore B ∗ y = B ∗Ω yy q = Ω zy q . i i The matrix B ∗ is also called the regression matrix of z on y , and the unscaled covariance matrix has the form Ω yy Ω yz Ω yy Ω yy X (20.1.13) Ω= = Ω zy Ω zz X Ω y y X Ω y y X + Ω z z .y Where we wrote here B ∗ = X in order to make the analogy with regression clearer. A g-inverse is (20.1.14) Ω− = − Ω −y + X Ω zz.y X y −X Ω −z.y z −X Ω −z.y z − Ω z z .y and every g-inverse of the covariance matrix has a g-inverse of Ω zz.y as its z z partition. (Proof in Problem 392.) 20.1. MINIMUM MEAN SQUARED ERROR, UNBIASEDNESS NOT REQUIRED 227 Ω yy Ω yz Ω is nonsingular, 20.1.5 is also solved by B ∗ = −(Ω zz )−Ω zy Ω zy Ω zz zz zy −1 where Ω and Ω are the corresponding partitions of the inverse Ω . See Problem 392 for a proof. Therefore instead of 20.1.6 the predictor can also be written If Ω = (20.1.15) z ∗ = ν − Ω zz −1 Ω zy (y − µ) (note the minus sign) or (20.1.16) z ∗ = ν − Ω zz.y Ω zy (y − µ). Problem 268. This problem utilizes the concept of a bounded risk estimator, which is not yet explained very well in these notes. Assume y , z , µ, and ν are jointly distributed random vectors. First assume ν and µ are observed, but y and z are not. Assume we know that in this case, the best linear bounded MSE predictor of y and z is µ and ν , with prediction errors distributed as follows: (20.1.17) y−µ o Ω ∼ , σ 2 yy z−ν Ω zy o Ω yz . Ω zz This is the initial information. Here it is unnecessary to specify the unconditional distributions of µ and ν , i.e., E [µ] and E [ν ] as well as the joint covariance matrix of µ and ν are not needed, even if they are known. Then in a second step assume that an observation of y becomes available, i.e., now y , ν , and µ are observed, but z still isn’t. Then the predictor (20.1.18) z ∗ = ν + Ω zy Ω −y (y − µ) y is the best linear bounded MSE predictor of z based on y , µ, and ν . • a. Give special cases of this specification in which µ and ν are constant and y and z random, and one in which µ and ν and y are random and z is constant, and one in which µ and ν are random and y and z are constant. Answer. If µ and ν are constant, they are written µ and ν . From this follows µ = E [y ] and Ω yy Ω yz y = V[ and every linear predictor has bounded MSE . Then the ν = E [z ] and σ 2 Ω zy Ω zz rx proof is as given earlier in this chapter. But an example in which µ and ν are not known constants but are observed random variables, and y is also a random variable but z is constant, is (21.0.26). Another example, in which y and z both are constants and µ and ν random, is constrained least squares (22.4.3). • b. Prove equation 20.1.18. Answer. In this proof we allow all four µ and ν and y and z to be random. A linear ˜ ˜ predictor based on y , µ, and ν can be written as z = B y + C µ + D ν + d, therefore z − z = B (y − µ) + (C + B )µ + (D − I )ν − (z − ν ) + d. E [˜ − z ] = o + (C + B ) E [µ] + (D − I ) E [ν ] − o + d. z Assuming that E [µ] and E [ν ] can be anything, the requirement of bounded MSE (or simply the requirement of unbiasedness, but this is not as elegant) gives C = −B and D = I , therefore ˜ ˜ z = ν + B (y − µ) + d, and the estimation error is z − z = B (y − µ) − (z − ν ) + d. Now continue as in the proof of theorem 20.1.1. I must still carry out this proof much more carefully! Problem 269. 4 points According to (20.1.2), the prediction error z ∗ − z is uncorrelated with y . If the distribution is such that the prediction error is even independent of y (as is the case if y and z are jointly normal), then z ∗ as defined in (20.1.6) is the conditional mean z ∗ = E [z |y ], and its MSE -matrix as defined in (20.1.7) is the conditional variance V [z |y ]. 228 20. BEST LINEAR PREDICTION Answer. From independence follows E [z ∗ − z |y ] = E [z ∗ − z ], and by the law of iterated expectations E [z ∗ − z ] = o. Rewrite this as E [z |y ] = E [z ∗ |y ]. But since z ∗ is a function of y , E [z ∗ |y ] = z ∗ . Now the proof that the conditional dispersion matrix is the MSE matrix: ∗ ∗ V [z |y ] = E [(z − E [z |y ])(z − E [z |y ]) |y ] = E [(z − z )(z − z ) |y ] (20.1.19) = E [(z − z ∗ )(z − z ∗ ) ] = MSE [z ∗ ; z ]. Problem 270. Assume the expected values of x, y and z are known, and their joint covariance matrix is known up to an unknown scalar factor σ 2 > 0. Ω xx x λ y ∼ µ , σ 2 Ω xy z ν Ω xz (20.1.20) Ω xy Ω yy Ω yz Ω xz Ω yz . Ω zz x is the original information, y is additional information which becomes available, and z is the variable which we want to predict on the basis of this information. • a. 2 points Show that y ∗ = µ + Ω xy Ω −x (x − λ) is the best linear predictor x − of y and z ∗ = ν + Ω xz Ω xx (x − λ) the best linear predictor of z on the basis of the observation of x, and that their joint MSE -matrix is y∗ − y z∗ − z (y ∗ − y ) = σ2 (z ∗ − z ) Ω yy − Ω xy Ω −xΩ xy x Ω yz − Ω xz Ω −xΩ xy x = σ2 E Ω y y .x Ω y z .x Ω yz − Ω xy Ω −xΩ xz x Ω zz − Ω xz Ω −xΩ xz x which can also be written Ω y z .x . Ω z z .x Answer. This part of the question is a simple application of the formulas derived earlier. For the MSE -matrix you first get σ2 Ω yy Ω yz Ω yz Ω xy − Ω −x Ω xy x Ω zz Ω xz Ω xz • b. 5 points Show that the best linear predictor of z on the basis of the observations of x and y has the form z ∗∗ = z ∗ + Ω yz.xΩ −y.x (y − y ∗ ) y (20.1.21) This is an important formula. All you need to compute z ∗∗ is the best estimate z ∗ before the new information y became available, the best estimate y ∗ of that new information itself, and the joint MSE matrix of the two. The original data x and the covariance matrix (20.1.20) do not enter this formula. Answer. Follows from z ∗∗ = ν + Ωxz Ωyz Ω xx Ω xy Ω xy Ω yy − x−λ = y−µ 20.1. MINIMUM MEAN SQUARED ERROR, UNBIASEDNESS NOT REQUIRED 229 Now apply (A.8.2): = ν + Ω xz Ω yz − Ω −x + Ω −xΩ xy Ω −y.xΩ xy Ω xx x x y − − −Ω yy.xΩ xy Ω xx − −Ω −xΩ xy Ω yy.x x − Ω yy.x = ν + Ω xz Ω yz Ω −x (x − λ) + Ω −xΩ xy Ω −y.x (y ∗ − µ) − Ω −xΩ xy Ω −y.x (y − µ) x x y x y = − −Ω −y.x (y ∗ − µ) + Ω yy.x (y − µ) y = ν + Ω xz Ω yz Ω −x (x − λ) − Ω −xΩ xy Ω −y.x (y − y ∗ ) x x y = Ω− +Ω yy.x (y − y ∗ ) x−λ = y−µ = ν + Ω xz Ω −x (x − λ) − Ω xz Ω −xΩ xy Ω −y.x (y − y ∗ ) + Ω yz Ω −y.x (y − y ∗ ) = x x y y = z ∗ + Ω yz − Ω xz Ω −xΩ xy Ω −y.x (y − y ∗ ) = z ∗ + Ω yz.xΩ −y.x (y − y ∗ ) x y y Problem 271. Assume x, y , and z have a joint probability distribution, and the conditional expectation E [z |x, y ] = α∗ + A∗ x + B ∗ y is linear in x and y . • a. 1 point Show that E [z |x] = α∗ + A∗ x + B ∗ E [y |x]. Hint: you may use the law of iterated expectations in the following form: E [z |x] = E E [z |x, y ] x . Answer. With this hint it is trivial: E [z |x] = E α∗ + A∗ x + B ∗ y x = α∗ + A∗ x + B ∗ E [y |x]. • b. 1 point The next three examples are from [CW99, pp. 264/5]: Assume E[z |x, y ] = 1 + 2x + 3y , x and y are independent, and E[y ] = 2. Compute E[z |x]. Answer. According to the formula, E[z |x] = 1 + 2x + 3E[y |x], but since x and y are independent, E[y |x] = E[y ] = 2; therefore E[z |x] = 7 + 2x. I.e., the slope is the same, but the intercept changes. • c. 1 point Assume again E[z |x, y ] = 1 + 2x + 3y , but this time x and y are not independent but E[y |x] = 2 − x. Compute E[z |x]. Answer. E[z |x] = 1 + 2x + 3(2 − x) = 7 − x. In this situation, both slope and intercept change, but it is still a linear relationship. • d. 1 point Again E[z |x, y ] = 1 + 2x + 3y , and this time the relationship between x and y is nonlinear: E[y |x] = 2 − ex . Compute E[z |x]. Answer. E[z |x] = 1 + 2x + 3(2 − ex ) = 7 + 2x − 3ex . This time the marginal relationship between x and y is no longer linear. This is so despite the fact that, if all the variables are included, i.e., if both x and y are included, then the relationship is linear. • e. 1 point Assume E[f (z )|x, y ] = 1 + 2x + 3y , where f is a nonlinear function, and E[y |x] = 2 − x. Compute E[f (z )|x]. Answer. E[f (z )|x] = 1 + 2x + 3(2 − x) = 7 − x. If one plots z against x and z , then the plots should be similar, though not identical, since the same transformation f will straighten them out. This is why the plots in the top row or right column of [CW99, p. 435] are so similar. Connection between prediction and inverse prediction: If y is observed and z is to be predicted, the BLUP is z ∗ − ν = B ∗ (y − µ) where B ∗ = Ω zy Ω −y . If z y is observed and y is to be predicted, then the BLUP is y ∗ − µ = C ∗ (z − ν ) with C ∗ = Ω yz Ω −z . B ∗ and C ∗ are connected by the formula z (20.1.22) Ω yy B ∗ = C ∗Ω z z . This relationship can be used for graphical regression methods [Coo98, pp. 187/8]: If z is a scalar, it is much easier to determine the elements of C ∗ than those of B ∗ . C ∗ consists of the regression slopes in the scatter plot of each of the observed variables against z . They can be read off easily from a scatterplot matrix. This 230 20. BEST LINEAR PREDICTION works not only if the distribution is Normal, but also with arbitrary distributions as long as all conditional expectations between the explanatory variables are linear. Problem 272. In order to make relationship (20.1.22) more intuitive, assume x and ε are Normally distributed and independent of each other, and E[ε] = 0. Define y = α + β x + ε. • a. Show that α + β x is the best linear predictor of y based on the observation of x. Answer. Follows from the fact that the predictor is unbiased and the prediction error is uncorrelated with x. • b. Express β in terms of the variances and covariances of x and y . Answer. cov[x, y ] = β var[x], therefore β = cov[x,y ] var[x] • c. Since x and y are jointly normal, they can also be written x = γ + δ y + ω where ω is independent of y . Express δ in terms of the variances and covariances of x and y , and show that var[y ]β = γ var[x]. Answer. δ = cov[x,y ] . var[y ] • d. Now let us extend the model a little: assume x1 , x2 , and ε are Normally distributed and independent of each other, and E[ε] = 0. Define y = α + β1 x1 + β2 x2 + ε. Again express β1 and β2 in terms of variances and covariances of x1 , x2 , and y . Answer. Since x1 and x2 are independent, one gets the same formulas as in the univariate cov[x1 ,y ] case: from cov[x1 , y ] = β1 var[x1 ] and cov[x2 , y ] = β2 var[x2 ] follows β1 = var[x ] and β2 = 1 cov[x2 ,y ] . var[x2 ] • e. Since x1 and y are jointly normal, they can also be written x1 = γ1 +δ1 y +ω 1 , where ω 1 is independent of y . Likewise, x2 = γ2 + δ2 y + ω 2 , where ω 2 is independent of y . Express δ1 and δ2 in terms of the variances and covariances of x1 , x2 , and y , and show that 0 δ1 var[x1 ] var[y ] = 0 var[x2 ] δ2 (20.1.23) β1 β2 This is (20.1.22) in the present situation. Answer. δ1 = cov[x1 ,y ] var[y ] and δ2 = cov[x2 ,y ] . var[y ] 20.2. The Associated Least Squares Problem For every estimation problem there is an associated “least squares” problem. In the present situation, z ∗ is that value which, together with the given observation y , “blends best” into the population defined by µ, ν and the dispersion matrix Ω , in − the following sense: Given the observed value y , the vector z ∗ = ν + Ω zy Ω yy (y − µ) y is that value z for which has smallest Mahalanobis distance from the population z µ Ω Ω yz defined by the mean vector and the covariance matrix σ 2 yy . ν Ω zy Ω zz In the case of singular Ω zz , it is only necessary to minimize among those z which have finite distance from the population, i.e., which can be written in the form 20.3. PREDICTION OF FUTURE OBSERVATIONS IN THE REGRESSION MODEL z = ν + Ω zz q for some q . We will also write r = rank solves the following “least squares problem:” (20.2.1) z = z ∗ min. 1 rσ 2 y−µ z−ν Ω yy Ω zy Ω yz Ω zz − Ωyy Ωyz Ω zy Ω zz 231 . Therefore, z ∗ y−µ s. t. z = ν + Ω zz q for some q . z−ν To prove this, use (A.8.2) to invert the dispersion matrix: Ω yy Ω zy (20.2.2) Ω yz Ω zz − = Ω −y + Ω −y Ω yz Ω −z.y Ω zy Ω −y y y z y − −Ω −z.y Ω zy Ω yy z −Ω −y Ω yz Ω −z.y y z . − Ω z z .y If one plugs z = z ∗ into this objective function, one obtains a very simple expression: (20.2.3) (y −µ) I Ω −y Ω yz y Ω −y + Ω −y Ω yz Ω −z.y Ω zy Ω −y y y z y −Ω −z.y Ω zy Ω −y z y − −Ω yy Ω yz Ω −z.y z Ω −z.y z I (y −µ) = − Ω zyΩ yy − = (y − µ) Ω yy (y − µ). (20.2.4) Now take any z of the form z = ν + Ωzz q for some q and write it in the form z = z ∗ + Ωzz d, i.e., y−µ y−µ o =∗ + . z−ν z −ν Ω zz d Then the cross product terms in the objective function disappear: (20.2.5) o d Ω zz Ω −y + Ω −y Ω yz Ω −z.y Ω zy Ω −y y y z y − −Ω zz.y Ω zy Ω −y y −Ω −y Ω yz Ω −z.y y z Ω −z.y z =o d Ω zz I (y −µ) = Ω zy Ω −y y Ω −y y (y − µ) = 0 O Therefore this gives a larger value of the objective function. Problem 273. Use problem 379 for an alternative proof of this. From (20.2.1) follows that z ∗ is the mode of the normal density function, and since the mode is the mean, this is an alternative proof, in the case of nonsingular covariance matrix, when the density exists, that z ∗ is the normal conditional mean. 20.3. Prediction of Future Observations in the Regression Model For a moment let us go back to the model y = Xβ +ε with spherically distributed disturbances ε ∼ (o, σ 2 I ). This time, our goal is not to estimate β , but the situation is the following: For a new set of observations of the explanatory variables X 0 the values of the dependent variable y 0 = X 0 β + ε 0 have not yet been observed and we ˆ want to predict them. The obvious predictor is y ∗ = X 0 β = X 0 (X X )−1 X y . 0 Since (20.3.1) y ∗ − y 0 = X 0 (X X )−1 X y − y 0 = 0 = X 0 (X X )−1 X X β +X 0 (X X )−1 X ε −X 0 β −ε 0 = X 0 (X X )−1 X ε −ε 0 232 20. BEST LINEAR PREDICTION one sees that E[y ∗ − y 0 ] = o, i.e., it is an unbiased predictor. And since ε and ε 0 0 are uncorrelated, one obtains (20.3.2) MSE [y ∗ ; y 0 ] = V [y ∗ − y 0 ] = V [X 0 (X X )−1 X ε ] + V [ε 0 ] 0 0 = σ 2 (X 0 (X X )−1 X 0 + I ). (20.3.3) Problem 274 shows that this is the Best Linear Unbiased Predictor (BLUP) of y 0 on the basis of y . Problem 274. The prediction problem in the Ordinary Least Squares model can be formulated as follows: (20.3.4) y X ε = β+ y0 X0 ε0 ε o E[ ε ] = o 0 ε 2I V[ ε ] = σ O 0 O . I X and X 0 are known, y is observed, y 0 is not observed. ˆ • a. 4 points Show that y ∗ = X 0 β is the Best Linear Unbiased Predictor (BLUP) 0 ˆ of y 0 on the basis of y , where β is the OLS estimate in the model y = Xβ + ε . ˜ ˜ ˜ Answer. Take any other predictor y 0 = B y and write B = X 0 (X X )−1 X + D . Unbiasedy ness means E [˜ 0 − y 0 ] = X 0 (X X )−1 X X β + DXβ − X 0 β = o, from which follows DX = O . y y Because of unbiasedness we know MSE [˜ 0 ; y 0 ] = V [˜ 0 − y 0 ]. Since the prediction error can be ˜ written y 0 − y = X 0 (X X )−1 X y V [˜ 0 − y 0 ] = X 0 (X X )−1 X y , one obtains y0 −I +D +D −I V [ y X (X X )−1 X 0 + D y0 −I X (X X )1 X 0 + D −I = σ 2 X 0 (X X )−1 X +D −I = σ 2 X 0 ( X X ) −1 X +D X 0 (X X )−1 X = σ 2 X 0 (X X )−1 X 0 + DD +D + σ2 I +I . This is smallest for D = O . • b. 2 points From our formulation of the Gauss-Markov theorem in Theorem ˆ 18.1.1 it is obvious that the same y ∗ = X 0 β is also the Best Linear Unbiased Es0 timator of X 0 β , which is the expected value of y 0 . You are not required to reˆ prove this here, but you are asked to compute MSE [X 0 β ; X 0 β ] and compare it with MSE [y ∗ ; y 0 ]. Can you explain the difference? 0 Answer. Estimation error and MSE are ˆ ˆ X 0 β − X 0 β = X 0 (β − β ) = X 0 (X X )−1 X ε due to (??) ˆ ˆ MSE [X 0 β ; X 0 β ] = V [X 0 β − X 0 β ] = V [X 0 (X X )−1 X ε ] = σ 2 X 0 (X X )−1 X 0 . It differs from the prediction MSE matrix by σ 2 I , which is the uncertainty about the value of the new disturbance ε 0 about which the data have no information. [Gre97, p. 369] has an enlightening formula showing how the prediction intervals increase if one goes away from the center of the data. Now let us look at the prediction problem in the Generalized Least Squares model Ψ C y X ε ε o ε 2 (20.3.5) = β+ . Eε =o V ε =σ y0 X0 ε0 C Ψ0 0 0 X and X 0 are known, y is observed, y 0 is not observed, and we assume Ψ is positive ˆ ˆ definite. If C = O , the BLUP of y 0 is X 0 β , where β is the BLUE in the model 20.3. PREDICTION OF FUTURE OBSERVATIONS IN THE REGRESSION MODEL 233 y = Xβ + ε . In other words, all new disturbances are simply predicted by zero. If past and future disturbances are correlated, this predictor is no longer optimal. In [JHG+ 88, pp. 343–346] it is proved that the best linear unbiased predictor of y 0 is ˆ ˆ (20.3.6) y ∗ = X 0 β + C Ψ−1 (y − X β ). 0 ˆ where β is the generalized least squares estimator of β , and that its MSE -matrix MSE [y ∗ ; y 0 ] is 0 (20.3.7) σ 2 Ψ0 − C Ψ−1 C +(X 0 − C Ψ−1 X )(X Ψ−1 X )−1 (X 0 − X Ψ−1 C ) . Problem 275. Derive the formula for the MSE matrix from the formula of the predictor, and compute the joint MSE matrix for the predicted values and the parameter vector. Answer. The prediction error is, using (19.0.7), ˆ ˆ y ∗ − y 0 = X 0 β − X 0 β + X 0 β − y 0 + C Ψ−1 (y − Xβ + Xβ − X β ) 0 (20.3.8) (20.3.9) ˆ ˆ = X 0 (β − β ) − ε 0 + C Ψ−1 (ε − X (β − β )) (20.3.10) ˆ = C Ψ−1ε + (X 0 − C Ψ−1 X )(β − β ) − ε 0 (20.3.11) = C Ψ−1 + (X 0 − C Ψ−1 X )(X Ψ−1 X )−1 X Ψ−1 −I ε ε0 The MSE -matrix is therefore E [(y ∗ − y 0 )(y ∗ − y 0 ) ] = 0 0 = σ 2 C Ψ−1 + (X 0 − C Ψ−1 X )(X Ψ−1 X )−1 X Ψ−1 (20.3.12) Ψ C C Ψ0 −1 Ψ −1 C+Ψ −1 X (X Ψ −I X )−1 (X 0 − X Ψ−1 C ) −I ˆ and the joint MSE matrix with the sampling error of the parameter vector β − β is σ2 (20.3.13) Ψ C C Ψ0 (20.3.14) −1 CΨ C Ψ−1 + (X 0 − C Ψ−1 X )(X Ψ−1 X )−1 X Ψ−1 (X Ψ−1 X )−1 X Ψ−1 Ψ−1 C + Ψ−1 X (X Ψ−1 X )−1 (X 0 − X Ψ−1 C ) −I = σ2 −I O Ψ−1 X (X Ψ−1 X )−1 O C Ψ−1 + (X 0 − C Ψ−1 X )(X Ψ−1 X )−1 X Ψ−1 (X Ψ−1 X )−1 X Ψ−1 X (X Ψ−1 X )−1 (X 0 − X Ψ−1 C ) C + C Ψ−1 X (X Ψ−1 X )−1 (X 0 − X Ψ−1 C ) − Ψ0 = −I O X (X Ψ−1 X )−1 C Ψ−1 X (X Ψ−1 X )−1 If one multiplies this out, one gets (20.3.15) Ψ0 − C Ψ−1 C + (X 0 − C Ψ−1 X )(X Ψ−1 X )−1 (X 0 − X Ψ−1 C ) (X Ψ−1 X )−1 (X 0 − X Ψ−1 C ) (X 0 − C Ψ−1 X )(X Ψ−1 X )−1 (X Ψ−1 X )−1 The upper left diagonal element is as claimed in (20.3.7). The strategy of the proof given in ITPE is similar to the strategy used to obtain the GLS results, namely, to transform the data in such a way that the disturbances are well behaved. Both data vectors y and y 0 will be transformed, but this transformation must have the following additional property: the transformed y must be a function of y alone, not of y 0 . Once such a transformation is found, it is easy to predict the transformed y 0 on the basis of the transformed y , and from this one also obtains a prediction of y 0 on the basis of y . 234 20. BEST LINEAR PREDICTION Here is some heuristics in order to understand formula (20.3.6). Assume for a moment that β was known. Then you can apply theorem ?? to the model Ψ Xβ y ∼ , σ2 X 0β y0 C (20.3.16) C Ψ0 to get y ∗ = X 0 β + C Ψ−1 (y − Xβ ) as best linear predictor of y 0 on the basis of 0 y . According to theorem ??, its MSE matrix is σ 2 (Ψ0 − C Ψ−1 C ). Since β is ˆ ˆ not known, replace it by β , which gives exactly (20.3.6). This adds MSE [X 0 β + −1 ˆ); X 0 β +C Ψ−1 (y −Xβ )] to the MSE -matrix, which gives (20.3.7). C Ψ (y −X β Problem 276. Show that (20.3.17) ˆ ˆ MSE [X 0 β + C Ψ−1 (y − X β ); X 0 β + C Ψ−1 (y − Xβ )] = = σ 2 (X 0 − C Ψ−1 X )(X Ψ−1 X )−1 (X 0 − X Ψ−1 C ). Answer. What is predicted is a random variable, therefore the MSE matrix is the covariance ˆ matrix of the prediction error. The prediction error is (X 0 − C Ψ−1 )(β − β ), its covariance matrix is therefore σ 2 (X 0 − C Ψ−1 X )(X Ψ−1 X )−1 (X 0 − X Ψ−1 C ). Problem 277. In the following we work with partitioned matrices. Given the model (20.3.18) y X ε = β+ y0 X0 ε0 E[ Ψ ε 2 V[ ε ] = σ C 0 ε o= ε0 o C . Ψ0 X has full rank. y is observed, y 0 is not observed. C is not the null matrix. ˆ ˆ • a. Someone predicts y 0 by y ∗ = X 0 β , where β = (X Ψ−1 X )−1 X Ψ−1 y is 0 the BLUE of β . Is this predictor unbiased? ˆ Answer. Yes, since E [y 0 ] = X 0 β , and E [β ] = β . ˆ • b. Compute the MSE matrix MSE [X 0 β ; y 0 ] of this predictor. Hint: For any y matrix B , the difference B y − y 0 can be written in the form B −I . Hint: y0 For an unbiased predictor (or estimator), the MSE matrix is the covariance matrix of the prediction (or estimation) error. Answer. (20.3.19) E[(B y − y 0 )(B y − y 0 ) ] = V [B y − y 0 ] (20.3.20) =V B −I y y0 (20.3.21) = σ2 B −I Ψ C (20.3.22) = σ 2 B ΨB −C B C Ψ0 B −I − CB + Ψ0 . Now one must use B = X 0 (X Ψ−1 X )−1 X Ψ−1 . One ends up with (20.3.23) ˆ MSE [X 0 β ; y 0 ] = σ 2 X 0 (X Ψ−1 X )−1 X 0 −C Ψ−1 X (X Ψ−1 X )−1 X 0 −X 0 (X Ψ−1 X )−1 X Ψ−1 C +Ψ0 . • c. Compare its MSE -matrix with formula (20.3.7). Is the difference nonnegative definite? 20.3. PREDICTION OF FUTURE OBSERVATIONS IN THE REGRESSION MODEL 235 Answer. To compare it with the minimum MSE matrix, it can also be written as (20.3.24) ˆ MSE [X 0 β ; y 0 ] = σ 2 Ψ0 +(X 0 −C Ψ−1 X )(X Ψ−1 X )−1 (X 0 −X Ψ−1 C )−C Ψ−1 X (X Ψ−1 X )−1 X Ψ−1 C . i.e., it exceeds the minimum MSE matrix by C (Ψ−1 − Ψ−1 X (X Ψ−1 X )−1 X Ψ−1 )C . This is nnd because the matrix in parentheses is M = M ΨM , refer here to Problem 265. CHAPTER 21 Updating of Estimates When More Observations become Available The theory of the linear model often deals with pairs of models which are nested in each other, one model either having more data or more stringent parameter restrictions than the other. We will discuss such nested models in three forms: in the remainder of the present chapter 21 we will see how estimates must be updated when more observations become available, in chapter 22 how the imposition of a linear constraint affects the parameter estimates, and in chapter 23 what happens if one adds more regressors. ˆ Assume you have already computed the BLUE β on the basis of the observations y = Xβ + ε , and afterwards additional data y 0 = X 0 β + ε 0 become available. Then ˆ β can be updated using the following principles: Before the new observations became available, the information given in the origˆ inal dataset not only allowed to estimate β by β , but also yielded a prediction ∗ ˆ of the additional data. The estimation error β − β and the prediction ˆ y0 = X 0 β ∗ error y 0 − y 0 are unobserved, but we know their expected values (the zero vectors), and we also know their joint covariance matrix up to the unknown factor σ 2 . After the additional data have become available, we can compute the actual value of the prediction error y ∗ − y 0 . This allows us to also get a better idea of the actual value of 0 the estimation error, and therefore we can get a better estimator of β . The following steps are involved: (1) Make the best prediction y ∗ of the new data y 0 based on y . 0 (2) Compute the joint covariance matrix of the prediction error y ∗ − y 0 of the 0 new data by the old (which is observed) and the sampling error in the old regression ˆ β − β (which is unobserved). ˆ (3) Use the formula for best linear prediction (??) to get a predictor z ∗ of β − β . ˆ ˆ ˆ (4) Then β = β − z ∗ is the BLUE of β based on the joint observations y and y0 . (5) The sum of squared errors of the updated model minus that of the basic model is the standardized prediction error SSE ∗ − SSE = (y ∗ − y 0 ) Ω −1 (y ∗ − y 0 ) 0 0 ˆ ˆ ˆ ˆ where SSE ∗ = (y − X β ) (y − X β ) V [y ∗ − y 0 ] = σ 2Ω . 0 In the case of one additional observation and spherical covariance matrix, this procedure yields the following formulas: ˆ Problem 278. Assume β is the BLUE on the basis of the observation y = Xβ + ε , and a new observation y 0 = x0 β + ε0 becomes available. Show that the updated estimator has the form (21.0.25) ˆˆ ˆ β = β + (X X )−1 x0 ˆ y 0 − x0 β . 1 + x0 (X X )−1 x0 237 238 21. ADDITIONAL OBSERVATIONS Answer. Set it up as follows: ˆ y 0 − x0 β 0 x0 (X X )−1 x0 + 1 ∼ , σ2 ˆ o (X X )−1 x0 β−β (21.0.26) x0 (X X )−1 (X X )−1 and use (20.1.18). By the way, if the covariance matrix is not spherical but is Ψ c c ψ0 we get from (20.3.6) ˆ ˆ y ∗ = x0 β + c Ψ−1 (y − X β ) 0 (21.0.27) and from (20.3.15) y0 − y∗ 0 ˆ β−β (21.0.28) ∼ 0 , σ2 o ψ0 − c Ψ−1 c + (x0 − c Ψ−1 X )(X Ψ−1 X )−1 (x0 − X Ψ−1 c) (X Ψ−1 X )−1 (x0 − X Ψ−1 c) (x0 − c Ψ−1 X )(X Ψ−1 X )−1 (X Ψ−1 X )−1 ˆ • a. Show that the residual ε0 from the full regression is the following nonrandom ˆ ˆ multiple of the “predictive” residual y 0 − x0 β : 1 ˆ ˆ ˆ ˆ ε0 = y 0 − x0 β = ˆ (21.0.29) (y 0 − x0 β ) 1 + x0 (X X )−1 x0 Interestingly, this is the predictive residual divided by its relative variance (to standardize it one would have to divide it by its relative standard deviation). Compare this with (24.2.9). Answer. (21.0.29) can either be derived from (21.0.25), or from the following alternative application of the updating principle: All the information which the old observations have for the ˆ estimate of x0 β is contained in y 0 = x0 β . The information which the updated regression, which ˆ includes the additional observation, has about x0 β can therefore be represented by the following two “observations”: y0 ˆ y0 (21.0.30) = 1 δ x β+ 1 10 δ2 δ1 δ2 ∼ 0 x0 (X X )−1 x0 , σ2 0 0 0 1 This is a regression model with two observations and one unknown parameter, x0 β , which has a nonspherical error covariance matrix. The formula for the BLUE of x0 β in model (21.0.30) is (21.0.31) ˆ y0 = ˆ (21.0.32) = (21.0.33) = 1 x0 (X X )−1 x0 0 1 1 1+ 0 1 −1 1 1 y0 ˆ 1 x0 (X X )−1 x0 1 1 + x0 (X X )−1 x0 x0 (X X )−1 x0 −1 1 1 x0 (X X )−1 x0 0 + y0 (y 0 + x0 (X X )−1 x0 y 0 ). ˆ Now subtract (21.0.33) from y 0 to get (21.0.29). Using (21.0.29), one can write (21.0.25) as ˆˆ ˆ ˆ β = β + (X X )−1 x0 ε0 ˆ (21.0.34) Later, in (25.4.1), one will see that it can also be written in the form ˆˆ ˆ ˆ β = β + (Z Z )−1 x0 (y 0 − x0 β ) (21.0.35) where Z = X . x0 0 1 −1 y0 ˆ y0 21. ADDITIONAL OBSERVATIONS 239 Problem 279. Show the following fact which is point (5) in the above updating principle in this special case: If one takes the squares of the standardized predictive residuals, one gets the difference of the SSE for the regression with and without the additional observation y 0 ˆ (y 0 − x0 β )2 SSE ∗ − SSE = (21.0.36) 1 + x0 (X X )−1 x0 ˆ ˆ Answer. The sum of squared errors in the old regression is SSE = (y − X β ) (y − X β ); ˆ ˆ ∗ = (y − X β ) (y − X β ) + ε 2 . From ˆ0 ˆ ˆ the sum of squared errors in the updated regression is SSE ˆ (21.0.34) follows (21.0.37) ˆ ˆ ˆ ˆ y − X β = y − X β − X (X X )−1 x0 ε0 . ˆ ˆ ˆ ˆ ˆ ˆ ˆ If one squares this, the cross product terms fall away: (y − X β ) (y − X β ) = (y − X β ) (y − X β ) + ˆ0 x (X X )−1 x0 ε0 . Adding ε0 2 to both sides gives SSE ∗ = SSE + ε0 2 (1 + x (X X )−1 x0 ). ˆ ˆ ˆ ε0 ˆ ˆ ˆ ˆ 0 Now use (21.0.29) to get (21.0.36). CHAPTER 22 Constrained Least Squares One of the assumptions for the linear model was that nothing is known about the true value of β . Any k -vector γ is a possible candidate for the value of β . We ˜ used this assumption e.g. when we concluded that an unbiased estimator B y of β ˜ must satisfy BX = I . Now we will modify this assumption and assume we know that the true value β satisfies the linear constraint Rβ = u. To fix notation, assume y be a n × 1 vector, u a i × 1 vector, X a n × k matrix, and R a i × k matrix. In addition to our usual assumption that all columns of X are linearly independent (i.e., X has full column rank) we will also make the assumption that all rows of R are linearly independent (which is called: R has full row rank). In other words, the matrix of constraints R does not include “redundant” constraints which are linear combinations of the other constraints. 22.1. Building the Constraint into the Model Problem 280. Given a regression with a constant term and two explanatory variables which we will call x and z , i.e., (22.1.1) y t = α + βxt + γzt + εt • a. 1 point How will you estimate β and γ if it is known that β = γ ? Answer. Write (22.1.2) y t = α + β (xt + zt ) + εt • b. 1 point How will you estimate β and γ if it is known that β + γ = 1? Answer. Setting γ = 1 − β gives the regression (22.1.3) y t − zt = α + β (xt − zt ) + εt • c. 3 points Go back to a. If you add the original z as an additional regressor into the modified regression incorporating the constraint β = γ , then the coefficient of z is no longer an estimate of the original γ , but of a new parameter δ which is a linear combination of α, β , and γ . Compute this linear combination, i.e., express δ in terms of α, β , and γ . Remark (no proof required): this regression is equivalent to (22.1.1), and it allows you to test the constraint. Answer. It you add z as additional regressor into (22.1.2), you get y t = α + β (xt + zt )+ δzt + εt . Now substitute the right hand side from (22.1.1) for y to get α + βxt + γzt + εt = α + β (xt + zt ) + δzt + εt . Cancelling out gives γzt = βzt + δzt , in other words, γ = β + δ . In this regression, therefore, the coefficient of z is split into the sum of two terms, the first term is the value it should be if the constraint were satisfied, and the other term is the difference from that. • d. 2 points Now do the same thing with the modified regression from part b which incorporates the constraint β + γ = 1: include the original z as an additional regressor and determine the meaning of the coefficient of z . 241 242 22. CONSTRAINED LEAST SQUARES What Problem 280 suggests is true in general: every constrained Least Squares problem can be reduced to an equivalent unconstrained Least Squares problem with fewer explanatory variables. Indeed, one can consider every least squares problem to be “constrained” because the assumption E [y ] = Xβ for some β is equivalent to a linear constraint on E [y ]. The decision not to include certain explanatory variables in the regression can be considered the decision to set certain elements of β zero, which is the imposition of a constraint. If one writes a certain regression model as a constrained version of some other regression model, this simply means that one is interested in the relationship between two nested regressions. Problem 219 is another example here. 22.2. Conversion of an Arbitrary Constraint into a Zero Constraint This section, which is nothing but the matrix version of Problem 280, follows [DM93, pp. 16–19]. By reordering the elements of β one can write the constraint Rβ = u in the form (22.2.1) R1 R2 β1 ≡ R1 β 1 + R2 β 2 = u β2 where R1 is a nonsingular i × i matrix. Why can that be done? The rank of R is i, i.e., all the rows are linearly independent. Since row rank is equal to column rank, there are also i linearly independent columns. Use those for R1 . Using this same partition, the original regression can be written (22.2.2) y = X 1 β1 + X 2 β2 + ε Now one can solve (22.2.1) for β 1 to get (22.2.3) β 1 = R−1 u − R−1 R2 β 2 1 1 Plug (22.2.3) into (22.2.2) and rearrange to get a regression which is equivalent to the constrained regression: (22.2.4) y − X 1 R−1 u = (X 2 − X 1 R−1 R2 )β 2 + ε 1 1 or (22.2.5) y∗ = Z 2 β2 + ε One more thing is noteworthy here: if we add X 1 as additional regressors into (22.2.5), we get a regression that is equivalent to (22.2.2). To see this, define the difference between the left hand side and right hand side of (22.2.3) as γ 1 = β 1 − R−1 u + R−1 R2 β 2 ; then the constraint (22.2.1) is equivalent to the “zero constraint” 1 1 γ 1 = o, and the regression (22.2.6) y − X 1 R−1 u = (X 2 − X 1 R−1 R2 )β 2 + X 1 (β 1 − R−1 u + R−1 R2 β 2 ) + ε 1 1 1 1 is equivalent to the original regression (22.2.2). (22.2.6) can also be written as (22.2.7) y∗ = Z 2 β2 + X 1 γ 1 + ε The coefficient of X 1 , if it is added back into (22.2.5), is therefore γ 1 . Problem 281. [DM93] assert on p. 17, middle, that (22.2.8) R[X 1 , Z 2 ] = R[X 1 , X 2 ]. where Z 2 = X 2 − X 1 R−1 R2 . Give a proof. 1 22.3. LAGRANGE APPROACH TO CONSTRAINED LEAST SQUARES 243 Answer. We have to show (22.2.9) {z : z = X 1 γ + X 2 δ } = {z : z = X 1 α + Z 2 β } First ⊂: given γ and δ we need a α and β with (22.2.10) X 1 γ + X 2 δ = X 1 α + ( X 2 − X 1 R −1 R 2 ) β 1 This can be accomplished with β = δ and α = γ + R−1 R2 δ . The other side is even more trivial: 1 given α and β , multiplying out the right side of (22.2.10) gives X 1 α + X 2 β − X 1 R−1 R2 β , i.e., 1 δ = β and γ = α − R−1 R2 β . 1 22.3. Lagrange Approach to Constrained Least Squares ˆ ˆ The constrained least squares estimator is that k × 1 vector β = β which minimizes SSE = (y − Xβ ) (y − Xβ ) subject to the linear constraint Rβ = u. Again, we assume that X has full column and R full row rank. The Lagrange approach to constrained least squares, which we follow here, is given in [Gre97, Section 7.3 on pp. 341/2], also [DM93, pp. 90/1]: The Constrained Least Squares problem can be solved with the help of the “Lagrange function,” which is a function of the k × 1 vector β and an additional i × 1 vector λ of “Lagrange multipliers”: (22.3.1) L(β , λ) = (y − Xβ ) (y − Xβ ) + (Rβ − u) λ λ can be considered a vector of “penalties” for violating the constraint. For every ˜ possible value of λ one computes that β = β which minimizes L for that λ (This is an unconstrained minimization problem.) It will turn out that for one of the values ˆ ˆ ˆ ˆ λ = λ∗ , the corresponding β = β satisfies the constraint. This β is the solution of the constrained minimization problem we are looking for. ˆ ˆ Problem 282. 4 points Show the following: If β = β is the unconstrained minimum argument of the Lagrange function L(β , λ∗ ) = (y − Xβ ) (y − Xβ ) + (Rβ − u) λ∗ ˆ ˆ ˆ ˆ ˆ ˆ for some fixed value λ∗ , and if at the same time β satisfies Rβ = u, then β = β minimizes (y − Xβ ) (y − Xβ ) subject to the constraint Rβ = u. (22.3.2) ˆ ˆ Answer. Since β minimizes the Lagrange function, we know that ˆ ˆ ˆ ˜ ˜ ˜ ˆ ˆ ˆ (y − X β ) (y − X β ) + (Rβ − u) λ∗ ≥ (y − X β ) (y − X β ) + (Rβ − u) λ∗ (22.3.3) ˆ ˜ ˆ for all β . Since by assumption, β also satisfies the constraint, this simplifies to: (22.3.4) ˆ ˆ ˜ ˜ ˜ ˆ ˆ (y − X β ) (y − X β ) + (Rβ − u) λ∗ ≥ (y − X β ) (y − X β ). ˜ ˜ This is still true for all β . If we only look at those β which satisfy the constraint, we get (22.3.5) ˆ ˆ ˜ ˜ ˆ ˆ (y − X β ) ( y − X β ) ≥ (y − X β ) (y − X β ). ˆ ˆ This means, β is the constrained minimum argument. Instead of imposing the constraint itself, one imposes a penalty function which has such a form that the agents will “voluntarily” heed the constraint. This is a familiar principle in neoclassical economics: instead of restricting pollution to a certain level, tax the polluters so much that they will voluntarily stay within the desired level. ˆ ˆ The proof which follows now not only derives the formula for β but also shows ˆ satisfies Rβ = u. ˆ ∗ ˆ ˆ that there is always a λ for which β 244 22. CONSTRAINED LEAST SQUARES Problem 283. 2 points Use the simple matrix differentiation rules ∂ (w β )/∂ β = w and ∂ (β M β )/∂ β = 2β M to compute ∂ L/∂ β where (22.3.6) L(β ) = (y − Xβ ) (y − Xβ ) + (Rβ − u) λ Answer. Write the objective function as y y − 2y X β + β X X β + λ Rβ − λ u to get (22.3.7). ˆ ˆ ˆ ˆ Our goal is to find a β and a λ∗ so that (a) β = β minimizes L(β , λ∗ ) and (b) ˆ = u. In other words, β and λ∗ together satisfy the following two conditions: (a) ˆ ˆ ˆ Rβ they must satisfy the first order condition for the unconstrained minimization of L ˆ ˆ with respect to β , i.e., β must annul (22.3.7) ∂ L/∂ β = −2y X + 2β X X + λ∗ R, ˆ ˆ and (b) β must satisfy the constraint (22.3.9). (22.3.7) and (22.3.9) are two linear matrix equations which can indeed be solved ˆ ˆ for β and λ∗ . I wrote (22.3.7) as a row vector, because the Jacobian of a scalar function is a row vector, but it is usually written as a column vector. Since this conventional notation is arithmetically a little simpler here, we will replace (22.3.7) with its transpose (22.3.8). Our starting point is therefore (22.3.8) (22.3.9) ˆ ˆ 2X X β = 2X y − R λ∗ ˆ ˆ Rβ − u = o Some textbook treatments have an extra factor 2 in front of λ∗ , which makes the math slightly smoother, but which has the disadvantage that the Lagrange multiplier can no longer be interpreted as the “shadow price” for violating the constraint. ˆ ˆ ˆ ˆ Solve (22.3.8) for β to get that β which minimizes L for any given λ∗ : (22.3.10) 1 ˆ ˆ ˆ1 β = (X X )−1 X y − (X X )−1 R λ∗ = β − (X X )−1 R λ∗ 2 2 ˆ Here β on the right hand side is the unconstrained OLS estimate. Plug this formula ˆ into (22.3.9) in order to determine that value of λ∗ for which the corresponding ˆ for β ˆ ˆ β satisfies the constraint: (22.3.11) ˆ1 Rβ − R(X X )−1 R λ∗ − u = o. 2 Since R has full row rank and X full column rank, R(X X )−1 R (Problem 284). Therefore one can solve for λ∗ : (22.3.12) λ∗ = 2 R(X X )−1 R −1 has an inverse ˆ (Rβ − u) If one substitutes this λ∗ back into (22.3.10), one gets the formula for the constrained least squares estimator: (22.3.13) ˆˆ ˆ β = β − (X X )−1 R R(X X )−1 R −1 ˆ (Rβ − u). Problem 284. If R has full row rank and X full column rank, show that R(X X )−1 R has an inverse. Answer. Since it is nonnegative definite we have to show that it is positive definite. b R(X X )−1 R b = 0 implies b R = o b ecause (X X )−1 is positive definite, and this implies b = o because R has full row rank. 22.4. CONSTRAINED LEAST SQUARES AS THE NESTING OF TWO SIMPLER MODELS 245 Problem 285. Assume ε ∼ (o, σ 2 Ψ) with a nonsingular Ψ and show: If one minimizes SSE = (y − Xβ ) Ψ−1 (y − Xβ ) subject to the linear constraint Rβ = u, ˆ ˆ the formula for the minimum argument β is the following modification of (22.3.13): (22.3.14) ˆˆ ˆ β = β − (X Ψ−1 X )−1 R R(X Ψ−1 X )−1 R −1 ˆ (Rβ − u) ˆ where β = (X Ψ−1 X )−1 X Ψ−1 y . This formula is given in [JHG+ 88, (11.2.38) on p. 457]. Remark, which you are not asked to prove: this is the best linear unbiased estimator if ε ∼ (o, σ 2 Ψ) among all linear estimators which are unbiased whenever the true β satisfies the constraint Rβ = u.) Answer. Lagrange function is L(β , λ) = (y − Xβ ) Ψ−1 (y − Xβ ) + (Rβ − u) λ = y y − 2y Ψ−1 Xβ + β X Ψ−1 Xβ + λ Rβ − λ u Jacobian is ∂ L/∂ β = −2y Ψ−1 X + 2β X Ψ−1 X + λ R, Transposing and setting it zero gives ˆ ˆ 2X Ψ−1 X β = 2X Ψ−1 y − R λ∗ (22.3.15) ˆ ˆ Solve (22.3.15) for β : 1 ˆ ˆ1 ˆ (22.3.16) β = (X Ψ−1 X )−1 X Ψ−1 y − (X Ψ−1 X )−1 R λ∗ = β − (X Ψ−1 X )−1 R λ∗ 2 2 ˆ ˆ is the unconstrained GLS estimate. Plug β into the constraint (22.3.9): ˆ Here β ˆ Rβ − (22.3.17) 1 R(X Ψ−1 X )−1 R λ∗ − u = o. 2 Since R has full row rank and X full column rank and Ψ is nonsingular, R(X Ψ−1 X )−1 R has an inverse. Therefore λ∗ = 2 R(X Ψ−1 X )−1 R (22.3.18) −1 still ˆ (Rβ − u) ∗ Now substitute this λ back into (22.3.16): (22.3.19) ˆˆ ˆ β = β − (X Ψ−1 X )−1 R R(X Ψ−1 X )−1 R −1 ˆ ( R β − u) . 22.4. Constrained Least Squares as the Nesting of Two Simpler Models The imposition of a constraint can also be considered the addition of new information: a certain linear transformation of β , namely, Rβ , is observed without error. ˆ Problem 286. Assume the random β ∼ (β , σ 2 (X X )−1 ) is unobserved, but one observes Rβ = u. • a. 2 points Compute the best linear predictor of β on the basis of the observation u. Hint: First write down the joint means and covariance matrix of u and β . Answer. (22.4.1) u ∼ β −1 R ˆ Rβ 2 R (X X ) ˆ ,σ ( X X ) −1 R β R(X X )−1 ( X X ) −1 . Therefore application of formula (??) gives (22.4.2) ˆ β ∗ = β + (X X )−1 R R(X X )−1 R −1 ˆ ( u − R β ). 246 22. CONSTRAINED LEAST SQUARES • b. 1 point Look at the formula for the predictor you just derived. Have you seen this formula before? Describe the situation in which this formula is valid as a BLUE-formula, and compare the situation with the situation here. Answer. Of course, constrained least squares. But in contrained least squares, β is nonrandom ˆ and β is random, while here it is the other way round. In the unconstrained OLS model, i.e., before the “observation” of u = Rβ , the ˆ ˆ best bounded MSE estimators of u and β are Rβ and β , with the sampling errors having the following means and variances: (22.4.3) ˆ u − Rβ ˆ∼ β−β o R(X X )−1 R , σ2 o (X X )−1 R R(X X )−1 (X X )−1 After the observation of u we can therefore apply (20.1.18) to get exactly equation ˆ ˆ (22.3.13) for β . This is probably the easiest way to derive this equation, but it derives constrained least squares by the minimization of the MSE -matrix, not by the least squares problem. 22.5. Solution by Quadratic Decomposition An alternative purely algebraic solution method for this constrained minimization problem rewrites the OLS objective function in such a way that one sees immediately what the constrained minimum value is. Start with the decomposition (14.2.12) which can be used to show optimality of the OLS estimate: ˆ ˆ ˆ ˆ (y − Xβ ) (y − Xβ ) = (y − X β ) (y − X β ) + (β − β ) X X (β − β ). ˆˆ ˆ Split the second term again, using β − β = (X X )−1 R u): ˆ ˆ ˆ ˆ ˆ ˆˆ (β − β ) X X (β − β ) = β − β − (β − β ) R(X X )−1 R −1 ˆ (Rβ − ˆ ˆ ˆ ˆˆ X X β − β − (β − β ) ˆ ˆ ˆ ˆ = (β − β ) X X (β − β ) ˆ ˆ − 2(β − β ) X X (X X )−1 R R(X X )−1 R −1 ˆ (Rβ − u) ˆˆ ˆ ˆˆ ˆ + (β − β ) X X (β − β ). −1 ˆ The cross product terms can be simplified to −2(Rβ −u) R(X X )−1 R (Rβ − ˆ − u) R(X X )−1 R −1 (Rβ − u). Therefore the ˆ u), and the last term is (Rβ objective function for an arbitrary β can be written as ˆ ˆ (y − Xβ ) (y − Xβ ) = (y − X β ) (y − X β ) ˆ ˆ ˆ ˆ + (β − β ) X X (β − β ) − 2(Rβ − u) ˆ + (Rβ − u) R(X X )−1 R R(X X )−1 R −1 −1 ˆ (Rβ − u) ˆ (Rβ − u) The first and last terms do not depend on β at all; the third term is zero whenever ˆ ˆ β satisfies Rβ = u; and the second term is minimized if and only if β = β , in which case it also takes the value zero. 22.6. SAMPLING PROPERTIES OF CONSTRAINED LEAST SQUARES 247 22.6. Sampling Properties of Constrained Least Squares Again, this variant of the least squares principle leads to estimators with desirable ˆ ˆ ˆ ˆ sampling properties. Note that β is an affine function of y . We will compute E [β − β ] ˆ; β ] not only in the case that the true β satisfies Rβ = u, but also in ˆ and MSE [β the case that it does not. For this, let us first get a suitable representation of the sampling error: ˆ ˆˆ ˆ ˆ ˆ β − β = (β − β ) + (β − β ) = (22.6.1) ˆ = (β − β ) − (X X )−1 R −(X X )−1 R −1 R(X X )−1 R R(X X )−1 R −1 ˆ R(β − β ) (Rβ − u). The last term is zero if β satisfies the constraint. Now use (18.0.7) twice to get (22.6.2) ˆ ˆ β − β = W X ε −(X X )−1 R R(X X )−1 R −1 (Rβ − u) where (22.6.3) W = (X X )−1 − (X X )−1 R −1 R(X X )−1 R R(X X )−1 . ˆ ˆ If β satisfies the constraint, (22.6.2) simplifies to β − β = W X ε . In this case, ˆ is unbiased and MSE [β ; β ] = σ 2 W (Problem 287). Since (X X )−1 − ˆ ˆ ˆ therefore, β ˆ ˆ; β ] is smaller than MSE [β ; β ] by a nonnegative ˆ W is nonnegative definite, MSE [β ˆ ˆ ˆ definite matrix. This should be expected, since β uses more information than β . Problem 287. • a. Show that W X X W = W (i.e., X X is a g-inverse of W ). Answer. This is a tedious matrix multiplication. ˆ ˆ • b. Use this to show that MSE [β ; β ] = σ 2 W . (Without proof:) The Gauss-Markov theorem can be extended here as follows: the constrained least squares estimator is the best linear unbiased estimator among all linear (or, more precisely, affine) estimators which are unbiased whenever the true β satisfies the constraint Rβ = u. Note that there are more estimators which are unbiased whenever the true β satisfies the constraint than there are estimators which are unbiased for all β . ˆ ˆ If Rβ = u, then β is biased. Its bias is (22.6.4) ˆ −1 ˆ E [β − β ] = −(X X ) R R(X X )−1 R −1 (Rβ − u). Due to the decomposition (17.1.2) of the MSE matrix into dispersion matrix plus squared bias, it follows (22.6.5) ˆ ˆ MSE [β ; β ] = σ 2 W + + (X X )−1 R R(X X )−1 R · (Rβ − u) −1 (Rβ − u) · R(X X )−1 R −1 R(X X )−1 . Even if the true parameter does not satisfy the constraint, it is still possible that the constrained least squares estimator has a better MSE matrix than the 248 22. CONSTRAINED LEAST SQUARES unconstrained one. This is the case if and only if the true parameter values β and σ 2 satisfy (22.6.6) (Rβ − u) R(X X )−1 R )−1 (Rβ − u) ≤ σ 2 . This equation, which is the same as [Gre97, (8-27) on p. 406], is an interesting result, because the obvious estimate of the lefthand side in (22.6.6) is i times the value of the F -test statistic for the hypothesis Rβ = u. To test for this, one has to use the noncentral F -test with parameters i, n − k , and 1/2. ˆ ˆ Problem 288. 2 points This Problem motivates Equation (22.6.6). If β is a ˆ = u is also a better estimator of Rβ than ˆ ˆ better estimator of β than β , then Rβ ˆ Rβ . Show that this latter condition is not only necessary but already sufficient, ˆ i.e., if MSE [Rβ ; Rβ ] − MSE [u; Rβ ] is nonnegative definite then β and σ 2 satisfy (22.6.6). You are allowed to use, without proof, theorem A.5.9 in the mathematical Appendix. Answer. We have to show σ 2 R(X X )−1 R (22.6.7) is nonnegative definite. Since Ω = leads to (22.6.6). σ 2 R (X − (Rβ − u)(Rβ − u) X )−1 R has an inverse, theorem A.5.9 immediately 22.7. Estimation of the Variance in Constrained OLS Next we will compute the expected value of the minimum value of the constrained ˆ ˆˆ ˆ ˆ OLS objective funtion, i.e., E[ε ε] where ε = y − X β , again without necessarily ˆˆ ˆ making the assumption that Rβ = u: −1 ˆˆ ˆ ˆ ˆ (22.7.1) ε = y − X β = ε + X (X X )−1 R R(X X )−1 R ˆ (Rβ − u). Since X ε = o, it follows ˆ (22.7.2) ˆˆ ˆˆ ˆ ε ε = ε ε + (Rβ − u) ˆˆ R(X X )−1 R −1 ˆ (Rβ − u). ˆ ˆ Now note that E [Rβ − u] = Rβ − u and V [Rβ − u] = σ 2 R(X X )−1 R . Therefore −1 −1 R(X X )−1 R use (??) in theorem ?? and tr R(X X ) R = i to get (22.7.3) ˆ E[(Rβ − u) R(X X )−1 R −1 ˆ (Rβ − u)] = = σ 2 i+(Rβ − u) R(X X )−1 R −1 (Rβ − u) Since E[ε ε] = σ 2 (n − k ), it follows ˆˆ (22.7.4) ˆˆ E[ε ε] = σ 2 (n + i − k )+(Rβ − u) ˆˆ R(X X )−1 R −1 (Rβ − u). ˆˆ In other words, ε ε/(n + i − k ) is an unbiased estimator of σ 2 if the constraint ˆˆ holds, and it is biased upwards if the constraint does not hold. The adjustment of the degrees of freedom is what one should expect: a regression with k explanatory variables and i constraints can always be rewritten as a regression with k − i different explanatory variables (see Section 22.2), and the distribution of the SSE does not depend on the values taken by the explanatory variables at all, only on how many there are. The unbiased estimate of σ 2 is therefore ˆ ˆˆ (22.7.5) σ 2 = ε ε/(n + i − k ) ˆ ˆˆ ˆˆ Here is some geometric intuition: y = X β + ε is an orthogonal decomposition, since ε is orthogonal to all columns of X . From orthogonality follows y y = ˆ 22.7. ESTIMATION OF THE VARIANCE IN CONSTRAINED OLS 249 ˆ ˆ ˆ ˆˆ ˆˆ β X X β + ε ε. If one splits up y = X β + ε, one should expect this to be orthogˆ onal as well. But this is only the case if u = o. If u = o, one first has to shift the origin of the coordinate system to a point which can be written in the form Xβ 0 where β 0 satisfies the constraint: ˆ ˆ Problem 289. 3 points Assume β is the constrained least squares estimate, and β 0 is any vector satisfying Rβ 0 = u. Show that in the decomposition ˆ ˆ ˆ y − Xβ 0 = X (β − β 0 ) + ε ˆ (22.7.6) the two vectors on the righthand side are orthogonal. ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆˆ ˆ Answer. We have to show (β − β 0 ) X ε = 0. Since ε = y − X β = y − X β + X (β − β ) = ˆ ˆ ˆ ˆ − β ), and we already know that X ε = o, it is necessary and sufficient to show that ˆ ε + X (β ˆ ˆ ˆ ˆ ˆˆ ˆ (β − β 0 ) X X (β − β ) = 0. By (22.3.13), ˆ ˆ ˆ ˆˆ ˆ ˆ (β − β 0 ) X X (β − β ) = (β − β 0 ) X X (X X )−1 R = (u − u) −1 R(X X )−1 R R(X X )−1 R −1 ˆ ( R β − u) ˆ (Rβ − u) = 0. ˆˆ If u = o, then one has two orthogonal decompositions: y = y + ε, and y = y + ε. ˆˆ ˆˆ And if one connects the footpoints of these two orthogonal decompositions, one obtains an orthogonal decomposition into three parts: ˆ ˆ Problem 290. Assume β is the constrained least squares estimator subject to ˆ is the unconstrained least squares estimator. the constraint Rβ = o, and β ˆ ˆ ˆ ˆ • a. 1 point With the usual notation y = X β and y = X β , show that ˆ ˆ ˆ y = y + (y − y ) + ε ˆ ˆˆ ˆ ˆ (22.7.7) Point out these vectors in the reggeom simulation. ˆ ˆ Answer. In the reggeom-simulation, y is the purple line; X β is the red line starting at the ˆ) = y − y is the light blue line, and ε is the green line which ˆ ˆˆ origin, one could also call it y ; X (β − β ˆ ˆˆ ˆ ˆ does not start at the origin. In other words: if one projects y on a plane, and also on a line in that plane, and then connects the footpoints of these two projections, one obtains a zig-zag line with two right angles. ˆ • b. 4 points Show that in (22.7.7) the three vectors y , y − y , and ε are orthogˆˆ ˆ ˆ ˆ onal. You are allowed to use, without proof, formula (22.3.13): Answer. One has to verify that the scalar products of the three vectors on the right hand side ˆ ˆˆ ˆ ˆˆ ˆ of (22.7.7) are zero. y ε = β X ε = 0 and (y − y ) ε = (β − β ) X ε = 0 follow from X ε = o; ˆ ˆ ˆˆˆ ˆ ˆ ˆ ˆ geometrically on can simply say that y and y are in the space spanned by the columns of X , and ˆ ˆ ˆˆ ˆ ε is orthogonal to that space. Finally, using (22.3.13) for β − β , ˆ ˆ ˆ ˆ ˆˆ ˆ y (y − y ) = β X X (β − β ) = ˆˆˆ ˆ ˆ ˆ = β X X (X X )−1 R ˆ ˆ =β R R(X X )−1 R ˆ ˆ ˆ ˆ ˆ ˆ because β satisfies the constraint Rβ = o, hence β R Problem 291. R(X X )−1 R −1 ˆ Rβ = 0 =o . −1 ˆ Rβ = 250 22. CONSTRAINED LEAST SQUARES • a. 3 points In the model y = β + ε , where y is a n × 1 vector, and ε ∼ (o, σ 2 I ), ˆˆ ˆ2 ˆˆ ˆ subject to the constraint ι β = 0, compute β , ε, and the unbiased estimate σ . Give general formulas and the numerical results for the case y = −1 0 1 2 . All you need to do is evaluate the appropriate formulas and correctly count the number of degrees of freedom. ˆ Answer. The unconstrained least squares estimate of β is β = y , and since X = I , R = ι , ˆ = y − ι(ι ι)−1 (ι y ) = y − ιy by (22.3.13). If ˆ and u = 0, the constrained LSE has the form β ¯ ˆ ˆ y = [−1, 0, 1, 2] this gives β = [−1.5, −0.5, 0.5, 1.5]. The residuals in the constrained model are ˆ ˆ therefore ε = ιy , i.e., ε = [0.5, 0.5, 0.5, 0.5]. Since one has n observations, n parameters and 1 ˆ ¯ ˆ ˆ2 ˆˆ constraint, the number of degrees of freedom is 1. Therefore σ = ε ε/1 = ny 2 which is = 1 in our ˆ ˆˆ ¯ case. • b. 1 point Can you think of a practical situation in which this model might be appropriate? Answer. This can occur if one measures data which theoretically add to zero, and the measurement errors are independent and have equal standard deviations. • c. 2 points Check your results against a SAS printout (or do it in any other statistical package) with the data vector y = [ −1 0 1 2 ]. Here are the sas commands: data zeromean; input y x1 x2 x3 x4; cards; -1 1 0 0 0 00100 10010 20001 ; proc reg; model y= x1 x2 x3 x4 / noint; restrict x1+x2+x3+x4=0; output out=zerout residual=ehat; run; proc print data=zerout; run; Problem 292. Least squares estimates of the coefficients of a linear regression model often have signs that are regarded by the researcher to be ‘wrong’. In an effort to obtain the ‘right’ signs, the researcher may be tempted to drop statistically insignificant variables from the equation. [Lea75] showed that such attempts necessarily fail: there can be no change in sign of any coefficient which is more significant than the coefficient of the omitted variable. The present exercise shows this, using a different proof than Leamer’s. You will need the formula for the constrained least squares estimator subject to one linear constraint r β = u, which is (22.7.8) ˆˆ ˆ β =β−Vr r Vr −1 ˆ (r β − u). where V = (X X )−1 . • a. In order to assess the sensitivity of the estimate of any linear combination of the elements of β , φ = t β , due to imposition of the constraint, it makes sense 22.9. APPLICATION: BIASED ESTIMATORS AND PRE-TEST ESTIMATORS 251 ˆ ˆ ˆ ˆ to divide the change t β − t β by the standard deviation of t β , i.e., to look at ˆˆ ˆ t (β − β ) (22.7.9) . σ t (X X )−1 t Such a standardization allows you to compare the sensitivity of different linear comˆ binations. Show that that linear combination of the elements of β which is affected most if one imposes the constraint r β = u is the constraint t = r itself. If this ˆ value is small, then no other linear combination of the elements of β will be affected much by the imposition of the constraint either. Answer. Using (22.7.8) and equation (25.4.1) one obtains max t ˆ ( r β − u) = r (X −1 X )−1 r (r σ2 ˆ β − u) = ˆ ˆˆ (t (β − β ))2 σ 2 t (X X )−1 t = ˆˆ ˆ ˆˆ ˆ (β − β ) X X ( β − β ) = 2 σ ˆ ( r β − u)2 σ2 r (X X )−1 r 22.8. Inequality Restrictions With linear inequality restrictions, it makes sense to have R of deficient rank, these are like two different half planes in the same plane, and the restrictions define a quarter plane, or a triangle, etc. One obvious approach would be: compute the unrestricted estimator, see what restrictions it violates, and apply these restrictions with equality. But this equality restricted estimator may then suddenly violate other restrictions. One brute force approach would be: impose all combinations of restrictions and see if the so partially restricted parameter satisfies the other restrictions too; and among those that do, choose the one with the lowest SSE. [Gre97, 8.5.3 on pp. 411/12] has good discussion. The inequality restricted estimator is biased, unless the true parameter value satisfies all inequality restrictions ˆ with equality. It is always a mixture between the unbiased β and some restricted estimator which is biased if this condition does not hold. ˆ Its variance is always smaller than that of β but, incredibly, its MSE will someˆ times be larger than that of β . Don’t understand how this comes about. 22.9. Application: Biased Estimators and Pre-Test Estimators The formulas about Constrained Least Squares which were just derived suggest that it is sometimes advantageous (in terms of MSE) to impose constraints even if they do not really hold. In other words, one should not put all explanatory variables into a regression which have an influence, but only the main ones. A logical extension of this idea is the common practice of first testing whether some variables have significant influence and dropping the variables if they do not. These so-called pre-test estimators are very common. [DM93, Chapter 3.7, pp. 94–98] says something about them. Pre-test estimation this seems a good procedure, but the graph regarding MSE shows it is not: the pre-test estimator never has lowest MSE, and it has highest MSE exactly in the area where it is most likely to be applied. CHAPTER 23 Additional Regressors A good detailed explanation of the topics covered in this chapter is [DM93, pp. 19–24]. [DM93] use the addition of variables as their main paradigm for going from a more restrictive to a less restrictive model. In this chapter, the usual regression model is given in the form (23.0.1) y = X 1 β 1 + X 2 β 2 + ε = X 1 X2 β1 + ε = Xβ + ε , β2 ε ∼ (o, σ 2 I ) β1 . β2 We take a sequential approach to this regression. First we regress y on X 1 ˆ ˆ alone, which gives the regression coefficient β 1 . This by itself is an inconsistent estimator of β 1 , but we will use it as a stepping stone towards the full regression. We make use of the information gained by the regression on X 1 in our computation of the full regression. Such a sequential approach may be appropriate in the following situations: • If regression on X 1 is much simpler than the combined regression, for instance if X 1 contains dummy or trend variables, and the dataset is large. Example: model (??). • If we want to fit the regressors in X 2 by graphical methods and those in X 1 by analytical methods (added variable plots). • If we need an estimate of β 2 but are not interested in an estimate of β 1 . • If we want to test the joint significance of the regressors in X 2 , while X 1 consists of regressors not being tested. ˆˆ ˆ ˆˆ ˆ If one regresses y on X 1 , one gets y = X 1 β 1 + ε. Of course, β 1 is an inconsistent ˆ ˆ estimator of β 1 , since some explanatory variables are left out. And ε is orthogonal to X 1 but not to X 2 . ˆ The iterative “backfitting” method proceeds from here as follows: it regresses ε ˆ on X 2 , which gives another residual, which is again orthogonal on X 2 but no longer orthogonal on X 1 . Then this new residual is regressed on X 1 again, etc. where X = X 1 X 2 has full column rank, and the coefficient vector is β = Problem 293. The purpose of this Problem is to get a graphical intuition of the issues in sequential regression. Make sure the stand-alone program xgobi is installed on your computer (in Debian GNU-Linux do apt-get install xgobi), and the Rinterface xgobi is installed (the R-command is simply install.packages("xgobi"), or, on a Debian system the preferred argument is install.packages("xgobi", lib = "/usr/lib/R/library")). You have to give the commands library(xgobi) and then reggeom(). This produces a graph in the XGobi window which looks like [DM93, Figure 3b on p. 22]. If you switch from the XYPlot view to the Rotation view, you will see the same lines rotating 3-dimensionally, and you can interact with this graph. You will see that this graph shows the dependent variable y , the regression of y on x1 , and the regression of y on x1 and x2 . 253 254 23. ADDITIONAL REGRESSORS • a. 1 point In order to show that you have correctly identified which line is y , please answer the following two questions: Which color is y : red, yellow, light blue, dark blue, green, purple, or white? If it is yellow, also answer the question: Is it that yellow line which is in part covered by a red line, or is it the other one? If it is red, green, or dark blue, also answer the question: Does it start at the origin or not? • b. 1 point Now answer the same two questions about x1 . • c. 1 point Now answer the same two questions about x2 . ˆ • d. 1 point Now answer the same two questions about ε, the residual in the ˆ regression of y on x1 . • e. Now assume x1 is the vector of ones. The R2 of this regression is a ratio of the squared lengths of two of the lines in the regression. Which lines? ˆ ˆ • f . 2 points If one regresses ε on x2 , one gets a decomposition ε = h + k, ˆ ˆ where h is a multiple of x2 and k orthogonal to x2 . This is the next step in the backfitting algorithm. Draw this decomposition into the diagram. The points are already invisibly present. Therefore you should use the line editor to connect the points. You may want to increase the magnification scale of the figure for this. (In my version of XGobi, I often lose lines if I try to add more lines. This seems to be a bug which will probably be fixed eventually.) Which label does the corner point of the decomposition have? Make a geometric argument that the new residual k is no longer orthogonal to x2 . • g. 1 point The next step in the backfitting procedure is to regress k on x1 . The corner point for this decomposition is again invisibly in the animation. Identify the two endpoints of the residual in this regression. Hint: the R-command example(reggeom) produces a modified version of the animation in which the backfitting prodecure is highlighted. The successive residuals which are used as regressors are drawn in dark blue, and the quickly improving approximations to the fitted value are connected by a red zig-zag line. • h. 1 point The diagram contains the points for two more backfitting steps. Identify the endpoints of both residuals. • i. 2 points Of the five cornerpoints obtained by simple regressions, c, p, q , r, and s, three lie on one straight line, and the other two on a different straight line, with the intersection of these straight lines being the corner point in the multiple regression of y on x1 and x2 . Which three points are on the same line, and how can these two lines be characterized? • j. 1 point Of the lines cp, pq , qr, and rs, two are parallel to x1 , and two parallel to x2 . Which two are parallel to x1 ? • k. 1 point Draw in the regression of y on x2 . • l. 3 points Which two variables are plotted against each other in an addedvariable plot for x2 ? Here are the coordinates of some of the points in this animation: x1 x2 y y y ˆˆ ˆ 5 -1 3 3 3 0 4330 0 0400 23. ADDITIONAL REGRESSORS 255 In the dataset which R submits to XGobi, all coordinates are multiplied by 1156, which has the effect that all the points included in the animation have integer coordinates. Problem 294. 2 points How do you know that the decomposition 0 3 4 ˆˆ ˆˆ is y = y + ε in the regression of y = 3 3 4 on x1 = 5 0 0 3 3 4 = 3 0 0 + ? ˆ ˆ ˆ Answer. Besides the equation y = y + ε we have to check two things: (1) y is a linear ˆ ˆ ˆ ˆ combination of all the explanatory variables (here: is a multiple of x1 ), and (2) ε is orthogonal to ˆ all explanatory variables. Compare Problem ??. Problem 295. 3 points In the same way, check that the decomposition 3 3 0 + 0 0 4 ˆ is y = y + ε in the regression of y = 3 3 4 on x1 = 5 0 0 and x2 = 3 3 4 −1 4 0 = . ˆ ˆ ˆ Answer. Besides the equation y = y + ε we have to check two things: (1) y is a linear ˆ ˆ ˆ combination of all the explanatory variables. Since both x1 and x2 have zero as third coordinate, and they are linearly independent, they span the whole plane, therefore y , which also has the ˆ ˆ third coordinate zero, is their linear combination. (2) ε is orthogonal to both explanatory variables ˆ because its only nonzero coordinate is the third. ˆ ˆ ˆ ˆ The residuals ε in the regression on x1 are y − y = 3 3 4 3 0 0 − = 0 3 4 . This −1 ˆ vector is clearly orthogonal to x1 = . Now let us regress ε = on x2 = 4 . ˆ 0 Say h is the vector of fitted values and k the residual vector in this regression. We saw in problem 293 that this is the next step in backfitting, but k is not the same as the residual vector ε in the multiple regression, because k is not orthogonal to ˆ x1 . In order to get the correct residual in the joint regression and also the correct ˆ coefficient of x2 , one must regress ε only on that part of x2 which is orthogonal to ˆ x1 . This regressor is the dark blue line starting at the origin. ˆ ˆ In formulas: One gets the correct ε and β 2 by regressingx ε = M 1 y not on X 2 ˆ ˆ but on M 1 X 2 , where M 1 = I − X 1 (X 1 X 1 )−1 X 1 is the matrix which forms the residuals under the regression on X 1 . In other words, one has to remove the influence of X 1 not only from the dependent but also the independent variables. Instead of ˆ regressing the residuals ε = M 1 y on X 2 , one has to regress them on what is new ˆ about X 2 after we know X 1 , i.e., on what remains of X 2 after taking out the effect ˆ of X 1 , which is M 1 X 2 . The regression which gets the correct β 2 is therefore 3 0 0 0 3 4 ˆ M 1 y = M 1 X 2 β2 + ε ˆ (23.0.2) ˆ In formulas, the correct β 2 is (23.0.3) ˆ β 2 = (X 2 M 1 X 2 )−1 X 2 M 1 y . This regression also yields the correct covariance matrix. (The only thing which is not right is the number of degrees of freedom). The regression is therefore fully ˆ representative of the additional effect of x2 , and the plot of ε against M 1 X 2 with ˆ ˆ the fitted line drawn (which has the correct slope β 2 ) is called the “added variable plot” for X 2 . [CW99, pp. 244–246] has a good discussion of added variable plots. ˆ Problem 296. 2 points Show that in the model (23.0.1), the estimator β 2 = −1 ˆ (X 2 M 1 X 2 ) X 2 M 1 y is unbiased. Compute MSE [β 2 ; β 2 ]. ˆ Answer. β 2 −β 2 = (X 2 M 1 X 2 )−1 X 2 M 1 (X 1 β 1 +X 2 β 2 +ε )−β 2 = (X 2 M 1 X 2 )−1 X 2 M 1ε ; ˆ therefore MSE [β 2 ; β 2 ] = σ 2 (X 2 M 1 X 2 )−1 X 2 M 1 M 1 X 2 (X 2 M 1 X 2 )−1 = σ 2 (X 2 M 1 X 2 )−1 . 256 23. ADDITIONAL REGRESSORS ˆ In order to get an estimate of β 1 , one can again do what seems intuitive, namely, ˆ2 on X 1 . This gives regress y − X 2 β ˆ ˆ (23.0.4) β 1 = (X 1 X 1 )−1 X 1 (y − X 2 β 2 ). This regression also gives the right residuals, but not the right estimates of the covariance matrix. Problem 297. The three Figures in [DM93, p. 22] can be seen in XGobi if you use the instructions in Problem 293. The purple line represents the dependent variable y , and the two yellow lines the explanatory variables x1 and x2 . (x1 is the one which is in part red.) The two green lines represent the unconstrained regression ˆˆ y = y + ε, and the two red lines the constrained regression y = y + ε where y is ˆˆ ˆˆ only regressed on x1 . The two dark blue lines, barely visible against the dark blue background, represent the regression of x2 on x1 . • a. The first diagram which XGobi shows on startup is [DM93, diagram (b) on p. 22]. Go into the Rotation view and rotate the diagram in such a way that the view is [DM93, Figure (a)]. You may want to delete the two white lines, since they are not shown in Figure (a). • b. Make a geometric argument that the light blue line, which represents y − y = ˆˆ ˆ ˆ), is orthogonal on the green line ε (this is the green line which ends at the ˆ−β ˆ ˆ X (β point y , i.e., not the green line which starts at the origin). Answer. The light blue line lies in the plane spanned by x1 and x2 , and ε is orthogonal to ˆ this plane. • c. Make a geometric argument that the light blue line is also orthogonal to the ˆ red line y emanating from the origin. ˆ ˆ Answer. This is a little trickier. The red line ε is orthogonal to x1 , and the green line ε is ˆ ˆ ˆ also orthogonal to x1 . Together, ε and ε span therefore the plane orthogonal to x1 . Since the light ˆ ˆ ˆ blue line lies in the plane spanned by ε and ε, it is orthogonal to x1 . ˆ ˆ ˆ Question 297 shows that the decomposition y = y + (y − y ) + ε is orthogonal, ˆ ˆˆ ˆ ˆ ˆ ˆ , y − y , and ε are orthogonal to each other. This is (22.7.6) in the ˆ i.e., all 3 vectors y ˆ ˆ ˆ special case that u = o and therefore β 0 = o. One can use this same animation also to show the following: If you first project the purple line on the plane spanned by the yellow lines, you get the green line in the plane. If you then project that green line on x1 , which is a subspace of the plane, then you get the red section of the yellow line. This is the same result as if you had projected the purple line directly on x1 . A matrix-algebraic proof of this fact is given in (A.6.3). The same animation allows us to verify the following: ˆ ˆ ˆ • In the regression of y on x1 , the coefficient is β 1 , and the residual is ε. ˆ ˆ , β , and the ˆ • In the regression of y on x1 and x2 , the coefficients are β 1 2 residual is ε. ˆ ˆˆ ˆ • In the regression of y on x1 and M 1 x2 , the coefficients are β 1 , β 2 , and the residual is ε. The residual is ε because the space spanned by the regressors ˆ ˆ ˆ is the same as in the regression on x1 and x2 , and ε only depends on that space. ˆ • In the regression of y on M 1 x2 , the coefficient is β 2 , because the regressor I am leaving out is orthogonal to M 1 x2 . The residual contains the contribution of the left-out variable, i.e., it is ε + β 1 x1 . ˆˆ 23. ADDITIONAL REGRESSORS 257 ˆ ˆ • But in the regression of ε = M 1 y on M 1 x2 , the coefficient is β 2 and the ˆ residual ε. ˆ This last statement is (23.0.3). Now let us turn to proving all this mathematically. The “brute force” proof, i.e., the proof which is conceptually simplest but has to plow through some tedious mathematics, uses (14.2.4) with partitioned matrix inverses. For this we need (23.0.5). Problem 298. 4 points This is a simplified version of question 393. Show the following, by multiplying X X with its alleged inverse: If X = X 1 X 2 has full column rank, then (X X )−1 is the following partitioned matrix: (23.0.5) −1 X1 X1 X1 X2 (X 1 X 1 )−1 + K 1 X 2 (X 2 M 1 X 2 )−1 X 2 K 1 −K 1 X 2 (X 2 M 1 X 2 )−1 = X2 X1 X2 X2 −(X 2 M 1 X 2 )−1 X 2 K 1 (X 2 M 1 X 2 )−1 where M 1 = I − X 1 (X 1 X 1 )−1 X 1 and K 1 = X 1 (X 1 X 1 )−1 . From (23.0.5) one sees that the covariance matrix in regression (23.0.3) is the lower left partition of the covariance matrix in the full regression (23.0.1). ˆ Problem 299. 6 points Use the usual formula β = (X X )−1 X y together with (23.0.5) to prove (23.0.3) and (23.0.4). Answer. (14.2.4) reads here (23.0.6) ˆ (X 1 X 1 )−1 + K 1 X 2 (X 2 M 1 X 2 )−1 X 2 K 1 β1 ˆ= −(X M 1 X 2 )−1 X K 1 β2 2 2 −K 1 X 2 (X 2 M 1 X 2 )−1 (X 2 M 1 X 2 )−1 X1 y X2 y Since M 1 = I − K 1 X 1 , one can simplify (23.0.7) ˆ β 2 = −(X 2 M 1 X 2 )−1 X 2 K 1 X 1 y + (X 2 M 1 X 2 )−1 X 2 y (23.0.8) = (X 2 M 1 X 2 )−1 X 2 M y (23.0.9) ˆ β 1 = (X 1 X 1 )−1 X 1 y + K 1 X 2 (X 2 M 1 X 2 )−1 X 2 K 1 X 1 y − K 1 X 2 (X 2 M 1 X 2 )−1 X 2 y (23.0.10) = K 1 y − K 1 X 2 (X 2 M 1 X 2 )−1 X 2 (I − K 1 X 1 )y (23.0.11) = K 1 y − K 1 X 2 (X 2 M 1 X 2 )−1 X 2 M 1 y (23.0.12) ˆ = K 1 (y − X 2 β 2 ) [Gre97, pp. 245–7] follow a different proof strategy: he solves the partitioned normal equations (23.0.13) X1 X1 X2 X1 X1 X2 X2 X2 ˆ β1 ˆ β2 X1 y X2 y directly, without going through the inverse. A third proof strategy, used by [Seb77, pp. 65–72], is followed in Problems 301 and 302. 258 23. ADDITIONAL REGRESSORS Problem 300. 5 points [Gre97, problem 18 on p. 326]. The following matrix gives the slope in the simple regression of the column variable on the row variable: y 1 0.4 1.2 (23.0.14) x1 0.03 1 0.075 x2 0.36 0.3 1 y x1 x2 For example, if y is regressed on x1 , the slope is 0.4, but if x1 is regressed on y , the slope is 0.03. All variables have zero means, so the constant terms in all regressions are zero. What are the two slope coefficients in the multiple regression of y on x1 and x2 ? Hint: Use the partitioned normal equation as given in [Gre97, p. 245] in the special case when each of the partitions of X has only one colum. Answer. x1 x1 x2 x1 (23.0.15) x1 x2 x2 x2 ˆ β1 ˆ β2 = x1 y x2 y The first row reads ˆ ˆ β1 + (x1 x1 )−1 x1 x2 β2 = (x1 x1 )−1 x1 y (23.0.16) ˆ ˆ which is the upper line of [Gre97, (6.24) on p, 245], and in our numbers this is β1 = 0.4 − 0.3β2 . The second row reads ˆ ˆ (23.0.17) (x2 x2 )−1 x2 x1 β1 + β2 = (x2 x2 )−1 x2 y ˆ ˆ ˆ or in our numbers 0.075β2 + β2 = 1.2. Plugging in the formula for β1 gives 0.075 · 0.4 − 0.075 · ˆ ˆ ˆ ˆ 0.3β2 + β2 = 1.2. This gives β2 = 1.17/0.9775 = 1.196931 = 1.2 roughly, and β1 = 0.4 − 0.36 = 0.0409207 = 0.041 roughly. Problem 301. Derive (23.0.3) and (23.0.4) from the first order conditions for minimizing (23.0.18) (y − X 1 β 1 − X 2 β 2 ) (y − X 1 β 1 − X 2 β 2 ). Answer. Start by writing down the OLS objective function for the full model. Perhaps we can use the more sophisticated matrix differentiation rules? (23.0.19) (y −X 1 β 1 −X 2 β 2 ) (y −X 1 β 1 −X 2 β 2 ) = y y +β 1 X 1 X 1 β 1 +β 2 X 2 X 2 β 2 −2y X 1 β 1 −2y X 2 β 2 +2β 2 X 2 X 1 β 1 . Taking partial derivatives with respect to β 1 and β 2 gives (23.0.20) 2β 1 X 1 X 1 − 2y X 1 + 2β 2 X 2 X 1 or, transposed 2X 1 X 1 β 1 − 2X 1 y + 2X 1 X 2 β 2 (23.0.21) 2β 2 X 2 X 2 − 2y X 2 + 2β 1 X 1 X 2 or, transposed 2X 2 X 2 β 2 − 2X 2 y + 2X 2 X 1 β 1 ˆ ˆ Setting them zero and replacing β 1 by β 1 and β 2 by β 2 gives (23.0.22) ˆ ˆ X 1 X 1 β 1 = X 1 (y − X 2 β 2 ) (23.0.23) ˆ ˆ X 2 X 2 β 2 = X 2 (y − X 1 β 1 ). Premultiply (23.0.22) by X 1 (X 1 X 1 )−1 : (23.0.24) ˆ ˆ X 1 β 1 = X 1 (X 1 X 1 )−1 X 1 (y − X 2 β 2 ). Plug this into (23.0.23): (23.0.25) (23.0.26) ˆ X 2 X 2 β2 = X 2 ˆ y − X 1 (X 1 X 1 )−1 X 1 y + X 1 (X 1 X 1 )−1 X 1 X 2 β 2 ˆ X 2 M 1 X 2 β2 = X 2 M 1 y. (23.0.26) is the normal equation of the regression of M 1 y on M 1 X 2 ; it immediately implies (23.0.3). ˆ ˆ Once β 2 is known, (23.0.22) is the normal equation of the regression of y − X 2 β 2 on X 1 , which gives (23.0.4). 23. ADDITIONAL REGRESSORS 259 Problem 302. Using (23.0.3) and (23.0.4) show that the residuals in regression (23.0.1) are identical to those in the regression of M 1 y on M 1 X 2 . Answer. (23.0.27) ˆ ˆ ε = y − X 1 β1 − X 2 β2 ˆ (23.0.28) ˆ ˆ = y − X 1 (X 1 X 1 )−1 X 1 (y − X 2 β 2 ) − X 2 β 2 (23.0.29) ˆ = M 1 y − M 1 X 2 β2 . Problem 303. The following problem derives one of the main formulas for adding regressors, following [DM93, pp. 19–24]. We are working in model (23.0.1). • a. 1 point Show that, if X has full column rank, then X X , X 1 X 1 , and X 2 X 2 are nonsingular. Hint: A matrix X has full column rank if Xa = o implies a = o. Answer. From X X a = o follows a X X a = 0 which can also be written X a = 0. Therefore Xa = o, and since the columns are linearly independent, it follows a = o. X 1 X 1 and X 2 X 2 are nonsingular because, along with X , also X 1 and X 2 have full column rank. • b. 1 point Define M = I − X (X X )−1 X and M 1 = I − X 1 (X 1 X 1 )−1 X 1 . Show that both M and M 1 are projection matrices. (Give the definition of a projection matrix.) Which spaces do they project on? Which space is bigger? Answer. A projection matrix is symmetric and idempotent. That M M = M is easily verified. M projects on the orthogonal complement of the column space of X , and M 1 on that of X 1 . I.e., M 1 projects on the larger space. • c. 2 points Prove that M 1 M = M and that M X 1 = O as well as M X 2 = O . You will need each these equationse below. What is their geometric meaning? Answer. X 1 = X 1 X2 I = XA, say. Therefore M 1 M = (I −XA(A X X A)−1 A X )M = O M because X M = O . Geometrically this means that the space on which M projects is a subspace of the space on which M 1 projects. To show that M X 2 = O note that X 2 can be written in the O form X 2 = XB , too; this time, B = . M X 2 = O means geometrically that M projects on a I space that is orthogonal to all columns of X 2 . • d. 2 points Show that M 1 X 2 has full column rank. Answer. If M 1 X 2 b = o, then X 2 b = X 1 a for some a. We showed this in Problem 196. −a o −a Therefore X 1 X 2 = , and since X 1 X 2 has full column rank, it follows = b o b o , in particular b = o. o • e. 1 point Here is some more notation: the regression of y on X 1 and X 2 can also be represented by the equation (23.0.30) ˆ ˆ y = X 1 β1 + X 2 β2 + ε ˆ The difference between (23.0.1) and (23.0.30) is that (23.0.30) contains the parameter estimates, not their true values, and the residuals, not the true disturbances. Explain the difference between residuals and disturbances, and between the fitted regression line and the true regression line. 260 23. ADDITIONAL REGRESSORS • f . 1 point Verify that premultiplication of (23.0.30) by M 1 gives (23.0.31) ˆ ˆ M 1 y = M 1 X 2 β2 + ε Answer. We need M 1 X 1 = O and M 1 ε = M 1 M y = M y = ε (or this can also besseen ˆ ˆ because X 1 ε = o). ˆ • g. 2 points Prove that (23.0.31) is the fit which one gets if one regresses M 1 y on M 1 X 2 . In other words, if one runs OLS with dependent variable M 1 y and ˆ explanatory variables M 1 X 2 , one gets the same β 2 and ε as in (23.0.31), which are ˆ ˆ2 and ε as in the complete regression (23.0.30). the same β ˆ Answer. According to Problem ?? we have to check X 2 M 1 ε = X 2 M 1 M y = X 2 M y = ˆ O y = o. ˆ • h. 1 point Show that V [β 2 ] = (X 2 M 1 X 2 )−1 . Are the variance estimates and confidence intervals valid, which the computer automatically prints out if one regresses M 1 y on M 1 X 2 ? Answer. Yes except for the number of degrees of freedom. • i. 4 points If one premultiplies (23.0.1) by M 1 , one obtains (23.0.32) M 1 y = M 1 X 2 β 2 + M 1ε , M 1ε ∼ (o, σ 2 M 1 ) Although the covariance matrix of the disturbance M 1ε in (23.0.32) is no longer ˆ spherical, show that nevertheless the β 2 obtained by running OLS on (23.0.32) is the BLUE of β 2 based on the information given in (23.0.32) (i.e., assuming that M 1 y and M 1 X 2 are known, but not necessarily M 1 , y , and X 2 separately). Hint: this proof is almost identical to the proof that for spherically distributed disturbances the OLS is BLUE (e.g. given in [DM93, p. 159]), but you have to add some M 1 ’s to your formulas. ˜ ˜ Answer. Any other linear estimator γ of β 2 can be written as γ = (X 2 M 1 X 2 )−1 X 2 + ˜ γ C M 1 y . Its expected value is E[˜ ] = (X 2 M 1 X 2 )−1 X 2 M 1 X 2 β 2 + CM 1 X 2 β 2 . For γ to be unbiased, regardless of the value of β 2 , C must satisfy CM 1 X 2 = O . From this follows MSE [˜ ; β 2 ] = γ γ = σ 2 (X 2 M 1 X 2 )−1 +σ 2 CM 1 C , V [˜ ] = σ 2 (X 2 M 1 X 2 )−1 X 2 +C M 1 X 2 (X 2 M 1 X 2 )−1 +C ˆ i.e., it exceeds the MSE -matrix of β by a nonnegative definite matrix. Is it unique? The formula for the BLUE is not unique, since one can add any C with CM 1 C = O or equivalently CM 1 = O or C = AX for some A. However such a C applied to a dependent variable of the form M 1 y will give the null vector, therefore the values of the BLUE for those values of y which are possible are indeed unique. ˆ • j. 1 point Once β 2 is known, one can move it to the left hand side in (23.0.30) to get ˆ ˆ (23.0.33) y − X 2 β2 = X 1 β1 + ε ˆ ˆ ˆ Prove that one gets the right values of β 1 and of ε if one regresses y − X 2 β 2 on X 1 . ˆ Answer. The simplest answer just observes that X 1 ε = o. Or: The normal equation for this ˆ ˆ ˆ pseudo-regression is X 1 y − X 1 X 2 β 2 = X 1 X 1 β 1 , which holds due to the normal equation for the full model. ˆ • k. 1 point Does (23.0.33) also give the right covariance matrix for β 1 ? ˆ Answer. No, since y − X 2 β 2 has a different covariance matrix than σ 2 I . This following Problems gives some applications of the results in Problem 303. You are allowed to use the results of Problem 303 without proof. 23. ADDITIONAL REGRESSORS 261 Problem 304. Assume your regression involves an intercept, i.e., the matrix of regressors is ι X , where X is the matrix of the “true” explanatory variables with no vector of ones built in, and ι the vector of ones. The regression can therefore be written y = ια + Xβ + ε . (23.0.34) • a. 1 point Show that the OLS estimate of the slope parameters β can be obtained by regressing y on X without intercept, where y and X are the variables with their 1 means taken out, i.e., y = D y and X = DX , with D = I − n ιι . Answer. This is called the “sweeping out of means.” It follows immediately from (23.0.3). This is the usual procedure to do regression with a constant term: in simple regression y i = α + βxi + εi , (23.0.3) is equation (14.2.22): ˆ β= (23.0.35) (xi − x)(y i − y ) ¯ ¯ (xi − x)2 ¯ . ¯ • b. Show that the OLS estimate of the intercept is α = y − x β where x ˆ ¯ ¯ˆ 1 ¯ the row vector of column means of X , i.e., x = n ι X . is Answer. This is exactly (23.0.4). Here is a more specific argument: The intercept α is obˆ ˆ ˆ tained by regressing y − X β on ι. The normal equation for this second regression is ι y − ι X β = ¯ ˆ ¯ ι ια. If y is the mean of y , and x the row vector consisting of means of the colums of X , then this gives y = x β + α. In the case of simple regression, this was derived earlier as formula (14.2.23). ¯ ¯ˆˆ ˆ ˆ • c. 2 points Show that MSE [β ; β ] = σ 2 (X X )−1 . (Use the formula for β .) Answer. Since ι X (23.0.36) ι X= n ¯ xn ¯ nx , XX it follows by Problem 393 (23.0.37) ( ι X ι X )−1 = ¯ ¯ 1/n + x (X X )−1 x ¯ −(X X )−1 x ¯ −x (X X )−1 (X X )−1 In other words, one simply does as if the actual regressors had been the data with their means removed, and then takes the inverse of that design matrix. The only place where on has to be careful is the number of degrees of freedom. See also Seber [Seb77, section 11.7] about centering and scaling the data. ˆ • d. 3 points Show that y − ιy = X β . ˆ ¯ 1 ¯ ¯ Answer. First note that X = X + n ιι X = X + ιx where x is the row vector of means ˆ = ια + X β + ιx β = ι( α + x β ) + X β = ιy + X β . ˆ ˆ ˆ ¯ˆ of X . By definition, y = ια + X β ˆ ˆ ˆ ˆ ¯ˆ ¯ • e. 2 points Show that R2 = y X (X X )−1 X y yy Answer. (23.0.38) R2 = ˆ ˆ ( y − y ι) ( y − y ι) ˆ¯ ˆ¯ β X Xβ = yy yy ˆ and now plugging in the formula for β the result follows. 262 23. ADDITIONAL REGRESSORS • f . 3 points Now, split once more X = X 1 x2 where the second partition x2 consists of one column only, and X is, as above, the X matrix with the column ˆ β ˆ means taken out. Conformably, β = ˆ1 . Show that β2 (23.0.39) ˆ var[β 2 ] = σ2 1 2 x x (1 − R2· ) 2 where R2· is the R2 in the regression of x2 on all other variables in X . This is in ˆ [Gre97, (9.3) on p. 421]. Hint: you should first show that var[β 2 ] = σ 2 /x2 M 1 x2 −1 where M 1 = I −X 1 (X 1 X 1 ) X 1 . Here is an interpretation of (23.0.39) which you don’t have to prove: σ 2 /x x is the variance in a simple regression with a constant 2 term and x2 as the only explanatory variable, and 1/(1 − R2· ) is called the variance inflation factor. Answer. Note that we are not talking about the variance of the constant term but that of all the other terms. x X (X 1 X 1 )−1 X 1 x2 (23.0.40) x2 M 1 x2 = x2 x2 + x2 X 1 (X 1 X 1 )−1 X 1 x2 = x2 x2 1 + 2 1 x2 x2 2 and since the fraction is R2· , i.e., it is the R2 in the regression of x2 on all other variables in X , we get the result. CHAPTER 24 Residuals: Standardized, Predictive, “Studentized” 24.1. Three Decisions about Plotting Residuals After running a regression it is always advisable to look at the residuals. Here one has to make three decisions. The first decision is whether to look at the ordinary residuals (24.1.1) ˆ εi = y i − xi β ˆ (xi is the ith row of X ), or the “predictive” residuals, which are the residuals computed using the OLS estimate of β gained from all the other data except the ˆ data point where the residual is taken. If one writes β (i) for the OLS estimate without the ith observation, the defining equation for the ith predictive residual, which we call εi (i), is ˆ (24.1.2) ˆ εi (i) = y i − xi β (i). ˆ The second decision is whether to standardize the residuals or not, i.e., whether to divide them by their estimated standard deviations or not. Since ε = M y , the ˆ variance of the ith ordinary residual is (24.1.3) var[εi ] = σ 2 mii = σ 2 (1 − hii ), ˆ and regarding the predictive residuals it will be shown below, see (24.2.9), that (24.1.4) var[εi (i)] = ˆ σ2 σ2 = . mii 1 − hii Here (24.1.5) hii = xi (X X )−1 xi . (Note that xi is the ith row of X written as a column vector.) hii is the ith diagonal element of the “hat matrix” H = X (X X )−1 X , the projector on the column ˆ space of X . This projector is called “hat matrix” because y = H y , i.e., H puts the “hat” on y . Problem 305. 2 points Show that the ith diagonal element of the “hat matrix” H = X (X X )−1 X is xi (X X )−1 xi where xi is the ith row of X written as a column vector. Answer. In terms of ei , the n-vector with 1 on the ith place and 0 everywhere else, xi = X ei , and the ith diagonal element of the hat matrix is ei Hei = ei X i (X X )−1 X ei = xi (X X )−1 xi . Problem 306. 2 points The variance of the ith disturbance is σ 2 . Is the variance of the ith residual bigger than σ 2 , smaller than σ 2 , or equal to σ 2 ? (Before doing the math, first argue in words what you would expect it to be.) What about the variance of the predictive residual? Prove your answers mathematically. You are allowed to use (24.2.9) without proof. 263 264 24. RESIDUALS Answer. Here is only the math part of the answer: ε = M y . Since M = I − H is idempotent ˆ and symmetric, we get V [M y ] = σ 2 M , in particular this means var[εi ] = σ 2 mii where mii is the ˆ ith diagonal elements of M . Then mii = 1 − hii . Since all diagonal elements of projection matrices are between 0 and 1, the answer is: the variances of the ordinary residuals cannot be bigger than σ 2 . Regarding predictive residuals, if we plug mii = 1 − hii into (24.2.9) it becomes (24.1.6) εi (i) = ˆ 1 εi ˆ mii therefore var[εi (i)] = ˆ 12 σ2 σ mii = 2 mii mii which is bigger than σ 2 . Problem 307. Decide in the following situations whether you want predictive residuals or ordinary residuals, and whether you want them standardized or not. • a. 1 point You are looking at the residuals in order to check whether the associated data points are outliers and do perhaps not belong into the model. Answer. Here one should use the predictive residuals. If the ith observation is an outlier which should not be in the regression, then one should not use it when running the regression. Its inclusion may have a strong influence on the regression result, and therefore the residual may not be as conspicuous. One should standardize them. • b. 1 point You are looking at the residuals in order to assess whether there is heteroskedasticity. Answer. Here you want them standardized, but there is no reason to use the predictive residuals. Ordinary residuals are a little more precise than predictive residuals because they are based on more observations. • c. 1 point You are looking at the residuals in order to assess whether the disturbances are autocorrelated. Answer. Same answer as for b. • d. 1 point You are looking at the residuals in order to assess whether the disturbances are normally distributed. Answer. In my view, one should make a normal QQ-plot of standardized residuals, but one should not use the predictive residuals. To see why,√ us first look at the distribution of the let ˆ standardized residuals before division by s. Each εi / 1 − hii is normally distributed with mean zero and standard deviation σ . (But different such residuals are not independent.) If one takes a QQ-plot of those residuals against the normal distribution, one will get in the limit a straight line with slope σ . If one divides every residual by s, the slope will be close to 1, but one will again get something approximating a straight line. The fact that s is random does not affect the relation of the residuals to each other, and this relation is what determines whether or not the QQ-plot approximates a straight line. But Belsley, Kuh, and Welsch on [BKW80, p. 43] draw a normal probability plot of the studentized, not the standardized, residuals. They give no justification for their choice. I think it is the wrong choice. • e. 1 point Is there any situation in which you do not want to standardize the residuals? Answer. Standardization is a mathematical procedure which is justified when certain conditions hold. But there is no guarantee that these conditions acutally hold, and in order to get a more immediate impression of the fit of the curve one may want to look at the unstandardized residuals. The third decision is how to plot the residuals. Never do it against y . Either do it against the predicted y , or make several plots against all the columns of the ˆ X -matrix. In time series, also a plot of the residuals against time is called for. 24.2. RELATIONSHIP BETWEEN ORDINARY AND PREDICTIVE RESIDUALS 265 Another option are the partial residual plots, see about this also (23.0.2). Say ˆ[h] is the estimated parameter vector, which is estimated with the full model, but β after estimation we drop the h-th parameter, and X [h] is the X -matrix without the hth column, and xh is the hth column of the X -matrix. Then by (23.0.4), the estimate of the hth slope parameter is the same as that in the simple regression of ˆ ˆ y − X [h]β [h] on xh . The plot of y − X [h]β [h] against xh is called the hth partial residual plot. To understand this better, start out with a regression y i = α + βxi + γzi + εi ; which gives you the fitted values y i = α + β xi + γ zi + εi . Now if you regress y i − α − β xi ˆˆ ˆ ˆ ˆˆ on xi and zi then the intercept will be zero and the estimated coefficient of xi will ˆ ˆ be zero, and the estimated coefficient of zi will be γ , and the residuals will be εi . The plot of y i − α − β xi versus zi is the partial residuals plot for z . ˆˆ 24.2. Relationship between Ordinary and Predictive Residuals ˆ In equation (24.1.2), the ith predictive residuals was defined in terms of β (i), the parameter estimate from the regression of y on X with the ith observation left out. We will show now that there is a very simple mathematical relationship between the ith predictive residual and the ith ordinary residual, namely, equation (24.2.9). (It is therefore not necessary to run n different regressions to get the n predictive residuals.) We will write y (i) for the y vector with the ith element deleted, and X (i) is the matrix X with the ith row deleted. Problem 308. 2 points Show that (24.2.1) X (i) X (i) = X X − xi xi (24.2.2) X (i) y (i) = X y − xi y i . Answer. Write (24.2.2) as X y = X (i) y (i) + xi y i , and observe that with our definition of xi as column vectors representing the rows of X , X = x1 · · · xn . Therefore y1 (24.2.3) X y = x1 ... xn . = x1 y1 + · · · + xn yn . . . yn An important stepping stone towards the proof of (24.2.9) is equation (24.2.8), which gives a relationship between hii and (24.2.4) hii (i) = xi (X (i) X (i))−1 xi . ˆ y i (i) = xi β (i) has variance σ 2 hii (i). The following problems give the steps necesˆ sary to prove (24.2.8). We begin with a simplified version of theorem A.8.2 in the Mathematical Appendix: Theorem 24.2.1. Let A be a nonsingular k × k matrix, δ = 0 a scalar, and b a k × 1 vector with b A−1 b + δ = 0. Then (24.2.5) A+ bb δ −1 = A−1 − A−1 bb A−1 . δ + b A−1 b Problem 309. Prove (24.2.5) by showing that the product of the matrix with its alleged inverse is the unit matrix. 266 24. RESIDUALS Problem 310. As an application of (24.2.5) show that (24.2.6) (X X )−1 xi xi (X X )−1 (X X )−1 + is the inverse of 1 − hii X (i) X (i). Answer. This is (24.2.5), or (A.8.20), with A = X X , b = xi , and δ = −1. Problem 311. Using (24.2.6) show that 1 (X X )−1 xi , 1 − hii and using (24.2.7) show that hii (i) is related to hii by the equation 1 (24.2.8) 1 + hii (i) = 1 − hii [Gre97, (9-37) on p. 445] was apparently not aware of this relationship. (24.2.7) (X (i) X (i))−1 xi = Problem 312. Prove the following mathematical relationship between predictive residuals and ordinary residuals: 1 (24.2.9) εi (i) = ˆ εi ˆ 1 − hii which is the same as (21.0.29), only in a different notation. Answer. For this we have to apply the above mathematical tools. With the help of (24.2.7) (transpose it!) and (24.2.2), (24.1.2) becomes εi (i) = y i − xi (X (i) X (i))−1 X (i) y (i) ˆ 1 x (X X )−1 (X y − xi y i ) 1 − hii i 1 1 ˆ = yi − x β+ x (X X )−1 xi y i 1 − hii i 1 − hii i 1 hii ˆ = yi 1 + − xβ 1 − hii 1 − hii i 1 ˆ (y − xi β ) = 1 − hii i This is a little tedious but simplifies extremely nicely at the end. = yi − The relationship (24.2.9) is so simple because the estimation of ηi = xi β can be done in two steps. First collect the information which the n − 1 observations other than the ith contribute to the estimation of ηi = xi β is contained in y i (i). The ˆ information from all observations except the ith can be written as (24.2.10) y i (i) = η i + δ i ˆ δ i ∼ (0, σ 2 hii (i)) Here δ i is the “sampling error” or “estimation error” y i (i) − η i from the regression of ˆ y (i) on X (i). If we combine this compound “observation” with the ith observation y i , we get (24.2.11) y i (i) ˆ 1 δ = η+ i yi 1i εi δi ∼ εi 0 h (i) 0 , σ 2 ii 0 0 1 This is a regression model similar to model (14.1.1), but this time with a nonspherical covariance matrix. Problem 313. Show that the BLUE of η i in model (24.2.11) is (24.2.12) y i = (1 − hii )y i (i) + hii y i = y i (i) + hii εi (i) ˆ ˆ ˆ ˆ Hint: apply (24.2.8). Use this to prove (24.2.9). 24.3. STANDARDIZATION 267 Answer. As shown in problem 178, the BLUE in this situation is the weighted average of the observations with the weights proportional to the inverses of the variances. I.e., the first observation has weight 1/hii (i) 1 = = 1 − hii . 1/hii (i) + 1 1 + hii (i) (24.2.13) Since the sum of the weights must be 1, the weight of the second observation is hii . Here is an alternative solution, using formula (19.0.6) for the BLUE, which reads here yi = ˆ 1 = hii 1 hii 1−hii 1 0 1 0 1 1−hii hii 0 −1 0 1 1 1 −1 1 1 hii 1−hii 0 0 1 −1 y i (i) ˆ = yi y i (i) ˆ = (1 − hii )y i (i) + hii y i . ˆ yi Now subtract this last formula from y i to get y i − y i = (1 − hii )(y i − y i (i)), which is (24.2.9). ˆ ˆ 24.3. Standardization In this section we will show that the standardized predictive residual is what is sometimes called the “studentized” residual. It is recommended not to use the term “studentized residual” but say “standardized predictive residual” instead. The standardization of the ordinary residuals has two steps: every εi is divided ˆ √ by its “relative” standard deviation 1 − hii , and then by s, an estimate of σ , the standard deviation of the true disturbances. In formulas, εi ˆ (24.3.1) the ith standardized ordinary residual = √ . s 1 − hii Standardization of the ith predictive residual has the same two steps: first divide the predictive residual (24.2.9) by the relative standard deviation, and then divide by s(i). But a look at formula (24.2.9) shows that the ordinary and the predictive residual differ only by a nonrandom factor. Therefore the first step of the standardization yields exactly the same result whether one starts with an ordinary or a predictive residual. Standardized predictive residuals differ therefore from standardized ordinary residuals only in the second step: εi ˆ (24.3.2) the ith standardized predictive residual = . √ s(i) 1 − hii Note that equation (24.3.2) writes the standardized predictive residual as a function of the ordinary residual, not the predictive residual. The standardized predictive residual is sometimes called the “studentized” residual. Problem 314. 3 points The ith predictive residual has the formula 1 (24.3.3) εi (i) = ˆ εi ˆ 1 − hii You do not have to prove this formula, but you are asked to derive the standard deviation of εi (i), and to derive from it a formula for the standardized ith predictive ˆ residual. This similarity between these two formulas has lead to widespread confusion. Even [BKW80] seem to have been unaware of the significance of “studentization”; they do not work with the concept of predictive residuals at all. The standardized predictive residuals have a t-distribution, because they are a normally distributed variable divided by an independent χ2 over its degrees of freedom. (But note that the joint distribution of all standardized predictive residuals is not a multivariate t.) Therefore one can use the quantiles of the t-distribution to 268 24. RESIDUALS judge, from the size of these residuals, whether one has an extreme observation or not. Problem 315. Following [DM93, p. 34], we will use (23.0.3) and the other formulas regarding additional regressors to prove the following: If you add a dummy variable which has the value 1 for the ith observation and the value 0 for all other observations to your regression, then the coefficient estimate of this dummy is the ith predictive residual, and the coefficient estimate of the other parameters after inclusion ˆ of this dummy is equal to β (i). To fix notation (and without loss of generality), assume the ith observation is the last observation, i.e., i = n, and put the dummy variable first in the regression: o y (n) = yn 1 (24.3.4) X (n) xn ε(i) ˆ α + εn ˆ β • a. 2 points With the definition X 1 = en = or y = en X α +ε β o , write M 1 = I −X 1 (X 1 X 1 )−1 X 1 1 as a 2 × 2 partitioned matrix. Answer. (24.3.5) M1 = I o o o − 1 1 o 1= I o o ; 0 I o o 0 z (i) z (i) = zi 0 i.e., M 1 simply annulls the last element. • b. 2 points Either show mathematically, perhaps by evaluating (X 2 M 1 X 2 )−1 X 2 M 1 y , or give a good heuristic argument (as [DM93] do), that regressing M 1 y on M 1 X gives the same parameter estimate as regressing y on X with the nth observation dropped. Answer. (23.0.2) reads here X (n) ˆ y (n) ε(i) ˆ = β (i) + o 0 0 (24.3.6) ˆ in other words, the estimate of β is indeed β (i), and the first n − 1 elements of the residual are indeed the residuals one gets in the regression without the ith observation. This is so ugly because the singularity shows here in the zeros of the last row, usually it does not show so much. But this way one also sees that it gives zero as the last residual, and this is what one needs to know! To have a mathematical proof that the last row with zeros does not affect the estimate, evaluate (23.0.3) ˆ β 2 = (X 2 M 1 X 2 )−1 X 2 M 1 y = X (n) xn I o o 0 X (n) xn −1 X (n) xn I o o 0 y (n) yn ˆ = (X (n) X (n))−1 X (n) y (n) = β (n) • c. 2 points Use the fact that the residuals in the regression of M 1 y on M 1 X are the same as the residuals in the full regression (24.3.4) to show that α is the nth ˆ predictive residual. ˆ Answer. α is obtained from that last row, which reads y n = α+xn β (i), i.e., α is the predictive ˆ ˆ ˆ residual. • d. 2 points Use (23.0.3) with X 1 and X 2 interchanged to get a formula for α. ˆ Answer. α = (X 1 M X 1 )−1 X 1 M y = ˆ 1 ε ˆ mnn n = 1 ε, ˆ 1−hnn n here M = I − X (X X )−1 X . 24.3. STANDARDIZATION 269 ˆ ˆ • e. 2 points From (23.0.4) follows that also β 2 = (X 2 X 2 )−1 X 2 (y − X 1 β 1 ). Use this to prove 1 ˆˆ (24.3.7) β − β (i) = (X X )−1 xi εi ˆ 1 − hii which is [DM93, equation (1.40) on p. 33]. ˆ Answer. For this we also need to show that one gets the right β (i) if one regresses y − en α, ˆ ˆ or, in other words y − en εn (n), on X . In other words, β (n) = (X X )−1 X (y − en εn (n)), which ˆ ˆ is exactly (25.4.1). CHAPTER 25 Regression Diagnostics “Regression Diagnostics” can either concentrate on observations or on variables. Regarding observations, it looks for outliers or influential data in the dataset. Regarding variables, it checks whether there are highly collinear variables, or it keeps track of how much each variable contributes to the MSE of the regression. Collinearity is discussed in [DM93, 6.3] and [Gre97, 9.2]. Regression diagnostics needs five to ten times more computer resources than the regression itself, and often relies on graphics, therefore it has only recently become part of the standard procedures. Problem 316. 1 point Define multicollinearity. • a. 2 points What are the symptoms of multicollinearity? • b. 2 points How can one detect multicollinearity? • c. 2 points How can one remedy multicollinearity? 25.1. Missing Observations First case: data on y are missing. If you use a least squares predictor then this will not give any change in the estimates and although the computer will think it is more efficient it isn’t. What other schemes are there? Filling in the missing y by the arithmetic mean of the observed y does not give an unbiased estimator. General conclusion: in a single-equation context, filling in missing y not a good idea. Now missing values in the X -matrix. If there is only one regressor and a constant term, then the zero order filling in of x “results in no changes and is equivalent with dropping the incomplete data.” ¯ The alternative: filling it with zeros and adding a dummy for the data with missing observation amounts to exactly the same thing. The only case where filling in missing data makes sense is: if you have multiple regression and you can predict the missing data in the X matrix from the other data in the X matrix. 25.2. Grouped Data If single observations are replaced by arithmetic means of groups of observations, then the error variances vary with the size of the group. If one takes this into consideration, GLS still has good properties, although having the original data is of course more efficient. 25.3. Influential Observations and Outliers The following discussion focuses on diagnostics regarding observations. To be more precise, we will investigate how each single observation affects the fit established 271 272 25. REGRESSION DIAGNOSTICS by the other data. (One may also ask how the addition of any two observations affects the fit, etc.) 25.3.1. The “Leverage”. The ith diagonal element hii of the “hat matrix” is called the “leverage” of the ith observation. The leverage satisfies the following identity y i = (1 − hii )y i (i) + hii y i ˆ ˆ (25.3.1) ˆ hii is therefore is the weight which y i has in the least squares estimate y i of ηi = xi β , ˆ ˆ compared with all other observations, which contribute to y i through y i (i). The larger this weight, the more strongly this one observation will influence the estimate of ηi (and if the estimate of ηi is affected, then other parameter estimates may be affected too). Problem 317. 3 points Explain the meanings of all the terms in equation (25.3.1) and use that equation to explain why hii is called the “leverage” of the ith observation. Is every observation with high leverage also “influential” (in the sense that its removal would greatly change the regression estimates)? Answer. y i is the fitted value for the ith observation, i.e., it is the BLUE of ηi , of the expected ˆ value of the ith observation. It is a weighted average of two quantities: the actual observation y i (which has ηi as expected value), and y i (i), which is the BLUE of ηi based on all the other ˆ observations except the ith. The weight of the ith observation in this weighted average is called the “leverage” of the ith observation. The sum of all leverages is always k, the number of parameters in the regression. If the leverage of one individual point is much greater than k/n, then this point has much more influence on its own fitted value than one should expect just based on the number of observations, Leverage is not the same as influence; if an observation has high leverage, but by accident ˆ the observed value y i is very close to y i (i), then removal of this observation will not change the regression results much. Leverage is potential influence. Leverage does not depend on any of the observations, one only needs the X matrix to compute it. Those observations whose x-values are away from the other observations have “leverage” and can therefore potentially influence the regression results more than the others. hii serves as a measure of this distance. Note that hii only depends on the X matrix, not on y , i.e., points may have a high leverage but not be influential, because the associated y i blends well into the fit established by the other data. However, regardless of the observed value of y , observations with high leverage always affect ˆ the covariance matrix of β . (25.3.2) hii = det(X X ) − det(X (i) X (i)) , det(X X ) where X (i) is the X -matrix without the ith observation. Problem 318. Prove equation (25.3.2). Answer. Since X (i)X (i) = X X − xi xi , use theorem A.7.3 with W = X X , α = −1, and d = xi . Problem 319. Prove the following facts about the diagonal elements of the socalled “hat matrix” H = X (X X )−1 X , which has its name because H y = y , ˆ i.e., it puts the hat on y . • a. 1 point H is a projection matrix, i.e., it is symmetric and idempotent. Answer. Symmetry follows from the laws for the transposes of products: H = (ABC ) = C B A = H where A = X , B = (X X )−1 which is symmetric, and C = X . Idempotency X (X X )−1 X X (X X )−1 X = X (X X )−1 X . 25.4. SENSITIVITY OF ESTIMATES TO OMISSION OF ONE OBSERVATION 273 • b. 1 point Prove that a symmetric idempotent matrix is nonnegative definite. Hg Answer. If H is symmetric and idempotent, then for arbitrary g , g H g = g H H g = 2 ≥ 0. But g H g ≥ 0 for all g is the criterion which makes H nonnegative definite. • c. 2 points Show that 0 ≤ hii ≤ 1 (25.3.3) Answer. If ei is the vector with a 1 on the ith place and zeros everywhere else, then ei Hei = hii . From H nonnegative definite follows therefore that hii ≥ 0. hii ≤ 1 follows because I − H is symmetric and idempotent (and therefore nonnegative definite) as well: it is the projection on the orthogonal complement. • d. 2 points Show: the average value of the hii is hii /n = k/n, where k is the number of columns of X . (Hint: for this you must compute the trace tr H .) Answer. The average can be written as 1 n tr(H ) = 1 n tr(X (X X )−1 X ) = 1 n tr(X X (X X )−1 ) = 1 n tr(I k ) = k . n Here we used tr BC = tr CB (Theorem A.1.2). • e. 1 point Show that ones. 1 n ιι is a projection matrix. Here ι is the n-vector of 1 • f . 2 points Show: If the regression has a constant term, then H − n ιι projection matrix. is a Answer. If ι, the vector of ones, is one of the columns of X (or a linear combination of these columns), this means there is a vector a with ι = Xa. From this follows Hιι = 1 X (X X )−1 X X aι = Xaι = ιι . One can use this to show that H − n ιι is idempotent: 1 1 1 1 1 1 1 1 1 (H − n ιι )(H − n ιι ) = HH − H n ιι − n ιι H + n ιι n ιι = H − n ιι − n ιι + n ιι = 1 H − n ιι . • g. 1 point Show: If the regression has a constant term, then one can sharpen inequality (25.3.3) to 1/n ≤ hii ≤ 1. Answer. H − ιι /n is a projection matrix, therefore nonnegative definite, therefore its diagonal elements hii − 1/n are nonnegative. • h. 3 points Why is hii called the “leverage” of the ith observation? To get full points, you must give a really good verbal explanation. ˆ Answer. Use equation (24.2.12). Effect on any other linear combination of β is less than the effect on y i . Distinguish from influence. Leverage depends only on X matrix, not on y . ˆ hii is closely related to the test statistic testing whether the xi comes from the same multivariate normal distribution as the other rows of the X -matrix. Belsley, Kuh, and Welsch [BKW80, p. 17] say those observations i with hii > 2k/n, i.e., more than twice the average, should be considered as “leverage points” which might deserve some attention. 25.4. Sensitivity of Estimates to Omission of One Observation The most straightforward approach to sensitivity analysis is to see how the estimates of the parameters of interest are affected if one leaves out the ith observation. In the case of linear regression, it is not necessary for this to run n different regressions, but one can derive simple formulas for the changes in the parameters of interest. Interestingly, the various sensitivity measures to be discussed below only depend on the two quantities hii and εi . ˆ 274 25. REGRESSION DIAGNOSTICS ˆ 25.4.1. Changes in the Least Squares Estimate. Define β (i) to be the OLS estimate computed without the ith observation, and εi (i) = 1−1 ii εi the ith ˆ hˆ predictive residual. Then (25.4.1) ˆˆ β − β (i) = (X X )−1 xi εi (i) ˆ Problem 320. Show (25.4.1) by methods very similar to the proof of (24.2.9) Answer. Here is this brute-force proof, I think from [BKW80]: Let y (i) be the y vector with the ith observation deleted. As shown in Problem 308, X (i)y (i) = X y − xi y i . Therefore by (24.2.6) ˆ β (i) = (X (i)X (i))−1 X (i)y (i) = (X X )−1 + (X X )−1 xi xi (X X )−1 X y − xi y i 1 − hii hii 1 ˆ ˆ = β − (X X )−1 xi y i + (X X )−1 xi xi β − (X X )−1 xi y i 1 − hii 1 − hii 1 1 1 ˆ ˆˆ (X X )−1 xi y i + (X X )−1 xi xi β = β − (X X )−1 xi εi ˆ =β− 1 − hii 1 − hii 1 − hii = To understand (25.4.1), note the following fact which is interesting in its own ˆ right: β (i), which is defined as the OLS estimator if one drops the ith observation, can also be obtained as the OLS estimator if one replaces the ith observation by the prediction of the ith observation on the basis of all other observations, i.e., by y i (i). ˆ Writing y ((i)) for the vector y whose ith observation has been replaced in this way, one obtains (25.4.2) ˆ β = (X X )−1 X y ; ˆ β (i) = (X X )−1 X y ((i)). Since y − y ((i)) = ei εi (i) and xi = X ei (25.4.1) follows. ˆ ˆ ˆ The quantities hii , β (i)−β , and s2 (i) are computed by the R-function lm.influence. Compare [CH93, pp. 129–131]. 25.4.2. Scaled Measures of Sensitivity. In order to assess the sensitivity of the estimate of any linear combination of the elements of β , φ = t β , it makes sense ˆ to divide the change in t β due to omission of the ith observation by the standard ˆ deviation of t β , i.e., to look at ˆˆ t (β − β (i)) (25.4.3) σ . t (X X )−1 t Such a standardization makes it possible to compare the sensitivity of different ˆ linear combinations, and to ask: Which linear combination of the elements of β is affected most if one drops the ith observation? Interestingly and, in hindsight, perhaps not surprisingly, the linear combination which is most sensitive to the addition of the ith observation, is t = xi . For a mathematical proof we need the following inequality, which is nothing but the Cauchy-Schwartz inequality in disguise: Theorem 25.4.1. If Ω is positive definite symmetric, then (25.4.4) max g (g x)2 g Ωg = x Ω −1 x. If the denominator in the fraction on the lefthand side is zero, then g = o and therefore the numerator is necessarily zero as well. In this case, the fraction itself should be considered zero. 25.4. SENSITIVITY OF ESTIMATES TO OMISSION OF ONE OBSERVATION 275 Proof: As in the derivation of the BLUE with nonsperical covariance matrix, pick a nonsingular Q with Ω = QQ , and define P = Q−1 . Then it follows P ΩP = I . Define y = P x and h = Q g . Then h y = g x, h h = g Ωg , and y y = x Ω −1 x. Therefore (25.4.4) follows from the Cauchy-Schwartz inequality (h y )2 ≤ (h h)(y y ). Using Theorem 25.4.1 and equation (25.4.1) one obtains (25.4.5) max t ˆˆ (t (β − β (i)))2 σ2 t (X X )−1 t = 1ˆ ˆ ˆˆ (β − β (i)) X X (β − β (i)) = σ2 1 hii x (X X )−1 X X (X X )−1 xi ε2 (i) = 2 ε2 (i) ˆi ˆ σ2 i σi Now we will show that the linear combination which attains this maximum, i.e., which is most sensitive to the addition of the ith observation, is t = xi . If one premultiplies (25.4.1) by xi one obtains = hii εi = hii εi (i) ˆ ˆ 1 − hii If one divides (25.4.6) by the standard deviation of y i , i.e., if one applies the conˆ struction (25.4.3), one obtains √ √ y i − y i (i) ˆ ˆ hii hii √ (25.4.7) εi (i) = ˆ εi ˆ = σ σ (1 − hii ) σ hii ˆ ˆ If y i changes only little (compared with the standard deviation of y i ) if the ith ˆ observation is removed, then no other linear combination of the elements of β will be affected much by the omission of this observation either. The righthand side of (25.4.7), with σ estimated by s(i), is called by [BKW80] and many others DFFITS (which stands for DiFference in FIT, Standardized). If one takes its square, divides it by k , and estimates σ 2 by s2 (which is more consistent ˆ than using s2 (i), since one standardizes by the standard deviation of t β and not ˆ(i)), one obtains Cook’s distance [Coo77]. (25.4.5) gives an equation by that of t β ˆˆ for Cook’s distance in terms of β − β (i): (25.4.8) ˆˆ ˆˆ (β − β (i)) X X (β − β (i)) hii hii Cook’s distance = ˆi ε2 ˆ = 2 ε2 (i) = 2 2 ks ks k s (1 − hii )2 i (25.4.6) ˆ ˆ y i − y i (i) = xi β − xi β (i) = ˆ ˆ Problem 321. Can you think of a situation in which an observation has a small residual but a large “influence” as measured by Cook’s distance? Answer. Assume “all observations are clustered near each other while the solitary odd observation lies a way out” as Kmenta wrote in [Kme86, p. 426]. If the observation happens to lie on the regression line, then it can be discovered by its influence on the variance-covariance matrix (25.3.2), i.e., in this case only the hii count. Problem 322. The following is the example given in [Coo77]. In R, the command data(longley) makes the data frame longley available, which has the famous Longley-data, a standard example for a highly multicollinear dataset. These data are also available on the web at www.econ.utah.edu/ehrbar/data/longley.txt. attach(longley) makes the individual variables available as R-objects. • a. 3 points Look at the data in a scatterplot matrix and explain what you see. Later we will see that one of the observations is in the regression much more influential than the rest. Can you see from the scatterplot matrix which observation that might be? 276 25. REGRESSION DIAGNOSTICS Answer. In linux, you first have to give the command x11() in order to make the graphics window available. In windows, this is not necessary. It is important to display the data in a reasonable order, therefore instead of pairs(longley) you should do something like attach(longley) and then pairs(cbind(Year, Population, Employed, Unemployed, Armed.Forces, GNP, GNP.deflator)). Put Year first, so that all variables are plotted against Year on the horizontal axis. Population vs. year is a very smooth line. Population vs GNP also quite smooth. You see the huge increase in the armed forced in 1951 due to the Korean War, which led to a (temporary) drop in unemployment and a (not so temporary) jump in the GNP deflator. Otherwise the unemployed show the stop-and-go scenario of the fifties. unemployed is not correlated with anything. One should expect a strong negative correlation between employed and unemployed, but this is not the case. • b. 4 points Run a regression of the model Employed ~ GNP.deflator + GNP + Unemployed + Armed.Forces + Population + Year and discuss the result. Answer. To fit a regression run longley.fit <- lm(Employed ~ GNP + Unemployed + Armed.Forces + Population + Year). You can see the regression results by typing summary(longley.fit). Armed forces and unemployed are significant and have negative sign, as expected. GNP and Population are insignificant and have negative sign too, this is not expected. GNP, Population and Year are highly collinear. • c. 3 points Make plots of the ordinary residuals and the standardized residuals against time. How do they differ? In R, the commands are plot(Year, residuals(longley.fit), type="h", ylab="Ordinary Residuals in Longley Regression"). In order to get the next plot in a different graphics window, so that you can compare them, do now either x11() in linux or windows() in windows, and then plot(Year, rstandard(longley.fit), type="h", ylab="Standardized Residuals in Longley Regression"). Answer. You see that the standardized residuals at the edge of the dataset are bigger than the ordinary residuals. The datapoints at the edge are better able to attract the regression plane than those in the middle, therefore the ordinary residuals are “too small.” Standardization corrects for this. • d. 4 points Make plots of the predictive residuals. Apparently there is no special command in R to do this, therefore you should use formula (24.2.9). Also plot the standardized predictive residuals, and compare them. Answer. The predictive residuals are plot(Year, residuals(longley.fit)/(1-hatvalues(longley.fit)), type="h", ylab="Predictive Residuals in Longley Regression"). The standardized predictive residuals are often called studentized residuals, plot(Year, rstudent(longley.fit), type="h", ylab="Standardized predictive Residuals in Longley Regression"). A comparison shows an opposite effect as with the ordinary residuals: the predictive residuals at the edge of the dataset are too large, and standardization corrects this. Specific results: standardized predictive residual in 1950 smaller than that in 1962, but predictive residual in 1950 is very close to 1962. standardized predictive residual in 1951 smaller than that in 1956, but predictive residual in 1951 is larger than in 1956. Largest predictive residual is 1951, but largest standardized predictive residual is 1956. • e. 3 points Make a plot of the leverage, i.e., the hii -values, using plot(Year, hatvalues(longley.fit), type="h", ylab="Leverage in Longley Regression"), and explain what leverage means. • f . 3 points One observation is much more influential than the others; which is it? First look at the plots for the residuals, then look also at the plot for leverage, 25.4. SENSITIVITY OF ESTIMATES TO OMISSION OF ONE OBSERVATION 277 and try to guess which is the most influential observation. Then do it the right way. Can you give reasons based on your prior knowledge about the time period involved why an observation in that year might be influential? Answer. The “right” way is to use Cook’s distance: plot(Year, cooks.distance(longley.fit), type="h", ylab="Cook’s Distance in Longley Regression") One sees that 1951 towers above all others. It does not have highest leverage, but it has second-highest, and a bigger residual than the point with the highest leverage. 1951 has the largest distance of .61. The second largest is the last observation in the dataset, 1962, with a distance of .47, and the others have .24 or less. Cook says: removal of 1951 point will ˆ move the least squares estimate to the edge of a 35% confidence region around β . This point is probably so influential because 1951 was the first full year of the Korean war. One would not be able to detect this point from the ordinary residuals, standardized or not! The predictive residuals are a little better; their maximum is at 1951, but several other residuals are almost as large. 1951 is so influential because it has an extremely high hat-value, and one of the highest values for the ordinary residuals! At the end don’t forget to detach(longley) if you have attached it before. 25.4.3. Changes in the Sum of Squared Errors. For the computation of s2 (i) from the regression results one can take advantage of the following simple relationship between the SSE for the regression with and without the ith observation: SSE − SSE (i) = (25.4.9) ε2 ˆi 1 − hii Problem 323. Use (25.4.9) to derive the following formula for s2 (i): s2 (i) = (25.4.10) ε2 ˆi 1 (n − k )s2 − n−k−1 1 − hii Answer. This merely involves re-writing SSE and SSE (i) in terms of s2 and s2 (i). s2 (i) = (25.4.11) ε2 ˆi 1 SSE (i) = S SE − n−1−k n−k−1 1 − hii Proof of equation (25.4.9): ˆ (y j − xj β (i))2 = SSE (i) = j : j =i = j : j =i εj + ˆ j ε2 + ˆj = j 2 j : j =i εj + ˆ = ˆ ˆ ˆ y j − xj β − xj (β (i) − β ) hji εi ˆ 1 − hii hji εi ˆ 1 − hii 2ε i ˆ 1 − hii 2 2 − 1 εi ˆ 1 − hii hij εj + ˆ j 2 εi ˆ 1 − hii 2 h2 − ji j εi ˆ 1 − hii 2 In the last line the first term is SSE . The second term is zero because H ε = o. ˆ Furthermore, hii = j h2 because H is symmetric and idempotent, therefore the ji sum of the last two items is −ε2 /(1 − hii ). ˆi Note that every single relationship we have derived so far is a function of εi and ˆ hii . 278 25. REGRESSION DIAGNOSTICS Problem 324. 3 points What are the main concepts used in modern “Regression Diagnostics”? Can it be characterized to be a careful look at the residuals, or does it have elements which cannot be inferred from the residuals alone? Answer. Leverage (sometimes it is called “potential”) is something which cannot be inferred from the residuals, it does not depend on y at all. Problem 325. An observation in a linear regression model is “influential” if its omission causes large changes to the regression results. Discuss how you would ascertain in practice whether a given observation is influential or not. • a. What is meant by leverage? Does high leverage necessarily imply that an observation is influential? Answer. Leverage is potential influence. It only depends of X , not on y . It is the distance of the observation from the center of gravity of all observations. Whether this is actual influence depends on the y -values. • b. How are the concepts of leverage and influence affected by sample size? • c. What steps would you take when alerted to the presence of an influential observation? Answer. Make sure you know whether the results you rely on are affected if that influential observation is dropped. Try to find out why this observation is influential (e.g. in the Longley data the observations in the year when the Korean War started are influential). • d. What is a “predictive residual” and how does it differ from an ordinary residual? • e. Discuss situations in which one would want to deal with the “predictive” residuals rather than the ordinary residuals, and situations in which one would want residuals standardized versus situations in which it would be preferable to have the unstandardized residuals. Problem 326. 6 points Describe what you would do to ascertain that a regression you ran is correctly specified? Answer. Economic theory behind that regression, size and sign of coefficients, plot residuals versus predicted values, time, and every independent variable, run all tests: F -test, t-tests, R2 , DW, portmanteau test, forecasting, multicollinearity, influence statistics, overfitting to see if other variables are significant, try to defeat the result by using alternative variables, divide time period into subperiods in order to see if parameters are constant over time, pre-test specification assumptions. CHAPTER 26 Asymptotic Properties of the OLS Estimator A much more detailed treatment of the contents of this chapter can be found in [DM93, Chapters 4 and 5]. Here we are concerned with the consistency of the OLS estimator for large samples. In other words, we assume that our regression model can be extended to encompass an arbitrary number of observations. First we assume that the regressors are nonstochastic, and we will make the following assumption: 1 (26.0.12) Q = lim X X exists and is nonsingular. n→∞ n Two examples where this is not the case. Look at the model y t = α + βt + εt . Here 11 1 2 1 + 1 + 1 + ··· + 1 1 + 2 + 3 + ··· + n = X = 1 3 . Therefore X X = 1 + 2 + 3 + · · · + n 1 + 4 + 9 + · · · + n2 . . . . .. 1n n n(n + 1)/2 1∞ 1 , and n X X → . Here the assumption n(n + 1)/2 n(n + 1)(2n + 1)/6 ∞∞ (26.0.12) does not hold, but one can still prove consistency and asymtotic normality, the estimators converge even faster than in the usual case. The other example is the model y t = α + βλt + εt with a known λ with −1 < λ < 1. Here 1 + 1 + ··· + 1 λ + λ 2 + · · · + λn X X= = 2 n λ + λ + ··· + λ λ2 + λ 4 + · · · + λ 2n = (λ − λ n )/(1 − λ) n+1 (λ − λn+1 )/(1 − λ) . (λ2 − λ2n+2 )/(1 − λ2 ) 10 , which is singular. In this case, a consistent estimate of 00 λ does not exist: future observations depend on λ so little that even with infinitely many observations there is not enough information to get the precise value of λ. ˆ We will show that under assumption (26.0.12), β and s2 are consistent. However this assumption is really too strong for consistency. A weaker set of assumptions is the Grenander conditions, see [Gre97, p. 275]. To write down the Grenander conditions, remember that presently X depends on n (in that we only look at the first n elements of y and first n rows of X ), therefore also the column vectors xj also depend of n (although we are not indicating this here). Therefore xj xj depends on n as well, and we will make this dependency explicit by writing xj xj = d2 . nj Then the first Grenander condition is limn→∞ d2 = +∞ for all j . Second: for all i nj and k , limn→∞ maxi=1···n xij /d2 = 0 (here is a typo in Greene, he leaves the max nj out). Third: Sample correlation matrix of the columns of X minus the constant term converges to a nonsingular matrix. Therefore 1 nX X→ 279 280 26. ASYMPTOTIC PROPERTIES OF OLS Consistency means that the probability limit of the estimates converges towards ˆ ˆ the true value. For β this can be written as plimn→∞ β n = β . This means by ˆn − β | ≤ ε] = 1. definition that for all ε > 0 follows limn→∞ Pr[|β The probability limit is one of several concepts of limits used in probability theory. We will need the following properties of the plim here: (1) For nonrandom magnitudes, the probability limit is equal to the ordinary limit. (2) It satisfies the Slutsky theorem, that for a continuous function g , (26.0.13) plim g (z ) = g (plim(z )). (3) If the MSE -matrix of an estimator converges towards the null matrix, then the estimator is consistent. (4) Kinchine’s theorem: the sample mean of an i.i.d. distribution is a consistent estimate of the population mean, even if the distribution does not have a population variance. 26.1. Consistency of the OLS estimator ˆ For the proof of consistency of the OLS estimators β and of s2 we need the following result: 1 X ε = o. n I.e., the true ε is asymptotically orthogonal to all columns of X . This follows immediately from MSE [o; X ε /n] = E [X εε X /n2 ] = σ 2 X X /n2 , which converges towards O . ˆ ˆ In order to prove consistency of β and s2 , transform the formulas for β and s2 in such a way that they are written as continuous functions of terms each of which ˆ converges for n → ∞, and then apply Slutsky’s theorem. Write β as (26.1.1) (26.1.2) (26.1.3) (26.1.4) plim XX ˆ β = β + (X X )−1 X ε = β + n X X −1 Xε ˆ plim β = β + lim plim n n −1 = β + Q o = β. −1 X ε n Let’s look at the geometry of this when there is only one explanatory variable. The specification is therefore y = xβ + ε . The assumption is that ε is asymptotically orthogonal to x. In small samples, it only happens by sheer accident with probability 0 that ε is orthogonal to x. Only ε is. But now let’s assume the sample grows ˆ larger, i.e., the vectors y and x become very high-dimensional observation vectors, i.e. we are drawing here a two-dimensional subspace out of a very high-dimensional space. As more and more data are added, the observation vectors also become √ lengths of these longer and longer. But if we divide each vector by n, then the√ normalized lenghts stabilize. The squared length of the vector ε / n has the plim 1 of σ 2 . Furthermore, assumption (26.0.12) means in our case that plimn→∞ n x x 1 exists and is nonsingular. This is the squared length of √n x. I.e., if we normalize the √ vectors by dividing them by n, then they do not get longer but converge towards 1 a finite length. And the result (26.1.1) plim n x ε = 0 means now that with this √ √ normalization, ε / n becomes more and more orthogonal to x/ n. I.e., if n is large enough, asymptotically, not only ε but also the true ε is orthogonal to x, and this ˆ ˆ means that asymptotically β converges towards the true β . 26.2. ASYMPTOTIC NORMALITY OF THE LEAST SQUARES ESTIMATOR 281 For the proof of consistency of s2 we need, among others, that plim ε nε = σ 2 , which is a consequence of Kinchine’s theorem. Since ε ε = ε M ε it follows ˆˆ εε ˆˆ I n X X X −1 X ε ε= = − n−k n−k n n n n n ε ε ε X X X −1 X ε = − → 1 · σ 2 − o Q−1 o . n−k n n n n 26.2. Asymptotic Normality of the Least Squares Estimator √ To show asymptotic normality of an estimator, multiply the sampling error by n, so that the variance is stabilized. 1 1 We have seen plim n X ε = o. Now look at √n X ε n . Its mean is o and its covariance matrix σ 2 X n X . Shape of distribution, due to a variant of the Central Limit 1 Theorem, is asymptotically normal: √n X ε n → N (o, σ 2 Q). (Here the convergence is convergence in distribution.) −1 √ˆ 1 We can write n(β n − β ) = X n X ( √n X ε n ). Therefore its limiting covari√ˆ ance matrix is Q−1 σ 2 QQ−1 = σ 2 Q−1 , Therefore n(β n − β ) → N (o, σ 2 Q−1 ) in disˆ tribution. One can also say: the asymptotic distribution of β is N (β , σ 2 (X X )−1 ). √ ˆn − Rβ ) → N (o, σ 2 RQ−1 R ), and therefore From this follows n(Rβ (26.2.1) ˆ n(Rβ n − Rβ ) RQ−1 R −1 ˆ (Rβ n − Rβ ) → σ 2 χ2 . i Divide by s2 and replace in the limiting case Q by X X /n and s2 by σ 2 to get −1 ˆ ˆ (Rβ n − Rβ ) R(X X )−1 R (Rβ n − Rβ ) → χ2 i 2 s in distribution. All this is not a proof; the point is that in the denominator, the distribution is divided by the increasingly bigger number n − k , while in the numerator, it is divided by the constant i; therefore asymptotically the denominator can be considered 1. The central limit theorems only say that for n → ∞ these converge towards the χ2 , which is asymptotically equal to the F distribution. It is easily possible that before one gets to the limit, the F -distribution is better. (26.2.2) ˆ Problem 327. Are the residuals y − X β asymptotically normally distributed? √ Answer. Only if the disturbances are normal, otherwise of course not! We can show that √ ˆ n(ε − ε) = nX (β − β ) ∼ N (o, σ 2 XQX ). ˆ Now these results also go through if one has stochastic regressors. [Gre97, 6.7.7] shows that the above condition (26.0.12) with the lim replaced by plim holds if xi and ε i are an i.i.d. sequence of random variables. Problem 328. 2 points In the regression model with random regressors y = 1 1 X β +ε , you only know that plim n X X = Q is a nonsingular matrix, and plim n X ε = o. Using these two conditions, show that the OLS estimate is consistent. ˆ Answer. β = (X X )−1 X y = β + (X X )−1 X ε due to (18.0.7), and plim(X X )−1 X ε = plim( X X −1 X ε ) = Qo = o. n n CHAPTER 27 Least Squares as the Normal Maximum Likelihood Estimate Now assume ε is multivariate normal. We will show that in this case the OLS ˆ estimator β is at the same time the Maximum Likelihood Estimator. For this we need to write down the density function of y . First look at one y t which is y t ∼ x1 . 2 N (xt β , σ ), where X = . , i.e., xt is the tth row of X . It is written as a . xn column vector, since we follow the “column vector convention.” The (marginal) density function for this one observation is (27.0.3) fyt (yt ) = √ 1 2πσ 2 e−(yt −xt β )2 /2σ 2 . Since the y i are stochastically independent, their joint density function is the product, which can be written as 1 (27.0.4) fy (y ) = (2πσ 2 )−n/2 exp − 2 (y − Xβ ) (y − Xβ ) . 2σ To compute the maximum likelihood estimator, it is advantageous to start with the log likelihood function: (27.0.5) log fy (y ; β , σ 2 ) = − n n 1 log 2π − log σ 2 − 2 (y − Xβ ) (y − Xβ ). 2 2 2σ Assume for a moment that σ 2 is known. Then the MLE of β is clearly equal to ˆ ˆ the OLS β . Since β does not depend on σ 2 , it is also the maximum likelihood 2 ˆ estimate when σ is unknown. β is a linear function of y . Linear transformations of normal variables are normal. Normal distributions are characterized by their mean vector and covariance matrix. The distribution of the MLE of β is therefore ˆ β ∼ N (β , σ 2 (X X )−1 ). ˆ If we replace β in the log likelihood function (27.0.5) by β , we get what is called the log likelihood function with β “concentrated out.” ˆ (27.0.6) log fy (y ; β = β , σ 2 ) = − n n 1 ˆ ˆ log 2π − log σ 2 − 2 (y − X β ) (y − X β ). 2 2 2σ One gets the maximum likelihood estimate of σ 2 by maximizing this “concentrated” log likelihoodfunction. Taking the derivative with respect to σ 2 (consider σ 2 the name of a variable, not the square of another variable), one gets ∂ n1 1 ˆ ˆ ˆ log fy (y ; β ) = − + 4 (y − X β ) (y − X β ) ∂σ 2 2 σ2 2σ Setting this zero gives (27.0.7) (27.0.8) σ2 = ˜ ˆ ˆ (y − X β ) (y − X β ) εε ˆˆ = . n n 283 284 27. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE This is a scalar multiple of the unbiased estimate s2 = ε ε/(n − k ) which we ˆˆ had earlier. Let’s look at the distribution of s2 (from which that of its scalar multiples follows easily). It is a quadratic form in a normal variable. Such quadratic forms very often have χ2 distributions. Now recall equation 7.4.9 characterizing all the quadratic forms of multivariate normal variables that are χ2 ’s. Here it is again: Assume y is a multivariate normal vector random variable with mean vector µ and covariance matrix σ 2 Ψ, and Ω is a symmetric nonnegative definite matrix. Then (y − µ) Ω (y − µ) ∼ σ 2 χ2 iff k ΨΩ ΨΩ Ψ = ΨΩ Ψ, (27.0.9) and k is the rank of ΨΩ . This condition is satisfied in particular if Ψ = I (the identity matrix) and Ω2 = Ω, and this is exactly our situation. (27.0.10) σ2 = ˆ ˆ ˆ (y − X β ) (y − X β ) ε (I − X (X X )−1 X )ε ε Mε = = n−k n−k n−k where M 2 = M and rank M = n − k . (This last identity because for idempotent matrices, rank = tr, and we computed its tr above.) Therefore s2 ∼ σ 2 χ2 −k /(n − k ), n from which one obtains again unbiasedness, but also that var[s2 ] = 2σ 4 /(n − k ), a result that one cannot get from mean and variance alone. ˆ Problem 329. 4 points Show that, if y is normally distributed, s2 and β are independent. ˆ Answer. We showed in question 246 that β and ε are uncorrelated, therefore in the normal ˆ ˆ case independent, therefore β is also independent of any function of ε, such as σ 2 . ˆ ˆ Problem 330. Computer assignment: You run a regression with 3 explanatory variables, no constant term, the sample size is 20, the errors are normally distributed and you know that σ 2 = 2. Plot the density function of s2 . Hint: The command dchisq(x,df=25) returns the density of a Chi-square distribution with 25 degrees of freedom evaluated at x. But the number 25 was only taken as an example, this is not the number of degrees of freedom you need here. • a. In the same plot, plot the density function of the Theil-Schweitzer estimate. Can one see from the comparison of these density functions why the Theil-Schweitzer estimator has a better MSE? Answer. Start with the Theil-Schweitzer plot, because it is higher. > x <- seq(from = 0, to = 6, by = 0.01) > Density <- (19/2)*dchisq((19/2)*x, df=17) > plot(x, Density, type="l", lty=2) > lines(x,(17/2)*dchisq((17/2)*x, df=17)) > title(main = "Unbiased versus Theil-Schweitzer Variance Estimate, 17 d.f.") Now let us derive the maximum likelihood estimator in the case of nonspherical but positive definite covariance matrix. I.e., the model is y = Xβ + ε , ε ∼ N (o, σ 2 Ψ). The density function is −1/2 (27.0.11) fy (y ) = (2πσ 2 )−n/2 |det Ψ| exp − 1 (y − Xβ ) Ψ−1 (y − Xβ ) . 2σ 2 Problem 331. Derive (27.0.11) as follows: Take a matrix P with the property that P ε has covariance matrix σ 2 I . Write down the joint density function of P ε . Since y is a linear transformation of ε , one can apply the rule for the density function of a transformed random variable. 27. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE 285 Answer. Write Ψ = QQ with Q nonsingular and define P = Q−1 and v = P ε . Then V [v ] = σ 2 P QQ P = σ 2 I , therefore (27.0.12) fv (v ) = (2πσ 2 )−n/2 exp − 1 vv. 2σ 2 For the transformation rule, write v , whose density function you know, as a function of y , whose density function you want to know. v = P (y − Xβ ); therefore the Jacobian matrix is ∂ v /∂ y = ∂ (P y − P Xβ )/∂ y = P , or one can see it also element by element ∂ v1 ∂ y1 (27.0.13) . . . ∂ vn ∂ y1 ··· ∂ v1 ∂ yn .. . . . . ··· ∂ vn ∂ yn = P, therefore one has to do two things: first, substitute P (y − Xβ ) for v in formula (27.0.12), and secondly multiply by the absolute value of the determinant of the Jacobian. Here is how to express the determinant of the Jacobian in terms of Ψ: From Ψ−1 = (QQ )−1 = (Q )−1 Q−1 = √ (Q−1 ) Q−1 = P P follows (det P )2 = (det Ψ)−1 , hence |det P | = det Ψ. From (27.0.11) one obtains the following log likelihood function: (27.0.14) n n 1 1 log fy (y ) = − ln 2π − ln σ 2 − ln det[Ψ] − 2 (y − Xβ ) Ψ−1 (y − Xβ ). 2 2 2 2σ Here, usually not only the elements of β are unknown, but also Ψ depends on unknown parameters. Instead of concentrating out β , we will first concentrate out σ 2 , i.e., we will compute the maximum of this likelihood function over σ 2 for any given set of values for the data and the other parameters: (27.0.15) (27.0.16) ∂ (y − Xβ ) Ψ−1 (y − Xβ ) n1 log fy (y ) = − + 2 2 ∂σ 2σ 2σ 4 −1 (y − Xβ ) Ψ (y − Xβ ) σ2 = ˜ . n Whatever the value of β or the values of the unknown parameters in Ψ, σ 2 is the ˜ value of σ 2 which, together with the given β and Ψ, gives the highest value of the likelihood function. If one plugs this σ 2 into the likelihood function, one obtains the ˜ so-called “concentrated likelihood function” which then only has to be maximized over β and Ψ: (27.0.17) n n 1 log fy (y ; σ 2 ) = − (1 + ln 2π − ln n) − ln(y − Xβ ) Ψ−1 (y − Xβ ) − ln det[Ψ] ˜ 2 2 2 This objective function has to be maximized with respect to β and the parameters ˆ entering Ψ. If Ψ is known, then this is clearly maximized by the β minimizing (19.0.11), therefore the GLS estimator is also the maximum likelihood estimator. If Ψ depends on unknown parameters, it is interesting to compare the maximum likelihood estimator with the nonlinear least squares estimator. The objective function minimized by nonlinear least squares is (y − Xβ ) Ψ−1 (y − Xβ ), which is the sum of squares of the innovation parts of the residuals. These two objective 1 functions therefore differ by the factor (det[Ψ]) n , which only matters if there are unknown parameters in Ψ. Asymptotically, the objective functions are identical. Using the factorization theorem for sufficient statistics, one also sees easily that ˆ σ 2 and β together form sufficient statistics for σ 2 and β . For this use the identity ˆ ˆ ˆ ˆ ˆ (y − Xβ ) (y − Xβ ) = (y − X β ) (y − X β ) + (β − β ) X X (β − β ) (27.0.18) ˆ ˆ = (n − k )s2 + (β − β ) X X (β − β ). 286 27. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE Therefore the observation y enters the likelihood function only through the two ˆ statistics β and s2 . The factorization of the likelihood function is therefore the trivial factorization in which that part which does not depend on the unknown parameters but only on the data is unity. Problem 332. 12 points The log likelihood function in the linear model is given by (27.0.5). Show that the inverse of the information matrix is σ 2 (X X )−1 o (27.0.19) o 2σ 4 /n The information matrix can be obtained in two different ways. Its typical element has the following two forms: ∂ ln ∂ θi or written as matrix derivatives ∂ ln (27.0.21) E[ ∂θ (27.0.20) E[ ∂ ln ∂ 2 ln = − E[, ∂ θk ∂ θi ∂θk ∂ ln ∂ 2 ln = − E[. ∂θ ∂ θ∂ θ β . The expectation is taken under the assumption that the σ2 parameter values are the true values. Compute it both ways. In our case θ = Answer. The log likelihood function can be written as n n 1 (27.0.22) ln = − ln 2π − ln σ 2 − (y y − 2 y X β + β X X β ) . 2 2 2σ 2 The first derivatives were already computed for the maximum likelihood estimators: ∂ 1 1 1 ln = − 2 (2y X + 2β X X ) = 2 (y − Xβ ) X = 2 ε X 2σ σ σ ∂β n 1 1 ∂ n ln = − 2 + (y − Xβ ) (y − Xβ ) = − 2 + (27.0.24) εε ∂σ 2 2σ 2σ 4 2σ 2σ 4 By the way, one sees that each of these has expected value zero, which is a fact that is needed to prove consistency of the maximum likelihood estimator. The formula with only one partial derivative will be given first, although it is more tedious: (27.0.23) By doing ∂ ∂β ∂ ∂β we get a symmetric 2 × 2 partitioned matrix with the diagonal elements (27.0.25) E[ 1 1 X εε X ] = 2 X X σ4 σ and 1 n n 1 1 + 2nσ 4 = ε ε ] = var[ 4 ε ε ] = 2σ 2 2σ 4 2σ 4σ 8 2σ 4 n 1 One of the off-diagonal elements is ( 2σ4 + 2σ6 ε ε )ε X . Its expected value is zero: E [ε ] = o, and also E [εε ε ] = o since its ith component is E[εi ε2 ] = E[εi ε2 ]. If i = j , then εi is j jj j (27.0.26) E[ − n 1 + εε 2σ 2 2σ 4 2 = var[− independent of ε2 , therefore E[εi ε2 ] = 0 · σ 2 = 0. If i = j , we get E[ε3 ] = 0 since εi has a symmetric j j i distribution. It is easier if we differentiate once more: ∂2 1 (27.0.27) ln = − 2 X X σ ∂ β∂ β (27.0.28) (27.0.29) ∂2 1 1 ln = − 4 X (y − Xβ ) = − 4 X ε ∂ β ∂σ 2 σ σ ∂2 n 1 n 1 ln = − 6 (y − Xβ ) (y − Xβ ) = − 6ε ε (∂σ 2 )2 2σ 4 σ 2σ 4 σ This gives the top matrix in [JHG+ 88, (6.1.24b)]: (27.0.30) 1 − σ2 X X 1 − σ 4 (X y − X X β ) n 2σ 4 1 − σ 4 (X y − X X β ) 1 − σ6 (y − Xβ ) (y − Xβ ) 27. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE 287 Now assume that β and σ 2 are the true values, take expected values, and reverse the sign. This gives the information matrix (27.0.31) σ −2 X X o o n/(2σ 4 ) For the lower righthand side corner we need that E[(y − Xβ ) (y − Xβ )] = E[ε ε ] = nσ 2 . Taking inverses gives (27.0.19), which is a lower bound for the covariance matrix; we see that s2 with var[s2 ] = 2σ 4 /(n − k) does not attain the bound. However one can show with other means that it is nevertheless efficient. CHAPTER 28 Random Regressors Until now we always assumed that X was nonrandom, i.e., the hypothetical repetitions of the experiment used the same X matrix. In the nonexperimental sciences, such as economics, this assumption is clearly inappropriate. It is only justified because most results valid for nonrandom regressors can be generalized to the case of random regressors. To indicate that the regressors are random, we will write them as X . 28.1. Strongest Assumption: Error Term Well Behaved Conditionally on Explanatory Variables The assumption which we will discuss first is that X is random, but the classical assumptions hold conditionally on X , i.e., the conditional expectation E [ε |X ] = o, and the conditional variance-covariance matrix V [ε|X ] = σ 2 I . In this situation, the least squares estimator has all the classical properties conditionally on X , for ˆ ˆ instance E [β |X ] = β , V [β |X ] = σ 2 (X X )−1 , E[s2 |X ] = σ 2 , etc. Moreover, certain properties of the Least Squares estimator remain valid unconditionally. An application of the law of iterated expectations shows that the least ˆ squares estimator β is still unbiased. Start with (18.0.7): (28.1.1) (28.1.2) (28.1.3) ˆ β − β = (X X )−1 X ε −1 −1 ˆ E [β − β |X ] = E [(X X ) X ε |X ] = (X X ) X E [ε |X ] = o. ˆ ˆ E [β − β ] = E E [β − β |X ] = o. Problem 333. 1 point In the model with random explanatory variables X you ˜ ˜ are considering an estimator β of β . Which statement is stronger: E [β ] = β , or ˜ [β |X ] = β . Justify your answer. E Answer. The second statement is stronger. The first statement follows from the second by the law of iterated expectations. Problem 334. 2 points Assume the regressors X are random, and the classical assumptions hold conditionally on X , i.e., E [ε |X ] = o and V [ε |X ] = σ 2 I . Show that s2 is an unbiased estimate of σ 2 . Answer. From the theory with nonrandom explanatory variables follows E[s2 |X ] = σ 2 . Therefore E[s2 ] = E E[s2 |X ] = E[σ 2 ] = σ 2 . In words: if the expectation conditional on X does not depend on X , then it is also the unconditional expectation. 289 290 28. RANDOM REGRESSORS The law of iterated expectations can also be used to compute the unconditional ˆ MSE matrix of β : ˆ ˆ ˆ (28.1.4) MSE [β ; β ] = E [(β − β )(β − β ) ] (28.1.5) ˆ ˆ = E E [(β − β )(β − β ) |X ] (28.1.6) = E [σ 2 (X X )−1 ] (28.1.7) = σ 2 E [(X X )−1 ]. ˆ Problem 335. 2 points Show that s2 (X X )−1 is unbiased estimator of MSE [β ; β ]. Answer. (28.1.8) −1 −1 2 2 E [s ( X X ) ] = E E [s ( X X ) |X ] (28.1.9) = E [σ 2 (X X )−1 ] (28.1.10) = σ 2 E [(X X )−1 ] (28.1.11) ˆ = MSE [β ; β ] by (28.1.7). ˜ The Gauss-Markov theorem generalizes in the following way: Say β is an estima˜|X ] = β (which is stronger tor, linear in y , but not necessarily in X , satisfying E [β ˜ ˆ than unbiasedness); then MSE [β ; β ] ≥ MSE [β ; β ]. Proof is immediate: we know by ˜ ˆ the usual Gauss-Markov theorem that MSE [β ; β |X ] ≥ MSE [β ; β |X ], and taking ˜ ˆ expected values will preserve this inequality: E MSE [β ; β |X ] ≥ E MSE [β ; β |X ] , but this expected value is exactly the unconditional MSE . The assumption E [ε |X ] = o can also be written E [y |X ] = X β , and V [ε |X ] = σ 2 I can also be written as V [y |X ] = σ 2 I . Both of these are assumptions about the conditional distribution y |X = X for all X . This suggests the following broadening of the regression paradigm: y and X are jointly distributed random variables, and one is interested how y |X = X depends on X . If the expected value of this distribution depends linearly, and the variance of this distribution is constant, then this is the linear regression model discussed above. But the expected value might also depend on X in a nonlinear fashion (nonlinear least squares), and the variance may not be constant—in which case the intuition that y is some function of X plus some error term may no longer be appropriate; y may for instance be the outcome of a binary choice, the probability of which depends on X (see chapter ??; the generalized linear model). 28.2. Contemporaneously Uncorrelated Disturbances In many situations with random regressors, the condition E [ε |X ] = o is not satisfied. Instead, the columns of X are contemporaneously uncorrelated with ε , but they may be correlated with past values of ε . The main example here is regression with a lagged dependent variable. In this case, OLS is no longer unbiased, but asymptotically it still has all the good properties, it is asymptotically normal with the covariance matrix which one would expect. Asymptotically, the computer printout is still valid. This is a very important result, which is often used in econometrics, but most econometrics textbooks do not even start to prove it. There is a proof in [Kme86, pp. 749–757], and one in [Mal80, pp. 535–539]. Problem 336. Since least squares with random regressors is appropriate whenever the disturbances are contemporaneously uncorrelated with the explanatory variables, a friend of yours proposes to test for random explanatory variables by checking 28.3. DISTURBANCES CORRELATED WITH REGRESSORS IN SAME OBSERVATION 291 whether the sample correlation coefficients between the residuals and the explanatory variables is significantly different from zero or not. Is this an appropriate statistic? Answer. No. The sample correlation coefficients are always zero! 28.3. Disturbances Correlated with Regressors in Same Observation But if ε is contemporaneously correlated with X , then OLS is inconsistent. This can be the case in some dynamic processes (lagged dependent variable as regressor, and autocorrelated errors, see question ??), when there are, in addition to the relation which one wants to test with the regression, other relations making the righthand side variables dependent on the lefthand side variable, or when the righthand side variables are measured with errors. This is usually the case in economics, and econometrics has developed the technique of simultaneous equations estimation to deal with it. Problem 337. 3 points What does one have to watch out for if some of the regressors are random? CHAPTER 29 The Mahalanobis Distance Everything in this chapter is unpublished work, presently still in draft form. The aim is to give a motivation for the least squares objective function in terms of an initial measure of precision. The case of prediction is mathematically simpler than that of estimation, therefore this chapter will only discuss prediction. We assume that the joint distribution of y and z has the form X Ω y ∼ β , σ 2 yy Ωzy W z Ω yz , Ωzz (29.0.1) σ 2 > 0, otherwise unknown β unknown as well. y is observed but z is not and has to be predicted. But assume we are not interested in the MSE since we do the experiment only once. We want to predict z in such a way that, whatever the true value of β , the predicted value z ∗ “blends in” best with the given data y . There is an important conceptual difference between this criterion and the one based on the MSE . The present criterion cannot be applied until after the data are known, therefore it is called a “final” criterion as opposed to the “initial” criterion of the MSE . See Barnett [Bar82, pp. 157–159] for a good discussion of these issues. How do we measure the degree to which a given data set “blend in,” i.e., are not outliers for a given distribution? Hypothesis testing uses this criterion. The most often-used testing principle is: reject the null hypothesis if the observed value of a certain statistic is too much an outlier for the distribution which this statistic would have under the null hypothesis. If the statistic is a scalar, and if under the null hypothesis this statistic has expected value µ and standard deviation σ , then one often uses an estimate of |x − µ| /σ , the number of standard deviations the observed value is away from the mean, to measure the “distance” of the observed value x from the distribution (µ, σ 2 ). The Mahalanobis distance generalizes this concept to the case that the test statistic is a vector random variable. 29.1. Definition of the Mahalanobis Distance Since it is mathematically more convenient to work with the squared distance than with the distance itself, we will make the following thought experiment to motivate the Mahalanobis distance. How could one generalize the squared scalar distance (y − µ)2 /σ 2 for the distance of a vector value y from the distribution of the vector random variable y ∼ (µ, σ 2Ω )? If all y i have same variance σ 2 , i.e., if Ω = I , one might measure the squared distance of y from the distribution (µ, σ 2Ω ) 1 by σ2 maxi (yi − µi )2 , but since the maximum from two trials is bigger than the value from one trial only, one should divide this perhaps by the expected value of such 2 a maximum. If the variances are different, say σi , one might want to look a the number of standard deviations which the “worst” component of y is away from what would be its mean if y were an observation of y , i.e., the squared distance of the µ2 obsrved vector from the distribution would be maxi (yi −2 i ) , again normalized by its σi expected value. 293 294 29. THE MAHALANOBIS DISTANCE The principle actually used by the Mahalanobis distance goes only a small step further than the examples just cited. It is coordinate-free, i.e., any linear combinations of the elements of y are considered on equal footing with these elements themselves. In other words, it does not distinguish between variates and variables. The distance of a given vector value from a certain multivariate distribution is defined to be the distance of the “worst” linear combination of the elements of this vector from the univariate distribution of this linear combination, normalized in such a way that the expected value of this distance is 1. Definition 29.1.1. Given a random n-vector y which has expected value and a nonsingular covariance matrix. The squared “Mahalanobis distance” or “statistical distance” of the observed value y from the distribution of y is defined to be g y − E[g y ] 1 MHD[y ; y ] = max g n var[g y ] (29.1.1) 2 . If the denominator var[g y ] is zero, then g = o, therefore the numerator is zero as well. In this case the fraction is defined to be zero. Theorem 29.1.2. Let y be a vector random variable with E [y ] = µ and V [y ] = σ 2Ω , σ 2 > 0 and Ω positive definite. The squared Mahalanobis distance of the value y from the distribution of y is equal to 1 (29.1.2) MHD[y ; y ] = (y − µ) Ω −1 (y − µ) nσ 2 Proof. (29.1.2) is a simple consequence of (25.4.4). It is also somewhat intuitive since the righthand side of (29.1.2) can be considered a division of the square of y − µ by the covariance matrix of y . The Mahalanobis distance is an asymmetric measure; a large value indicates a bad fit of the hypothetical population to the observation, while a value of, say, 0.1 does not necessarily indicate a better fit than a value of 1. Problem 338. Let y be a random n-vector with expected value µ and nonsingular covariance matrix σ 2Ω . Show that the expected value of the Mahalobis distance of the observations of y from the distribution of y is 1, i.e., (29.1.3) E MHD[y ; y ] = 1 Answer. (29.1.4) 1 1 1 1 E[ 2 (y − µ) Ω −1 (y − µ)] = E[tr Ω −1 (y − µ)(y − µ) ] tr( 2 Ω −1 σ 2Ω ) = tr(I ) = 1. nσ nσ 2 nσ n (29.1.2) is, up to a constant factor, the quadratic form in the exponent of the normal density function of y . For a normally distributed y , therefore, all observations located on the same density contour have equal distance from the distribution. The Mahalanobis distance is also defined if the covariance matrix of y is singular. In this case, certain nonzero linear combinations of the elements of y are known with certainty. Certain vectors can therefore not possibly be realizations of y , i.e., the set of realizations of y does not fill the whole Rn . Problem 339. 2 points The random vector y = covariance matrix 1 3 2 −1 −1 −1 2 −1 −1 −1 2 y1 y2 y3 has mean 1 2 −3 and . Is this covariance matrix singular? If so, give a 29.1. DEFINITION OF THE MAHALANOBIS DISTANCE 295 linear combination of the elements of y which is known with certainty. And give a value which can never be a realization of y . Prove everything you state. Answer. Yes, it is singular; (29.1.5) 2 −1 −1 −1 2 −1 −1 −1 2 1 1 1 = 0 0 0 I.e., y 1 + y 2 + y 3 = 0 because its variance is 0 and its mean is zero as well since [ 1 1 1 ] 1 2 −3 = 0. Definition 29.1.3. Given a vector random variable y which has a mean and a covariance matrix. A value y has infinite statistical distance from this random variable, i.e., it cannot possibly be a realization of this random variable, if a vector of coefficients g exists such that var[g y ] = 0 but g y = g E [y ]. If such a g does not exist, then the squared Mahalanobis distance of y from y is defined as in (29.1.1), Ω with n replaced by rank[Ω ]. If the denominator in (29.1.1) is zero, then it no longer necessarily follows that g = o but it nevertheless follows that the numerator is zero, and the fraction should in this case again be considered zero. If Ω is singular, then the inverse Ω −1 in formula (29.1.2) must be replaced by a “g-inverse.” A g-inverse of a matrix A is any matrix A− which satisfies AA− A = A. G-inverses always exist, but they are usually not unique. Problem 340. a is a scalar. What is its g-inverse a− ? Theorem 29.1.4. Let y be a random variable with E [y ] = µ and V [y ] = σ 2Ω , σ > 0. If it is not possible to express the vector y in the form y = µ + Ω a for some a, then the squared Mahalanobis distance of y from the distribution of y is infinite, i.e., MHD[y ; y ] = ∞; otherwise 1 (29.1.6) MHD[y ; y ] = 2 (y − µ) Ω − (y − µ) Ω σ rank[Ω ] 2 Now we will dicuss how a given observation vector can be extended by additional observations in such a way that the Mahalanobis distance of the whole vector from its distribution is minimized. CHAPTER 30 Interval Estimation We will first show how the least squares principle can be used to construct confidence regions, and then we will derive the properties of these confidence regions. 30.1. A Basic Construction Principle for Confidence Regions The least squares objective function, whose minimum argument gave us the BLUE, naturally allows us to generate confidence intervals or higher-dimensional confidence regions. A confidence region for β based on y = Xβ +ε can be constructed as follows: ˆ • Draw the OLS estimate β into k -dimensional space; it is the vector which ˆ ˆ minimizes SSE = (y − X β ) (y − X β ). ˜ one can define the sum of squared errors associ• For every other vector β ˜ ˜ ated with that vector as SSE β = (y − X β ) (y − X β ). Draw the level ˜ hypersurfaces (if k = 2: level lines) of this function. These are ellipsoids ˆ centered on β . • Each of these ellipsoids is a confidence region for β . Different confidence regions differ by their coverage probabilities. • If one is only interested in certain coordinates of β and not in the others, or in some other linear transformation β , then the corresponding confidence regions are the corresponding transformations of this ellipse. Geometrically this can best be seen if this transformation is an orthogonal projection; then the confidence ellipse of the transformed vector Rβ is also a projection or shadow” of the confidence region for the whole vector. Projections of the same confidence region have the same confidence level, independent of the direction in which this projection goes. The confidence regions for β with coverage probability π will be written here as B β;π or, if we want to make its dependence on the observation vector y explicit, Bβ;π (y ). These confidence regions are level lines of the SSE , and mathematically, it is advantageous to define these level lines by their level relative to the minimum ˜ level, i.e., as as the set of all β for which the quotient of the attained SSE β = ˜ ˆ ˆ ˜) (y − X β ) divided by the smallest possible SSE = (y − X β ) (y − X β ) ˜ (y − X β is smaller or equal a given number. In formulas, ˜ ˜ (y − X β ) (y − X β ) ˜ (30.1.1) β ∈ Bβ;π (y ) ⇐⇒ ≤ cπ;n−k,k ˆ ˆ (y − X β ) (y − X β ) It will be shown below, in the discussion following (30.2.1), that cπ;n−k,k only depends on π (the confidence level), n − k (the degrees of freedom in the regression), and k (the dimension of the confidence region). To get a geometric intuition of this principle, look at the case k = 2, in which ˜ the parameter vector β has only two components. For each possible value β of the ˜) (y − parameter vector, the associated sum of squared errors is SSE β = (y − X β ˜ 297 298 30. INTERVAL ESTIMATION ˜ ˜ X β ). This a quadratic function of β , whose level lines form concentric ellipses as shown in Figure 1. The center of these ellipses is the unconstrained least squares estimate. Each of the ellipses is a confidence region for β for a different confidence level. If one needs a confidence region not for the whole vector β but, say, for i linearly independent linear combinations Rβ (here R is a i × k matrix with full row rank), ˜ then the above principle applies in the following way: the vector u lies in the confidence region for Rβ generated by y for confidence level π , notation B Rβ;π , if and ˜ only if there is a β in the confidence region (30.1.1) (with the parameters adjusted ˜˜ ˜ to reflect the dimensionality of u) which satisfies Rβ = u: (30.1.2) ˜ ˜ (y − X β ) (y − X β ) ˜ ˜ ˜ ˜ u ∈ BRβ;π (y ) ⇐⇒ exist β with u = Rβ and ≤ cπ;n−k,i ˆ) (y − X β ) ˆ (y − X β Problem 341. Why does one have to change the value of c when one goes over to the projections of the confidence regions? Answer. Because the projection is a many-to-one mapping, and vectors which are not in the original ellipsoid may still end up in the projection. Again let us illustrate this with the 2-dimensional case in which the confidence region for β is an ellipse, as drawn in Figure 1, called Bβ;π (y ). Starting with this ellipse, the above criterion defines individual confidence intervals for linear combina˜ ˜˜ tions u = r β by the rule: u ∈ Br β;π (y ) iff a β ∈ Bβ (y ) exists with r β = u. For ˜ 1 ], this interval is simply the projection of the ellipse on the horizontal axis, r = [0 and for r = [ 0 ] it is the projection on the vertical axis. 1 The same argument applies for all vectors r with r r = 1. The inner product of two vectors is the length of the first vector times the length of the projection ˜ of the second vector on the first. If r r = 1, therefore, r β is simply the length ˜ on the line generated by the vector r . Therefore of the orthogonal projection of β the confidence interval for r β is simply the projection of the ellipse on the line generated by r . (This projection is sometimes called the “shadow” of the ellipse.) ˜ The confidence region for Rβ can also be defined as follows: u lies in this ˆ ˆ ˆ which satisfies Rβ = u lies in the ˆ ˜ confidence region if and only if the “best” β ˆ ˆ confidence region (30.1.1), this best β being, of course, the constrained least squares ˜ estimate subject to the constraint Rβ = u, whose formula is given by (22.3.13). ˜ The confidence region for Rβ consists therefore of all u for which the constrained −1 ˆˆ −1 ˆ˜ ˆ least squares estimate β = β − (X X ) R R(X X )−1 R (Rβ − u) satisfies condition (30.1.1): (30.1.3) ˜ u ∈ BRβ (y ) ⇐⇒ ˆ ˆ ˆ ˆ (y − X β ) (y − X β ) ≤ cπ;n−k,i ˆ ˆ (y − X β ) (y − X β ) One can also write it as (30.1.4) ˜ u ∈ BRβ (y ) ⇐⇒ SSE constrained ≤ cπ;n−k,i SSE unconstrained ˜ i.e., those u are in the confidence region which, if imposed as a constraint on the regression, will not make the SSE too much bigger. 30.1. CONSTRUCTION OF CONFIDENCE REGIONS −2 −3 −1 0 1 2 −3 ...... .......................................... ............ ................................ ..... ........... .......... ... .. ... .. . . . . ....................................... ............ .......................................... ............ . . .. ......... ........... ... ......... ....... ... .. ...... ...... ...... . ...... ...... .. . .. ...... .... .. ...... .. .. ... ...... ........ ...... ...... .. ... ... ..... ...... ..... ...... ... ... ..... ..... .... ... ..... .... ... ... .... ..... ..... ..... ..... ..... .... ... ..... .... .... .... .... .... .... .... .... ..... .... ..... .... .... .... .... .... ... .... ... .... .... . . .... .... .... .... .... .... .... .... .... ..... .... .... .... ..... .... .... ..... .... ..... .... ... .... ... .... ..... ..... ... ..... ...... ... ..... ...... ... ...... ..... ..... .. ...... ..... ..... ..... ...... .. .. .. .. .. ..... ...... ...... ...... . .. ...... ...... .. . . ...... ........ .. . ........ ...... ........... .. . ...... . ...... . ...... ................ . ...... ..................... . ....... . ............................. .. ........ ....................... .. ..... . ........... ........... .. .............. ............... ... ... . ............................... ........................... −4 299 −4 −5 −5 −2 −1 2 1 0 Figure 1. Confidence Ellipse with “Shadows” In order to transform (30.1.3) into a mathematically more convenient form, write it as ˆ ˆ ˆ ˆ ˆ ˆ (y − X β ) (y − X β ) − (y − X β ) (y − X β ) ˜ u ∈ BRβ;π (y ) ⇐⇒ ≤ cπ;n−k,i − 1 ˆ) (y − X β ) ˆ (y − X β and then use (22.7.2) to get (30.1.5) ˆ˜ (Rβ − u) ˜ u ∈ BRβ;π (y ) ⇐⇒ −1 ˆ˜ R(X X )−1 R (Rβ − u) ≤ cπ;n−k,i − 1 ˆ ˆ (y − X β ) (y − X β ) ˆ ˆ This formula has the great advantage that β no longer appears in it. The condition ˆ ˜ whether u belongs to the confidence region is here formulated in terms of β alone. Problem 342. Using (14.2.12), show that (30.1.1) can be rewritten as (30.1.6) ˆ˜ ˆ˜ (β − β ) X X (β − β ) ˜ ≤ cπ;n−k,k − 1 β ∈ Bβ;π (y ) ⇐⇒ ˆ ˆ (y − X β ) (y − X β ) Verify that this is the same as (30.1.5) in the special case R = I . Problem 343. You have run a regression with intercept, but you are not interested in the intercept per se but need a joint confidence region for all slope parameters. Using the notation of Problem 304, show that this confidence region has the form (30.1.7) ˆ˜ ˆ˜ (β − β ) X X (β − β ) ˜ ≤ cπ;n−k,k−1 − 1 β ∈ Bβ;π (y ) ⇐⇒ ˆ ˆ (y − X β ) (y − X β ) I.e., we are sweeping the means out of both regressors and dependent variables, and then we act as if the regression never had an intercept and use the formula for the full parameter vector (30.1.6) for these transformed data (except that the number of degrees of freedom n − k still reflects the intercept as one of the explanatory variables). Answer. Write the full parameter vector as α β and R = o I . Use (30.1.5) but instead ˜ ˜ of u write β . The only tricky part is the following which uses (23.0.37): (30.1.8) ¯ ¯ ¯ 1/n + x (X X )−1 x −x (X X )−1 R(X X )−1 R = o I ¯ −(X X )−1 x (X X )−1 o I = (X X )−1 300 30. INTERVAL ESTIMATION Figure 2. Confidence Band for Regression Line ˆ ˆ The denominator is (y − ια − X β ) (y − ια − X β ), but since α = y x β , see problem 204, this ˆ ˆ ˆ ¯ ¯ˆ ˆ ˆ denominator can be rewritten as (y − X β ) (y − X β ). Problem 344. 3 points We are in the simple regression y t = α + βxt + εt . If one draws, for every value of x, a 95% confidence interval for α + βx, one gets a “confidence band” around the fitted line, as shown in Figure 2. Is the probability that this confidence band covers the true regression line over its whole length equal to 95%, greater than 95%, or smaller than 95%? Give a good verbal reasoning for your answer. You should make sure that your explanation is consistent with the fact that the confidence interval is random and the true regression line is fixed. 30.2. Coverage Probability of the Confidence Regions ˜ The probability that any given known value u lies in the confidence region (30.1.3) depends on the unknown β . But we will show now that the “coverage probability” of the region, i.e., the probability with which the confidence region contains the unknown true value u = Rβ , does not depend on any unknown parameters. ˜ To get the coverage probability, we must substitute u = Rβ (where β is the true parameter value) in (30.1.5). This gives (30.2.1) −1 ˆ ˆ (Rβ − Rβ ) R(X X )−1 R (Rβ − Rβ ) Rβ ∈ BRβ;π (y ) ⇐⇒ ≤ cπ;n−k,i − 1 ˆ ˆ (y − X β ) (y − X β ) Let us look at numerator and denominator separately. Under the Normality assumpˆ tion, Rβ ∼ N (Rβ , σ 2 R(X X )−1 R ). Therefore, by (7.4.9), the distribution of the numerator of (30.2.1) is (30.2.2) ˆ (Rβ − Rβ ) R(X X )−1 R −1 ˆ (Rβ − Rβ ) ∼ σ 2 χ2 . i This probability distribution only depends on one unknown parameter, namely, σ 2 . ˆ ˆ Regarding the denominator, remember that, by (18.4.2), (y − X β ) (y − X β ) = ε M ε , and if we apply (7.4.9) to this we can see that (30.2.3) ˆ ˆ (y − X β ) (y − X β ) ∼ σ 2 χ2 −k n Furthermore, numerator and denominator are independent. To see this, look first ˆ at β and ε. By Problem 246 they are uncorrelated, and since they are also jointly ˆ ˆ Normal, it follows that they are independent. If β and ε are independent, any ˆ ˆ are independent of any functions of ε. The numerator in the test functions of β ˆ 30.4. INTERPRETATION IN TERMS OF STUDENTIZED MAHALANOBIS DISTANCE 301 ˆ statistic (30.2.1) is a function of β and the denominator is a function of ε; therefore ˆ they are independent, as claimed. Lastly, if we divide numerator by denominator, the unknown “nuisance parameter” σ 2 in their probability distributions cancels out, i.e., the distribution of the quotient is fully known. ˜ ˜ To sum up: if u is the true value u = Rβ , then the test statistic in (30.2.1) can no longer be observed, but its distribution is is known; it is a χ2 divided by an i independent χ2 −k . Therefore, for every value c, the probability that the confidence n region (30.1.5) contains the true Rβ can be computed, and conversely, for any desired coverage probability, the appropriate critical value c can be computed. As claimed, this critical value only depends on the confidence level π and n − k and i. 30.3. Conventional Formulas for the Test Statistics In order to get this test statistic into the form in which it is conventionally tabulated, we must divide both numerator and denominator of (30.1.5) by their degrees of freedom, to get a χ2 /i divided by an independent χ2 −k /(n − k ). This i n quotient is called a F -distribution with i and n − k degrees of freedom. χ2 /i i The F -distribution is defined as F i,j = χ2 /j instead of the seemingly simpler j formula χ2 i , χ2 j because the division by the degrees of freedom makes all F -distributions and the associated critical values similar; an observed value below 4 is insignificant, but greater values may be signficant depending on the number of parameters. ˜ Therefore, instead of , the condition deciding whether a given vector u lies in the confidence region for Rβ with confidence level π = 1 − α is formulated as follows: (30.3.1) (SSE constrained − SSE unconstrained )/number of constraints ≤ F(i,n−k;α) SSE unconstr. /(numb. of obs. − numb. of coeff. in unconstr. model) Here the constrained SSE is the SSE in the model estimated with the constraint ˜ Rβ = u imposed, and F(i,n−k;α) is the upper α quantile of the F distribution with i and n − k degrees of freedom, i.e., it is that scalar c for which a random variable F which has a F distribution with i and n − k degrees of freedom satisfies Pr[F ≥ c] = α. 30.4. Interpretation in terms of Studentized Mahalanobis Distance The division of numerator and denominator by their degrees of freedom also gives us a second intuitive interpretation of the test statistic in terms of the Mahalanobis distance, see chapter 29. If one divides the denominator by its degrees of freedom, one gets an unbiased estimate of σ 2 (30.4.1) s2 = 1 ˆ ˆ (y − X β ) (y − X β ). n−k Therefore from (30.1.5) one gets the following alternative formula for the joint confidence region B (y ) for the vector parameter u = Rβ for confidence level π = 1 − α: (30.4.2) ˜ u ∈ BRβ;1−α (y ) ⇐⇒ 1 ˆ˜ (Rβ − u) s2 R(X X )−1 R −1 ˆ˜ (Rβ − u) ≤ iF(i,n−k;α) ˆ ˆ ˆ Here β is the least squares estimator of β , and s2 = (y − X β ) (y − X β )/(n − k ) the 2 ˆ = s2 (X X )−1 is the estimated covariance unbiased estimator of σ . Therefore Σ ˆ matrix as available in the regression printout. Therefore V = s2 R(X X )−1 R 302 30. INTERVAL ESTIMATION ˆ is the estimate of the covariance matrix of Rβ . Another way to write (30.4.2) is therefore (30.4.3) ˆ ˜ ˆ −1 ˆ ˜ ˜ B (y ) = {u ∈ Ri : (Rβ − u) V (Rβ − u) ≤ iF(i,n−k;α) }. ˜ This formula allows a suggestive interpretation. whether u lies in the confidence ˆ region or not depends on the Mahalanobis distance of the actual value of Rβ would ˆ would have if the true parameter vector were have from the distribution which Rβ ˜ to satisfy the constraint Rβ = u. It is not the Mahalanobis distance itself but only an estimate of it because σ 2 is replaced by its unbiased estimate s2 . These formulas are also useful for drawing the confidence ellipses. The r which you need in equation (7.3.22) in order to draw the confidence ellipse is r = iF(i,n−k;α) . This is the same as the local variable mult in the following S-function to draw this ellipse: its arguments are the center point (a 2-vector d), the estimated covariance matrix (a 2 × 2 matrix C), the degrees of freedom in the denominator of the F distribution (the scalar df), and the confidence level (the scalar level between 0 and 1 which defaults to 0.95 if not specified). confelli <function(b, C, df, level = 0.95, xlab = "", ylab = "", add=T, prec=51) # # # # # # # # # # # Plot an ellipse with "covariance matrix" C, center b, and P-content level according the F(2,df) distribution. Sent to S-NEWS on May 19, 1999 by Roger Koenker Department of Economics University of Illinois Champaign, IL 61820 url: http://www.econ.uiuc.edu email roger@ysidro.econ.uiuc.edu vox: 217-333-4558 fax: 217-244-6678. Included in the ecmet package with his permission. { d <- sqrt(diag(C)) dfvec <- c(2, df) phase <- acos(C[1, 2]/(d[1] * d[2])) angles <- seq( - (PI), PI, len = prec) mult <- sqrt(dfvec[1] * qf(level, dfvec[1], dfvec[2])) xpts <- b[1] + d[1] * mult * cos(angles) ypts <- b[2] + d[2] * mult * cos(angles + phase) if(add) lines(xpts, ypts) else plot(xpts, ypts, type = "l", xlab = xlab, ylab = ylab) } The mathematics why this works is in Problem 146. Problem 345. 3 points In the regression model y = Xβ + ε you observe y and the (nonstochastic) X and you construct the following confidence region B (y ) for Rβ , where R is a i × k matrix with full row rank: (30.4.4) ˆ ˆ B (y ) = {u ∈ Ri : (Rβ − u) (R(X X )−1 R )−1 (Rβ − u) ≤ is2 F(i,n−k;α) }. Compute the probability that B contains the true Rβ . 30.4. INTERPRETATION IN TERMS OF STUDENTIZED MAHALANOBIS DISTANCE 303 Answer. (30.4.5) Pr[B (y ) (30.4.6) = Pr[ ˆ ˆ Rβ ] = Pr[(Rβ − Rβ ) (R(X X )−1 R )−1 (Rβ − Rβ ) ≤ iF(i,n−k;α) s2 ] = ˆ ˆ (Rβ − Rβ ) (R(X X )−1 R )−1 (Rβ − Rβ )/i ≤ F(i,n−k;α) ] = 1 − α 2 s This interpretation with the Mahalanobis distance is commonly used for the construction of t-Intervals. A t-interval is a special case of the above confidence region for the case i = 1. The confidence interval with confidence level 1 − α for the scalar parameter u = r β , where r = o is a vector of constant coefficients, can be written as ˆ (30.4.7) B (y ) = {u ∈ R : |u − r β | ≤ t(n−k;α/2) s ˆ}. rβ ˆ What do those symbols mean? β is the least squares estimator of β . t(n−k;α/2) is the upper α/2-quantile of the t distribution with n − k degrees of freedom, i.e., it is that scalar c for which a random variable t which has a t distribution with n − k degrees of freedom satisfies Pr[t ≥ c] = α/2. Since by symmetry Pr[t ≤ −c] = α/2 as well, one obtains the inequality relevant for a two-sided test: Pr[|t| ≥ t(n−k;α/2) ] = α. (30.4.8) ˆ Finally, sr β is the estimated standard deviation of r β . ˆ It is computed by the following three steps: First write down the variance of ˆ r β: ˆ (30.4.9) var[r β ] = σ 2 r (X X )−1 r . ˆ ˆ Secondly, replace σ 2 by its unbiased estimator s2 = (y − X β ) (y − X β )/(n − k ), and thirdly take the square root. This gives sr ˆ β = s r (X X )−1 r . Problem 346. Which element(s) on the right hand side of (30.4.7) depend(s) on y ? ˆ Answer. β depends on y , and also sr ˆ β depends on y through s2 . Let us verify that the coverage probability, i.e., the probability that the confidence interval constructed using formula (30.4.7) contains the true value r β , is, as claimed, 1 − α: (30.4.10) Pr[B (y ) ˆ r β ] = Pr[ r β − r β ≤ t(n−k;α/2) sr ˆ β] (30.4.11) = Pr r (X X )−1 X ε ≤ t(n−k;α/2) s r (X X )−1 r (30.4.12) = Pr[ r (X X )−1 X ε ≤ t(n−k;α/2) ] s r (X X )−1 r (30.4.13) = Pr[ r (X X )−1 X ε σ r (X X )−1 r s ≤ t(n−k;α/2) ] = 1 − α, σ This last equality holds because the expression left of the big slash is a standard normal, and the expression on the right of the big slash is the square root of an 304 30. INTERVAL ESTIMATION independent χ2 −k divided by n − k . The random variable between the absolute signs n has therefore a t-distribution, and (30.4.13) follows from (30.4.8). In R, one obtains t(n−k;α/2) by giving the command qt(1-alpha/2,n-p). Here qt stands for t-quantile [BCW96, p. 48]. One needs 1-alpha/2 instead of alpha/2 because it is the usual convention for quantiles (or cumulative distribution functions) to be defined as lower quantiles, i.e., as the probabilities of a random variable being ≤ a given number, while test statistics are usually designed in such a way that the significant values are the high values, i.e., for testing one needs the upper quantiles. There is a basic duality between confidence intervals and hypothesis tests. Chapter 31 is therefore a discussion of the same subject under a slightly different angle: CHAPTER 31 Three Principles for Testing a Linear Constraint We work in the model y = Xβ + ε with normally distributed errors ε ∼ N (o, σ 2 I ). There are three basic approaches to test the null hypothesis Rβ = u. In the linear model, these three approaches are mathematically equivalent, but if one goes over to nonlinear least squares or maximum likelihood estimators, they lead to different (although asymptotically equivalent) tests. ˆ (1) (“Wald Criterion”) Compute the vector of OLS estimates β , and reject the ˆ null hypothesis if Rβ is “too far away” from u. For this criterion one only needs the unconstrained estimator, not the constrained one. (2) (“Likelihood Ratio Criterion”) Estimate the model twice: once with the constraint Rβ = u, and once without the constraint. Reject the null hypothesis if the model with the constraint imposed has a much worse fit than the model without the constraint. (3) (“Lagrange Multiplier Criterion”) This third criterion is based on the constrained estimator only. It has two variants. In its “score test” variant, one rejects the null hypothesis if the vector of derivatives of the unconstrained least squares ˆ ˆ objective function, evaluated at the constrained estimate β , is too far away from o. In the variant which has given this Criterion its name, one rejects if the vector of Lagrange multipliers needed for imposing the constraint is too far away from o. Many textbooks inadvertently and implicitly distinguish between (1) and (2) as follows: they introduce the t-test for one parameter by principle (1), and the F -test for several parameters by principle (2). Later, the student is surprised to find out that the t-test and the F -test in one dimension are equivalent, i.e., that the difference between t-test and F -test has nothing to do with the dimension of the parameter vector to be tested. Some textbooks make the distinction between (1) and (2) explicit. For instance [Chr87, p. 29ff] distinguishes between “testing linear parametric functions” and “testing models.” However the distinction between all 3 principles has been introduced into the linear model only after the discovery that these three principles give different but asymptotically equivalent tests in the Maximum Likelihood estimation. Compare [DM93, Chapter 3.6] about this. 31.1. Mathematical Detail of the Three Approaches ˆ (1) For the “Wald criterion” we must specify what it means that Rβ is “too far away” from u. The Mahalanobis distance gives such a criterion: If the true β ˆ satisfies Rβ = u, then Rβ ∼ (u, σ 2 R(X X )−1 R ), and the Mahalanobis distance ˆ of the observed value of Rβ from this distribution is a logical candidate for the Wald criterion. The only problem is that σ 2 is not known, therefore we have to use the “studentized” Mahalanobis distance in which σ 2 is replaced by s2 . Conventionally, in the conterxt of linear regression, the Mahalanobis distance is also divided by the number of degrees of freedom; this normalizes its expected value to 1. Replacing σ 2 305 306 31. THREE PRINCIPLES FOR TESTING A LINEAR CONSTRAINT by s2 and dividing by i gives the test statistic (31.1.1) ˆ 1 (Rβ − u) i R(X X )−1 R s2 −1 ˆ (Rβ − u) . (2) Here are the details for the second approach, the “goodness-of-fit criterion.” In order to compare the fit of the models, we look at the attained SSE ’s. Of course, the constrained SSE r is always larger than the unconstrained SSE u , even if the true parameter vector satisfies the constraint. But if we divide SSE r by its degrees of freedom n + i − k , it is an unbiased estimator of σ 2 if the constraint holds and it is biased upwards if the constraint does not hold. The unconstrained SSE u , divided by its degrees of freedom, on the other hand, is always an unbiased estimator of σ 2 . If the constraint holds, the SSE ’s divided by their respective degrees of freedom should give roughly equal numbers. According to this, a feasible test statistic would be (31.1.2) SSE r /(n + i − k ) SSE u /(n − k ) and one would reject if this is too much > 1. The following variation of this is more convenient, since its distribution does not depend on n, k and i separately, but only through n − k and i. (31.1.3) (SSE r − SSE u )/i SSE u /(n − k ) It still has the property that the numerator is an unbiased estimator of σ 2 if the constraint holds and biased upwards if the constraint does not hold, and the denominator is always an unbiased estimator. Furthermore, in this variation, the numerator and denominator are independent random variables. If this test statistic is much larger than 1, then the constraints are incompatible with the data and the null hypothesis must be rejected. The statistic (31.1.3) can also be written as (31.1.4) (SSE constrained − SSE unconstrained )/number of constraints SSE unconstrained /(numb. of observations − numb. of coefficients in unconstr. model) The equivalence of formulas (31.1.1) and (31.1.4) is a simple consequence of (22.7.2). (3) And here are the details about the score test variant of the Lagrange multiplier criterion: The Jacobian of the least squares objective function is (31.1.5) ∂ (y − Xβ ) (y − Xβ ) = −2(y − Xβ ) X . ∂β This is a row vector consisting of all the partial derivatives. Taking its transpose, in order to get a column vector, and plugging the constrained least squares estimate ˆ ˆ ˆ ˆ β into it gives −2X (y − X β ). Again we need the Mahalanobis distance of this observed value from the distribution which the random variable (31.1.6) ˆ ˆ −2X (y − X β ) ˆ ˆ has if the true β satisfies Rβ = u. If this constraint is satisfied, β is unbiased, therefore (31.1.6) has expected value zero. Furthermore, if one premulti−1 ˆ ˆ ˆ plies (22.7.1) by X one gets X (y − X β ) = R R(X X )−1 R (Rβ − u), −1 ˆ ˆ therefore V [X (y − X β )] = R R(X X )−1 R R; and now one can see that 31.1. MATHEMATICAL DETAIL OF THE THREE APPROACHES 307 X )−1 is a g-inverse of this covariance matrix. Therefore the Malahalanobis distance of the observed value from the distribution is 1 ˆ ˆ ˆ ˆ (31.1.7) (y − X β ) X (X X )−1 X (y − X β ) σ2 1 σ 2 (X The Lagrange multiplier statistic is based on the restricted estimator alone. If one wanted to take this principle seriously one would have to to replace σ 2 by the unbiased estimate from the restricted model to get the “score form” of the Lagrange Multiplier Test statistic. But in the linear model this leads to it that the denominator in the test statistic is no longer independent of the numerator, and since the test statistic as a function of the ratio of the constrained and unconstrained estimates of σ 2 anyway, one will only get yet another monotonic transformation of the same test statistic. If one were to use the unbiased estimate from the unrestricted model, one would exactly get the Wald statistic back, as one can verify using (22.3.13). This same statistic can also be motivated in terms of the Lagrange multipliers, and this is where this testing principle has its name from, although the applications usually use the score form. According to (22.3.12), the Lagrange multiplier is λ = −1 ˆ 2 R(X X )−1 R (Rβ − u). If the constraint holds, then E [λ] = o, and V [λ] = 4σ 2 R(X X )−1 R distribution is −1 (31.1.8) λ (V [λ])−1 λ = . The Mahalanobis distance of the observed value from this 1 λ R(X X )−1 R λ 4σ 2 Using (22.7.1) one can verify that this is the same as (31.1.7). Problem 347. Show that (31.1.7) is equal to the righthand side of (31.1.8). ˆˆ ˆˆ Problem 348. 10 points Prove that ε ε − ε ε can be written alternatively in ˆˆ the following five ways: (31.1.9) ˆˆ ˆˆ ˆˆ ˆ ˆˆ ˆ ε ε − ε ε = (β − β ) X X (β − β ) ˆˆ (31.1.12) ˆ ˆ = (Rβ − u) (R(X X )−1 R )−1 (Rβ − u) 1 = λ R(X X )−1 R λ 4 ˆ ˆ = ε X (X X )−1 X ε ˆ ˆ (31.1.13) ˆˆˆˆ = (ε − ε) (ε − ε) ˆ ˆ (31.1.10) (31.1.11) Furthermore show that (31.1.14) XX is σ 2 times a g-inverse of (31.1.15) (R(X X )−1 R )−1 1 R(X X )−1 R 4 is σ 2 times the inverse of ˆˆ ˆ V [β − β ] ˆ V [Rβ − u] is σ 2 times the inverse of V [λ] (31.1.17) (X X )−1 is σ 2 times a g-inverse of (31.1.18) I is σ 2 times a g-inverse of ˆ ˆ V [X (y − X β )] ˆˆ ˆ V [ε − ε] (31.1.16) ˆ ˆ and show that −2X (y − X β ) is the gradient of the SSE objective function evaluated ˆ ˆ at β . By the way, one should be a little careful in interpreting (31.1.12) because ˆ X (X X )−1 X is not σ 2 times the g-inverse of V [ε]. ˆ 308 31. THREE PRINCIPLES FOR TESTING A LINEAR CONSTRAINT Answer. (31.1.19) ˆ ˆ ˆ ˆ ˆˆ ˆ ˆˆ ˆ ˆ ˆ ε = y − X β = X β + ε − X β = X (β − β ) + ε , ˆ and since X ε = o, the righthand decomposition is an orthogonal decomposition. This gives (31.1.9) above: (31.1.20) ˆˆ ˆˆ ˆ ˆˆ ˆ ˆˆ ˆˆ ε = (β − β ) X X (β − β ) + ε ε , −1 ˆ ˆˆ Using (22.3.13) one obtains V [β − β ] = σ 2 (X X )−1 R R(X X )−1 R R(X X )−1 . This is 1 a singular matrix, and one verifies immediately that σ2 X X is a g-inverse of it. ˆ To obtain (31.1.10), which is (22.7.2), one has to plug (22.3.13) into (31.1.20). Clearly, V [Rβ − u] = σ 2 R(X X )−1 R . For (31.1.11) one needs the formula for the Lagrange multiplier (22.3.12). The test statistic defined alternatively either by (31.1.1) or (31.1.4) or (31.1.7) or (31.1.8) has the following nice properties: • E(SSE u ) = E(ε ε) = σ 2 (n − k ), which holds whether or not the constraint ˆˆ is true. Furthermore it was shown earlier that (31.1.21) E(SSE r − SSE u ) = σ 2 i + (Rβ − u) (R(X X )−1 R )−1 (Rβ − u), i.e., this expected value is equal to σ 2 i if the constraint is true, and larger otherwise. If one divides SSE u and SSE r − SSE u by their respective degrees of freedom, as is done in (31.1.4), one obtains therefore: the denominator is always an unbiased estimator of σ 2 , regardless of whether the null hypothesis is true or not. The numerator is an unbiased estimator of σ 2 when the null hypothesis is correct, and has a positive bias otherwise. • If the distribution of ε is normal, then numerator and denominator are ˆ independent. The numerator is a function of β and the denominator one ˆ and ε are independent. ˆ ˆ of ε, and β • Again under assumption of normality, numerator and denominator are distributed as σ 2 χ2 with i and n − k degrees of freedom, divided by their respective degrees of freedom. If one divides them, the common factor σ 2 cancels out, and the ratio has a F distribution. Since both numerator and denominator have the same expected value σ 2 , the value of this F distribution should be in the order of magnitude of 1. If it is much larger than that, the null hypothesis is to be rejected. (Precise values in the F -tables). 31.2. Examples of Tests of Linear Hypotheses Some tests can be read off directly from the computer printouts. One example is the t-tests for an individual component of β . The situation is y = Xβ +ε , where β = β1 · · · βk , and we want to test βj = u. Here R = ej = [ 0 ··· 0 1 0 ··· 0 ], with the 1 on the j th place, and u is the 1-vector u, and i = 1. Therefore R(X X )−1 R = djj , the j th diagonal element of (X X )−1 , and (31.1.1) becomes (31.2.1) ˆ (β j − u)2 ∼ F1,n−k s2 djj when H is true. This is the square of a random variable which has a t-distribution: ˆ βj − u ∼ tn−k when H is true. (31.2.2) s djj ˆ This latter test statistic is simply β j − u divided by the estimated standard deviation ˆ. of β j 31.2. EXAMPLES OF TESTS OF LINEAR HYPOTHESES 309 If one wants to test that a certain linear combination of the parameter values is equal to (or bigger than or smaller than) a given value, say r β = u, one can use a ˆ t-test as well. The test statistic is, again, simply r β − u divided by the estimated ˆ: standard deviation of r β ˆ r β−u (31.2.3) ∼ tn−k when H is true. s r (X X )−1 r By this one can for instance also test whether the sum of certain regression coefficients is equal to 1, or whether two regression coefficients are equal to each other (but not the hypothesis that three coefficients are equal to each other). Many textbooks use the Wald criterion to derive the t-test, and the LikelihoodRatio criterion to derive the F -test. Our approach showed that the Wald criterion can be used for simultaneous testing of several hypotheses as well. The t-test is equivalent to an F -test if only one hypothesis is tested, i.e., if R is a row vector. The only difference is that with the t-test one can test one-sided hypotheses, with the F -test one cannot. Next let us discuss the test for the existence of a relationship, “the” F -test which every statistics package performs automatically whenever the regression has a constant term: it is the test whether all the slope parameters are zero, such that only the intercept may take a nonzero value. Problem 349. 4 points In the model y = Xβ + ε with intercept, show that the test statistic for testing whether all the slope parameters are zero is ˆ (y X β − ny 2 )/(k − 1) ¯ (31.2.4) ˆ (y y − y X β )/(n − k ) This is [Seb77, equation (4.26) on p. 110]. What is the distribution of this test statistic if the null hypothesis is true (i.e., if all the slope parameters are zero)? Answer. The distribution is ∼ F k−1,n−k . (31.2.4) is most conveniently derived from (31.1.4). In the constrained model, which has only a constant term and no other explanatory variables, i.e., y = ιµ + ε , the BLUE is µ = y . Therefore the constrained residual sum of squares SSE const. is ˆ ¯ what is commonly called SST (“total” or, more precisely, “corrected total” sum of squares): (31.2.5) SSE const. = SST = (y − ιy ) (y − ιy ) = y (y − ιy ) = y y − ny 2 ¯ ¯ ¯ ¯ while the unconstrained residual sum of squares is what is usually called SSE : (31.2.6) ˆ ˆ ˆ ˆ SSE unconst. = SSE = (y − X β ) (y − X β ) = y (y − X β ) = y y − y X β . ˆ This last equation because X (y − X β ) = X ε = o. A more elegant way is perhaps ˆ (31.2.7) ˆ SSE unconst. = SSE = ε ε = y M M y = y M y = y y − y X (X X )−1 X y = y y − y X β ˆˆ According to (14.3.12) we can write SSR = SST − SSE , therefore the F -statistic is (31.2.8) ˆ SSR/(k − 1) (y X β − ny 2 )/(k − 1) ¯ = ∼ F k−1,n−k ˆ SSE /(n − k) ( y y − y X β )/ (n − k ) if H0 is true. Problem 350. 2 points Can one compute the value of the F -statistic testing for the existence of a relationship if one only knows the coefficient of determination R2 = SSR/SST , the number of observations n, and the number of regressors (counting the constant term as one of the regressors) k ? 310 31. THREE PRINCIPLES FOR TESTING A LINEAR CONSTRAINT Answer. (31.2.9) F= SSR/(k − 1) n−k SSR n − k R2 = . = SSE /(n − k) k − 1 SST − SSR k − 1 1 − R2 Other, similar F -tests are: the F -test that all among a number of additional variables have the coefficient zero, the F -test that three or more coefficients are equal. One can use the t-test for testing whether two coefficients are equal, but not for three. It may be possible that the t-test for β1 = β2 does not reject and the t-test for β2 = β3 does not reject either, but the t-test for β1 = β3 does reject! Problem 351. 4 points [Seb77, exercise 4b.5 on p. 109/10] In the model y = β + ε with ε ∼ N (o, σ 2 I ) and subject to the constraint ι β = 0, which we had in Problem 291, compute the test statistic for the hypothesis β1 = β3 . Answer. In this problem, the “unconstrained” model for the purposes of testing is already constrained, it is subject to the constraint ι β 0. The “constrained” model has the additional = β1 . constraint Rβ = 1 0 −1 0 · · · 0 . = 0. In Problem 291 we computed the “uncon. βk ˆ strained” estimates β = y − ιy and s2 = ny 2 = (y 1 + · · · + y n )2 /n. You are allowed to use this ¯ ¯ ˆ without proving it again. Therefore Rβ = y 1 − y 3 ; its variance is 2σ 2 , and the F test statistic n(y −y )2 1 is 2(y +···+3 )2 ∼ F1,1 . The “unconstrained” model had 4 parameters subject to one constraint, yn 1 therefore it had 3 free parameters, i.e.,k = 3, n = 4, and j = 1. Another important F -test is the “Chow test” named by its popularizer Chow [Cho60]: it tests whether two regressions have equal coefficients (assuming that the disturbance variances are equal). For this one has to run three regressions. If the first regression has n1 observations and sum of squared error SSE 1 , and the second regression n2 observations and SSE 2 , and the combined regression (i.e., the restricted model) has SSE r , then the test statistic is (31.2.10) (SSE r − SSE 1 − SSE 2 )/k . (SSE 1 + SSE 2 )/(n1 + n2 − 2k ) If n2 < k , the second regression cannot be run by itself. In this case, the unconstrained model has “too many” parameters: they give an exact fit for the second group of observations SSE 2 = 0, and in addition not all parameters are identified. In effect this second regression has only n2 parameters. These parameters can be considered dummy variables for every observation, i.e., this test can be interpreted to be a test whether the n2 additional observations come from the same population as the n1 first ones. The test statistic becomes (31.2.11) (SSE r − SSE 1 )/n2 . SSE 1 /(n1 − k ) This latter is called the “predictive Chow test,” because in its Wald version it looks at the prediction errors involving observations in the second regression. The following is a special case of the Chow test, in which one can give a simple formula for the test statistic. Problem 352. Assume you have n1 observations uj ∼ N (µ1 , σ 2 ) and n2 observations v j ∼ N (µ2 , σ 2 ), all independent of each other, and you want to test whether µ1 = µ2 . (Note that the variances are known to be equal). • a. 2 points Write the model in the form y = Xβ + ε. 31.2. EXAMPLES OF TESTS OF LINEAR HYPOTHESES 311 Answer. ι µ + ε1 u = 11 v ι2 µ 2 + ε 2 (31.2.12) ι1 o = µ1 ε + 1. µ2 ε2 o ι2 here ι1 and ι2 are vectors of ones of appropriate lengths. • b. 2 points Compute (X X )−1 in this case. Answer. (31.2.13) X X= (31.2.14) (X X )−1 = ι1 o o ι2 1 n1 ι1 o o ι2 n1 0 0 n2 0 = 1 n2 0 ˆ • c. 2 points Compute β = (X X )−1 X y in this case. Answer. (31.2.15) X y= (31.2.16) ˆ β = (X X )−1 X y = ι1 o o ι2 1 n1 0 n1 u i=1 i n2 v j =1 j 1 n2 0 n1 u i=1 i n2 v j =1 j u = v u ¯ v ¯ = ˆ ˆ • d. 3 points Compute SSE = (y − X β ) (y − X β ) and s2 , the unbiased esti2 mator of σ , in this case. Answer. ˆ y − Xβ = (31.2.17) u ι −1 v o u u − ι1 u ¯ ¯ = v ¯ v − ι2 u ¯ o ι2 n1 (31.2.18) SSE = s2 = (31.2.19) n2 ¯ ( ui − u) 2 + i=1 n1 (u i=1 i ¯ (v j − v )2 j =1 − u) 2 + ¯ n2 (v j =1 j − v )2 ¯ n1 + n2 − 2 • e. 1 point Next, the hypothesis µ1 = µ2 must be written in the form Rβ = u. Since in the present case R has just has one row, it should be written as a row-vector R = r , and since the vector u has only one component, it should be written as a scalar u, i.e., the hypothesis should be written in the form r β = u. What are r and u in our case? Answer. Since β = (31.2.20) 1 −1 µ1 , the constraint can be written as µ2 µ1 µ2 =0 i.e., r= 1 −1 and u=0 ˆ • f . 2 points Compute the standard deviation of r β . Answer. First compute the variance and then take the square root. (31.2.21) ˆ var[r β ] = σ 2 r (X X )−1 r = σ 2 1 −1 1 n1 0 0 1 n2 1 1 1 = σ2 + −1 n1 n2 1 1 One can also see this without matrix algebra. var[¯ = σ 2 n , var[¯ = σ 2 n , and since u and v are u v ¯ ¯ 1 2 independent, the variance of the difference is the sum of the variances. 312 31. THREE PRINCIPLES FOR TESTING A LINEAR CONSTRAINT • g. 2 points Use (31.2.3) to derive the formula for the t-test. Answer. The test statistic is u − v divided by its estimated standard deviation, i.e., ¯¯ u−v ¯¯ (31.2.22) s 1 n1 + ∼ tn1 +n2 −2 when H is true. 1 n2 Problem 353. [Seb77, exercise 4d-3] Given n + 1 observations yj from a N (µ, σ 2 ). After the first n observations, it is suspected that a sudden change in the mean of the distribution occurred, i.e., that y n+1 ∼ N (ν, σ 2 ) with ν = µ. We will use here three different approaches to derive the same test statistic for testing the hypothesis that the n + 1st observation has the same population mean as the previous observations, i.e., that ν = µ, against the two-sided alternative. The formulas for this statistic should be given in terms of the observations yi . It is recommended to n n+1 1 ¯ use the notation y = n i=1 yi and y = n+1 j =1 yj . ¯1 • a. 3 points First you should derive this statistic by testing whether ν − µ = 0 (the “Wald principle”). For this you must compute the BLUE of ν − µ and its standard deviation and construct the t statistic from this. n 1 Answer. BLUE of µ is y = n ¯ y , and that of ν is y n+1 . BLUE of ν − µ is y − y n+1 . ¯ i=1 i y y Because of independence var[¯ − y n+1 ] = var[¯]+var[y n+1 ] = σ 2 ((1/n)+1) = σ 2 (n +1)/n. Standard deviation is σ (n + 1)/n. For the denominator in the t-statistic you need the s2 from the unconstrained regression, which is (31.2.23) s2 = 1 n−1 n (y j − y )2 ¯ j =1 What happened to the (n + 1)st observation here? It always has a zero residual. And the factor 1/(n − 1) should really be written 1/(n + 1 − 2): there are n + 1 observations and 2 parameters. Divide y − y n+1 by its standard deviation and replace σ by s (the square root of s2 ) to get the ¯ t statistic y − y n+1 ¯ (31.2.24) 1 s 1+ n • b. 2 points One can interpret this same formula also differently (and this is why this test is sometimes called the “predictive” Chow test). Compute the Best Linear Unbiased Predictor of y n+1 on the basis of the first n observations, call it ˆ ˆ y (n + 1)n+1 . Show that the predictive residual y n+1 − y (n + 1)n+1 , divided by the ˆ ˆ ˆ(n + 1)n+1 ; y square root of MSE[y ˆ n+1 ], with σ replaced by s (based on the first n observations only), is equal to the above t statistic. Answer. BLUP of y n+1 based on first n observations is y again. Since it is unbiased, ¯ MSE[¯; y n+1 ] = var[¯ − y n+1 ] = σ 2 (n + 1)/n. From now on everything is as in part a. y y • c. 6 points Next you should show that the above two formulas are identical to the statistic based on comparing the SSE s of the constrained and unconstrained models (the likelihood ratio principle). Give a formula for the constrained SSE r , the unconstrained SSE u , and the F -statistic. Answer. According to the Likelihood Ratio principle, one has to compare the residual sums of squares in the regressions under the assumption that the mean did not change with that under the ¯ assumption that the mean changed. If the mean did not change (constrained model), then y is the 31.2. EXAMPLES OF TESTS OF LINEAR HYPOTHESES 313 OLS of µ. In order to make it easier to derive the difference between constrained and unconstrained SSE , we will write the constrained SSE as follows: n+1 n+1 n+1 2 ¯ yj − (n + 1)y 2 = ¯ ( yj − y ) 2 = SSE r = j =1 2 yj − 1 (ny + yn+1 )2 ¯ n+1 j =1 j =1 If one allows the mean to change (unconstrained model), then y is the BLUE of µ, and yn+1 is the ¯ BLUE of ν . n n 2 ¯ yj − ny 2 . (yj − y )2 + (yn+1 − yn+1 )2 = ¯ SSE u = j =1 j =1 Now subtract: 1 (ny + yn+1 )2 ¯ n+1 1 2 2 = yn+1 + ny 2 − ¯ (n2 y 2 + 2ny yn+1 + yn+1 ) ¯ ¯ n+1 n2 n 1 = (1 − )y 2 + (n − )¯2 − y 2¯yn+1 y n + 1 n+1 n+1 n+1 n = (yn+1 − y )2 . ¯ n+1 2 SSE r − SSE u = yn+1 + ny 2 − ¯ Interestingly, this depends on the first n observations only through y . ¯ Since the unconstrained model has n + 1 observations and 2 parameters, the test statistic is (31.2.25) SSE r − SSE u = SSE u /(n + 1 − 2) n (y − y )2 ¯ n+1 n+1 n (yj − y ) 2 / (n − ¯ 1 1) = (yn+1 − y )2 n(n − 1) ¯ n (y 1j − y )2 (n + 1) ¯ ∼ F1,n−1 This is the square of the t statistic (31.2.24). 31.2.1. Goodness of Fit Test. Problem 354. [Seb77, pp. 117–119] Given a regression model with k independent variables. There are n observations of the vector of independent variables, and for each of these n values there is not one but r > 1 different replicated observations of the dependent variable. This model can be written k (31.2.26) y mq = xmj βj + εmq or y mq = xm β + εmq , j =1 where m = 1, . . . , n, j = 1, . . . , k , q = 1, . . . , r, and xm is the mth row of the X matrix. For simplicity we assume that r does not depend on m, each observation of the independent variables has the same number of repetitions. We also assume that the n × k matrix X has full column rank. • a. 2 points In this model it is possible to test whether the regression line is in fact a straight line. If it is not a straight line, then each observation of the dependent variables xm has a different coefficient vector β m associated with it, i.e., the model is k (31.2.27) y mq = xmj βmj + εmq or y mq = xm β m + εmq . j =1 This unconstrained model does not have enough information to estimate any of the individual coefficients βmj . Explain how it is nevertheless still possible to compute SSE u . 314 31. THREE PRINCIPLES FOR TESTING A LINEAR CONSTRAINT Answer. Even though the individual coefficients βmj are not identified, their linear combinak tion ηm = xm β m = xβ is identified; one unbiased estimator, although by far not the j =1 mj mj best one, is any individual observation y mq . This linear combination is all one needs to compute SSE u , the sum of squared errors in the unconstrained model. • b. 2 points Writing your estimate of ηm = xm β m as η m , give the formula of the ˜ sum of squared errors of this estimate, and by taking the first order conditions, show that the unconstrained least squares estimate of ηm is η m = y m· for m = 1, . . . , n, ˆ ¯ r where y m· = 1 q=1 y mq (i.e., the dot in the subscript indicates taking the mean). ¯ r Answer. If we know the η m the sum of squared errors no longer depents on the independent ˜ observations xm but is simply (31.2.28) (y mq − η m )2 ˜ SSE u = m,q First order conditions are (31.2.29) ∂ ∂ ηh ˜ (y mq − η m )2 = ˜ m,q ∂ ∂ ηh ˜ (y hq − η h )2 = −2 ˜ q (y hq − η h ) = 0 ˜ q • c. 1 point The sum of squared errors associated with this least squares estimate is the unconstrained sum of squared errors SSE u . How would you set up a regression with dummy variables which would give you this SSE u ? Answer. The unconstrained model should be regressed in the form y mq = ηm + εmq . I.e., string out the matrix Y as a vector and for each column of Y introduce a dummy variable which is = 1 if the given observation was originally in this colum. • d. 2 points Next turn to the constrained model (31.2.26). If X has full column ˜ rank, then it is fully identified. Writing β j for your estimates of βj , give a formula for the sum of squared errors of this estimate. By taking the first order conditions, ˆ show that the estimate β is the same as the estimate in the model without replicated observations k (31.2.30) zm = xmj βj + εm , j =1 where z m = y m· as defined above. ¯ • e. 2 points If SSE c is the SSE in the constrained model (31.2.26) and SSE b the SSE in (31.2.30), show that SSE c = r · SSE b + SSE u . ˆ (y mq − xm β )2 = ˆ )2 + r (y − xm β )2 ; Answer. For every m we have therefore SSE c = m,q (y mq − y m· ¯ q m q ˆ (y mq − y m· )2 + r (y m· − xm β )2 ; ¯ m· • f . 3 points Write down the formula of the F -test in terms of SSE u and SSE c with a correct accounting of the degrees of freedom, and give this formula also in terms of SSE u and SSE b . Answer. Unconstrained model has n parameters, and constrained model has k parameters; the number of additional “constraints” is therefore n − k. This gives the F -statistic (31.2.31) (SSE c − SSE u )/(n − k) r SSE b /(n − k) = SSE u /n(r − 1) SSE u /n(r − 1) 31.4. TESTS OF NONLINEAR HYPOTHESES 315 31.3. The F-Test Statistic is a Function of the Likelihood Ratio Problem 355. The critical region of the generalized likelihood ratio test can be written as C = {y1 , . . . , yn : (31.3.1) supθ∈Ω (y1 , . . . , yn ; θ1 , . . . , θk ) ≥ k }, supθ∈ω (y1 , . . . , yn ; θ1 , . . . , θk ) where ω refers to the null and Ω to the alternative hypothesis (it is assumed that the hypotheses are nested, i.e., ω ⊂ Ω). In other words, one rejects the hypothesis if the maximal achievable likelihood level with the restriction imposed is much lower than ˆ ˆ θ that without the restriction. If θ is the unrestricted and ˆ the restricted maximum likelihood estimator, then the test statistic is ˆ ˆ θ LR = 2(log (y , θ ) − log (y , ˆ )) → χ2 i (31.3.2) where i is the number of restrictions. In this exercise we are proving that the F -test in the linear model is equivalent to the generalized likelihood ratio test. (You should assume here that both β and σ 2 are unknown.) All this is in [Gre97, p. 304]. • a. 1 point Since we only have constraints on β and not on σ 2 , it makes sense to first compute the concentrated likelihood function with σ 2 concentrated out. Derive the formula for this concentrated likelihood function which is given in [Gre97, just above (6.88)]. Answer. (31.3.3) Concentrated log (y ; β ) = − n 1 1 + log 2π + log (y − Xβ ) (y − Xβ ) 2 n • b. 2 points In the case of a linear restriction, show that LR is connected with the F -statistic F as follows: (31.3.4) LR = n log 1 + Answer. LR = −n log 1 ε ˆ n ε − log ˆ 1ˆ ε ˆ n i F n−k ˆ ε = n log ˆ ˆˆ εε ˆˆ εε ˆˆ In order to connect this with the F statistic note that (31.3.5) F= n−k i ˆˆ εε ˆˆ −1 εε ˆˆ 31.4. Tests of Nonlinear Hypotheses Make linear approximation, need Jacobian for this. Here is an example where a nonlinear hypothesis arises naturally: Problem 356. [Gre97, Example 7.14 on p. 361]: The model (31.4.1) Ct = α + βYt + γCt−1 + εt has different long run and short run propensities to consume. Give formulas for both. Answer. Short-run is β ; to compute the long run propensity, which would prevail in the stationary state when Ct = Ct−1 , write C∞ = α + βY∞ + γC∞ + ε∞ or C∞ (1 − γ ) = α + βY∞ + ε∞ or C∞ = α/(1 − γ ) + β/(1 − γ )Y∞ + εt /(1 − γ ). Therefore long run propensity is δ = β/(1 − γ ). 316 31. THREE PRINCIPLES FOR TESTING A LINEAR CONSTRAINT 31.5. Choosing Between Nonnested Models Throwing all regressors into the same regression is a straightforward way out but not very good. J-test (the J comes from “joint”) is better: throw the predicted values of one of the two models as a regressor into the other model and test whether this predicted value has a nonzero coefficient. Here is more detail: if the null hypothesis is that model 1 is right, then throw the predicted value of model 2 into model 1 and test the null hypothesis that the coefficient of this predicted value is zero. If Model 1 is right, then this additional regressor leaves all other estimators unbiased, and the true coefficient of the additional regressor is 0. If Model 2 is right, then asymptotically, this additional regressor should be the only regressor in the combined model with a nonzero coefficient (its coefficient is = 1 asymptotically, and all the other regressors should have coefficient zero.) Whenever nonnested hypotheses are tested, is is possible that both hypotheses are rejected, or that neither hypothesis is rejected by this criterion. CHAPTER 32 Instrumental Variables Compare here [DM93, chapter 7] and [Gre97, Section 6.7.8]. Greene first introduces the simple instrumental variables estimator and then shows that the generalized one picks out the best linear combinations for forming simple instruments. I will follow [DM93] and first introduce the generalized instrumental variables estimator, and then go down to the simple one. In this chapter, we will discuss a sequence of models y n = X n β + ε n , where ε n ∼ (on , σ 2 I n ), and X n are n × k -matrices of random regressors, and the number 1 of observations n → ∞. We do not make the assumption plim n X n ε n = o which would ensure consistency of the OLS estimator (compare Problem 328). Instead, a sequence of n × m matrices of (random or nonrandom) “instrumental variables” W n is available which satisfies the following three conditions: (32.0.1) (32.0.2) (32.0.3) plim 1 W εn = o nn 1 W W n = Q exists, is nonrandom and nonsingular nn 1 plim W n X n = D exists, is nonrandom and has full column rank n plim Full column rank in (32.0.3) is only possible if m ≥ k . In this situation, regression of y on X is inconsistent. But if one regresses y on the projection of X on R[W ], the column space of W , one obtains a consistent estimator. This is called the instrumental variables estimator. If xi is the ith column vector of X , then W (W W )−1 W xi is the projection of xi on the space spanned by the columns of W . Therefore the matrix W (W W )−1 W X consists of the columns of X projected on R[W ]. This is what we meant by the projection of X on R[W ]. With these projections as regressors, the vector of regression coefficients becomes the “generalized instrumental variables estimator” (32.0.4) ˜ β = X W (W W )−1 W X −1 X W (W W )−1 W y Problem 357. 3 points We are in the model y = X β + ε and we have a matrix W of “instrumental variables” which satisfies the following three conditions: 1 1 plim n W ε = o, plim n W W = Q exists, is nonrandom and positive definite, and 1 plim n W X = D exists, is nonrandom and has full column rank. Show that the instrumental variables estimator (32.0.5) ˜ β = X W (W W )−1 W X ˜ is consistent. Hint: Write β n − β = B n · matrices B n has a plim. 317 −1 1 nW X W (W W )−1 W y ε and show that the sequence of 318 32. INSTRUMENTAL VARIABLES Answer. Write it as ˜ βn = X W (W W )−1 W X −1 X W (W W )−1 W (X β + ε ) = β + X W (W W )−1 W X −1 X W (W W )−1 W ε 1 1 1 = β + ( X W )( W W )−1 ( W X ) n n n −1 1 1 1 ( X W )( W W )−1 W ε , n n n i.e., the B n and B of the hint are as follows: 1 1 1 ( X W )( W W )−1 ( W X ) n n n B = plim B n = (D Q−1 D )−1 D Q−1 Bn = −1 1 1 ( X W )( W W )−1 n n 1 1 Problem 358. Assume plim n X X exists, and plim n X ε exists. (We only need the existence, not that the first is nonsingular and the second zero). Show that 1 ˜ ˜ σ 2 can be estimated consistently by s2 = n (y − X β ) (y − X β ). ˜ ˜ ˜ Answer. y − X β = X β + ε − X β = ε − X (β − β ). Therefore 1 1 2 ˜ ˜ ˜ ˜ ( y − X β ) ( y − X β ) = ε ε − ε X ( β − β ) + (β − β ) n n n 1 ˜ X X (β − β ). n All summands have plims, the plim of the first is σ 2 and those of the other two are zero. Problem 359. In the situation of Problem 357, add the stronger assumption √˜ ε → N (o, σ 2 Q), and show that n(β n − β ) → N (o, σ 2 (D Q−1 D )−1 ) 1 √W n √˜ 1 ˜ Answer. β n − β = B n n W n ε n , therefore n(β n − β ) = B n n−1/2 W n ε n → B N (o, σ 2 Q) = N (o, σ 2 BQB ). Since B = (D Q−1 D )−1 D Q−1 , the result follows. ˜ From Problem 359 follows that for finite samples approximately β n − β ∼ 2 1 ˜ N o, σ (D Q−1 D )−1 . Since n (D Q−1 D )−1 = (nD (nQ)−1 nD )−1 , MSE [β ; β ] n −1 can be estimated by s2 X W (W W )−1 W X The estimator (32.0.4) is sometimes called the two stages least squares estimate, because the projection of X on the column space of W can be considered the predicted values if one regresses every column of X on W . I.e., instead of regressing y on X one regresses y on those linear combinations of the columns of W which best approximate the columns of X . Here is more detail: the matrix of estimated coeffiˆ cients in the first regression is Π = (W W )−1 W X , and the predicted values in ˆ = W Π = W (W W )−1 W X . The second regression, which ˆ this regression are X ˆ regresses y on X , gives the coefficient vector (32.0.6) ˆˆ ˆ ˜ β = (X X )−1 X y . If you plug this in you see this is exactly (32.0.4) again. Now let’s look at the geometry of instrumental variable regression of one variable y on one other variable x with w as an instrument. The specification is y = xβ + ε . On p. 280 we visualized the asymptotic results if ε is asymptotically orthogonal to x. Now let us assume ε is asymptotically not orthogonal to x. One can visualize this as √ three vectors, again normalized by dividing by n, but now even in the asymptotic case the ε -vector is not orthogonal to x. (Draw ε vertically, and make x long enough that β < 1.) We assume n is large enough so that the asymptotic results hold for the sample already (or, perhaps better, that the difference between the sample and its plim is only infinitesimal). Therefore the OLS regression, with estimates β by 32. INSTRUMENTAL VARIABLES 319 x y /x x, is inconsistent. Let O be the origin, A the point on the x-vector where ε branches off (i.e., the end of xβ ), furthermore let B be the point on the x-vector where the orthogonal projection of y comes down, and C the end of the x-vector. ¯¯ ¯2 ¯ ¯ Then x y = OC OB and x x = OC , therefore x y /x x = OB/OC , which would be the β if the errors were orthogonal. Now introduce a new variable w which is orthogonal to the errors. (Since ε is vertical, w is on the horizontal axis.) Call D the projection of y on w, which is the prolongation of the vector ε , and call E the end of ¯¯ the w-vector, and call F the projection of x on w. Then w y = OE OD, and w x = ¯¯ ¯¯ ¯ ¯ ¯ ¯ ¯ OF . Therefore w y /w x = (OE OD)(OE OF ) = OD/OF = OA/OC = β . ¯ OE Or geometrically it is obvious that the regression of y on the projection of x on w ˆ will give the right β . One also sees here why the s2 based on this second regression is inconsistent. If I allow two instruments, the two instruments must be in the horizontal plane perpendicular to the vector ε which is assumed still vertical. Here we project x on this horizontal plane and then regress the y , which stays where it is, on this x. In this way the residuals have the right direction! What if there is one instrument, but it does not not lie in the same plane as x and y ? This is the most general case as long as there is only one regressor and one instrument. This instrument w must lie somewhere in the horizontal plane. We have to project x on it, and then regress y on this projection. Look at it this way: take the plane orthogonal to w which goes through point C . The projection of x on w is the intersection of the ray generated by w with this plane. Now move this plane parallel until it intersects point A. Then the intersection with the w-ray is the projection of y on w. But this latter plane contains ε , since ε is orthogonal to w. This makes sure that the regression gives the right results. Problem 360. 4 points The asymptotic MSE matrix of the instrumental variables estimator with W as matrix of instruments is σ 2 plim X W (W W )−1 W X Show that if one adds more instruments, then this asymptotic MSE -matrix can only decrease. It is sufficient to show that the inequality holds before going over to the plim, i.e., if W = U V , then (32.0.7) X U (U U )−1 U X is nonnegative definite. not required). (2) Note partitioned matrix form? W G(G W W G)−1 G −1 − X W (W W )−1 W X −1 Hints: (1) Use theorem A.5.5 in the Appendix (proof is that U = W G for some G. Can you write this G in (3) Show that, whatever W and G, W (W W )−1 W − W is idempotent. Answer. (32.0.8) U= U V I = WG O where G= I . O Problem 361. 2 points Show: if a matrix D has full column rank and is square, then it has an inverse. Answer. Here you need that column rank is row rank: if D has full column rank it also has full row rank. And to make the proof complete you need: if A has a left inverse L and a right inverse R, then L is the only left inverse and R the only right inverse and L = R. Proof: L = L(AR) = (LA)R = R. −1 320 32. INSTRUMENTAL VARIABLES Problem 362. 2 points If W X is square and has full column rank, then it is nonsingular. Show that in this case (32.0.4) simplifies to the “simple” instrumental variables estimator: ˜ (32.0.9) β = (W X )−1 W y Answer. In this case the big inverse can be split into three: (32.0.10) (32.0.11) ˜ β= X W ( W W ) −1 W X −1 X W (W W )−1 W y = = (W X )−1 W W (X W )−1 X W (W W )−1 W y Problem 363. We only have one regressor with intercept, i.e., X = ι x , and we have one instrument w for x (while the constant term is its own instrument), i.e., W = ι w . Show that the instrumental variables estimators for slope and intercept are (wt − w)(y t − y ) ¯ ¯ ˜ (32.0.12) β= ¯ ¯ (wt − w)(xt − x) ˜x (32.0.13) α = y − β¯ ˜¯ Hint: the math is identical to that in question 200. Problem 364. 2 points Show that, if there are as many instruments as there are observations, then the instrumental variables estimator (32.0.4) becomes identical to OLS. Answer. In this case W has an inverse, therefore the projection on R[W ] is the identity. Staying in the algebraic paradigm, (W W )−1 = W −1 (W )−1 . An implication of Problem 364 is that one must be careful not to include too many instruments if one has a small sample. Asymptotically it is better to have more instruments, but for n = m, the instrumental variables estimator is equal to OLS, i.e., the sequence of instrumental variables estimators starts at the (inconsistent) OLS. If one uses fewer instruments, then the asymptotic MSE matrix is not so good, but one may get a sequence of estimators which moves away from the inconsistent OLS more quickly. APPENDIX A Matrix Formulas In this Appendix, efforts are made to give some of the familiar matrix lemmas in their most general form. The reader should be warned: the concept of a deficiency matrix and the notation which uses a thick fraction line multiplication with a scalar g-inverse are my own. A.1. A Fundamental Matrix Decomposition Theorem A.1.1. Every matrix B which is not the null matrix can be written as a product of two matrices B = CD , where C has a left inverse L and D a right inverse R, i.e., LC = DR = I . This identity matrix is r × r, where r is the rank of B . A proof is in [Rao73, p. 19]. This is the fundamental theorem of algebra, that every homomorphism can be written as a product of epimorphism and monomorphism, together with the fact that all epimorphisms and monomorphisms split, i.e., have one-sided inverses. One such factorization is given by the singular value theorem: If B = P ΛQ is the svd as in Theorem A.9.2, then one might set e.g. C = P Λ and D = Q, consequently L = Λ−1 P and R = Q . In this decomposition, the first row/column carries the largest weight and gives the best approximation in a least squares sense, etc. The trace of a square matrix is defined as the sum of its diagonal elements. The rank of a matrix is defined as the number of its linearly independent rows, which is equal to the number of its linearly independent columns (row rank = column rank). Theorem A.1.2. tr BC = tr CB . Problem 365. Prove theorem A.1.2. Problem 366. Use theorem A.1.1 to prove that if BB = B , then rank B = tr B . Answer. Premultiply the equation CD = CDCD by L and postmultiply it by R to get DC = I r . This is useful for the trace: tr B = tr CD = tr DC = tr I r = r . I have this proof from [Rao73, p. 28]. Theorem A.1.3. B = O if and only if B B = O . A.2. The Spectral Norm of a Matrix The spectral norm of a matrix extends the Euclidean norm z from vectors to matrices. Its definition is A = max z =1 Az . This spectral norm is the maximum singular value µmax , and if A is square, then A−1 = 1/µmin . It is a true norm, i.e., A = 0 if and only if A = O , furthermore λA = |λ|· A , and the triangle inequality A + B ≤ A + B . In addition, it obeys AB ≤ A · B . Problem 367. Show that the spectral norm is the maximum singular value. 321 322 A. MATRIX FORMULAS Answer. Use the definition (A.2.1) A 2 = max z A Az zz . Write A = P ΛQ as in (A.9.1), Then z A Az = z Q Λ2 Qz . Therefore we can first show: there is a z in the form z = Q x which attains this maximum. Proof: for every z which has a nonzero value in the numerator of (A.2.1), set x = Qz . Then x = o, and Q x attains the same value as z in the numerator of (A.2.1), and a smaller or equal value in the denominator. Therefore one can restrict the search for the maximum argument to vectors of the form Q x. But for them 2 the objective function becomes x x Λx x , which is maximized by x = i1 , the first unit vector (or column vector of the unit matrix). Therefore the squared spectral norm is λ2 , and therefore the ii spectral norm itself is λii . A.3. Inverses and g-Inverses of Matrices A g-inverse of a matrix A is any matrix A− satisfying (A.3.1) A = AA− A. It always exists but is not always unique. If A is square and nonsingular, then A1 is its only g-inverse. Problem 368. Show that a symmetric matrix Ω has a g-inverse which is also symmetric. Answer. Use Ω −ΩΩ − . The definition of a g-inverse is apparently due to [Rao62]. It is sometimes called the “conditional inverse” [Gra83, p. 129]. This g-inverse, and not the Moore-Penrose generalized inverse or pseudoinverse A+ , is needed for the linear model, The MoorePenrose generalized inverse is a g-inverse that in addition satisfies A+ AA+ = A+ , and AA+ as well as A+ A symmetric. It always exists and is also unique, but the additional requirements are burdensome ballast. [Gre97, pp. 44-5] also advocates the Moore-Penrose inverse, but he does not really use it. If he were to try to use it, he would probably soon discover that it is not appropriate. The book [Alb72] does the linear model with the Moore-Penrose inverse. It is a good demonstration of how complicated everything gets if one uses an inappropriate mathematical tool. Problem 369. Use theorem A.1.1 to prove that every matrix has a g-inverse. Answer. Simple: a null matrix has its transpose as g-inverse, and if A = O then RL is such a g-inverse. The g-inverse of a number is its inverse if the number is nonzero, and is arbitrary otherwise. Scalar expressions written as fractions are in many cases the multiplication by a g-inverse. We will use a fraction with a thick horizontal rule to indicate where this is the case. In other words, by definition, a a = b− a. Compare that with the ordinary fraction . (A.3.2) b b This idiosyncratic notation allows to write certain theorems in a more concise form, but it requires more work in the proofs, because one has to consider the additional case that the denominator is zero. Theorems A.5.8 and A.8.2 are examples. Theorem A.3.1. If B = AA− B holds for one g-inverse A− of A, then it holds for all g-inverses. If A is symmetric and B = AA− B , then also B = B A− A. If B = BA− A and C = AA− C then BA− C is independent of the choice of ginverses. A.4. DEFICIENCY MATRICES 323 Proof. Assume the identity B = AA+ B holds for some fixed g-inverse A+ (which may be, as the notation suggests, the Moore Penrose g-inverse, but this is not necessary), and let A− be an different g-inverse. Then AA− B = AA− AA+ B = AA+ B = B . For the second statement one merely has to take transposes and note that a matrix is a g-inverse of a symmetric A if and only if its transpose is. For the third statement: BA+ C = BA− AA+ AA− C = BA− AA− C = BA− C . Here + signifies a different g-inverse; again, it is not necessarily the Moore-Penrose one. Problem 370. Show that x satisfies x = Ba for some a if and only if x = BB − x. Theorem A.3.2. Both A (AA )− and (A A)− A are g-inverses of A. Proof. We have to show (A.3.3) A = AA (AA )− A which is [Rao73, (1b.5.5) on p. 26]. Define D = A − AA (AA )− A and show, by multiplying out, that DD = O . A.4. Deficiency Matrices Here is again some idiosyncratic terminology and notation. It gives an explicit algebraic formulation for something that is often done implicitly or in a geometric paradigm. A matrix G will be called a “left deficiency matrix” of S , in symbols, G ⊥ S , if GS = O , and for all Q with QS = O there is an X with Q = XG. This factorization property is an algebraic formulation of the geometric concept of a null space. It is symmetric in the sense that G ⊥ S is also equivalent with: GS = O , and for all R with GR = O there is a Y with R = SY . In other words, G ⊥ S and S ⊥ G are equivalent. This symmetry follows from the following characterization of a deficiency matrix which is symmetric: Theorem A.4.1. T ⊥ U iff T U = O and T T + U U nonsingular. Proof. This proof here seems terribly complicated. There must be a simpler way. Proof of “⇒”: Assume T ⊥ U . Take any γ with γ T T γ + γ U U γ = 0, i.e., T γ = o and γ U = o . From this one can show that γ = o: since T γ = o, there is a ξ with γ = U ξ , therefore γ γ = γ U ξ = 0. To prove “⇐” assume T U = O and T T + U U is nonsingular. To show that T ⊥ U take any B with BU = O . Then B = B (T T + U U )(T T + U U )−1 = BT T (T T + U U )−1 . In the same way one gets T = T T T (T T + U U )−1 . Premultiply this last equation by T T (T T T T )− T and use theorem A.3.2 to get T T (T T T T )− T T = T T (T T + U U )−1 . Inserting this into the equation for B gives B = BT T (T T T T )− T T , i.e., B factors over T . The R/Splus-function Null gives the transpose of a deficiency matrix. Theorem A.4.2. If for all Y , BY = O implies AY = O , then a X exists with A = XB . Problem 371. Prove theorem A.4.2. Answer. Let B ⊥ C . Choosing Y = B follows AB = O , hence X exists. Problem 372. Show that I − SS − ⊥ S . Answer. Clearly, (I − SS − )S = O . Now if QS = O , then Q = Q(I − SS − ), i.e., the X whose existence is postulated in the definition of a deficiency matrix is Q itself. 324 A. MATRIX FORMULAS Problem 373. Show that S ⊥ U if and only if S is a matrix with maximal rank which satisfies SU = O . In other words, one cannot add linearly independent rows to S in such a way that the new matrix still satisfies T U = O . Answer. First assume S ⊥ U and take any additional row t exists a Q S such that r t = Q S , i.e., SQ = S , and t r so that S t U= O . Then o = r S . But this last equation means that t is a linear combination of the rows of S with the ri as coefficients. Now conversely, assume S O S is such that one cannot add a linearly independent row t such that U= , and let t o P U = O . Then all rows of P must be linear combinations of rows of S (otherwise one could add such a row to S and get the result which was just ruled out), therefore P = SS where A is the matrix of coefficients of these linear combinations. The deficiency matrix is not unique, but we will use the concept of a deficiency matrix in a formula only then when this formula remains correct for every deficiency matrix. One can make deficiency matrices unique if one requires them to be projection matrices. Problem 374. Given X and a symmetric nonnegative definite Ω such that X = Ω W for some W . Show that X ⊥ U if and only if X Ω − X ⊥ U . Answer. One has to show that XY = O is equivalent to X Ω − XY = O . ⇒ clear; for ⇐ note that X Ω − X = W Ω W , therefore XY = Ω W Y = Ω W (W Ω W )− W Ω W Y = Ω W (W Ω W )− X Ω − XY = O . A matrix is said to have full column rank if all its columns are linearly independent, and full row rank if its rows are linearly independent. The deficiency matrix provides a “holistic” definition for which it is not necessary to look at single rows and columns. X has full column rank if and only if X ⊥ O , and full row rank if and only if O ⊥ X . Problem 375. Show that the following three statements are equivalent: (1) X has full column rank, (2) X X is nonsingular, and (3) X has a left inverse. Answer. Here use X ⊥ O as the definition of “full column rank.” Then (1) ⇔ (2) is theorem A.4.1. Now (1) ⇒ (3): Since IO = O , a P exists with I = P X . And (3) ⇒ (1): if a P exists with I = P X , then any Q with QO = O can be factored over X , simply say Q = QP X . Note that the usual solution of linear matrix equations with g-inverses involves a deficiency matrix: Theorem A.4.3. The solution of the consistent matrix equation T X = A is (A.4.1) X = T −A + U W where T ⊥ U and W is arbitrary. Proof. Given consistency, i.e., the existence of at least one Z with T Z = A, (A.4.1) defines indeed a solution, since T X = T T − T Z . Conversely, if Y satisfies T Y = A, then T (Y − T − A) = O , therefore Y − T − A = U W for some W . Theorem A.4.4. Let L ⊥ T ⊥ U and J ⊥ HU ⊥ R; then L −J HT − O T ⊥ ⊥ U R. H J A.4. DEFICIENCY MATRICES 325 Proof. First deficiency relation: Since I −T T − = U W for some W , −J HT − T + T J H = O , therefore the matrix product is zero. Now assume A B = O. H Then BHU = O , i.e., B = DJ for some D . Then AT = −DJ H , which has as general solution A = −DJ HT − + CL for some C . This together gives L O AB=CD . Now the second deficiency relation: clearly, −J HT − J the product of the matrices is zero. If M satisfies T M = O , then M = U N for some N . If M furthermore satisfies HM = O , then HU N = O , therefore N = RP for some P , therefore M = U RP . Theorem A.4.5. Assume Ω is nonnegative definite symmetric and K is such that KΩ is defined. Then the matrix (A.4.2) Ξ = Ω − Ω K (KΩ K )− KΩ has the (1) (2) (3) following properties: Ξ does not depend on the choice of g-inverse of KΩ K used in (A.4.2). Any g-inverse of Ω is also a g-inverse of Ξ, i.e. ΞΩ − Ξ = Ξ. Ξ is nonnegative definite and symmetric. K (4) For every P ⊥ Ω follows ⊥Ξ P K K (5) If T is any other right deficiency matrix of , i.e., if ⊥ T , then P P (A.4.3) Ξ = T (T Ω − T )− T . Hint: show that any D satisfying Ξ = T DT is a g-inverse of T Ω− T . In order to apply (A.4.3) show that the matrix T = SK where K ⊥ S and K P S ⊥ K is a right deficiency matrix of . P Proof of theorem A.4.5: Independence of choice of g-inverse follows from theorem A.5.10. That Ω − is a g-inverse is also an immediate consequence of theorem A.5.10. From the factorization Ξ = ΞΩ − Ξ follows also that Ξ is nnd symmetric (since every nnd symmetric Ω also has a symmetric nnd g-inverse). (4) Deficiency property: K From Q = O follows KQ = O and P Q = O . From this second equation P and P ⊥ Ω follows Q = Ω R for some R. Since KΩ R = KQ = O , it follows Ω Q = Ω R = (Ω − Ω K (KΩ K )− KΩ )R. K Proof of (5): Since Ξ = O it follows Ξ = T A for some A, and therefore P Ξ = ΞΩ − Ξ = T AΩ − A T = T DT where D = AΩ − A . Ω Before going on we need a lemma. Since (I − ΩΩ − )Ω = O , there exists a N with I − ΩΩ − = N P , therefore T − ΩΩ − T = N P T = O or (A.4.4) T = ΩΩ − T Using (A.4.4) one can show the hint: that any D satisfying Ξ = T DT g-inverse of T Ω − T : is a Ω Ω (A.4.5) T Ω − T DT Ω − T ≡ T Ω − (Ω − Ω K (KΩ K )− KΩ )Ω − T = T Ω − T . To complete the proof of (5) we have to show that the expression T (T Ω − T )− T does not depend on the choice of the g-inverse of T Ω − T . This follows from T (T Ω − T )− T = ΩΩ − T (T Ω − T )− T Ω −Ω and theorem A.5.10. 326 A. MATRIX FORMULAS Theorem A.4.6. Given two matrices T and U . Then T ⊥ U if and only if for any D the following two statements are equivalent: (A.4.6) TD = O and (A.4.7) For all C which satisfy CU = O follows CD = O . A.5. Nonnegative Definite Symmetric Matrices By definition, a symmetric matrix Ω is nonnegative definite if a Ω a ≥ 0 for all vectors a. It is positive definite if a Ω a > 0 for all vectors a = o. Theorem A.5.1. Ω nonnegative definite symmetric if and only if it can be written in the form Ω = A A for some A. Theorem A.5.2. If Ω is nonnegative definite, and a Ω a = 0, then already Ω a = o. Theorem A.5.3. A is positive definite if and only it is nonnegative definite and nonsingular. Theorem A.5.4. If the symmetric matrix A has a nnd g-inverse then A itself is also nnd. Theorem A.5.5. If Ω and Σ are positive definite, then Ω − Σ is positive (nonnegative) definite if and only if Σ −1 − Ω −1 is. Ω Theorem A.5.6. If Ω and Σ are nonnegative definite, then tr(ΩΣ ) ≥ 0. Problem 376. Prove theorem A.5.6. Ω Answer. Find any factorization Σ = P P . Then tr(ΩΣ ) = tr(P Ω P ) ≥ 0. Theorem A.5.7. If Ω is nonnegative definite symmetric, then (A.5.1) (g Ω a)2 ≤ g Ω g a Ω a, for arbitrary vectors a and g . Equality holds if and only if Ω g and Ω a are linearly dependent, i.e., α and β exist, not both zero, such that Ω g α + Ω aβ = o. Proof: First we will show that the condition for equality is sufficient. Therefore assume Ω g α + Ω aβ = 0 for a certain α and β , which are not both zero. Without loss of generality we can assume α = 0. Then we can solve a Ω g α + a Ω aβ = 0 to get a Ω g = −(β/α)a Ω a, therefore the lefthand side of (A.5.1) is (β/α)2 (a Ω a)2 . Furthermore we can solve g Ω g α + g Ω aβ = 0 to get g Ω g = −(β/α)g Ω a = (β/α)2 a Ω a, therefore the righthand side of (A.5.1) is (β/α)2 (a Ω a)2 as well—i.e., (A.5.1) holds with equality. Secondly we will show that (A.5.1) holds in the general case and that, if it holds with equality, Ω g and Ω a are linearly dependent. We will split this second half of the proof into two substeps. First verify that (A.5.1) holds if g Ω g = 0. If this is the case, then already Ω g = o, therefore the Ω g and Ω a are linearly dependent and, by the first part of the proof, (A.5.1) holds with equality. The second substep is the main part of the proof. Assume g Ω g = 0. Since Ω is nonnegative definite, it follows (A.5.2) (g Ω a)2 (g Ω a)2 (g Ω a)2 g Ωa g Ωa 0 ≤ a−g Ω a−g = a Ω a−2 + = a Ω a− . g Ωg g Ωg g Ωg g Ωg g Ωg A.5. NONNEGATIVE DEFINITE SYMMETRIC MATRICES From this follows (A.5.1). If (A.5.2) is an equality, then already Ω a − g g g o, which means that Ω g and Ω a are linearly dependent. 327 Ωa Ωg = Theorem A.5.8. In the situation of theorem A.5.7, one can take g-inverses as follows without disturbing the inequality (g Ω a)2 ≤ a Ω a. g Ωg Equality holds if and only if a γ = 0 exists with Ω g = Ω aγ . (A.5.3) Problem 377. Show that if Ω is nonnegative definite, then its elements satisfy 2 ωij ≤ ωii ωjj (A.5.4) Answer. Let a and b be the ith and j th unit vector. Then (A.5.5) (b Ω a)2 ≤ max (g Ω a)2 g b Ωb g Ωg = a Ω a. Problem 378. Assume Ω nonnegative definite symmetric. If x satisfies x = Ω a for some a, show that (g x)2 = x Ω − x. Ωg g Furthermore show that equality holds if and only if Ω g = xγ for some γ = 0. (A.5.6) max g Answer. From x = Ω a follows g x = g Ω a and x Ω − x = a Ω a; therefore it follows from theorem A.5.8. Problem 379. Assume Ω nonnegative definite symmetric, x satisfies x = Ω a for some a, and R is such that Rx is defined. Show that (A.5.7) x R (RΩ R )− Rx ≤ x Ω − x Answer. Follows from (A.5.8) max (h Rx)2 ≤ max (g x)2 g Ωg h RΩ R h because on the term on the lhs maximization is done over the smaller set of g which have the form Rh. An alternative proof would be to show that Ω − Ω r (RΩ R )− RΩ is nnd (it has Ω − as g-inverse). h g Problem 380. Assume Ω nonnegative definite symmetric. Show that (A.5.9) max g: Ω g =Ω a for some a (g x)2 g Ω−g = x Ω x. Answer. Since g = Ω a for some a, maximize over a instead of g . This reduces it to theorem A.5.8: ( g x) 2 (a Ω x)2 (A.5.10) max = max = x Ωx Ω g : g =Ω a for some a g Ω − g a a Ωa Theorem A.5.9. Let Ω be symmetric and nonnegative definite, and x an arbitrary vector. Then Ω − xx is nonnegative definite if and only if the following two conditions hold: x can be written in the form x = Ω a for some a, and x Ω − x ≤ 1 for one (and therefore for all) g-inverses Ω − of Ω . 328 A. MATRIX FORMULAS Problem 381. Prove theorem A.5.9. Answer. Assume x = Ω a and x Ω − x = a Ω a ≤ 1; then for any g , g (Ω − xx )g = g Ω g − g Ω aa Ω g ≥ a Ω ag Ω g − g Ω aa Ω g ≥ 0 by theorem A.5.7. Conversely, assume x cannot be written in the form x = Ω a for some a; then a g exists with g Ω = o but g x = o. Then g (Ω − xx )g < 0, therefore not nnd. Finally assume x Ω − x = a Ω a > 1; then a (Ω − xx )a = a Ω a − (a Ω a)2 < 0, therefore again not nnd. Theorem A.5.10. If Ω and Σ are nonnegative definite symmetric, and K a matrix so that Σ KΩ is defined, then KΩ = (KΩ K + Σ )(KΩ K + Σ )− KΩ . (A.5.11) Furthermore, Ω K (KΩ K + Σ )− KΩ is independent of the choice of g-inverses. Problem 382. Prove theorem A.5.10. Answer. To see that (A.5.11) is a special case of (A.3.3), take any Q with Ω = QQ and P with Σ = P P and define A = K Q P . The independence of the choice of g-inverses follows from theorem A.3.1 together with (A.5.11). The following was apparently first shown in [Alb69] for the special case of the Moore-Penrose pseudoinverse: Theorem A.5.11. The symmetric partitioned matrix Ω = Ω yy Ω yz Ω yz is nonΩ zz negative definite if and only if the following conditions hold: (A.5.12) Ω yy and Ω zz.y := Ω zz − Ω yz Ω −y Ω yz y are both nonnegative definite, and Ω yz = Ω yy Ω −y Ω yz y (A.5.13) Reminder: It follows from theorem A.3.1 that (A.5.13) holds for some g-inverse if and only if it holds for all, and that, if it holds, Ω zz.y is independent of the choice of the g-inverse. Proof of theorem A.5.11: First we prove the necessity of the three conditions in the theorem. If the symmetric partitioned matrix Ω is nonnegative definite, Ω yy Ω yz there exists a R with Ω = R R. Write R = Ry Rz to get = Ω yz Ω zz Ry Ry Ry Rz . Ω yy is nonnegative definite because it is equal to Ry Ry , Rz Ry Rz Rz and (A.5.13) follows from (A.5.11): Ω yy Ω −y Ω yz = Ry Ry (Ry Ry )− Ry Rz = y Ry Rz = Ω yz . To show that Ω zz.y is nonnegative definite, define S = (I − Ry (Ry Ry )− Ry )Rz . Then S S = Rz I − Ry (Ry Ry )− Ry Rz = Ω zz.y . To show sufficiency of the three conditions of theorem A.5.11, assume the symΩyy Ωyz metric satisfies them. Pick two matrices Q and S so that Ω yy = Q Q Ω yz Ω zz and Ω zz.y = S S . Then Ω yy Ω yz Q Ω yz = Ω zz Ω yz Ω −y Q y O S Q O QΩ −y Ω yz y , S therefore nonnegative definite. Problem 383. [SM86, A 3.2/11] Given a positive definite matrix Q and a positive definite Q with Q∗ = Q − Q nonnegative definite. A.6. PROJECTION MATRICES 329 • a. Show that Q − QQ−1 Q is nonnegative definite. −1 Answer. We know that Q −1 − Q∗−1 is nnd, therefore QQ Q − QQ∗−1 Q nnd. • b. This part is more difficult: Show that also Q∗ − Q∗ Q−1 Q∗ is nonnegative definite. Answer. We will write it in a symmetric form from which it is obvious that it is nonnegative definite: (A.5.14) Q∗ − Q∗ Q−1 Q∗ = Q∗ − Q∗ (Q + Q∗ )−1 Q∗ (A.5.15) = Q∗ (Q + Q∗ )−1 (Q + Q∗ − Q∗ ) = Q∗ (Q + Q∗ )−1 Q (A.5.16) = Q(Q + Q∗ )−1 (Q + Q∗ )Q (A.5.17) = QQ−1 (Q∗ + Q∗ Q −1 −1 Q∗ (Q + Q∗ )−1 Q Q∗ )Q−1 Q. Problem 384. Given the vector h = o. For which values of the scalar γ is the matrix I − hh singular, nonsingular, nonnegative definite, a projection matrix, γ orthogonal? Answer. It is nnd iff γ ≥ h h, because of theorem A.5.9. One easily verifies that it is orthogonal iff γ = h h/2, and it is a projection matrix iff γ = h h. Now let us prove that it is singular iff γ = h h: if this condition holds, then the matrix annuls h; now assume the condition does not hold, i.e., γ = h h, and take any x with (I − hh )x = o. It follows x = hα where γ α = h x/γ , therefore (I − hh )x = hα(1 − h h/γ ). Since h = o and 1 − h h/γ = 0 this can γ only be the null vector if α = 0. A.6. Projection Matrices Problem 385. Show that X (X X )− X is the projection matrix on the range space R[X ] of X , i.e., on the space spanned by the columns of X . This is true whether or not X has full column rank. Answer. Idempotence requires theorem A.3.2, and symmetry the invariance under choice of g-inverse. Furthermore one has to show X (X X )− Xa = a holds if and only if a = Xb for some b. ⇒ is clear, and ⇐ follows from theorem A.3.2. Theorem A.6.1. Let P and Q be projection matrices, i.e., both are symmetric and idempotent. Then the following five conditions are equivalent, each meaning that the space on which P projects is a subspace of the space on which Q projects: (A.6.1) (A.6.2) (A.6.3) R[P ] ⊂ R[Q] QP = P PQ = P (A.6.4) Q−P projection matrix (A.6.5) Q−P nonnegative definite. (A.6.2) is geometrically trivial. It means: if one first projects on a certain space, and then on a larger space which contains the first space as a subspace, then nothing happens under this second projection because one is already in the larger space. (A.6.3) is geometrically not trivial and worth remembering: if one first projects on a certain space, and then on a smaller space which is a subspace of the first space, then the result is the same as if one had projected directly on the smaller space. (A.6.4) means: the difference Q − P is the projection on the orthogonal complement of R[P ] 330 A. MATRIX FORMULAS in R[Q]. And (A.6.5) means: the projection of a vector on the smaller space cannot be longer than that on the larger space. Problem 386. Prove theorem A.6.1. Answer. Instead of going in a circle it is more natural to show (A.6.1) ⇐⇒ (A.6.2) and (A.6.3) ⇐⇒ (A.6.2) and then go in a circle for the remaining conditions: (A.6.2), (A.6.3) ⇒ (A.6.4) ⇒ (A.6.3) ⇒ (A.6.5). (A.6.1) ⇒ (A.6.2): R[P ] ⊂ R[Q] means that for every c exists a d with P c = Qd. Therefore far all c follows QP c = QQd = Qd = P c, i.e., QP = P . (A.6.2) ⇒ (A.6.1): if P c = QP c for all c, then clearly R[P ] ⊂ R[Q]. (A.6.2) ⇒ (A.6.3) by symmetry of P and Q: If QP = P then P Q = P Q = (QP ) = P = P. (A.6.3) ⇒ (A.6.2) follows in exactly the same way: If P Q = P then QP = Q P = (P Q) = P = P. (A.6.2), (A.6.3) ⇒ (A.6.4): Symmetry of Q − P clear, and (Q − P )(Q − P ) = Q − P − P + P = Q − P. (A.6.4) ⇒ (A.6.5): c (Q − P )c = c (Q − P ) (Q − P )c ≥ 0. (A.6.5) ⇒ (A.6.3): First show that, if Q − P nnd, then Qc = o implies P c = o. Proof: from Q − P nnd and Qc = o follows 0 ≤ c (Q − P )c = −c P c ≤ 0, therefore equality throughout, i.e., 0 = c P c = c P P c = P c 2 and therefore P c = o. Secondly: this is also true for matrices: QC = O implies P C = O , since it is valid for every column of C . Thirdly: Since Q(I − Q) = O , it follows P (I − Q) = O , which is (A.6.3). Problem 387. If Y = XA for some A, show that Y (Y Y )− Y X (X X )− X Y (Y Y )− Y . = Answer. Y = XA means that every column of Y is a linear combination of columns of A: (A.6.6) y1 ··· y m = X a1 ··· am = X a1 ··· Xam . Therefore geometrically the statement follows from the fact shown in Problem 385 that the above matrices are projection matrices on the columnn spaces. But it can also be shown algebraically: Y (Y Y )− Y X (X X )− X = Y (Y Y )− A X X (X X )− X = Y (Y Y )− Y . Problem 388. (Not eligible for in-class exams) Let Q be a projection matrix (i.e., a symmetric and idempotent matrix) with the property that Q = XAX for ˜ some A. Define X = (I − Q)X . Then (A.6.7) X (X X )− X ˜ ˜˜ ˜ = X (X X )− X + Q. Hint: this can be done through a geometric argument. If you want to do it algebraically, you might want to use the fact that (X X )− is also a g-inverse of ˜˜ X X. Answer. Geometric argument: Q is a projector on a subspace of the range space of X . The ˜ columns of X are projections of the columns of X on the orthogonal complement of the space on which Q projects. The equation which we have to prove shows therefore that the projection on the column space of X is the sum of the projections on the space Q projects on plus the projection on the orthogonal complement of that space in X . ˜˜ Now an algebraic proof: First let us show that (X X )− is a g-inverse of X X , i.e., let us evaluate (A.6.8) X (I −Q)X (X X )− X (I −Q)X = X X (X X )− X X −X X (X X )− X QX −X QX (X X )− X X +X QX (X X )− X (A.6.9) = X X − X QX − X QX + X QX = X (I − Q)X . Only for the fourth term did we need the condition Q = XAX : (A.6.10) X X AX X (X X )− X X AX X = X X AX X AX X = X QQX = X X . A.7. DETERMINANTS 331 Using this g-inverse we have (A.6.11) X (X X )− X ˜ ˜ − X (X X )− X = X (X X )− X − (I − Q)X (X X )− X (I − Q) = (A.6.12) = X (X X )− X −X (X X )− X +X (X X )− X Q+QX (X X )− X −QX (X X )− X Q = X (X X )− X −X (X X )− X + Problem 389. Given any projection matrix P . Show that its ith diagonal element can be written (A.6.13) p2 . ij pii = j Answer. From idempotence P = P P follows pii = (A.6.13). j pij pji , now use symmetry to get A.7. Determinants Theorem A.7.1. The determinant of a block-triangular matrix is the product of the determinants of the blocks in the diagonal. In other words, A O (A.7.1) B = |A| |D | D For the proof recall the definition of a determinant. A mapping π : {1, . . . , n} → {1, . . . , n} is called a permutation if and only if it is one-to-one if and only if it is onto. Permutations can be classified as even or odd according to whether they can be written as the product of an even or odd number of transpositions. Then the determinant is defined as (A.7.2) a1π(1) · · · anπ(n) − det(A) = π : π even a1π(1) · · · anπ(n) π : π odd Now assume A is m × m, 1 ≤ m < n. If a j ≤ m exists with π (j ) > m then not all i ≤ m can be images of other points j ≤ m, i.e., there must be at least one j > m with π (j ) ≤ m. Therefore, in a block triangular matrix in which all aij = 0 for i ≤ m, j > m, only those permutations give a nonzero product which remain in the two submatrices straddling the diagonal. Theorem A.7.2. If B = AA− B , then the following identity is valid between determinants: (A.7.3) A C B = |A| |E | D where E = D − CA− B . Proof: Postmultiply by a matrix whose determinant, by lemma A.7.1, is one, and then apply lemma A.7.1 once more: (A.7.4) A O AB A B I −A− B = = = |A| D − CA− B . CD CDO I C D − CA− B Problem 390. Show the following counterpart of theorem A.7.2: If C = DD − C , then the following identity is valid between determinants: (A.7.5) A C B = A − BD − C |D | . D 332 A. MATRIX FORMULAS Answer. A C (A.7.6) A B = C D I −D − C B D A − BD − C O = O I B = A − BD − D |D | . D Problem 391. Show that whenever BC and CB are defined, it follows |I − BC | = |I − CB | Answer. Set A = I and D = I in (A.7.3) and (A.7.5). Theorem A.7.3. Assume that d = W W − d. Then det(W + α · dd ) = det(W )(1 + αd W − d). (A.7.7) Proof: If α = 0, then there is nothing to prove. Otherwise look at the determinant of the matrix W d (A.7.8) H= d −1/α Equations (A.7.3) and (A.7.5) give two expressions for it: 1 (A.7.9) det(H ) = det(W )(−1/α − d W − d) = − det(W + αdd ). α A.8. More About Inverses AB which satisfies B = AA− B CD and C = CA− A. (These conditions hold for instance, due to theorem A.5.11, if AB is nonnegative definite symmetric, but it also holds in the nonsymmetric CD case if A is nonsingular, which by theorem A.7.2 is the case if the whole partioned matrix is nonsingular.) Define E = D − CA− B , F = A− B , and G = CA− . Problem 392. Given a partitioned matrix • a. Prove that in terms of A, E , F , and G, the original matrix can be written as A C (A.8.1) B A = D GA AF E + GAF (this is trivial), and that (this is the nontrivial part) A− + F E − G −F E − −E − G E− (A.8.2) is a g-inverse of A C B . D Answer. This here is not the shortest proof because I was still wondering if it could be formulated in a more general way. Multiply out but do not yet use the conditions B = AA− B and C = CA− A: (A.8.3) A C B D A− + F E − G −E − G −F E − E− = AA− − (I − AA− )BE − G (I − EE − )G (I − AA− )BE − EE − and (A.8.4) AA− − (I − AA− )BE − G (I − EE − )G = (I − AA− )BE − EE − A C A + (I − AA− )BE − C (I − A− A) C − (I − EE − )C (I − A− A) B = D B − (I − AA− )B (I − E − E ) D One sees that not only the conditions B = AA− B and C = CA− A, but also the conditions B = AA− B and C = EE − C , or alternatively the conditions B = BE − E and C = CA− A imply the statement. I think one can also work with the conditions AA− B = BD − D and DD − C = CA− A. Note that the lower right partition is D no matter what. A.8. MORE ABOUT INVERSES U W • b. If V X is a g-inverse of A GA 333 AF , show that X is a gE + GAF inverse of E . Answer. The g-inverse condition means A GA (A.8.5) AF E + GAF U W V X A GA AF E + GAF = A GA AF E + GAF The first matrix product evaluated is (A.8.6) A GA AF E + GAF U W V X = AU + AF W GAU + EW + GAF W The g-inverse condition means therefore (A.8.7) AU + AF W AV + AF X GAU + EW + GAF W GAV + EX + GAF X A GA AV + AF X . GAV + EX + GAF X AF E + GAF = A GA AF E + GAF For the upper left partition this means AU A + AF W A + AV GA + AF XGA = A, and for the upper right partition it means AU AF + AF W AF + AV E + AV GAF + AF XE + AF XGAF = AF . Postmultiply the upper left equation by F and subtract from the upper right to get AV E + AF XE = O . For the lower left we get GAU A + EW A + GAF W A + GAV GA + EXGA + GAF XGA = GA. Premultiplication of the upper left equation by G and subtraction gives EW A + EXGA = O . For the lower right corner we get GAU AF + EW AF + GAF W AF + GAV E + EXE + GAF XE + GAV GAF + EXGAF + GAF XGAF = E + GAF . Since AV E + AF XE = O and EW A + EXGA = O , this simplifies to GAU AF + GAF W AF + EXE + GAV GAF + GAF XGAF = E + GAF . And if one premultiplies the upper right corner by G and postmultiplies it by F and subtracts it from this one gets EXE = E . Problem 393. Show that a g-inverse of the matrix X1 X1 X2 X1 (A.8.8) X1 X2 X2 X2 has the form (A.8.9) (X 1 X 1 )− + D 1 X 2 (X 2 M 1 X 2 )− X 2 D 1 −(X 2 M 1 X 2 )− X 2 D 1 −D 1 X 2 (X 2 M 1 X 2 )− (X 2 M 1 X 2 )− where M 1 = I − X 1 (X 1 X 1 )− X 1 and D 1 = X 1 (X 1 X 1 )− . Answer. Either show it by multiplying it out, or apply Problem 392. Problem 394. Show that the following are g-inverses: (A.8.10) − − I X IO (X X )− XX X = = OO X XX X I O O I − X (X X )− X Answer. Either do it by multiplying it out, or apply problem 392. Problem 395. Assume again B = AA− B and C = CAA− , but assume this AB time that nonsingular. Then A is nonsingular, CD (A.8.11) −1 |A| PQ AB AB and if = , then the determinant = . RS CD CD |S | Answer. The determinant is, by (A.7.3), |A| |E | where E = D − CA− B . By assumption, this determinant is nonzero, therefore also |A| and |E | are nonzero, i.e., A and E are nonsingular. Therefore (A.8.2) reads (A.8.12) P R Q A−1 + F E −1 G = S −E −1 G −F E −1 , E −1 334 A. MATRIX FORMULAS i.e., S = E −1 = (D − CA− B )−1 . hence |A| |E | = |A| / |S |. Theorem A.8.1. Given a m × n matrix A, a m × h matrix B , a k × n matrix C , and a k × h matrix D satisfying AA− B = BD − D and DD − C = CA− A. Then the following are g-inverses: (A.8.13) (A.8.14) D + CA− B − A + BD − C − = A− − A− B (D + CA− B )− CA− = D − − D − C (A + BD − C )− BD − . Problem 396. Prove theorem A.8.1. Answer. Proof: Define E = D + CA− B . Then it follows from the assumptions that (A.8.15) (A + BD − C )(A− − A− BE − CA− ) = AA− − BD − DE − CA− + BD − CA− − BD − CA− BE − CA− = = AA− + BD − (I − EE − )CA− (A.8.16) Since AA− (A + BD − C ) = A + BD − C , we have to show that the second term on the rhs. annulls (A + BD − C ). Indeed, BD − (I − EE − )CA− (A + BD − C ) = (A.8.17) (A.8.18) = BD − CA− A + BD − CA− BD − C − BD − EE − CA− A − BD − EE − CA− BD − C = (A.8.19) = BD − (D + CA− B − EE − D − EE − CA− B )D − C = BD − (E − EE − E )D − C = O . Theorem A.8.2. (Sherman-Morrison-Woodbury theorem) Given a m × n matrix A, a m × 1 vector b satisfying AA− b = b, a n × 1 vector c satisfying c AA− = c , and a scalar δ . If A− is a g-inverse of A, then (A.8.20) A− − A− bc A− is a g-inverse of − c A b+δ A+ bc δ Problem 397. Prove theorem A.8.2. Answer. It is a special case of theorem A.8.1. Theorem A.8.3. For any symmetric nonnegative definite r × r matrix A, (A.8.21) (det A) e−(tr A) ≤ e−r , with equality holding if and only if A = I . Problem 398. Prove Theorem A.8.3. Hint: Let λ1 , . . . , λr be the eigenvalues of A. Then det A = i λi , and tr A = i λi . Answer. Therefore the inequality reads r λi e−λi ≤ e−r (A.8.22) i=1 For this it is sufficient to show for each value of λ (A.8.23) λe−λ ≤ e−1 , which follows immediately by taking the derivatives: e−λ − λe−λ = 0 gives λ = 1. The matrix with all eigenvalues being equal to 1 is the identity matrix. A.9. EIGENVALUES AND SINGULAR VALUE DECOMPOSITION 335 A.9. Eigenvalues and Singular Value Decomposition Every symmetric matrix B has real eigenvalues and a system of orthogonal eigenvectors which span the whole space. If one normalizes these eigenvectors and combines them as row vectors into a matrix T , then orthonormality means T T = I , and since T is square, T T = I also implies T T = I , i.e., T is an orthogonal matrix. The existence of a complete set of real eigenvectors is therefore equivalent to the following matrix algebraic result: For every symmetric matrix B there is an orthogonal transformation T so that BT = T Λ where Λ is a diagonal matrix. Equivalently one could write B = T ΛT . And if B has rank r, then r of the diagonal elements are nonzero and the others zero. If one removes those eigenvectors from T which belong to the eigenvalue zero, and calls the remaining matrix P , one gets the following: Theorem A.9.1. If B is a symmetric n × n matrix of rank r, then a r × n matrix P exists with P P = I (any P satisfying this condition which is not a square matrix is called incomplete orthogonal), and B = P ΛP , where Λ is a r × r diagonal matrix with all diagonal elements nonzero. Proof. Let T be an orthogonal matrix whose rows are eigenvectors of B , and P partition it T = where P consists of all eigenvectors with nonzero eigenvalue Q ΛO (there are r of them). The eigenvalue property reads B P ; Q =P Q OO ΛOP therefore by orthogonality T T = I follows B = P = Q OOQ IO P = , P ΛP . Orthogonality also means T T = I , i.e., P Q OI Q therefore P P = I . Problem 399. If B is a n × n symmetric matrix of rank r and B 2 = B , i.e., B is a projection, then a r × n matrix P exists with B = P P and P P = I . Answer. Let t be an eigenvector of the projection matrix B with eigenvalue λ. Then B 2 t = Bt, i.e., λ2 t = λt, and since t = o, λ2 = λ. This is a quadratic equation with solutions λ = 0 or λ = 1. The matrix Λ from theorem A.9.1, whose diagonal elements are the nonzero eigenvalues, is therefore an identity matrix. A theorem similar to A.9.1 holds for arbitrary matrices. It is called the “singular value decomposition”: Theorem A.9.2. Let B be a m × n matrix of rank r. Then B can be expressed as (A.9.1) B = P ΛQ where Λ is a r × r diagonal matrix with positive diagonal elements, and P P = I as well as QQ = I . The diagonal elements of Λ are called the singular values of B. Proof. If P ΛQ is the svd of B then P ΛQQ ΛP = P Λ2 Q is the eigenvalue decomposition of BB . We will use this fact to construct P and Q, and then verify condition (A.9.1). P and Q have r rows each, write them p1 q1 . . (A.9.2) P = . and Q = . . . . pr qr 336 A. MATRIX FORMULAS Then the pi are orthonormal eigenvectors of BB corresponding to the nonzero eigenvalues λ2 , and q i = B pi λ−1 . The proof that this definition is symmetric is i i left as exercise problem 400 below. Now find pr+1 , . . . , pm such that p1 , . . . , pm is a complete set of orthonormal vectors, i.e., p1 p1 + · · · + pm pm = I . Then (A.9.3) B = (p1 p1 + · · · + pm pm )B (A.9.4) = (p1 p1 + · · · + pr pr )B (A.9.5) = (p1 q 1 λ1 + · · · + pr q r λr ) = P ΛQ. because pi B = o for i > r Problem 400. Show that the q i are orthonormal eigenvectors of B B corresponding to the same eigenvalues λ2 . i Answer. (A.9.6) (A.9.7) q i q j = λ−1 pi BB pj λ−1 = λ−1 pi pj λ2 λ−1 = δij jj i j i B B qi = B B B pi λ−1 i = B pi λi = Kronecker symbol q i λ2 i Problem 401. Show that Bq i = λi pi and B pi = λi q i . Answer. The second condition comes from the definition q i = B pi λ−1 , and premultiply i this definition by B to get Bq i = BB pi λ−1 = λ2 pi λ1 = λpi . i i P Q and are orthogonal. Then the singular P0 Q0 value decomposition can also be written in the full form, in which the matrix in the middle is m × n: ΛO Q (A.9.8) B= P P0 O O Q0 Let P 0 and Q0 be such that Problem 402. Let λ1 be the biggest diagonal element of Λ, and let c and d be two vectors with the properties that c B d is defined and c c = 1 as well as d d = 1. Show that c B d ≤ λ1 . The other singular values maximize among those who are orthogonal to the prior maximizers. Answer. c B d = c P ΛQd = h Λk where we call P c = h and Qd = k. By CauchySchwartz (A.5.1), (h Λk)2 ≤ (h Λh)(k Λk). Now (h Λk) = λii h2 ≤ λ11 h2 = λ11 h h. i i Now we only have to show that h h ≤ 1: 1 − h h = c c − c P P c = c (I − P P )c = c (I − P P )(I − P P )c ≥ 0, here we used that P P = I , therefore P P idempotent, therefore also I − P P idempotent. APPENDIX B Arrays of Higher Rank This chapter was presented at the Array Programming Languages Conference in Berlin, on July 24, 2000. Besides scalars, vectors, and matrices, also higher arrays are necessary in statistics; for instance, the “covariance matrix” of a random matrix is really an array of rank 4, etc. Usually, such higher arrays are avoided in the applied sciences because of the difficulties to write them on a two-dimensional sheet of paper. The following symbolic notation makes the structure of arrays explicit without writing them down element by element. It is hoped that this makes arrays easier to understand, and that this notation leads to simple high-level user interfaces for programming languages manipulating arrays. B.1. Informal Survey of the Notation Each array is symbolized by a rectangular tile with arms sticking out, similar to a molecule. Tiles with one arm are vectors, those with two arms matrices, those with more arms are arrays of higher rank (or “valence” as in [SS35], [Mor73], and [MS86, p. 12]), and those without arms are scalars. The arrays considered here are rectangular, not “ragged,” therefore in addition to their rank we only need to know the dimension of each arm; it can be thought of as the number of fingers associated with this arm. Arrays can only hold hands (i.e., “contract” along two arms) if the hands have the same number of fingers. Sometimes it is convenient to write the dimension of each arm at the end of the A n . Matrix products arm, i.e., a m × n matrix A can be represented as m are represented by joining the obvious arms: if B is n × q , then the matrix product AB is m A n B q or, in short, A B . The notation allows the reader to always tell which arm is which, even if the arms are not marked. If C m r is m × r, then the product C A is (B.1.1) C A= r C m A n=r C m A n. In the second representation, the tile representing C is turned by 180 degrees. Since the white part of the frame of C is at the bottom, not on the top, one knows that the West arm of C , not its East arm, is concatenated with the West arm of A. The C r is r C m , i.e., it is not a different entity but transpose of m the same entity in a different position. The order in which the elements are arranged on the page (or in computer memory) is not a part of the definition of the array itself. Likewise, there is no distinction between row vectors and column vectors. Vectors are usually, but not necessarily, written in such a way that their arm points West (column vector convention). If a and b are vectors, their 337 338 B. ARRAYS OF HIGHER RANK scalar product a b is the concatenation a b which has no free arms, i.e., it is a scalar, and their outer product ab is a b , which is a matrix. Juxtaposition of tiles represents the outer product, i.e., the array consisting of all the products of elements of the arrays represented by the tiles placed side by side. Q is the concatenation , which Q The trace of a square matrix is a scalar since no arms are sticking out. In general, concatenation of two arms of the same tile represents contraction, i.e., summation over equal values of the indices associated with these two arms. This notation makes it obvious that tr XY = X Y tr Y X , because by definition there is no difference between and Y X . Also X Y or X Y etc. represent the same array (here array of rank zero, i.e., scalar). Each of these tiles can be evaluated in essentially two different ways. One way is (1) Juxtapose the tiles for X and Y , i.e., form their outer product, which is an array of rank 4 with typical element xmp yqn . (2) Connect the East arm of X with the West arm of Y . This is a contraction, resulting in an array of rank 2, the matrix product XY , with typical element p xmp ypn . (3) Now connect the West arm of X with the East arm of Y . The result of this second contraction is a scalar, the trace tr XY = p,m xmp ypm . An alternative sequence of operations evaluating this same graph would be (1) Juxtapose the tiles for X and Y . (2) Connect the West arm of X with the East arm of Y to get the matrix product Y X . (3) Now connect the East arm of X with the West arm of Y to get tr Y X . The result is the same, the notation does not specify which of these alternative evaluation paths is meant, and a computer receiving commands based on this notation can choose the most efficient evaluation path. Probably the most efficient evaluation path is given by (B.2.8) below: take the element-by-element product of X with the transpose of Y , and add all the elements of the resulting matrix. If the user specifies tr(XY ), the computer is locked into one evaluation path: it first has to compute the matrix product XY , even if X is a column vector and Y a row vector and it would be much more efficient to compute it as tr(Y X ), and then form the trace, i.e., throw away all off-diagonal elements. If the trace is specified as X Y , the computer can choose the most efficient of a number of different evaluation paths transparently to the user. This advantage of the graphical notation is of course even more important if the graphs are more complex. There is also the “diagonal” array, which in the case of rank 3 can be written n ∆ (B.1.2) n n ∆ or n n n or similar configurations. It has 1’s down the main diagonal and 0’s elsewhere. It can be used to construct the diagonal matrix diag(x) of a vector (the square matrix B.2. AXIOMATIC DEVELOPMENT OF ARRAY OPERATIONS 339 with the vector in the diagonal and zeros elsewhere) as n (B.1.3) ∆ n diag(x) = , x the diagonal vector of a square matrix (i.e., the vector containing its diagonal elements) as ∆ (B.1.4) A, and the “Hadamard product” (element-by-element product) of two vectors x ∗ y as x (B.1.5) x∗y= ∆ . y All these are natural operations involving vectors and matrices, but the usual matrix notation cannot represent them and therefore ad-hoc notation must be invented for it. In our graphical representation, however, they all can be built up from a small number of atomic operations, which will be enumerated in Section B.2. Each such graph can be evaluated in a number of different ways, and all these evaluations give the same result. In principle, each graph can be evaluated as follows: form the outer product of all arrays involved, and then contract along all those pairs of arms which are connected. For practical implementations it is more efficient to develop functions which connect two arrays along one or several of their arms without first forming outer products, and to perform the array concatenations recursively in such a way that contractions are done as early as possible. A computer might be programmed to decide on the most efficient construction path for any given array. B.2. Axiomatic Development of Array Operations The following sketch shows how this axiom system might be built up. Since I am an economist I do not plan to develop the material presented here any further. Others are invited to take over. If you are interested in working on this, I would be happy to hear from you; email me at ehrbar@econ.utah.edu There are two kinds of special arrays: unit vectors and diagonal arrays. For every natural number m ≥ 1, m unit vectors m i (i = 1, . . . , m) exist. Despite the fact that the unit vectors are denoted here by numbers, there is no intrinsic ordering among them; they might as well have the names “red, green, blue, . . . ” (From (B.2.4) and other axioms below it will follow that each unit vector can be represented as a m-vector with 1 as one of the components and 0 elsewhere.) For every rank ≥ 1 and dimension n ≥ 1 there is a unique diagonal array denoted by ∆. Their main properties are (B.2.1) and (B.2.2). (This and the other axioms must be formulated in such a way that it will be possible to show that the diagonal arrays of rank 1 are the “vectors of ones” ι which have 1 in every component; diagonal arrays of rank 2 are the identity matrices; and for higher ranks, all arms of a diagonal array have the same dimension, and their ijk · · · element is 1 if i = j = k = · · · and 0 otherwise.) Perhaps it makes sense to define the diagonal array of rank 0 and dimension n to be the scalar n, and to declare all arrays which are everywhere 0-dimensional to be diagonal. 340 B. ARRAYS OF HIGHER RANK There are only three operations of arrays: their outer product, represented by writing them side by side, contraction, represented by the joining of arms, and the direct sum, which will be defined now: The direct sum is the operation by which a vector can be built up from scalars, a matrix from its row or column vectors, an array of rank 3 from its layers, etc. The direct sum of a set of r similar arrays (i.e., arrays which have the same number of arms, and corresponding arms have the same dimensions) is an array which has one additional arm, called the reference arm of the direct sum. If one “saturates” the reference arm with the ith unit vector, one gets the ith original array back, and this property defines the direct sum uniquely: m m m m r Ai n=r S ⇒ n i r S n= Ai n. i=1 q q q q It is impossible to tell which is the first summand and which the second, direct sum is an operation defined on finite sets of arrays (where different elements of a set may be equal to each other in every respect but still have different identities). There is a broad rule of associativity: the order in which outer products and contractions are performed does not matter, as long as the at the end, the right arms are connected with each other. And there are distributive rules involving (contracted) outer products and direct sums. Additional rules apply for the special arrays. If two different diagonal arrays join arms, the result is again a diagonal array. For instance, the following three concatenations of diagonal three-way arrays are identical, and they all evaluate to the (for a given dimension) unique diagonal array or rank 4: ∆ (B.2.1) = ∆ ∆ ∆ = ∆ = ∆ ∆ The diagonal array of rank 2 is neutral under concatenation, i.e., it can be written as (B.2.2) n ∆ n = . because attaching it to any array will not change this array. (B.2.1) and (B.2.2) make it possible to represent diagonal arrays simply as the branching points of several arms. This will make the array notation even simpler. However in the present introductory article, all diagonal arrays will be shown explicitly, and the vector of ones will be denoted m ι instead of m ∆ or perhaps m δ. Unit vectors concatenate as follows: (B.2.3) i m j = 1 0 if i = j otherwise. and the direct sum of all unit vectors is the diagonal array of rank 2: n (B.2.4) i i=1 n = n ∆ n = . B.2. AXIOMATIC DEVELOPMENT OF ARRAY OPERATIONS 341 I am sure there will be modifications if one works it all out in detail, but if done right, the number of axioms should be fairly small. Element-by-element addition of arrays is not an axiom because it can be derived: if one saturates the reference arm of a direct sum with the vector of ones, one gets the element-by-element sum of the arrays in this direct sum. Multiplication of an array by a scalar is also contained in the above system of axioms: it is simply the outer product with an array of rank zero. Problem 403. Show that the saturation of an arm of a diagonal array with the vector of ones is the same as dropping this arm. Answer. Since the vector of ones is the diagonal array of rank 1, this is a special case of the general concantenation rule for diagonal arrays. Problem 404. Show that the diagonal matrix of the vector of ones is the identity matrix, i.e., n ∆ n (B.2.5) = . ι Answer. In view of (B.2.2), this is a special case of Problem 403. Problem 405. A trivial array operation is the addition of an arm of dimension 1; for instance, this is how a n-vector can be turned into a n × 1 matrix. Is this operation contained in the above system of axioms? Answer. It is a special case of the direct sum: the direct sum of one array only, the only effect of which is the addition of the reference arm. From (B.2.4) and (B.2.2) follows that every array of rank k can be represented as a direct sum of arrays of rank k − 1, and recursively, as iterated direct sums of those scalars which one gets by saturating all arms with unit vectors. Hence the following “extensionality property”: if the arrays A and B are such that for all possible conformable choices of unit vectors κ1 · · · κ8 follows κ3 κ2 A κ7 = κ4 κ5 κ2 B κ6 κ1 κ6 κ8 κ3 κ5 κ1 (B.2.6) κ4 κ8 κ7 then A = B . This is why the saturation of an array with unit vectors can be considered one of its “elements,” i.e., κ3 κ5 κ2 A κ6 κ1 (B.2.7) κ4 κ8 κ7 = aκ1 κ2 κ3 κ4 κ5 κ6 κ7 κ8 . From (B.2.3) and (B.2.4) follows that the concatenation of two arrays by joining one or more pairs of arms consists in forming all possible products and summing over those subscripts (arms) which are joined to each other. For instance, if m A n B r =m C r, 342 B. ARRAYS OF HIGHER RANK n then cµρ = ν =1 aµν bνρ . This is one of the most basic facts if one thinks of arrays as collections of elements. From this point of view, the proposed notation is simply a graphical elaboration of Einstein’s summation convention. But in the holistic approach taken by the proposed system of axioms, which is informed by category theory, it is an implication; it comes at the end, not the beginning. Instead of considering arrays as bags filled with elements, with the associated false problem of specifying the order in which the elements are packed into the bag, this notation and system of axioms consider each array as an abstract entity, associated with a certain finite graph. These entities can be operated on as specified in the axioms, but the only time they lose their abstract character is when they are fully saturated, i.e., concatenated with each other in such a way that no free arms are left: in this case they become scalars. An array of rank 1 is not the same as a vector, although it can be represented as a vector—after an ordering of its elements has been specified. This ordering is not part of the definition of the array itself. (Some vectors, such as time series, have an intrinsic ordering, but I am speaking here of the simplest case where they do not.) Also the ordering of the arms is not specified, and the order in which a set of arrays is packed into its direct sum is not specified either. These axioms therefore make a strict distinction between the abstract entities themselves (which the user is interested in) and their various representations (which the computer worries about). Maybe the following examples may clarify these points. If you specify a set of colors as {red, green, blue}, then this representation has an ordering built in: red comes first, then green, then blue. However this ordering is not part of the definition of the set; {green, red, blue} is the same set. The two notations are two different representations of the same set. Another example: mathematicians usually distinguish between the outer products A ⊗ B and B ⊗ A; there is a “natural isomorphism” between them but they are two different objects. In the system of axioms proposed here these two notations are two different representations of the same object, as in the set example. This object is represented by a graph which has A and B as nodes, but it is not apparent from this graph which node comes first. Interesting conceptual issues are involved here. The proposed axioms are quite different than e.g. [Mor73]. Problem 406. The trace of the product of two matrices can be written as tr(XY ) = ι (X ∗ Y )ι. (B.2.8) I.e., one forms the element-by-element product of X and Y and takes the sum of all the elements of the resulting matrix. Use tile notation to show that this gives indeed tr(XY ). Answer. In analogy with (B.1.5), the Hadamard product of the two matrices X and Z , i.e., their element by element multiplication, is X X∗Z= ∆ ∆ Z If Z = Y , one gets X X∗Y = ∆ ∆ Y . B.4. EQUALITY OF ARRAYS AND EXTENDED SUBSTITUTION 343 Therefore one gets, using (B.2.5): X ι (X ∗ Y )ι = ι X ∆ ∆ ι = = tr(XY ) Y Y B.3. An Additional Notational Detail Besides turning a tile by 90, 180, or 270 degrees, the notation proposed here also allows to flip the tile over. The tile (here drawn without its arms) is simply the tile laid on its face; i.e., those parts of the frame, which are black on the side visible to the reader, are white on the opposite side and vice versa. If one flips a tile, the arms appear in a mirror-symmetric manner. For a matrix, flipping over is equivalent to turning by 180 degrees, i.e., there is no difference between the matrix A and the matrix A . Since sometimes one and sometimes the other notation seems more natural, both will be used. For higher arrays, flipping over arranges the arms in a different fashion, which is sometimes convenient in order to keep the graphs uncluttered. It will be especially useful for differentiation. If one allows turning in 90 degree increments and flipping, each array can be represented in eight different positions, as shown here with a hypothetical array of rank 3: m k L n k m L n n k L m n L k k m L m k m m n L k n n L L k m n The black-and-white pattern at the edge of the tile indicates whether and how much the tile has been turned and/or flipped over, so that one can keep track which arm is which. In the above example, the arm with dimension k will always be called the West arm, whatever position the tile is in. B.4. Equality of Arrays and Extended Substitution Given the flexibility of representing the same array in various positions for concatenation, specific conventions are necessary to determine when two such arrays in generalized positions are equal to each other. Expressions like A =B or K = K are not allowed. The arms on both sides of the equal sign must be parallel, in order to make it clear which arm corresponds to which. A permissible way to write the 344 B. ARRAYS OF HIGHER RANK above expressions would therefore be A =B and K = K One additional benefit of this tile notation is the ability to substitute arrays with different numbers of arms into an equation. This is also a necessity since the number of possible arms is unbounded. This multiplicity can only be coped with because each arm in an identity written in this notation can be replaced by a bundle of many arms. Extended substitution also makes it possible to extend definitions familiar from matrices to higher arrays. For instance we want to be able to say that the array Ω is symmetric if and only if Ω = Ω . This notion of symmetry is not limited to arrays of rank 2. The arms of this array may symbolize not just a single arm, but whole bundles of arms; for instance an array of the form satisfying Σ = Σ Σ is symmetric according to this definition, and so is every scalar. Also the notion of a nonnegative definite matrix, or of a matrix inverse or generalized inverse, or of a projection matrix, can be extended to arrays in this way. B.5. Vectorization and Kronecker Product One conventional generally accepted method to deal with arrays of rank > 2 is the Kronecker product. If A and B are both matrices, then the outer product in tile notation is A (B.5.1) B Since this is an array of rank 4, there is no natural way to write its elements down on a sheet of paper. This is where the Kronecker product steps in. The Kronecker product of two matrices is their outer product written again as a matrix. Its definition includes a protocol how to arrange the elements of an array of rank 4 as a matrix. Alongside the Kronecker product, also the vectorization operator is useful, which is a protocol how to arrange the elements of a matrix as a vector, and also the so-called “commutation matrices” may become necessary. Here are the relevant definitions: B.5.1. Vectorization of a Matrix. If A is a matrix, then vec(A) is the vector obtained by stacking the column vectors on top of each other, i.e., a1 . (B.5.2) if A = a1 · · · an then vec(A) = . . . an The vectorization of a matrix is merely a different arrangement of the elements of the matrix on paper, just as the transpose of a matrix. Problem 407. Show that tr(B C ) = (vec B ) vec C . B.5. VECTORIZATION AND KRONECKER PRODUCT 345 Answer. Both sides are bji cji . (B.5.28) is a proof in tile notation which does not have to look at the matrices involved element by element. By the way, a better protocol for vectorizing would have been to assemble all rows into one long row vector and then converting it into a column vector. In other words b1 . if B = . . then vec(B ) should have been defined as b1 . . . . bm bm The usual protocol of stacking the columns is inconsistent with the lexicograpical ordering used in the Kronecker product. Using the alternative definition, equation (B.5.19) which will be discussed below would be a little more intelligible; it would read vec(ABC ) = (A ⊗ C ) vec B with the alternative definition of vec and also the definition of vectorization in tile notation would be a little less awkward; instead of (B.5.24) one would have m mn vec A = mn Π A n But this is merely a side remark; we will use the conventional definition (B.5.2) throughout. B.5.2. Kronecker Product of Matrices. Let A and B be two matrices, say A is m × n and B is r × q . Their Kronecker product A ⊗ B is the mr × nq matrix which in partitioned form can be written (B.5.3) a11 B . A⊗B = . . am1 B ··· .. . ··· a1n B . . . amn B This convention of how to write the elements of an array of rank 4 as a matrix is not symmetric, so that usually A ⊗ C = C ⊗ A. Both Kronecker products represent the same abstract array, but they arrange it differently on the page. However, in many other respects, the Kronecker product maintains the properties of outer products. 346 B. ARRAYS OF HIGHER RANK Problem 408. [The71, pp. 303–306] Prove the following simple properties of the Kronecker product: (A ⊗ B ) = A ⊗ B (B.5.4) (A ⊗ B ) ⊗ C = A ⊗ (B ⊗ C ) (B.5.5) I ⊗I =I (B.5.6) (B.5.7) (A ⊗ B )(C ⊗ D ) = AC ⊗ BD (B.5.8) (A ⊗ B )−1 = A−1 ⊗ B −1 (B.5.9) (A ⊗ B )− = A− ⊗ B − (B.5.10) A ⊗ (B + C ) = A ⊗ B + A ⊗ C (B.5.11) (A + B ) ⊗ C = A ⊗ C + B ⊗ C (cA) ⊗ B = A ⊗ (cB ) = c(A ⊗ B ) (B.5.12) A12 A11 ⊗ B ⊗B = A22 A21 ⊗ B A11 A21 (B.5.13) A12 ⊗ B A22 ⊗ B rank(A ⊗ B ) = (rank A)(rank B ) (B.5.14) tr(A ⊗ B ) = (tr A)(tr B ) (B.5.15) If a is a 1 × 1 matrix, then a ⊗ B = B ⊗ a = aB (B.5.16) det(A ⊗ B ) = (det(A))n (det(B ))k (B.5.17) where A is k × k and B is n × n. Answer. For the determinant use the following facts: if a is an eigenvector of A with eigenvalue α and b is an eigenvector of B with eigenvalue β , then a ⊗ b is an eigenvector of A ⊗ B with eigenvalue αβ . The determinant is the product of all eigenvalues (multiple eigenvalues being counted several times). Count how many there are. An alternative approach would be to write A ⊗ B = (A ⊗ I )(I ⊗ B ) and then to argue that det(A ⊗ I ) = (det(A))n and det(I ⊗ B ) = (det(B ))k . The formula for the rank can be shown using rank(A) = tr(AA− ). compare Problem 366. Problem 409. 2 points [JHG+ 88, pp. 962–4] Write down the Kronecker product of (B.5.18) A= 1 2 3 0 and B= 2 1 2 0 0 . 3 Show that A ⊗ B = B ⊗ A. Which other facts about the outer product do not carry over to the Kronecker product? Answer. 2 1 A⊗B = 4 2 2 0 4 0 0 3 0 6 6 3 0 0 6 0 0 0 0 9 0 0 2 4 B⊗A= 1 2 Partitioning of the matrix on the right does not carry over. Problem 410. [JHG+ 88, p. 965] Show that (B.5.19) vec(ABC ) = (C ⊗ A) vec(B ). 6 0 3 0 2 4 0 0 6 0 0 0 0 0 3 6 0 0 9 0 B.5. VECTORIZATION AND KRONECKER PRODUCT 347 a1 . Answer. Assume A is k × m, B is m × n, and C is n × p. Write A = . and B = . ak b1 ··· ⊗ A) vec B = bn . Then (C c11 A c12 A = . . . c 1p A c21 A c22 A . . . c 2p A c11 a1 b1 + c21 a1 b2 + · · · + cn1 a1 bn c11 a2 b1 + c21 a2 b2 + · · · + cn1 a2 bn . . . c11 a b1 + c21 a b2 + · · · + cn1 a bn k k k c12 a1 b1 + c22 a1 b2 + · · · + cn2 a1 bn cn1 A c12 a2 b1 + c22 a2 b2 + · · · + cn2 a2 bn b1 cn2 A . . . . . . . = . . c12 a b1 + c22 a b2 + · · · + cn2 a bn . k k k bn cnp A . . . c1p a1 b1 + c2p a1 b2 + · · · + cnp a1 bn c a b + c a b + ··· + c a b np 2 n 1p 2 1 2p 2 2 . . . c1p ak b1 + c2p ak b2 + · · · + cnp ak bn ··· ··· .. . ··· One obtains the same result by vectorizing the matrix a1 b1 a2 b1 ABC = . . . ak b1 a1 b2 a2 b2 . . . ak b2 ··· ··· .. . ··· a1 bn c11 a2 bn c21 . . . . . . ak bn cn1 a1 b1 c11 + a1 b2 c21 + · · · + a1 bn cn1 a2 b1 c11 + a2 b2 c21 + · · · + a2 bn cn1 = . . . ak b1 c11 + ak b2 c21 + · · · + ak bn cn1 c12 c22 . . . cn2 ··· ··· .. . ··· c 1p c 2p . = . . cnp a1 b1 c12 + a1 b2 c22 + · · · + a1 bn cn2 a2 b1 c12 + a2 b2 c22 + · · · + a2 bn cn2 . . . ak b1 c12 + ak b2 c22 + · · · + ak bn cn2 ··· ··· .. . ··· ··· ··· .. . ··· a1 b1 c1p + a1 b2 c2p + · · · + a1 bn cnp a2 b1 c1p + a2 b2 c2p + · · · + a2 bn cnp . . . . ak b1 c1p + ak b2 c2p + · · · + ak bn cnp The main challenge in this automatic proof is to fit the many matrix rows, columns, and single elements involved on the same sheet of paper. Among the shuffling of matrix entries, it is easy to lose track of how the result comes about. Later, in equation (B.5.29), a compact and intelligible proof will be given in tile notation. The dispersion of a random matrix Y is often given as the matrix V [vec Y ], where the vectorization is usually not made explicit, i.e., this matrix is denoted V [Y ]. Problem 411. If V [vec Y ] = Σ ⊗ Ω and P and Q are matrices of constants, show that V [vec P Y Q] = (Q Σ Q) ⊗ (P Ω P ). Answer. Apply (B.5.19): Now apply (B.5.7). Σ V [vec P Y Q] = V [(Q ⊗ P ) vec Y ] = (Q ⊗ P )(Σ ⊗ Ω )(Q ⊗ P ). Problem 412. 2 points If α and γ are vectors, then show that vec(αγ ) = γ ⊗ α. Answer. One sees this by writing down the matrices, or one can use (B.5.19) with A = α, B = 1, the 1 × 1 matrix, and C = γ . Problem 413. 2 points If α is a nonrandom vector and δ a random vector, show that V [δ ⊗ α] = V [δ ] ⊗ (αα ). 348 B. ARRAYS OF HIGHER RANK Answer. α var[δ 1 ]α α cov[δ 2 , δ 1 ]α V [δ ⊗ α] = . . . α cov[δ n , δ 1 ]α αδ 1 . δ ⊗α = . . αδ n var[δ 1 ]αα cov[δ 2 , δ 1 ]αα = . . . cov[δ n , δ 1 ]αα α cov[δ 1 , δ 2 ]α α var[δ 2 ]α . . . α cov[δ n , δ 2 ]α ··· ··· .. . ··· cov[δ 1 , δ 2 ]αα var[δ 2 ]αα . . . cov[δ n , δ 2 ]αα ··· ··· .. . ··· cov[δ 1 , δ n ]αα cov[δ 2 , δ n ]αα . . . cov[δ n , δ n ]αα α cov[δ 1 , δ n ]α α cov[δ 2 , δ n ]α . . . α cov[δ n , δ n ]α = = V [δ ] ⊗ αα B.5.3. The Commutation Matrix. Besides the Kronecker product and the vectorization operator, also the “commutation matrix” [MN88, pp. 46/7], [Mag88, p. 35] is needed for certain operations involving arrays of higher rank. Assume A is m × n. Then the commutation matrix K (m,n) is the mn × mn matrix which transforms vec A into vec(A ): K (m,n) vec A = vec(A ) (B.5.20) The main property of the commutation matrix is that it allows to commute the Kronecker product. For any m × n matrix A and r × q matrix B follows K (r,m) (A ⊗ B )K (n,q) = B ⊗ A (B.5.21) Problem 414. Use (B.5.20) to compute K (2,3) . Answer. K (2,3) (B.5.22) 1 0 0 = 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 B.5.4. Kronecker Product and Vectorization in Tile Notation. The Kronecker product of m A n and r B q is the following concatenation of A and B with members of a certain family of three-way arrays Π(i,j ) : m mr A⊗B nq = mr A n r (B.5.23) B q Π Π nq Strictly speaking we should have written Π(m,r) and Π(n,q) for the two Π-arrays in (B.5.23), but the superscripts can be inferred from the context: the first superscript is the dimension of the Northeast arm, and the second that of the Southeast arm. Vectorization uses a member of the same family Π(m,n) to convert the matrix A n m into the vector m (B.5.24) mn vec A = mn A Π n B.5. VECTORIZATION AND KRONECKER PRODUCT 349 This equation is a little awkward because the A is here a n × m matrix, while elsewhere it is a m × n matrix. It would have been more consistent with the lexicographical ordering used in the Kronecker product to define vectorization as the stacking of the row vectors; then some of the formulas would have looked more natural. m exists for every m ≥ 1 and n ≥ 1. The Π The array Π(m,n) = mn n dimension of the West arm is always the product of the dimensions of the two East arms. The elements of Π(m,n) will be given in (B.5.30) below; but first I will list three important properties of these arrays and give examples of their application. First of all, each Π(m,n) satisfies m Π (B.5.25) m m mn Π m = n . n n n Let us discuss the meaning of (B.5.25) in detail. The lefthand side of (B.5.25) shows the concatenation of two copies of the three-way array Π(m,n) in a certain way that yields a 4-way array. Now look at the righthand side. The arm m m by itself (which was bent only in order to remove any doubt about which arm to the left of the equal sign corresponds to which arm to the right) represents the neutral element under concatenation (i.e., the m × m identity matrix). Writing two arrays next to each other without joining any arms represents their outer product, i.e., the array whose rank is the sum of the ranks of the arrays involved, and whose elements are all possible products of elements of the first array with elements of the second array. The second identity satisfied by Π(m,n) is m (B.5.26) mn Π Π mn mn . mn = n Finally, there is also associativity: (B.5.27) mnp m n Π m = Π Π mnp Π p n p Here is the answer to Problem 407 in tile notation: tr B C = B (B.5.28) C=B = vec B Π Π vec C = C= (vec B ) vec C 350 B. ARRAYS OF HIGHER RANK Equation (B.5.25) was central for obtaining the result. The answer to Problem 410 also relies on equation (B.5.25): C C ⊗A Π vec B = Π Π B A C B Π = A (B.5.29) vec ABC = B.5.5. Looking Inside the Kronecker Arrays. It is necessary to open up the arrays from the Π -family and look at them “element by element,” in order to verify (B.5.23), (B.5.24), (B.5.25), (B.5.26), and (B.5.27). The elements of Π(m,n) , which can be written in tile notation by saturating the array with unit vectors, are m (B.5.30) (m,n) πθµν =θ mn µ Π = n ν 1 0 if θ = (µ − 1)n + ν otherwise. (m,n) Note that for every θ there is exactly one µ and one ν such that πθµν other values of µ and ν , Writing ν (m,n) πθµν A = 1; for all = 0. µ = aνµ and θ (B.5.31) vec A = cθ , (B.5.24) reads (m,n) cθ = πθµν aνµ , µ,ν which coincides with definition (B.5.2) of vec A. One also checks that (B.5.23) is (B.5.3). Calling A ⊗ B = C , it follows from (B.5.23) that (B.5.32) (m,r ) cφθ = (n,q ) πφµρ aµν bρκ πθνκ . µ,ν,ρ,κ (m,r ) For 1 ≤ φ ≤ r one gets a nonzero πφµρ only for µ = 1 and ρ = φ, and for 1 ≤ θ ≤ q (n,q ) one gets a nonzero πθνκ only for ν = 1 and κ = θ. Therefore cφθ = a11 bφθ for all elements of matrix C with φ ≤ r and θ ≤ q . Etc. The proof of (B.5.25) uses the fact that for every θ there is exactly one µ and (m,n) one ν such that πθµν = 0: θ =mn (m,n) (m,n) (B.5.33) πθµν πθωσ θ =1 = 1 0 if µ = ω and ν = σ otherwise Similarly, (B.5.26) and (B.5.27) can be shown by elementary but tedious proofs. The best verification of these rules is their implementation in a computer language, see Section ?? below. B.5. VECTORIZATION AND KRONECKER PRODUCT 351 B.5.6. The Commutation Matrix in Tile Notation. The simplest way to represent the commutation matrix K (m,n) in a tile is m K (m,n) = mn (B.5.34) Π Π mn . n This should not be confused with the lefthand side of (B.5.26): K (m,n) is composed of Π(m,n) on its West and Π(n,m) on its East side, while (B.5.26) contains Π(m,n) twice. We will therefore use the following representation, mathematically equivalent to (B.5.34), which makes it easier to see the effects of K (m,n) : m K (m,n) = mn (B.5.35) Π Π mn . n Problem 415. Using the definition (B.5.35) show that K (m,n) K (n,m) = I mn , the mn × mn identity matrix. Answer. You will need (B.5.25) and (B.5.26). Problem 416. Prove (B.5.21) in tile notation. Answer. Start with a tile representation of K (r,m) (A ⊗ B )K (n,q) : r rm Π m rm A n r Π B q Π Π m = nq n nq Π Π nq q Now use (B.5.25) twice to get r = rm A n m B q Π Π nq = r = rm B q m A n Π Π nq . APPENDIX C Matrix Differentiation C.1. First Derivatives Let us first consider the scalar case and then generalize from there. The derivative of a function f is often written dy (C.1.1) = f (x) dx Multiply through by dx to get dy = f (x) dx. In order to see the meaning of this equation, we must know the definition dy = f (x + dx) − f (x). Therefore one obtains f (x + dx) = f (x) + f (x) dx. If one holds x constant and only varies dx this formula shows that in an infinitesimal neighborhood of x, the function f is an affine function of dx, i.e., a linear function of dx with a constant term: f (x) is the intercept, i.e., the value for dx = 0, and f (x) is the slope parameter. Now let us transfer this argument to vector functions y = f (x). Here y is a n-vector and x a m-vector, i.e., f is a n-tuple of functions of m variables each y1 f1 (x1 , . . . , xm ) . . . (C.1.2) . = . . yn fn (x1 , . . . , xm ) One may also say, f is a n-vector, each element of which depends on x. Again, under certain differentiability conditions, it is possible to write this function infinitesimally as an affine function, i.e., one can write (C.1.3) f (x + dx) = f (x) + Adx. Here the coefficient of dx is no longer a scalar but necessarily a matrix A (whose elements again depend on x). A is called the Jacobian matrix of f . The Jacobian matrix generalizes the concept of a derivative to vectors. Instead of a prime denoting the derivative, as in f (x), one writes A = Df . Problem 417. 2 points If f is a scalar function of a vector argument x, is its Jacobian matrix A a row vector or a column vector? Explain why this must be so. The Jacobian A defined in this way turns out to have a very simple functional form: its elements are the partial derivatives of all components of f with respect to all components of x: ∂fi . (C.1.4) aij = ∂xj Since in this matrix f acts as column and x as a row vector, this matrix can be written, using matrix differentiation notation, as A(x) = ∂ f (x)/∂ x . Strictly speaking, matrix notation can be used for matrix differentiation only if we differentiate a column vector (or scalar) with respect to a row vector (or scalar), or if we differentiate a scalar with respect to a matrix or a matrix with respect to a scalar. If we want to differentiate matrices with respect to vectors or vectors with 353 354 C. MATRIX DIFFERENTIATION respect to matrices or matrices with respect to each other, we need the tile notation for arrays. A different, much less enlightening approach is to first “vectorize” the matrices involved. Both of those methods will be discussed later. If the dependence of y on x can be expressed in terms of matrix operations or more general array concatenations, then some useful matrix differentiation rules exist. The simplest matrix differentiation rule, for f (x) = w x with x1 w1 . . and x = . (C.1.5) w= . . . xn wn is (C.1.6) ∂ w x/∂ x = w Here is the proof of (C.1.6): ∂w x ∂ = ∂x1 (w1 x1 + · · · + wn xn ) · · · ∂x = w1 · · · wn = w ∂ ∂xn (w1 x1 + · · · + wn xn ) The second rule, for f (x) = x M x and M symmetric, is: (C.1.7) ∂ x M x/∂ x = 2x M . To show (C.1.7), write x Mx = + x1 m11 x1 x2 m21 x1 + . . . xn mn1 x1 + + x1 m12 x2 x2 m22 x2 + . . . xn mn2 x2 + + + ··· ··· + + x1 m1n xn x2 m2n xn ··· . . . + xn mnn xn + + and take the partial derivative of this sum with respect to each of the xi . For instance, differentiation with respect to x1 gives ∂ x M x/∂x1 = + + 2m11 x1 x2 m21 + + m12 x2 + ··· + m1n xn + . . . xn mn1 Now split the upper diagonal element, writing it as m11 x1 + x1 m11 , to get = + + + m11 x1 x1 m11 x2 m21 + + + m12 x2 + ··· + m1n xn + . . . xn mn1 The sum of the elements in the first row is the first element of the column vector M x, and the sum of the elements in the column underneath is the first element of C.1. FIRST DERIVATIVES 355 the row vector x M . Overall this has to be arranged as a row vector, since we differentiate with respect to ∂ x , therefore we get (C.1.8) ∂ x M x/∂ x = x (M + M ). This is true for arbitrary M , and for symmetric M , it simplifies to (C.1.7). The formula for symmetric M is all we need, since a quadratic form with an unsymmetric M is identical to that with the symmetric (M + M )/2. Here is the tile notation for matrix differentiation: If n y depends on x , then m (C.1.9) A m =∂ A n y dx = dy , dx = ∂x dy is that array which satisfies (C.1.10) i.e., (C.1.11) ∂ y ∂x Extended substitutability applies here: n y and m x are not necessarily vectors; the arms with dimension m and n can represent different bundles of several arms. In tiles, (C.1.6) is (C.1.12) ∂w x ∂x = w and (C.1.8) is x (C.1.13) ∂M x ∂x +M =M . x x In (C.1.6) and (C.1.7), we took the derivatives of scalars with respect to vectors. The simplest example of a derivative of a vector with respect to a vector is a linear function. This gives the most basic matrix differentiation rule: If y = Ax is a linear vector function, then its derivative is that same linear vector function: (C.1.14) ∂ Ax/∂ x = A, or in tiles (C.1.15) ∂ A x = ∂x A Problem 418. Show that ∂ tr AX = A. ∂X (C.1.16) In tiles it reads m (C.1.17) X ∂A ∂X = A. n Answer. tr(AX ) = i,j aij xji i.e., the coefficient of xji is aij . 356 C. MATRIX DIFFERENTIATION Here is a differentiation rule for a matrix with respect to a matrix, first written element by element, and then in tiles: If Y = AXB , i.e., yim = j,k aij xjk bkm , then ∂yim = aij akm , because for every fixed i and m this sum contains only one term ∂xjk which has xjk in it, namely, aij xjk bkm . In tiles: A A (C.1.18) ∂X ∂X = B B Equations (C.1.17) and (C.1.18) can be obtained from (C.1.12) and (C.1.15) by extended substitution, since a bundle of several arms can always be considered as one arm. For instance, (C.1.17) can be written ∂A X ∂X = A and this is a special case of (C.1.12), since the two parallel arms can be treated as one arm. With a better development of the logic underlying this notation, it will not be necessary to formulate them as separate theorems; all matrix differentiation rules given so far are trivial applications of (C.1.15). Problem 419. As a special case of (C.1.18) show that ∂x Ay ∂A = yx . Answer. x x (C.1.19) ∂A ∂A = y y Here is a basic differentiation rule for bilinear array concatenations: if x y= (C.1.20) A x then one gets the following simple generalization of (C.1.13): x x (C.1.21) ∂ ∂x A x = A + A x Proof. yi = j,k aijk xj xk . For a given i, this has x2 in the term aipp x2 , and p p it has xp in the terms aipk xp xk where p = k , and in aijp xj xp where j = p. The derivatives of these terms are 2aipp xp + k=p aipk xk + j =p aijp xj , which simplifies to k aipk xk + j aijp xj . This is the i, p-element of the matrix on the rhs of (C.1.21). C.1. FIRST DERIVATIVES 357 But there are also other ways to have the array X occur twice in a concatenation Y . If Y = X X then yik = j xji xjk and therefore ∂yik /∂xlm = 0 if m = i and m = k . Now assume m = i = k : ∂yik /∂xli = ∂xli xlk /∂xli = xlk . Now assume m = k = i: ∂yik /∂xlk = ∂xli xlk /∂xlk = xli . And if m = k = i then one gets the sum of the two above: ∂yii /∂xli = ∂x2 /∂xli = 2xli . In tiles this is li i l (C.1.22) ∂X X ∂X X = X ∂X ∂ = X + . X m k This rule is helpful for differentiating the multivariate Normal likelihood function. A computer implementation of this tile notation should contain algorithms to automatically take the derivatives of these array concatenations. Here are some more matrix differentiation rules: Chain rule: If g = g (η ) and η = η (β ) are two vector functions, then (C.1.23) ∂ g /∂ β = ∂ g /∂ η · ∂ η /∂ β For instance, the linear least squares objective function is SSE = (y − Xβ ) (y − Xβ ) = ε ε where ε = y − Xβ . Application of the chain rule gives ∂ SSE /∂ β = ˆˆ ˆ ˆ ˆ ˆ ∂ SSE /∂ ε · ∂ ε/∂ β = 2ε (−X ) which is the same result as in (14.2.2). If A is nonsingular then ∂ log det A = A−1 (C.1.24) ∂A Proof in [Gre97, pp. 52/3]. Bibliography [AD75] J. Acz´l and Z. Dar´czy. On Measures of Information and their Characterizations. Acae o demic Press, 1975. 40 [Alb69] Arthur E. Albert. Conditions for positive and negative semidefiniteness in terms of pseudoinverses. SIAM (Society for Industrial and Applied Mathematics) Journal of Applied Mathematics, 17:434–440, 1969. 328 [Alb72] Arthur E. Albert. Regression and the Moore-Penrose Pseudoinverse. Academic Press, New York and London, 1972. 322 [Ame85] Takeshi Amemiya. Advanced Econometrics. Harvard University Press, 1985. 85 [Ame94] Takeshi Amemiya. Introduction to Statistics and Econometrics. Harvard University Press, Cambridge, MA, 1994. 12, 13, 16, 28, 32, 78, 122, 123, 141 [Bar82] Vic Barnett. Comparative Statistical Inference. Wiley, New York, 1982. 293 [BCW96] Richard A. Becker, John M. Chambers, and Allan R. Wilks. The New S Language: A Programming Environment for Data Analysis and Graphics. Chapman and Hall, 1996. Reprint of the 1988 Wadsworth edition. 147, 304 [BD77] Peter J. Bickel and Kjell A. Doksum. Mathematical Statistics: Basic Ideas and Selected Topics. Holden-Day, San Francisco, 1977. 74, 137, 140, 152 [Ber91] Ernst R. Berndt. The Practice of Econometrics: Classic and Contemporary. AddisonWesley, Reading, Massachusetts, 1991. 184, 190 [BKW80] David A. Belsley, Edwin Kuh, and Roy E. Welsch. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley, New York, 1980. 264, 267, 273, 274, 275 [CD28] Charles W. Cobb and Paul H. Douglas. A theory of production. American Economic Review, 18(1, Suppl.):139–165, 1928. J. 177, 178 [CD97] Wojciech W. Charemza and Derek F. Deadman. New Directions in Econometric Practice: General to Specific Modelling, Cointegration, and Vector Autoregression. Edward Elgar, Cheltenham, UK; Lynne, NH, 2nd ed. edition, 1997. 159 [CH93] John M. Chambers and Trevor Hastie, editors. Statistical Models in S. Chapman and Hall, 1993. 190, 274 [Cho60] G. C. Chow. Tests of equality between sets of coefficients in two linear regressions. Econometrica, 28:591–605, July 1960. 310 [Chr87] Ronald Christensen. Plane Answers to Complex Questions; The Theory of Linear Models. Springer-Verlag, New York, 1987. 225, 305 [Coh50] A. C. Cohen. Estimating the mean and variance of normal populations from singly and doubly truncated samples. Annals of Mathematical Statistics, pages 557–569, 1950. 61 [Coo77] R. Dennis Cook. Detection of influential observations in linear regression. Technometrics, 19(1):15–18, February 1977. 275 [Coo98] R. Dennis Cook. Regression Graphics: Ideas for Studying Regressions through Graphics. Series in Probability and Statistics. Wiley, New York, 1998. 78, 229 [Cor69] J. Cornfield. The Bayesian outlook and its applications. Biometrics, 25:617–657, 1969. 140 [Cow77] Frank Alan Cowell. Measuring Inequality: Techniques for the Social Sciences. Wiley, New York, 1977. 63 [Cra43] A. T. Craig. Note on the independence of certain quadratic forms. Annals of Mathematical Statistics, 14:195, 1943. 98 [CT91] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Series in Telecommunications. Wiley, New York, 1991. 41, 42 [CW99] R. Dennis Cook and Sanford Weisberg. Applied Regression Including Computing and Graphics. Wiley, 1999. 229, 255 359 360 BIBLIOGRAPHY [Daw79a] A. P. Dawid. Conditional independence in statistical theory. JRSS(B), 41(1):1–31, 1979. 20, 22 [Daw79b] A. P. Dawid. Some misleading arguments involving conditional independence. JRSS(B), 41(2):249–252, 1979. 22 [Daw80] A. P. Dawid. Conditional independence for statistical operations. Annals of Statistics, 8:598–617, 1980. 22 [Dhr86] Phoebus J. Dhrymes. Limited dependent variables. In Zvi Griliches and Michael D. Intriligator, editors, Handbook of Econometrics, volume 3, chapter 27, pages 1567–1631. North-Holland, Amsterdam, 1986. 32 [DL91] Gerard Dumenil and Dominique Levy. The U.S. economy since the Civil War: Sources and construction of the series. Technical report, CEPREMAP, LAREA-CEDRA, December 1991. 189 [DM93] Russell Davidson and James G. MacKinnon. Estimation and Inference in Econometrics. Oxford University Press, New York, 1993. 163, 166, 207, 209, 242, 243, 251, 253, 256, 259, 260, 268, 269, 271, 279, 305, 317 [Dou92] Christopher Dougherty. Introduction to Econometrics. Oxford University Press, Oxford, 1992. 190 [DP20] R. E. Day and W. M. Persons. An index of the physical volume of production. Review of Economic Statistsics, II:309–37, 361–67, 1920. 177 [Fis] R. A. Fisher. Theory of statistical estimation. Proceedings of the Cambridge Philosophical Society, 22. 125 [Fri57] Milton Friedman. A Theory of the Consumption Function. Princeton University Press, 1957. 104 [FS91] Milton Friedman and Anna J. Schwarz. Alternative approaches to analyzing economic data. American Economic Review, 81(1):39–49, March 1991. 160 [GJM96] Amos Golan, George Judge, and Douglas Miller. Maximum Entropy Econometrics: Robust Estimation with Limited Data. Wiley, Chichester, England, 1996. 47 [Gra76] Franklin A. Graybill. Theory and Application of the Linear Model. Duxbury Press, North Sciutate, Mass., 1976. 175 [Gra83] Franklin A. Graybill. Matrices with Applications in Statistics. Wadsworth and Brooks/Cole, Pacific Grove, CA, second edition, 1983. 322 [Gre97] William H. Greene. Econometric Analysis. Prentice Hall, Upper Saddle River, NJ, third edition, 1997. 60, 63, 157, 163, 166, 169, 214, 216, 217, 232, 243, 248, 251, 257, 258, 262, 266, 271, 279, 281, 315, 317, 322, 357 [Gre03] William H. Greene. Econometric Analysis. Prentice Hall, Upper Saddle River, New Jersey 07458, fifth edition, 2003. [Hal78] Robert E. Hall. Stochastic implications of the life cycle-permanent income hypothesis: Theory and evidence. Journal of Political Economy, pages 971–987, December 1978. 81 [Hen95] David F. Hendry. Dynamic Econometrics. Oxford University Press, Oxford, New York, 1995. 157 [HK79] James M. Henle and Eugene M. Kleinberg. Infinitesimal Calculus. MIT Press, 1979. 26 [Hou51] H. S. Houthakker. Some calculations on electricity consumption in Great Britain. Journal of the Royal Statistical Society (A), (114 part III):351–371, 1951. J. 184, 185 [HT83] Robert V. Hogg and Elliot A. Tanis. Probability and Statistical Inference. Macmillan, second edition, 1983. 7, 65, 96 [HVdP02] Ben J. Hejdra and Frederick Van der Ploeg. Foundations of Modern Macroeconomics. Oxford University Press, 2002. 106 [JHG+ 88] George G. Judge, R. Carter Hill, William E. Griffiths, Helmut L¨tkepohl, and Tsoungu Chao Lee. Introduction to the Theory and Practice of Econometrics. Wiley, New York, second edition, 1988. 150, 159, 166, 169, 215, 233, 245, 286, 346 [JK70] Norman Johnson and Samuel Kotz. Continuous Univariate Distributions, volume 1. Houghton Mifflin, Boston, 1970. 61, 63 [KA69] J. Koerts and A. P. J. Abramanse. On the Theory and Application of the General Linear Model. Rotterdam University Press, Rotterdam, 1969. 170 [Kap89] Jagat Narain Kapur. Maximum Entropy Models in Science and Engineering. Wiley, 1989. 46, 60 [Ken98] Peter Kennedy. A Guide to Econometrics. MIT Press, Cambridge, MA, fourth edition, 1998. [Khi57] R. T. Khinchin. Mathematical Foundations of Information Theory. Dover Publications, New York, 1957. 40 BIBLIOGRAPHY [Kme86] [Knu81] [Krz88] [KS79] [Ksh19] [Lan69] [Lar82] [Lea75] [Mag88] [Mal80] [MN88] [Mor65] [Mor73] [Mor02] [MR91] [MS86] [Rao62] [Rao73] [Rao97] [Rei89] [R´n70] e [Rie77] [Rie85] [Rob70] [Rob74] [Ron02] [Roy97] [RZ78] [Seb77] [Sel58] 361 Jan Kmenta. Elements of Econometrics. Macmillan, New York, second edition, 1986. 275, 290 Donald E. Knuth. Seminumerical Algorithms, volume 2 of The Art of Computer Programming. Addison-Wesley, second edition, 1981. 4 W. J. Krzanowski. Principles of Multivariate Analysis: A User’s Persective. Clarendon Press, Oxford, 1988. 174 Sir Maurice Kendall and Alan Stuart. The Advanced Theory of Statistics, volume 2. Griffin, London, fourth edition, 1979. 117, 125 Anant M. Kshirsagar. Multivariate Analysis. Marcel Dekker, New York and Basel, 19?? 98, 99 H. O. Lancaster. The Chi-Squared Distribution. Wiley, 1969. 98 Harold Larson. Introduction to Probability and Statistical Inference. Wiley, 1982. 15, 32, 52, 68, 69, 132 Edward E. Leamer. A result on the sign of the restricted least squares estimator. Journal of Econometrics, 3:387–390, 1975. 250 Jan R. Magnus. Linear Structures. Oxford University Press, New York, 1988. 348 E. Malinvaud. Statistical Methods of Econometrics. North-Holland, Amsterdam, third edition, 1980. 290 Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley, Chichester, 1988. 348 A. Q. Morton. The authorship of Greek prose (with discussion). Journal of the Royal Statistical Society, Series A, 128:169–233, 1965. 23, 24 Trenchard More, Jr. Axioms and theorems for a theory of arrays. IBM Journal of Research and Development, 17(2):135–175, March 1973. 337, 342 Jamie Morgan. The global power of orthodox economics. Journal of Critical Realism, 1(2):7–34, May 2002. 158 Ieke Moerdijk and Gonzalo E. Reyes. Models for Smooth Infinitesimal Analysis. SpringerVerlag, New York, 1991. 26 Parry Hiram Moon and Domina Eberle Spencer. Theory of Holors; A Generalization of Tensors. Cambridge University Press, 1986. 337 C. Radhakrishna Rao. A note on a generalized inverse of a matrix with applications to problems in mathematical statistics. Journal of the Royal Statistical Society, Series B, 24:152–158, 1962. 322 C. Radhakrishna Rao. Linear Statistical Inference and Its Applications. Wiley, New York, second edition, 1973. 37, 78, 122, 139, 173, 321, 323 C. Radhakrishna Rao. Statistics and Truth: Putting Chance to Work. World Scientific, Singapore, second edition, 1997. 4 Rolf-Dieter Reiss. Approximate Distributions of Order Statistics. Springer-Verlag, New York, 1989. 29, 30 Alfred R´nyi. Foundations of Probability. Holden-Day, San Francisco, 1970. 1, 4, 13 e E. Rietsch. The maximum entropy approach to inverse problems. Journal of Geophysics, 42, 1977. 45 E. Rietsch. On an alleged breakdown of the maximum-entropy principle. In C. Ray Smith and Jr W. T. Grandy, editors, Maximum-Entropy and Bayesian Methods in Inverse Problems, pages 67–82. D. Reidel, Dordrecht, Boston, Lancaster, 1985. 45 Herbert Robbins. Statistical methods related to the law of the iterated logarithm. Annals of Mathematical Statistics, 41:1397–1409, 1970. 18 Abraham Robinson. Non-Standard Analysis. North Holland, Amsterdam, 1974. 26 Amit Ron. Regression analysis and the philosophy of social science: A critical realist view. Journal of Critical Realism, 1(1):119–142, November 2002. 158 Richard M. Royall. Statistical evidence: A Likelihood Paradigm. Number 71 in Monographs on Statistics and Applied Probability. Chapman & Hall, London; New York, 1997. 18, 142 L. S. Robertson and P. L. Zador. Driver education and fatal crash invovlement of teenage drivers. American Journal of Public Health, 68:959–65, 1978. 34 G. A. F. Seber. Linear Regression Analysis. Wiley, New York, 1977. 98, 221, 257, 261, 309, 310, 312, 313 H. C. Selvin. Durkheim’s suicide and problems of empirical research. American Journal of Sociology, 63:607–619, 1958. 34 362 [SG85] [SM86] [Spr98] [SS35] [SW76] [The71] [Tin51] [TS61] [Wit85] [Yul07] BIBLIOGRAPHY John Skilling and S. F. Gull. Algorithms and applications. In C. Ray Smith and Jr W. T. Grandy, editors, Maximum-Entropy and Bayesian Methods in Inverse Problems, pages 83–132. D. Reidel, Dordrecht, Boston, Lancaster, 1985. 47 Hans Schneeweiß and Hans-Joachim Mittag. Lineare Modelle mit fehlerbehafteten Daten. Physica Verlag, Heidelberg, Wien, 1986. 328 Peter Sprent. Data Driven Statistical Methods. Texts in statistical science. Chapman & Hall, London; New York, 1st ed. edition, 1998. 34 J. A. Schouten and Dirk J. Struik. Einf¨hrung in die neuren Methoden der Differentialu geometrie, volume I. 1935. 337 Thomes J. Sargent and Neil Wallace. Rational expectations and the theory of economic policy. Journal of Monetary Economics, 2:169–183, 1976. 106 Henri Theil. Principles of Econometrics. Wiley, New York, 1971. 346 J. Tinbergen. Econometrics. George Allen & Unwin Ltd., London, 1951. 159 Henri Theil and A. Schweitzer. The best quadratic estimator of the residual variance in regression analysis. Statistica Neerlandica, 15:19–23, 1961. 115, 218 Uli Wittman. Das Konzept rationaler Preiserwartungen, volume 241 of Lecture Notes in Economics and Mathematical Systems. Springer, 1985. 82 G. V. Yule. On the theory of correlation for any number of variables treated by a new system of notation. Proc. Roy. Soc. London A, 79:182, 1907. 175 ...
View Full Document

Page1 / 370

Econometrics_Notes_-_University_of_Utah__370_pages_ - Class...

This preview shows document pages 1 - 4. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online