This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Introduction Review of syllabus: Contact Info Warren 254 2552086 drj3@cornell.edu Office Hours: W 12 – 2. Policies: Please read the syllabus. A note on the Textbook: No readings are required. However, I expect (particularly with the more complicated topics we will cover) that simply listening to my lecture will not provide adequate understanding. My suggestion: Use my lectures and the homework assignments to point you to relevant topics in the book. I will try to give chapters or sections when I can. So long as you understand the material covered in class, you should be able to do well on the tests and problem sets. More may be required for your particular project or critique. If you find yourself unable to understand the textbook I have chosen for the course, feel free to use another. If you talk to me, I may be able to suggest books for specific topics. The syllabus lists texts that may help. These should be available at the Mann Library. Anyone using one of these on a regular basis, please send me a note as to which you found most useful, for which topics, and why. This will help me to streamline the course and enhance the experience of future students. The course homepage can be found on the Blackboard system. The stable url is: http://blackboard.cornell.edu/webapps/portal/frameset.jsp?/tab=courses&url=/bin/commo n/course.pl?course_id=_10513_1 All lecture notes, handouts and homework assignments are available on 1 the website. My preference is not to hand out any paper in class. Are there any for which this will be a particular burden? Expectations: 1. Learn to conduct and understand applied research in economics. 2. Understand basic theory of estimation. Possibility of feedback: I need all the feedback I can get. If you have suggestions for improving the course (or wish to tell us how we are unfair, unclear, or just dense) please talk to me. I do not claim to be a good teacher, but I am open to any suggestions you may have. I will regularly include feedback questions on problem sets and tests. These questions are required, but graded only on the basis of whether the question was attempted or not. Grading: Homework (50 pts), Project (130 pts), Critique (20 pts) and 2 Prelims (100 pts). Two assignments can be turned in anytime before the last day of class: Critique: This is a relatively short assignment of no more than two typed double spaced pages. I would prefer the average length to be near ¾ of one page. This project must be completed alone. Find an article (newspaper, journal or otherwise) that makes use of econometrics, and write a critique. In your critique, you should include: a brief summary of the article focusing on the use of econometrics, why the use of econometrics is interesting, how the application may be flawed. Econometrics Project: I don’t like the idea of econometrics projects because it is extremely difficult to find an interesting problem that can be researched within a semester. Hence, econometrics projects are notoriously boring to read and difficult for students to 2 complete. However, there is no better way to learn applied econometrics than to conduct ones own research. Here is the deal: Encouraged to use your own data and problem (but I can help in a pinch). Start working now! Due date: December 4, 2008 Requirements: 1. An interesting economic problem you believe you can address using econometrics (theory and statistics). 2. Write a one page proposal and return to me by September 18, 2008. Included in the proposal: Problem, Why it is interesting, Possible economic theories, How you will obtain the required data (if not a classroom set). The purpose of this is to let me identify groups that may be too ambitious, and help them correct their course. The earlier you get this to me, the earlier I can identify potential problems. 3. A 12 to 20 page rough draft (including statistical analysis) will be due on October 23, 2008. The paper should include: Introduction, motivation for the question, a review of similar studies or other relevant research, an exposition of the economic theory, a discussion of the econometric methods, a list of econometric results, a discussion of the results and tests, and conclusions drawn from the econometrics. I will provide feedback on this draft. 4. The final project will be due on the last day of class. Late projects will have 5% deducted daily from the grade. 3 Scientific Method vs. Econometric practice: Definition of Econometrics: Econometrics is economic measurement. Scientific method: 1. Observe some phenomenon 2. Create a hypothesis to explain the phenomenon 3. Make predictions employing the hypothesis 4. Test the predictions through experimentation or more observations (usually using statistical theory). 5. Modify hypothesis and repeat 3 and 4. Econometric Practice: 1. Observe some behavior 2. Create an economic model (hypothesis) to explain the behavior 3. Alter the model to allow estimation with available data (which may be the same as that observed in 1, or may be a new data set altogether). 4. Use available data to test the modified hypothesis using econometric theory. 5. Repeat steps 2, 3 and 4 (or any combination of 2,3, and 4). In general, econometric analysis is a marriage of economic models and statistical methods. Many of the statistical theories are based on ideal situations that one never finds in practice. Econometric theorists work on extending our knowledge of statistical theory so that more situations and models of behavior may be analyzed. A good applied econometrician learns to do the best she can with what is available. While this may not always (or ever) lead the researcher to truth, it will lead to greater understanding of the motivations and problems examined. This is an applied econometrics course. The purpose of this course is to teach you to take economic questions and find the best answers available. I will try to devote as much time to hands on problem solving as I can. However, in order to 4 apply econometric theory to any problem, it is necessary first to understand econometric theory. This will necessitate some mathematics on problem sets and (gasp!) exams. In particular there are some classic proofs that no econometrics course could ignore. Any involved proofs required for a test will be revealed in class (i.e. I will tell you “This proof will be on the test.”). I will not do this for simpler arithmetic problems (like taking an expected value). I don’t want math or proofs to be a hangup for anyone in the course, but, I do require understanding. Hence you may be asked to write down the steps of a proof and write a short paragraph on what it says and why it is important. Most important for the course is the ability to know when to apply which tools. Terminology We will restrict our attention to economic models that can be written in some mathematical expression. For example, suppose we wished to explore a demand for boneless skinless chicken breast (1) Qd = f ( p, ps , pc , y, z ) , where Qd is quantity demanded, p is the price, ps are the prices of substitutes (like whole chickens, or chicken with skin and bones, or beef),
pc are the prices of complements (BBQsauce, rice, etc.), y is income, and z are other factors that may affect demand (like nutritional perceptions). Variables on the right hand side of (1) are called explanatory variables because they explain the quantity demanded. The variable(s) on the left hand side are called
dependent variables. This is because our economic theory has informed us that quantity demanded is dependent upon the various prices, income and perceptions. The text also calls these ‘outcome variables.’ I have not seen this terminology in practice and don’t know why they used it (I’ll ask George next time I see him). There are two types of explanatory variables: 1. Independent or exogenous, and 2.
Endogenous. Independent variables are those like income and health perceptions. They 5 cause quantity demanded to change, and not the other way around. If we suppose the price of boneless skinless chicken breast to be determined by a market (rather than a price floor etc.) then price will be endogenous. In this case prices will be determined by the intersection of supply and demand curves. Price affects quantity, but quantity affects price. It is important to recognize when a variable is endogenous. The presence of endogenous variables causes problems with standard theory. We will talk about how to resolve these problems near the end of the course. In this course we will only deal with linear relationships. A linear relationship (at least as far as we are concerned) is one that can be represented as
y = m0 + m1 x1 + m2 x2 + ⋯ where y, x1 ,… are observed data, and m0 , m1 ,… are unknown parameters we want to learn about. Most of the techniques we will apply are generalizable to nonlinear equations. We use the linear equations because it is simpler to understand the mathematics and can be very general. After having identified a problem, and an economic model, we will need to find a linear mathematical relationship that represents the behavioral predictions of the model. In most cases the condition to be estimated will be a first order condition. Lets take an example from intermediate micro. We wish to explain the consumption of good x . Suppose there are 100 individuals (index each individual by i = 1,… ,100 ), each can
consume two goods (measured by xi and yi ). Each individual has an endowment of wealth wi . We are able to obtain data only on the first 10 individuals. For these individuals we observe quantity consumed (our dependent variables), prices and wealth (our independent variables). We suppose that each individual has identical utility functions, and maximizes their utility, or max U ( xi , yi )
xi , yi subject to px xi + p y yi = wi . 6 The solution to this problem can be represented as ∂U ∂U py = px , px xi + p y yi = wi . If ∂xi ∂yi we thought utility could be represented in the CobbDouglas form ( U ( xi , yi ) = xiα yi β ) then we could rewrite the optimization conditions as α yi p y = β xi px ,
px xi + p y yi = wi ,
or, xi = wi . α + β px α yi = wi . α + β py β This may not look like a linear relationship, but it is linear in the observed variables. We know wi and px , and hence have observed the ratio wi w = Wxi , i = Wyi . Further, let px py γ1 = α α+β , and γ 2 = β α+β . If we find these two values we can deduce α , β . So we have a linear relationship we can estimate: xi = γ 1Wxi . yi = γ 2Wyi .
Suppose that we observed p y = px = 1 and we found the following data Observation ( i = ) 1 2 3 4 5 6 7 xi
1.5282 4.2117 6.9832 3.0636 6.5139 6.6668 4.0593 yi
0.4742 7.7883 5.0168 4.9364 9.4861 7.3332 5.9407 wi = Wxi = Wyi
2 12 12 8 16 14 10 7 8 9 10 2.3044 15.9677 8.2944 3.6956 26.0323 11.7056 6 42 20 If we only wished to find γ 1 , we might think it reasonable to use algebra. Looking at the first line we have 1.5282 = 2γ 1 , which would lead us to believe that γ 1 = .7641 . However, we have a problem in one unknown ( γ 1 ) with 10 equations. Observations 2 and 3 have exactly identical wealth but very different consumption levels. Thus no solution exists! At this point a reasonable person may conclude that individuals do not behave according to our model. Instead, econometricians assume that there must be some source of error. In our case, maybe individuals would like to buy the optimal bundle according to our model, but they make calculation errors so that we need to rewrite our equilibrium condition as xi = γ 1Wxi + ε i ,
where ε i is some random factor. 8 Plotted values 30 25
Amounts 20 15 10 5 0 0 10 20
W X Y 30 40 50 Despite not fitting our model exactly, the plot reveals that the data display a nearly linear relationship. Our job is to find the line that best approximates this relationship. The best fit will depend crucially on what we believe ε1 is and how it behaves. This is where statistics comes into the picture. After we have estimated these values, we can test some pieces of our model assuming that other pieces hold. For example, our model implies that γ 1 + γ 2 = 1 . This could be tested assuming that all other hypotheses are correct. Of course,
if we reject that this sum is 1, many of our other hypotheses are in doubt. A word about why linear functions are useful. Some of you may be familiar with Taylor’s theorem, Taylor approximations, etc. Taylors theorem tells us that if we wish to approximate a continuous (and continuously differentiable) function y = f ( x1 , x2 ,… , xk )
we could use a form that is linear in powers of xi . In particular, we could use 9 ˆ y = f ( 0, 0,… , 0 ) +
+ ∂f ∂f 1 ∂2 f 2 1 ∂2 f 1 ∂2 f x1 + x2 x1 + ⋯ + xk x1 ( 0, 0,… , 0 ) x1 + ⋯ + xk + ∂x1 ∂xk 2 ∂x12 2 ∂x2 ∂x1 2 ∂xk ∂x1 1 ∂2 f 1 ∂2 f 2 1 ∂2 f 1 ∂2 f x2 x1 + x2 + x2 x3 + ⋯ + xk x2 + ⋯ 2 ∂x1∂x2 2 ∂x2 2 2 ∂x3∂x2 2 ∂xk ∂x2 If our function is well behaved, and we use a long enough Taylor expansion, we will have an exact approximation of f. Notice that if we knew the values of x1 ,… , xk , the only unknowns above are the derivatives. However, these are all constant parameters. Hence, in many cases we can find an exact linear relationship that satisfies a nonlinear function. Statistics Review
We need to have some language to talk about the properties of ε1 . Statistics gives us this language. I expect you to understand the following definitions: Sample space: the set of all possible outcomes (in the case of ε1 , it is all values we
believe ε1 can take on). Element: one possible outcome (like ε1 = 0 ). Event: A subset of the sample space (like ε1 falls in the interval [ −1,1] ). Mutually exclusive: Two events are mutually exclusive if no element can be included in
both events. In other words both can’t happen at once. If ε1 falls in the interval [ −1,1] then ε1 cannot also be in the interval [ 2,5] . These events are mutually exclusive. Exhaustive: A set of events is exhaustive if their union is identical to the sample space.
The events ( −∞, 0 ) and [ 0, ∞ ) are exhaustive if ε1 is a real number. Probability: Intuitively, the probability of event X , written P ( X ) , is a measure of the
proportion of times event X would occur if an experiment were repeated an infinite number of times. Any probability measure P ( X ) has the following properties 1. 1 ≥ P ( X ) ≥ 0 2. Let A, B, C ,… be a set of exhaustive events. Then P ( A ∪ B ∪ C ∪ …) = 1 . 10 3. If X and Y are mutually exclusive events, then P ( X ∪ Y ) = P ( X ) + P (Y ) Random variable: A random variable is a function that assigns each event in the sample
space to one and only one real number. If we were flipping coins, we may assign a variable x based on the outcome, e.g. every head x = 1 , and every tail x = 0 . Thus x is a
random variable. Any function of a random variable is a random variable. There are commonly two types of random variables: Discrete random variable: A random variable is discrete if it can take on a countable number of outcomes (like our coin flipping example). Continuous random variable: A random variable is continuous if it can assume a continuum of values (like a person’s height). Probability density function (PDF): f ( x ) is a probability density function if
∞ 1. ∫ f ( x ) dx = 1 if continuous or ∑ f ( x ) = 1 , where { x }
j J J j =1 j are the possible values of −∞ j =1 x.
2. f ( x ) ≥ 0 . 3. If x is a discrete random variable, f ( x ) is a probability measure. If x is a continuous random variable, then specific value is zero. ∫ f ( x ) dx = P ( a < x < b )
a b (thus the probability of taking on any Examples: Discrete PDF: if x =1 .25 .5 if x=2 f ( x) = if x=3 .25 0 otherwise 11 Discrete PDF 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Continuous PDF: The Normal (or Gaussian) distribution can be written f ( x  µ ,σ 2 ) = 1 2πσ 2 e 1 x − µ 2 − 2 σ 12 Normal Distribution 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 4 3 2 1 0 1 2 3 4 This is a distribution you should become very familiar with. Joint probability density function: A joint probability density function relates the probability distribution of two or more possibly related random variables, f ( x1 , x2 ) .
∞∞ ∫ ∫ f ( x , x ) dx dx
1 2 1 2 =1 −∞ −∞ f ( x1 , x2 ) ≥ 0
b2 b1 Pr ( a1 < x1 < b1 , a2 < x2 < b2 ) = ∫ ∫ f ( x , x ) dx dx
1 2 1 2 a2 a1 13 Bivariate Normal PDF 0.2 0.15 0.1 0.05 0 4 2 0 2 4 4 2 2 0 4 The last property leads us to the notion of a marginal PDF. Marginal PDF: This is the PDF of one of a set of related random variables,
∞ f ( x1 ) = ∫ f ( x , x ) dx
1 2 2 . −∞ The properties of joint PDFs imply that the above must satisfy the properties of a PDF for the random variable x1 . This will allow us to talk about the properties of x1 no matter what the outcome x2 . Conditional density function: We may wish to know the probability of x1 taking on certain values given a certain value of x2 (or conditional on x2 ) Define a conditional density function as f ( x1  x2 ) = f ( x1 , x2 ) f ( x2 )
. Independence: The random variables x1 , x2 are independent if f ( x1 , x2 ) = f ( x1 ) f ( x2 ) .
In this case the two random variables are not related in a statistical sense. 14 Cumulative Distribution Function (CDF): This function yields the probability of a
random variable falling below some threshold. We usually denote a PDF with a lower case letter (like f ( x ) ) and the CDF with an upper case function (like F ( x ) ).
x0 F ( x0 ) ≡ P ( x < x0 ) =
Moments ∫ f ( x ) dx . −∞ Let g ( x ) be any function of the random variable x , then the expectation of g ( x ) is given by
∞ E ( g ( x )) = ∫ g ( x ) f ( x ) dx,
J −∞ . E ( g ( x )) = ∑ g ( x j ) f ( x j )
j =1 The expectation is commonly called the population mean or average. Properties of E ( i ) : 1. If a is a constant, E ( a ) = a . 2. E ( a + bx ) = a + bE ( x ) . 3. E ( ax1 + bx2 ) = aE ( x1 ) + bE ( x2 ) . 4. If x1 , x2 are independent then E ( x1 x2 ) = E ( x1 ) E ( x2 ) . The nth moment of a random variable is defined as µ n = E ( xn ) = ∞ ∫ x f ( x ) dx
n J n j =1 −∞ µ n = E ( xn ) = ∑ ( x j ) f ( x j )
The 1st moment (or n = 1 ) is just the mean or expected value (denoted µ ) of the distribution. The nth central moment of a distribution is defined as
E (( x − E ( x )) ) .
n 2 2 2 The second central moment is commonly known as the population variance
VAR ( x ) = σ 2 = E (( x − E ( x )) ) = E ( x ) − E ( x ) ,
15 which provides a measure of dispersion around the mean. Properties of VAR ( x ) 1. If a is a constant VAR ( a ) = 0 . 2. VAR ( a + bx ) = b 2VAR ( x ) 3. If x1 , x2 are independent then VAR ( ax1 + bx2 ) = a 2VAR ( x1 ) + b 2VAR ( x2 ) . Other central moments are often used to measure asymmetry of the distribution (skewness). The standard deviation (or average deviation from the mean) if given by σ = σ 2 = E ( x2 ) − E ( x ) .
2 Covariance and Correlation
Covariance measures the relationship between two random variables. If covariance is positive, then the variables tend to move together (a higher x1 means a higher x2 is likely). If the covariance is negative, then the variables tend to move in opposite directions (a higher x1 means a lower x2 is likely). The covariance is defined as
COV ( x1 , x2 ) = E ( x1 − µ1 ) ( x2 − µ 2 ) = E ( x1 x2 ) − E ( x1 ) E ( x2 ) . Properties of covariance: 1. If x1 and x2 are independent, then COV ( x1 , x2 ) = 0 , although a zero covariance does not imply independence. 2. Changing the units of measurement changes the covariance. For example, if we were measuring the height of individuals and the length of their foot in meters, the correlation might appear very small compared to a measurement in millimeters. To see this last point, let x1 , x2 be measurements in millimeters. Then measurements in meters are just x1 x , 2. 1000 1000 16 x x2 x x x1 x2 COV 1 , 2 = E 1 − E 1, 000 E 1, 000 1, 000 1, 000 1, 000 1, 000 1 1 = E ( x1 x2 ) − E ( x1 ) E ( x2 ) = 1, 000, 000 COV ( x1 , x2 ) 1, 000, 000 Correlation is a normalized measure of covariance. ‘Normalized’ means that correlation is independent of measurement unit. Define correlation as COV ( x1 , x2 ) ρ ( x1 , x2 ) = σ 1σ 2 . To see that this is now independent of measurement units, lets take our example from before x x 1 COV 1 , 2 COV ( x1 , x2 ) x2 x1 1, 000, 000 1, 000 1, 000 ρ , = = 2 2 1, 000 1, 000 x1 x2 1 1 VAR VAR 1, 000 1, 000 1, 000 VAR ( x1 ) VAR ( x2 ) 1, 000 COV ( x1 , x2 ) = = ρ ( x1 , x2 ) VAR ( x1 ) VAR ( x2 )
Properties of correlation 1. 1 ≥ ρ ≥ 0 . 2. If the two variables are independent, ρ = 0 . 3. If ρ = 1 the variables are perfectly positively correlated. 4. If ρ = −1 the variables are perfectly negatively correlated. A Discrete Example: Let x1 , x2 have the following joint pdf .35 .04 .03 f ( x1 , x2 ) = .35 .03 .20 if if if if if if x1 = 1, x2 = 1 x1 = 1, x2 = 0 x1 = 0, x2 = 1 x1 = 0, x2 = 0 x1 = −1, x2 = 1 x1 = −1, x2 = 0 17 We can find the joint CDF 0 .20 .20 + .35 F ( x1 , x2 ) = .20 + .35 + .04 .20 + .03 .20 + .03 + .03 + .35 .20 + .35 + .35 + .04 + .03 + .03 0 .20 .55 = .59 .23 .61 1 if if if if if if if x1 < −1, x2 < 0 −1 ≤ x1 < 0, 0 ≤ x2 < 1 0 ≤ x1 < 1, 0 ≤ x2 < 1 1 ≤ x1 , 0 ≤ x2 < 1 −1 ≤ x1 < 0,1 ≤ x2 0 ≤ x1 < 1,1 ≤ x2 1 ≤ x1 ,1 ≤ x2 if if if if if if if x1 < −1, x2 < 0 −1 ≤ x1 < 0, 0 ≤ x2 < 1 0 ≤ x1 < 1, 0 ≤ x2 < 1 1 ≤ x1 , 0 ≤ x2 < 1 −1 ≤ x1 < 0,1 ≤ x2 0 ≤ x1 < 1,1 ≤ x2 1 ≤ x1 ,1 ≤ x2 We can find the marginal densities
.35 + .04 if f ( x1 ) = .03 + .35 if .03+.20 if .35 + .03 + .03 f ( x2 ) = .04 + .35 + .20 x1 = 1 .39 if x1 = 0 = .38 if x1 = −1 .23 if if if x2 = 1 x1 = 1 x1 = 0 x1 = −1 x2 = 1 x2 = 0 .41 if = x2 = 0 .59 if The marginal CDF’s .39 + .38 + .23 .38 + .23 F ( x1 ) = .23 0 .41 + .59 if F ( x2 ) = .59 if 0 if 1 if 0 ≤ x1 < 1 .61 = if −1 ≤ x1 .23 if x1 < −1 0 1 ≤ x2 1 if 0 ≤ x2 < 1 = .59 if 0 if x2 < 0 if 1 ≤ x1 if if if if 1 ≤ x1 0 ≤ x1 < 1 −1 ≤ x1 x1 < −1 1 ≤ x2 0 ≤ x2 < 1 x2 < 0 The conditional densities 18 .35 .41 .04 .59 .03 .41 f ( x1  x2 ) = .35 .59 .03 .41 .20 .59 .35 .39 .04 .39 .03 .38 f ( x2  x1 ) = .35 .38 .03 .23 .20 .23 if if if if if if if if if if if if x1 = 1, x2 = 1 x1 = 1, x2 = 0 x1 = 0, x2 = 1 x1 = 0, x2 = 0 x1 = −1, x2 = 1 x1 = −1, x2 = 0 x1 = 1, x2 = 1 x1 = 1, x2 = 0 x1 = 0, x2 = 1 x1 = 0, x2 = 0 x1 = −1, x2 = 1 x1 = −1, x2 = 0 We can also find the moments of our distributions:
E ( x1 ) = .39 × 1 + .38 × 0 + .23 × −1 = .16 E ( x2 ) = .41× 1 + .59 × 0 = .41 E ( x12 ) = .39 ×12 + .38 × 02 + .23 × ( −1) = .62
2 E ( x2 2 ) = .41× 12 + .59 × 02 = .41 VAR ( x1 ) = E ( x12 ) − E ( x1 ) = .62 − (.16 ) = .5944
2 2 VAR ( x2 ) = E ( x2 2 ) − E ( x2 ) = .41 − ( .41) = .2419
2 2 Covariance 19 COV ( x1 , x2 ) = E ( x1 x2 ) − E ( x1 ) E ( x2 ) = (.35 × 1× 1 + .04 × 1× 0 + .03 × 0 × 1 + .35 × 0 × 0 + .03 × ( −1) × 1 + .20 × ( −1) × 0 ) − .16 × .41 = .35 − .03 − .0656 = .2544 And correlation ρ ( x1 , x2 ) = COV ( x1 , x2 ) σ 1σ 2 = .2544 = .6709 . .5944 × .2419 Some Useful Distributions and Their Properties Normal Distribution: The normal PDF is given by (I’d memorize this) f ( x  µ ,σ 2 ) = 1 2πσ 2 e 1 x − µ 2 − 2 σ with mean µ and variance σ 2 . We will often write x ∼ N ( µ , σ 2 ) (read x is distributed normally with mean µ and variance σ 2 ). Important properties: 1. The distribution is symmetric around µ . 2. P ( µ − 2σ < x < µ + 2σ ) ≈ .95 3. If x1 and x2 are normal, and COV ( x1 , x2 ) = 0 then the variables are independent. 4. If x1 ∼ N ( µ1 , σ 12 ) and x2 ∼ N ( µ 2 , σ 2 2 ) x2 are normal, then the random variable z = ax1 + bx2 ∼ N ( a µ1 + bµ 2 , a 2σ 12 + 2abCOV ( x1 , x2 ) + b 2σ 2 2 ) . 5. This last property allows us to create a standard normal ( N ( 0,1) ) variable from any z= normally x−µ distributed random variable. If x ∼ N ( µ ,σ 2 ) , then σ ∼ N ( 0,1) . This transformation will allow us to deal easily with any 20 normally distributed random variable. It is particularly useful in hypothesis testing. Unfortunately, we need to know µ and σ to make this transformation. 6. We have a lot of good reasons to believe that many processes begin to look like normal distributions for a large enough sample. For example: THE CENTRAL LIMIT THEOREM: If x1 ,… , xn are independent and
identically distributed random variables with mean µ and variance σ 2 , let ∑x
ˆ µ= and
ˆ µ −µ σ N
n =1 N n N , have some probability distribution G, then G approaches the standard normal probability distribution as N → ∞ . ChiSquared: The ChiSquared PDF is given by (don’t bother memorizing this)
r −2 x − 1 2 2 xe r 2 r f ( x) = 2 Γ 2 0 if x>0 otherwise where, (2) Γ (α ) = ∫ yα −1e− y dy
0 ∞ This function has the remarkable property that Γ ( i ) = ( i − 1) ! if i is a positive integer greater than 1. The parameter r is called the degrees of freedom. We will often write x ∼ χ 2 ( r ) (read x is distributed Chisquare with r degrees of freedom). The mean and
variance are r and 2r respectively. Properties: 1. The Chisquare distribution is skewed (meaning not symmetric) and changes
shape dramatically for different degrees of freedom. 21 2 Degrees of Freedom 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 12 14 16 18 20 8 Degrees of Freedom 0.2 0.15 0.1 0.05 0 0 2 4 6 8 10 12 14 16 18 20 10 Degrees of Freedom 0.1 0.08 0.06 0.04 0.02 0 0 2 4 6 8 10 12 14 16 18 20 2. If x ∼ N ( 0,1) , then z = x 2 ∼ χ 2 (1) . 3. If { xn }n =1 are independent and standard normal, then z = ∑ xn 2 ∼ χ 2 ( N ) .
N n =1 N 22 4. This last property means that if { xn }n =1 are independent and normally distributed N x − µn 2 (not necessarily standard normal) , z = ∑ n ∼ χ ( N ) . This should σn n =1 N 2 remind you somewhat of the formula for variance.
N I N 5. If { xn }n =1 are independent and distributed χ 2 ( rn ) , then z = ∑ xn ∼ χ 2 ∑ rn . n =1 i =1 Studentt distribution: A studentt distributed variable has the following PDF (you will probably never see this again) r +1 1 r +1 Γ − 2 h 2 1 + h x − µ 2 2 f ( x) = ( ) 1 r r r Γ Γ 2 2
where r is the number of degrees of freedom, the mean is µ , and σ 2 = 1r , if r > 2 . r−2 h It is interesting to see the similarities between the Chisquare and normal pdfs. Properties of the tdistribution: 1. The distribution is symmetric and has the look and feel of a normal only with thicker tails. The higher the degrees of freedom, the closer the distribution to normal. For our purposes we will treat and t with higher than 20 degrees of freedom as a normal distribution. 23 2 Degrees of Freedom 0.4 0.3 0.2 0.1 0 4 0.4 0.3 0.2 0.1 0 4 0.4 0.3 0.2 0.1 0 4 3 2 1 0 1 2 3 4 8 Degrees of Freedom 3 2 1 0 1 2 3 4 20 Degrees of Freedom 3 2 1 0 1 2 3 4 2. If x1 ∼ N ( 0,1) , and x2 ∼ χ 2 ( r ) , then z = x1 ∼ t ( r ) (this notation refers to a x2 r standard t, or t with µ = 0 , h = 1 ). Fisher’s F: The PDF of an F distributed random variable is about as complicated as the t, involves a named function that you are unfamiliar with, and not terribly enlightening. 24 Properties: 1. If x1 ∼ χ 2 ( r1 ) , and x2 ∼ χ 2 ( r2 ) , and the random variables are independent, then x1 r z = 1 ∼ F ( r1 , r2 ) x2 r2 2. Again we call r1 , r2 degrees of freedom. 3. If x ∼ t ( r ) , then z = x 2 ∼ F (1, r ) (see point 2 in the discussion of tdistribution). Estimators and Hypothesis Testing In (almost) all the situations we will deal with, it is useful to assume that we know the functional form of the PDF of the random variables we are observing. In many cases we will assume that a random variable is distributed normally. We will then attempt to learn about the parameters of the distribution through our observations. Suppose that we have a sample N observations of the random variable x . We will generally assume that 1. { xn }n =1 are independently distributed. 2. Each xn has a PDF with unknown (but identical and stable) parameters. In the case of normality, xn ∼ N ( µ , σ 2 ) . We will then want to find estimates of the parameters ( µ and σ 2 ). Two common
N estimators are
ˆ µ= ˆ σ2 = 1 N ∑x
n =1 N n (called the sample mean) 1N 2 ˆ ∑ ( xn − µ ) N − 1 n =1 1N 2 ˆ or = ∑ ( xn − µ ) N n =1 (called the sample variance). These estimators yield point estimates , or a single value estimate of a parameter. The probability of these estimates being correct is 0. We will always use a hat (^) to denote an estimated value. An estimator is function of the sample data. Because it is a function of random variables, it is itself a random variable with a distribution of its own. Our
25 assumptions about the distribution of { xn }n =1 allow us to derive the distribution for our estimators. Important properties of point estimators: 1. Unbiasedness: θˆ is unbiased if E θˆ = θ . 2. Minimum variance: θˆ1 is a minimum variance estimator among some class of estimators, if VAR θˆ1 ≤ VAR θˆ for every other estimator in the class θˆ . 3. Efficiency: θˆ is efficient if it is a minimum variance estimator among the class of unbiased estimators. 4. Best Linear Unbiased Estimator (BLUE): If θˆ is a linear function of the data, and is efficient among the class of linear estimators, then it is BLUE. ˆ For example, µ is a sum of normal random variables. We know from the previous N () () () 1 ˆ section that µ ∼ N N ∑ µ = µ,
n =1 N 1 N2 ∑σ 2 =
n =1 N σ2 ˆ . Thus, µ is unbiased, and linear. We N will also be able to show that it is BLUE. If we knew σ 2 then we could make exact probability statements regarding the value of the mean. The most important of these types of statements are interval estimates. ˆ ˆ For example, we know that P ( µ − 2σ < µ < µ + 2σ ) ≈ .95 . This is called an interval
estimate because we are estimating the limits of an interval (rather than a single value) for the mean. Our assessment of probability is heavily dependent upon our assumptions of distribution and our knowledge of the variance. Confidence interval: Anytime we construct a statement P θˆ1 < θ < θˆ2 = 1 − α , we will
refer to θˆ1 ,θˆ2 ( ) ( ) as a confidence interval, and 1 − α as the level of confidence. We interpret this as: There is a 1 − α probability of θˆ1 ,θˆ2 containing the true value. Hypothesis Testing: We will often be interested in testing a hypothesis against the available data. In order to do this we will need to: ( ) 26 1. State a null (or currently accepted) hypothesis. Denote this H 0 . 2. State an alternative hypothesis. Denote this H1 . 3. Construct a test given a specific level of confidence desired. 4. Either reject the null in favor of the alternative hypothesis, or fail to reject the null hypothesis (we never accept any hypothesis as true). Most commonly we will construct tests based on confidence intervals. For example we may wish to test whether the mean height of the students in our class is different from 6 feet. So H0 : µ = 6 H1 : µ ≠ 6 ˆ Suppose we found our sample mean to be µ = 5.5 , and we knew the variance to be σ 2 = 2.4 , with a sample size of 4. We desire α = .05 . We know that
ˆ µ − µ 5.5 − µ 5.5 − 6 = ∼ N ( 0,1) . Our test should reject if = −0.645 is very different σ 0.775 0.775 N
from zero (either positive or negative), because zero is the mean value of our test statistic. Because we are looking for any deviation (positive or negative) we will construct a confidence interval that is symmetric around zero. We know that 5.5 − µ P −2 < < 2 = .95 because the standard deviation of a standard normal is 1 (For 0.775 other levels of confidence you can look up the needed values in the back of the book). Our test statistic, 0.645, falls into the interval [ −2, 2] , so we fail to reject the null hypothesis. By manipulating the inequalities in the probability functions, we find P ( 3.95 < µ < 7.05 ) = .95 .
In this case we wished to find any deviation from H 0 : µ = 6 , so we constructed a test that is symmetric around our estimate. To find the critical values (in this case 2 and 2) we look on page 844 in the book. The book gives us the probability that 0< 5.5 − µ < z . At z = 2.00 , this probability is 0.4772. Because we are using a test that 0.775 27 5.5 − µ is symmetric, we must multiply by two to find P − z < < z = .9544 which is σ approximately what we want. If instead we wanted to test H0 : µ > 6 H1 : µ < 6
we would want to construct a one tailed test. We should reject if 5.5 − 6 = −0.645 is too 0.775 5.5 − µ negative. We would want to find z such that P > z = .95 . Because the normal 0.775 is symmetric, we know that 5.5 − µ 5.5 − µ 5.5 − µ P0 > > z = P < 0 − P < − z = .5 − .05 = .45 . So, we look 0.775 0.775 0.775 around page 844 for .45. This occurs at − z = 1.65 . Solving the inequality we find µ < 5.5 + 1.65 × 0.775 = 7.033 . Hence we fail to reject our null.
The figure below displays the critical regions for symmetric and one tailed tests. If a test statistic falls into the critical region, the hypothesis is rejected. Hence, the critical region must have probability α . If the test is symmetric, each critical region has probability α
2 . The symmetric test rejects if the hypothesis is too far from the estimate, no matter which side. Hence this is appropriate if we want to test for equality. The onetailed test rejects if the hypothesis is too far to one side of the hypothesis. This is appropriate for inequality hypotheses. The figure suggests we are testing if the true value is greater than some number (because we reject if the hypothesis is too high). 28 Symmetric Test 0.4 0.35 0.3 Probability Density 0.25 0.2 0.15 0.1 0.05 0 4 3 2 1 0 X 1 2 3 4 Critical regions. PDF Estimate One Tailed Test 0.4 0.35 0.3 Probability Density 0.25 0.2 0.15 0.1 0.05 0 4 3 2 1 0 X 1 2 3 4 Critical Region PDF Estimate Errors There are two types of errors: 1. Type 1 error: Rejecting H0 when it is in fact true. The probability of making a Type I error is just α from above. Hence we can control this directly by controlling α . 29 2. Type II error: Failing to reject H0 when it is false. We call the probability of a Type II error the power of a test. If there is a high probability of failing to reject when false, then the test is not very powerful. Generally, we cannot control the power of a test directly. Having a larger sample size will almost always increase the power of a test. Also, having a larger α will increase the power of a test, but at the cost of Type I errors. In theory, we should always balance α against the power of the test given our sample size and what we wish to test. In practice α is almost always set to 0.01, 0.05, or ,0.10, and almost always 0.05. This more out of lazy habit and ease of communication than for any methodological reason. We almost always run tests to reject a hypothesis. Failing to reject a hypothesis almost never says anything interesting to the reader (with some few exceptions). What about T In practice we will never know σ 2 . This is why even those of us who don’t drink may be thankful for Guiness Beer. One of their employees developed the T distribution to resolve this issue. We know that ˆ µ −µ σ2
N ∼ N ( 0,1) , but we don’t know what σ 2 is. Our natural instinct might be to use the sample variance in place of σ 2 . The problem is that ˆ µ −µ ˆ σ2 ∼? N
Lets see what happens if we make some meaningless transformations: ˆ µ −µ σ2 ˆ µ−µ N = σ2 ˆ σ2 ˆ N N σ2 N 30 Now, we know the numerator is distributed standard normal. What about the denominator? ˆ σ2 σ2
N N= 1 N ∑(x
n =1 N n ˆ −µ) 2 σ2 ˆ x −µ ∑ nσ . = n =1 N
N 2 This last expression should start to look familiar. We know that
N ˆ xn − µ σ ∼ N ( 0,1) , thus ˆ ˆ µ−µ x −µ ∑ nσ ∼ χ 2 ( N ) . Thus, we know that ˆ 2 ∼ t ( N ) . There is usually little n =1 σ N difference in a T and Normal test. With smaller samples, it is important to use the correct test. We construct the test statistic as we did before, only replacing the population variance with the sample variance. We then use the ttest table to construct the test. 2 Notation Anyone wishing to study econometrics further will need to be able to use matrix notation, and matrix operations. See Appendix 3B of GHJ if you want some help. Definitions: Vector: a vector is a one dimensional array of numbers or variables. We will always use column vectors. For example
1 3 x= 4 2 The vector x is a column vector because it consists of one column of numbers. When specifying the dimensions of a matrix/vector, we will say the matrix has dimensions n × k where n is the number of rows, and k is the number of columns. The vector x has 31 dimensions 4 × 1 . We will also refer to elements of a matrix/vector this way. For example, the element in the 3rd row and 1st column of x is x31 = 4 .
1 2 A= 4 2 1 5 4 1 4 3 4 3 2 2 6 1 The matrix A is a 4x4 matrix. A 23 = 4 . We also call A a square matrix, because it has the same number of rows and columns. Diagonal matrix: A matrix that only has nonzero values along the diagonal (D11,D22, …) 1 0 0 D = 0 4 0 0 0 7 Identity matrix: a diagonal matrix with all 1’s along the diagonal. We refer to an n × n identity matrix as I n .
I n A = AI k = A . Once matrix multiplication is defined, you will note that 1 0 0 I = 0 1 0 0 0 1 Symmetric matrix: A square matrix such that Aij = A ji for all i, j . 1 2 3 A = 2 4 6 3 6 5 Idempotent: A matrix is idempotent if and only if the matrix multiplied by itself equals itself (We will define matrix multiplication later). Some examples: [1] × [1] = [1]
In × In = In Matrix operations Trace: The function trace ( A ) sums the diagonal elements of a square matrix. 32 trace ( A ) = ∑ Aii
i =1 n where n is the dimension of A. If 1 2 3 A = 2 4 6 3 6 5 then trace ( A ) = 1 + 4 + 5 = 10 . Addition and Subtraction: You may only add and subtract matrices that have the same dimensions. If A is an n × k matrix and B is an n × k , the A + B = C is also an n × k with each element Cij = Aij + Bij . 3 2 8 2 3 7 5 5 15 6 3 2 + 7 2 3 = 13 5 5 12 1 4 11 2 5 23 3 9 Scalar multiplication: Multiplying any matrix by a scalar (a single number) results in each element being multiplied by that number. 1 2 3 4 8 12 4 2 4 6 = 8 16 24 3 6 5 12 24 20 Transpose: For any n × k matrix A, its transpose (written A ' ) is a k × n matrix B with Bij = A ji . 1 2 3 A = 4 5 6 7 8 9 1 4 7 A ' = 2 5 8 3 6 9 (so the columns become rows). 33 Matrix multiplication: Suppose we want to find AB . This multiplication can only be carried out if A is n × k and B is k × m . In other words, the number of columns in the first matrix must equal the number of rows in the second. If C = AB , then Cij = ∑ Ait Btj .
t =1 m Example: 1 2 3 4 A = 5 6 7 8 , 9 10 11 12 1 2 3 4 5 6 7 8 9 10 . B= 11 12 13 14 15 16 17 18 19 20 Then, 11 11 11 11 11 0 × × × 1 1+2×6+3× 1+4× 6 1×2+2×7+3× 2+4× 7 1 3+2×8+3× 3+4× 8 1 4+2×9+3× 4+4× 9 1×5+2× 0+3× 5+4×2 5× +6×6+7× 1+8× 6 5×2+6×7+7× 2+8× 7 5×3+6×8+7× 3+8× 8 5×4+6×9+7× 4+8× 9 5×5+6× 0+7× 5+8×2 A = 1 B 11 11 11 11 11 0 9× +1 ×6+1 × 1+1 × 6 9×2+1 ×7+1 × 2+1 × 7 9×3+1 ×8+1 × 3+1 × 8 9×4+1 ×9+1 × 4+1 × 9 9×5+1 × 0+1 × 5+1 ×2 0 11 2 1 0 11 2 1 0 11 2 1 0 1 1 1 2 0 1 0 11 2 1 12345 1 0 1 0 1 0 1 0 1 0 2 6 2 2 2 8 3 4 3 0 =4 7 9 2 5 3 2 4 4 4 6 5 8 5 0 8 2 6 0 5 Note that BA does not exist, because B has 5 columns and A has 3 rows. Inverse: the inverse of a matrix A, written A −1 , if it exists, is a matrix such that
AA −1 = A −1A = I a You may need this for a homework or two: A = 11 a21 a12 then a22 − a12 a11 A −1 =
Also, for any diagonal matrix, a22 1 a11a22 − a12 a21 − a21 34 1 D 11 D −1 = 0 ⋮ 0 1 D22 0 ⋯ 0 ⋱ Some Rules for Matrix Operations (and two more Definitions) Commutative: addition. Distributive: ( A + B ) C = AC + BC and A ( B + C ) = AB + AC . Transposition: ( ABC ) ' = C ' B ' A ' . Inverse: ( AB ) = B −1A −1
−1 ( AB ) C = A ( BC ) for multiplication and ( A + B ) + C = A + ( B + C) for A square matrix is invertible only if its determinant ( det ( A ) ) is not 0. I won’t ask you te find determinants of anything but a 2x2 matrix on the homework. This determinant is det ( A ) = a11a22 − a21a12 . Positive definite matrix: A matrix A is positive definite if for any nonzero column vector that is conformable for multiplication x , x ' Ax > 0 . If x ' Ax ≥ 0 then we say the matrix is positive semidefinite. Matrix Calculus First we must define a function. Let f ( x ) : R n → R m (a function mapping n dimensional vectors into m dimensional vectors). For example x1 x 2 f (x) = 5 x1 2 x1 + x2 35 x maps any two dimensional vector 1 into a four dimensional vector. x2 Differentiation: In one dimension, differentiation finds something like a slope. In multiple dimensions, differentiation finds the slope in each of the cardinal directions (in the direction of each axis). The result is a matrix called a Jacobian denoted J = element of the Jacobian is J ij ( x ) = function ∂f j ∂xi ∂f . Each ∂x . J is a n × m matrix. In the case of the above ∂x1 ∂x1 J= ∂x 1 ∂x2 ∂x2 ∂x1 ∂x2 ∂x2 ∂5 x1 ∂x1 ∂5 x1 ∂x2 ∂ ( 2 x1 + x2 ) ∂x1 = 1 0 5 2 ∂ ( 2 x1 + x2 ) 0 1 0 1 ∂x2 If we are dealing with a linear function, finding the Jacobian is simple: f ( x ) = y = Ax
then, J = A'. On the other hand, if f ( x ) = y = xA
then J = A. 1 0 To see this from the function above, note that f ( x ) = Ax = 5 2
Hence J = A ' . 0 x1 x 1 1 x2 = x2 5 x1 0 1 2 x1 + x2 Linear Quadratic Forms: A linear quadratic function has the form f ( x ) = x ' Ax where A is a square matrix. The Jacobian of this function is 36 J ( x ) = Ax + A ' x
If A is symmetric then J ( x ) = 2 Ax . For example, x ' Ax = [ x1 3 2 x1 x2 ] = [3 x1 + 4 x2 4 1 x2 x 2 x1 + x2 ] 1 = 3 x12 + 6 x1 x2 + x2 2 x2 The Jacobian for the above is 6 x1 + 6 x2 6 6 x1 3 2 3 4 x1 ∂ 2 2 3 x1 + 6 x1 x2 + x2 = 6 x + 2 x = 6 2 x = 4 1 + 2 1 x . ∂x 2 2 1 2
Or J ( x ) = Ax + A ' x . Hessian: The Hessian is the second derivative of a matrix function, denoted H. In the case of a linear quadratic form, x ' Ax , H (x) = A + A '
Or, if A is symmetric, H ( x ) = 2A . Why Matrices? Matrices eliminate the need for out of control summation notation. For example, the sample variance from before can be written in summation notation as: ˆ σ2 = 1 N 1 ∑ xn − N N − 1 n =1 ∑ xn . n =1 N 2 Vectors and matrices take care of summation in a more simple notation. For example, x ' x = [ x1 x2 x1 x ⋯ xn ] 2 = x12 + x2 2 + … + xn 2 ⋮ xn Let 1 1 in = ⋮ 1 37 be a column vector of dimension n with all elements equal to 1. Define 1 An = In − i ni n ' . n This matrix has some very special properties: 1 1 0 1 1 1 0 1 1 1 2 1. A n is symmetric. A 2 = − [1 1] = 0 1 − 2 1 1 = 1 0 1 2 1 − 2 2. An is 1 − 2 1 2 idempotent. 1 1 1 1 1 A n A n = I n − i n i n ' I n − i n i n ' = I n I n − I n i n i n '− i n i n ' I n + 2 i n i n ' i n i n ' n n n n n 1 1 2 1 2 1 = I n − i n i n '+ 2 i n i n ' i n i n ' = I n − i n i n '+ 2 i n [1 1 ⋯ 1] i n ' ⋮ n n n n 1 2 1 n = I n − i n i n '+ 2 i n i n ' = I n − i n i n '. n n n 3. A n subtracts the mean from column data. Suppose that X is an n × k matrix of data, with each column representing n observations of the kth variable. For example, we could have columns representing age, height sex, etc, and each row represent a student in the class: 21 22 X= 19 25 6.1 1 5.9 0 6.5 0 5.2 1 Note the mean of the three columns are 21.75, 5.925, 0.5. By subtracting the column mean from each element, we obtain the deviation from sample mean, or .5 −0.75 1.075 0.25 −0.025 −.5 . X−X = −2.75 0.575 −.5 3.25 −0.725 .5 Instead we could multiply, 38 1 0 An X = 0 0 0 0 0 1 1 1 1 100 1 − [1 1 1 1] X = X − 1 0 1 0 4 1 4 1 0 0 1 1 1 1 1 1 1 1 1 X 1 1 1 1 1 1 4 ∑ X i1 i =1 4 ∑ X i1 1 i =1 = X− 4 4 ∑ X i1 i =1 4 ∑ X i1 i =1 ∑X
i =1 4 i2 ∑X
i =1 4 i2 ∑X
i =1 4 i2 ∑X
i =1 4 i2 i =1 4 −0.75 1.075 .5 ∑ X i3 0.25 −0.025 −.5 i =1 = 4 −2.75 0.575 −.5 ∑ X i3 3.25 −0.725 .5 i =1 4 ∑ X i3 i =1 ∑X 4 i3 Hence, using the vector x from before, 1N x1 − ∑ xi N i =1 1N x2 − ∑ xi Ax = N i =1 ⋮ N x − 1 ∑ x n N i =1 i Thus, 1 N 1 ˆ σ= ∑ xi − N N − 1 i =1 2 1 1 1 ∑ xi = N − 1 ( Ax ) ' Ax = N − 1 x ' A ' Ax = N − 1 x ' Ax i =1 N 2 Another reason for using matrices, is that we are often interested in multivariate random variables. Multivariate Statistics Let x denote an n × 1 vector of random variables. x1 x x = 2. ⋮ xn 39 Now we can begin to use the same operators as before. For example, we could find the expectation of x: E [ x1 ] µ1 E [ x2 ] = µ 2 = µ E [x] = x ⋮ ⋮ E [ xn ] µ n Now instead of finding the variance of each individual element, we will use the VarianceCovariance Matrix. This is more useful as it tells us about the relationship between each of the random variables. x1 − µ1 x2 − µ 2 [ x − µ VAR ( x ) = Σ = E ( x − µ x )( x − µ x ) ' = E 1 ⋮ 1 xn − µ n x2 − µ 2 ⋯ E ( x1 − µ1 )( x1 − µ1 ) E ( x1 − µ1 )( x2 − µ 2 ) E ( x2 − µ 2 )( x1 − µ1 ) E ( x2 − µ 2 )( x2 − µ 2 ) = ⋮ ⋮ E ( xn − µ n )( x1 − µ1 ) E ( xn − µ n )( x2 − µ 2 ) σ 12 σ 12 2 σ 12 σ 2 = ⋮ ⋮ σ 1n σ 2 n ⋯ σ 1n ⋯ σ 2n ⋱ ⋮ 2 ⋯ σn ⋯ ⋯ ⋱ ⋯ xn − µ n ] E ( x1 − µ1 )( xn − µ n ) E ( x2 − µ 2 )( xn − µ n ) ⋮ E ( xn − µ n )( xn − µ n ) where σ ij = COV ( xi , x j ) . This is necessarily a symmetric matrix. We find the same equivalent rules as before: E [ Ax + b ] = Aµ x + b
VAR ( Ax + b ) = E ( Ax + b − E ( Ax + b ) ) ( Ax + b − E ( Ax + b ) ) ' = E ( Ax + b − Aµ x − b )( Ax + b − Aµ x − b ) ' = E A ( x − µ x )( x − µ x ) ' A ' = AE ( x − µ x )( x − µ x ) ' A ' = AΣA ' 40 We will also specify multivariate distributions (or joint distributions of many random variables) using matrix notation. Multivariate normal: Let x be an n × 1 vector of random variables. If x is jointly normally distributed, then we write x ∼ N ( µ x , Σ ) . Note that this specifies covariances. As with the normal specified previously, z = Ax + b ∼ N ( Aµ x + b, AΣA ') . The multivariate standard normal is N ( 0, I ) , so each variable has mean zero, variance 1, and all are uncorrelated. The pdf for a multivariate normal is
−1 ( x − µx ) ' Σ ( x − µx ) 1 2 −1 2 exp − f (x) = . Σ 2 2π n Also, if x ∼ N ( 0, I ) , then x ' x = ∑ xi2 ∼ χ 2 ( n ) . If
i =1 n A is idempotent, then x ' Ax ∼ χ 2 ( trace ( A ) ) .
We will begin by dealing with random variables distributed N ( µ x , σ 2 I ) (independent and homoskedastic) Some rules for independence: 1. x ∼ N ( µ x , Σ ) , then z1 = ( x − µ x ) ' A ( x − µ x ) , and z 2 = ( x − µ x ) ' B ( x − µ x ) can only be independent if AB = 0 . 2. x ∼ N ( 0, I ) , L is an n × k ,matrix of rank k. Then Lx and x ' Ax are independent only if LA = 0 . Multivariate Sample Statistics Suppose we use the data matrix specified before 21 22 X= 19 25 41 6.1 1 5.9 0 6.5 0 5.2 1 We will want to know the sample means by column. 1 21 22 19 25 21 + 22 + 19 + 25 1 1 1 = 1 6.1 + 5.9 + 6.5 + 5.2 ˆ µ x = X ' i n = 6.1 5.9 6.5 5.2 1 4 4 n 1 0 0 1 1+ 0 + 0 +1 1 21.75 = 5.925 0.5 To find the sample variance covariance matrix: −0.75 1.075 −0.75 0.25 −2.75 3.25 1 1 1 0.25 −0.025 ˆ Σ= X ' AX = ( AX ) ' AX = 1.075 −0.025 0.575 −0.725 −2.75 0.575 n −1 n −1 3 .5 −.5 −.5 .5 3.25 −0.725 2.5 6.25 −1.5833 0.8333 18.75 −4.75 1 = −1.5833 0.6708 −0.0333 = −4.75 2.0125 −0.10 3 2.5 −0.1 1.00 0.8333 −0.0333 0.3333 The diagonal elements are the sample variance. The off diagonal elements tell us the covariance. So, for instance age has a negative relationship with height for our sample (1.5833). Having the gender value equal to 1 has a positive relation to height. .5 −.5 −.5 .5 The Simple Linear Regression Models to Decisions
When we research, we should be guided by some specific research questions: 1. Can I increase my profit by increasing price, or altering the quality of my product? 2. Will raising the minimum wage improve overall welfare? 3. How will new regulations affect input supply? 4. Is the model of consumption consistent with observed behavior? 42 In the majority of applied problems (like 1,2 and 3) we will first need to find a mathematical model that hypothesizes some relationship between the variables of interest. If none exists we can construct one, or alter an existing model to suit our purpose. As stated, it is often useful (although not necessary) to find a linear approximation of the modeled relationship. For this course, we will always use a linear approximation. We will want to find data that relate to the variables of our model. And then, after estimating, we will want to answer the question using statistical tests. Lets get back to the consumption problem we had looked at. We wanted to estimate xi = γ 1 wi + εi , px where xi is consumption, wi is wealth and px is the price of the good. We can now construct our observed consumption vector and relative wealth vector. 1.5282 2 4.2117 12 6.9832 12 3.0636 8 6.5139 16 x= ,w = 6.6668 14 4.0593 10 2.3044 6 15.9677 42 8.2944 20 We can now specify our model x = γ 1w + ε
where ε is the vector of error terms. We must now assume some distribution for ε . One easy assumption to make is that ε ∼ N ( 0, σ 2 I ) . Assuming the errors have mean zero, allows us to construct an unbiased estimate of γ 1 . In fact, we can construct several: γˆ = 1 10 xi ∑ = .4548 10 i =1 wi 43 γˆ1 = ( w ' w ) w ' x = .4023 −1 γˆ1 =
To see that these are unbiased note that: x1 = .7641 . w1 1 10 x 1 10 γ w + ε 1 10 γ 1wi + E ( ε ) E ∑ i = E ∑ 1 i = γ1 = ∑ wi 10 i =1 wi 10 i =1 wi 10 i =1 10 10 wi xi ∑ ∑ wi (γ 1wi + ε i ) −1 = E i =1 10 = E ( w ' w ) w ' x = E i =1 10 2 2 ∑ wi ∑ wi i =1 i =1 ( ) ∑ w (γ w + E ( ε ) )
i 1 i i i =1 10 ∑w
i =1 10 2 i ∑w
= γ1 10 2 i ∑w
i =1 i =1 10 = γ1
2 i x γ w + ε γ w + E ( ε1 ) w E 1 = E 1 1 1 = 1 1 = γ1 1 = γ1 w1 w1 w1 w1 How do we know which to choose? There are several standard estimators using several differing criteria. We will use the following to represent a standard linear model: x11 x y = Xβ + ε = 21 ⋮ xn1 x12 x22 ⋮ xn 2 ⋯ x1k β1 ε1 x11 β1 + x12 β 2 + ⋯ + x1k β k + ε1 ⋯ x2 k β 2 ε 2 x21 β1 + x22 β 2 + ⋯ + x2 k β k + ε 2 + = ⋱ ⋮ ⋮ ⋮ ⋮ ⋯ xnk β k ε n xn1 β1 + xn 2 β 2 + ⋯ + xnk β k + ε n where we have n observations and k explanatory variables. The parameters we wish to estimate of β , Σ . In general, the first column of x will all be ones, as in 44 1 x12 1 x 22 ⋮ ⋮ 1 xn 2 ⋯ x1k ⋯ x2 k , ⋱ ⋮ ⋯ xnk This makes the first parameter, β1 a constant term, or 1 x12 1 x 22 y = Xβ + ε = ⋮ ⋮ 1 xn 2 ⋯ x1k β1 ε1 β1 + x12 β 2 + ⋯ + x1k β k + ε1 ⋯ x2 k β 2 ε 2 β1 + x22 β 2 + ⋯ + x2 k β k + ε 2 + = ⋱ ⋮ ⋮ ⋮ ⋮ ⋯ xnk β k ε n β1 + xn 2 β 2 + ⋯ + xnk β k + ε n Even if our theory says no constant term should exist, we will often estimate a constant term to test if it is equal to zero. Estimation of this sort is called a regression. Estimators of Interest
Least Squares Estimator We saw before that the regression problem is equivalent to having an overidentified set of linear equations. One best guess at the parameters is to use estimates that minimize the sum of squared estimated errors. We will call this the Ordinary Least Squares (OLS) estimator. If we minimized error (rather than squared error) we would obtain a line errors equal to −∞ . Derivation (testable): Matrix notation: The sum of squared errors can be written as ∑ε
i =1 n 2
i = ε ' ε = ( y − Xβ ) ' ( y − Xβ ) = y ' y − y ' Xβ − ( Xβ ) ' y + ( Xβ ) ' Xβ = y ' y − y ' Xβ − β ' X ' y + β ' X ' Xβ To minimize we again take the derivative with respect to our estimators and set the value equal to 0
∂ ˆˆ ˆ ˆ ˆ ˆ y ' y − y ' Xβ − β ' X ' y + β ' X ' Xβ = − ( y ' X ) '− X ' y + 2 X ' Xβ = −2 X ' y + 2 X ' Xβ = 0 , ˆ ∂β 45 or, ˆ X ' Xβ = X ' y. Solving this yields ˆ β = ( X ' X) X ' y . Note this requires the inverse to exist. The second order condition is −1 ∂2 ∂ ˆˆ ˆ ˆ ˆ − 2 X ' y + 2 X ' Xβ = 2 X ' X y ' y − y ' Xβ − β ' X ' y + β ' X ' Xβ = ∂β∂β ' ∂β '
Thus we have a global optimum if X ' X is positive definite. Summation Notation???? Hard to do generally. If we assume an extremely simple form: yi = β1 + xi 2 β 2 + xi 3 β 3 + ε i ˆ then we want to find β that minimizes ∑( y − x
i i =1 n i1 ˆ ˆ ˆ β1 − xi 2 β 2 − xi 3 β 3 ) 2 Differentiating yields
ˆ ∂∑ εi2
i =1 n ˆ ∂β1 ˆ ∂∑ εi2
i =1 n ˆ ˆ ˆ = ∑ −2 yi − β1 − xi 2 β 2 − xi 3 β 3 (1) = 0
i =1 n ( ( ( ) ˆ ∂β 2 ˆ ∂∑ εi2
i =1 n ˆ ˆ ˆ = ∑ −2 yi − β1 − xi 2 β 2 − xi 3 β 3
i =1 n )( x )( x i2 )=0 )=0 ˆ ∂β 3 ˆ ˆ ˆ = ∑ −2 yi − β1 − xi 2 β 2 − xi 3 β 3
i =1 n i3 Solving these simultaneously yields 46 ˆ β1 = 1n 1n 1n ∑ yi − βˆ2 n ∑ xi 2 −βˆ3 n ∑ xi 3 n i =1 i =1 i =1 n n n n 2 ∑ yi xi 2 ∑ xi 3 − ∑ yi xi 3 ∑ xi 2 xi 3 i =1 i =1 i =1 ˆ β 2 = i =1 2 n n n ∑ xi 22 ∑ xi32 − ∑ xi 2 xi3 i =1 i =1 i =1 n n n n ∑ yi xi3 ∑ xi 22 − ∑ yi xi 2 ∑ xi 2 xi3 i =1 i =1 i =1 ˆ β 3 = i =1 2 n n n 2 2 ∑ xi 2 ∑ xi 3 − ∑ xi 2 xi 3 i =1 i =1 i =1 I’d really recommend using matrix notation for this derivation. It makes life simpler on all of us. Under some common assumptions the OLS estimator has some desirable properties. Some assumptions that are always employed: The functional form is correct. There are no pertinent variables that have been omitted from the regression. We must remember these two assumptions no matter what estimator we choose. Assumption : E ( ε i ) = 0 , X is nonstochastic (not random), and linearly independent. Properties: 1.Unbiased
−1 −1 −1 −1 ˆ E β = E ( X ' X ) X ' y = E ( X ' X ) X ' ( Xβ + ε ) = ( X ' X ) X ' Xβ + ( X ' X ) X ' E ( ε ) () ( )( ) = β +0= β
Assumptions: ε i ∼ N ( 0, σ 2 I ) (normal, independent, homoskedastic), E ( X ' ε ) = 0 (in other words if X is random, then it is uncorrelated with the error).
Properties: 1. Unbiased
−1 ˆ 2. β ∼ N β , σ 2 ( X ' X ) . To see this note ( ) VAR ( β ) = VAR ( X ' X ) X ' ( Xβ + ε ) = VAR ( X ' X ) X ' Xβ + ( X ' X ) X ' ε = ( X ' X ) X ' σ 2 IX ( X ' X ) = σ 2 ( X ' X ) X ' X ( X ' X ) = σ 2 ( X ' X )
−1 −1 −1 −1 ( −1 ) ( −1 −1 ) −1 47 ˆ 3. β is BLUE. Proof (known as GaussMarkov Theorem): 1. The estimator is clearly linear in y. In other words it can be written
Ay . 2. We have shown the estimator is unbiased. ˆ 3. We need then to show that β has the least variance of any linear unbiased estimator. In the case of a vector of estimates, a minimum variance estimator is an estimator β * such that for and constant vector c ,
ɶ VAR ( c ' β * ) ≤ VAR c ' β
ɶ where β is any other estimator. ( ) This last statement is equivalent to the statement that
ɶ VAR ( β * ) = VAR β + P () ˆ where P is a positive definite matrix. We will prove that β is least
ɶ variance by construction. Suppose that β is an unbiased linear estimator ɶ ɶ that has minimum variance. Since β is linear, we can write β = a ' y . ɶ Hence, E β = a ' y = a ' E ( y ) = a ' E ( Xβ + ε ) = a ' Xβ ɶ and VAR β = VAR ( a ' y ) = a 'VAR ( y ) a = a 'σ 2 Ia = σ 2a ' a . Then a must () () solve
ɶ min VAR β = σ 2a ' a () subject to
ɶ E β = a ' Xβ = β . () The solution to the Lagrangian problem must also solve the above L = σ 2a ' a − λ ' ( I − a ' X )
The first order conditions are: 48 (3) ∂L = 2σ 2a − Xλ = 0 ∂a ∂L = I −a'X = 0 ∂λ Premultiplying the first equation by X ' and solving yields λ = 2σ 2 ( X ' X ) X ' a
a'X = I Substitution obtains
−1 −1 λ = 2σ 2 ( X ' X ) I .
Again substituting into (3) obtains 2σ 2a − 2σ 2 X ( X ' X ) = 0 , or a = X ( X ' X) . Which means that
ɶ β = a ' y = ( X ' X) X ' y .
−1 −1 −1 We will use the following example data 22 1 3 ε1 β1 + 3 β 2 + ε 1 42 = Xβ + ε = 1 5 β1 + ε = β + 5 β + ε y= 2 2 β 2 1 39 1 7 2 ε 3 β1 + 7 β 2 + ε 3 Our estimator is 1 −1 1 1 1 ˆ β = ( X ' X) X ' y = 1 3 5 7 1 83 −15 103 1 = = 249 − 225 −15 3 549 3 22 −1 1 1 1 42 = 3 15 103 5 3 5 7 15 83 549 39 7 1 314 13.0833 = 24 102 4.25 −1 3.4583 −0.6250 ˆ ˆ ˆ Thus β1 = 13.0833, β 2 = 4.25 . We know that β ∼ N β , σ 2 , but −0.6250 0.1250 we don’t know σ 2 . This means we will need to construct a ttest when working with our 49 estimates. Whenever we estimate the variance covariance matrix of our parameter ˆ estimates VAR β , we call the square root of the diagonal elements of the matrix estimates of standard error. These diagonal elements estimate the variance of our estimators, their square root is an estimate of standard deviation. However, we have “used up” some of our degrees of freedom. In fact, for each parameter we use in the linear model, we have fixed some value of epsilon. To see this, suppose we had only two observations and two parameters. Then, we would have a perfectly identified system of ˆˆ equations, and ε1 = ε 2 = 0 . Thus, there are only n − k independent elements of the vector ˆ ε . This might sound convoluted, but, in any case () 1 13 13 2 −1 ˆ σ= y ' I − X ( X ' X ) X ' y = ∑ (εˆi ) = 1 ∑ yi − xi1βˆ1 − xi 2 βˆ2 n−k 3 − 2 n =1 n =1
2 ( ) 2 =
2 = (( 22 − 13.0833 − 3 × 4.25) 2 + ( 42 − 13.0833 − 5 × 4.25 ) + ( 39 − 13.0833 − 7 × 4.25 ) 2 ) = 88.1667 Which is again distributed proportional to a Chisquare with n − k degrees of freedom, thus ˆ βj −βj ˆ σ 2 xx jj where xx jj is the jjth element of ( X ' X ) . Now we could test the hypothesis: H 0 : β1 = 0 H A : β1 ≠ 0 This is called a significance test. Almost every econometric study reports the results of significance tests for all coefficients. If the coefficient is not statistically different from zero, it is considered to be a poor explanatory. If the coefficient is significantly different from zero, some evidence is provided for the model employed. In our case, our test statistic is ˆ βj −βj ˆ σ xx jj
2 −1 ∼ t (n − k ) = 13.0833 = 0.749 ∼ t (1) 88.1667 × 3.4583 50 We need to use a two tailed test, with α = .05 . Looking at page 845 of our textbook, they are using values for a one tailed test. So, we look in the column for α = .025 (because the other .025 probability is assigned to the left hand critical region). We see that with one ˆ degree of freedom, our value would have to exceed 12.706 to reject the null. Hence, β1 is
insignificant.
F itte d Value s
45 40 35 30 25 y 20 15 10 5 0 0 2 4 x 6 8 y yhat Linear (yhat) Maximum Likelihood The maximum likelihood principle is a way to maximize the likelihood of observing the sample data. (60 years ago this was called inverse probability and was strictly taboo. Now it is accepted practice). If we have a pdf given by f ( x  θ ) , where θ is the set of unknown parameters, then the likelihood function is given by L (θ  X ) = f ( X  θ ) ,
where X is the observed data. A likelihood function is a function of unknown distribution parameters given data. The pdf is a function of random variables given parameter values. (Does this seem inane?) the maximum likelihood estimator (MLE) is defined by 51 θˆ = arg max L (θ  X ) .
θ So if we need to find the MLE estimator, we will take the derivative of the likelihood function with respect to each parameter and set the value to 0. For example, let’s use the example from the previous section, y = Xβ + ε with ε ∼ N ( 0, σ 2 I ) .
The pdf of ε can be written (ε ) ' I (ε ) 1 2 ( y − Xβ ) ' ( y − Xβ ) 1 2 f (ε  β , σ 2I ) = exp − exp − = 2 2 2 2σ 2σ 2 2πσ 2πσ n n Hence, our likelihood function is given by ( y − Xβ ) ' ( y − Xβ ) 1 2 L ( β , σ I  y, X ) = exp − 2 2σ 2 2πσ 2
n In order to make optimization easier, we often take the log of the likelihood function. The log function is monotonic, so maximizing the log likelihood is equivalent to maximizing the likelihood.
n n 1 2 1 2 ( y − Xβ ) ' ( y − Xβ ) l ( β , σ I  y , X ) = ln + ln 2 − 2π σ 2σ 2 2 = n 1 ln 2 2π 1 1 n + ln 2 − 2 ( y ' y − y ' Xβ − β ' X ' y + β ' X ' Xβ ) 2σ 2 σ Thus, the first order conditions are
∂l 1 ˆ = X ' y + X ' y − 2 X ' Xβ = 0 ˆ ∂β 2σ 2 ˆ ˆ 2 y − Xβ ' y − Xβ ∂l n =− 2 + 2 ˆ ∂σ 2 2σ (σˆ 2 ) ( ) ( )( ) =− Solving these equations obtains ˆˆ n 2ε ' ε + =0 2 2 ˆ 2σ ˆ 4 (σ 2 ) 52 ˆ β = ( X ' X) X ' y ˆ σ2 = ˆˆ ε 'ε n −1 Using summation notation: Each individual ε i ∼ N ( 0, σ 2 ) . So, K y − x β ik k 1 i k =1 − σ 2 ∑ f (ε i  β , σ 2 ) = 1 2πσ
2 e 1 ε 2 − i 2 σ = 1 2πσ 2 e 2 Because each draw is independent, the joint pdf can be written
2 K y − x β i ∑ ik k 1 k =1 − 2 σ n 1 f (ε  β , σ 2 ) = ∏ e . 2 i =1 2πσ Thus the likelihood function can be written
2 K y − x β i ∑ ik k 1 k =1 − 2 σ n 1 L ( β , σ 2  X, β ) = ∏ e , 2 i =1 2πσ and the log likelihood can be written 2 2 K K yi − ∑ xik β k yi − ∑ xik β k n n k =1 k =1 = − n ln 2π − n ln σ 2 − 2 1 ln 1 − l ( β , σ  X, β ) = ∑ ∑ 2 2 2 2πσ 2 2σ 2 2 2σ i =1 i =1 The first order conditions are 53 K ˆ 2 yi − ∑ xik β k xik n ∂l k =1 =0 =∑ ˆ ∂β k i =1 2πσ 2 ∂l σ 2 =− n + i =1 =0 ˆ 2σ 2 4 (σ 2 )2 ˆ ˆ 2∑ ε i 2 n ˆ ˆ It is simple to confirm that β are the same as the least squares estimator, and that σ 2 is just the sum of squared errors divided by n. There is a formula for finding the asymptotic standard errors of MLE’s. However, the formula complicated and generally requires the use of numerical tools, etc. Econometrics packages will calculate these for you. I think this is a bit beyond the scope of the course to talk too much about this. Assumptions: We have the correct pdf specified, and supposing that the likelihood function meets some regularity conditions (dealing with identification, convergence and the space of possible parameters) Properties of MLE: 1. Consistent: What does this mean? lim P θˆ − θ < ε = 1, ∀ε > 0 .
n →∞ ( ) Or, as the sample size increases, the parameter converges in probability to the true value. This is sort of like saying that if we had a large enough sample, MLE would be unbiased. Note that the MLE estimate of the linear parameters are unbiased. This is not always the case. For example, the MLE estimate for variance is biased: n 2 ˆ ∑ εi n − k 2 i =1 = ˆ E (σ MLE ) = E σ2 n n However, if we take the limit as n gets large the value converges to the truth. 2. Asymptotically normal: What does this mean? This means the central limit theorem applies. In fact, we have assumed the prerequisites for the CLT. 54 3. Asymptotically Efficient: What does this mean? This means that as n → ∞ , the distribution of the estimator has a lower variance than all other consistent estimators. These are very good properties, but they only hold in infinite samples (which we will never have). The standard approach is to argue that your sample is large enough that we can invoke these properties. Method of Moments Estimators Affectionately called MOM. This method is the simplest of the three. Using MOM means we set the sample moment equal to the population moments. In the case of our linear model, we assume (4)
n ε 1 ˆ ε ' in = ∑ i = 0 , n i =1 n because we assume we know that the error has mean 0. We also assume that
1 n ˆ n ∑ xi1ε i i =1 0 1 n ˆ 1 n ∑ xi 2ε i = 0 . ˆ X 'ε = i =1 ⋮ n ⋮ n 0 1 ∑x ε ˆ n i =1 in i (5) because we assume we know that E ( X ' ε ) = 0 . Substituting into (4) and (5), and multiplying by n yields ( y − Xβˆ ) ' i = 0 ˆ ˆ X ' ( y − Xβ ) = X ' y − X ' Xβ = 0
n Solving the second yields ˆ β = ( X ' X) X ' y Substituting into the first yields (so long as there is an intercept term)
−1 55 ( y − X ( X ' X) −1 −1 X ' y ' in = In − X ( X ' X ) X ' y ' in = 0 ) ( ) To see this last point, try premultiplying both sides of the equation X ' I − X ( X ' X ) −1 X ' y ' i = X × 0 n n X ' y − X ' X ( X ' X ) −1 X ' y ' i n = 0 [X ' y − X ' y] 'in = 0 Assumptions: Some of the sample moments are identical to the population moments (no assumption about specific distribution), and some other reasonable assumptions (like MLE) Properties: 1. Consistent 2. Not necessarily asymptotically efficient (may not converge to unbiased estimator). 3. Asymptotically normal. Generalized MOM estimates have become quite popular because of the few assumptions necessary, and the simplicity of estimation. ( ) End of Material For First Midterm
In General While we have seen that these estimators are the same in the case of a specific linear model, this will not be the case generally. I have tried to list the important properties of each estimator. Deciding on which to use should be a product of the application. What properties are most desirable considering: 1. The problem to be addressed 2. The size of sample available 3. The mathematical form of the theory. For everything we do in this class we will be dealing with linear estimates, making the estimators the same. Notice that I have not given formulas for variance of the MLE and MOM estimators. In general these are complicated (although your computer software will find them for you). In many cases we will only know the asymptotic variance. 56 There are several measures of fit that may come in handy: ˆ ∑( y − y)
i n 2 R2 = i =1 ˆ ∑ ( y − y ) + ∑ (εˆ )
2
i i i =1 i =1 n n =
2 SSR 1n , where y = ∑ yi . SSR + SSE n i =1 This value ranges between 0 and 1. The higher R 2 , the closer the predicted values to the true values. We sometimes think of this as the percent of explained variance. Some things to remember: 1. We can inflate R 2 by inserting new variables into our model even if they don’t have anything to do with out dependent variable. For example, if we had n independent observations, and we had k = n independent variables, then R 2 = 1 (no matter what variables or model we are looking at). To see this, note that when we minimize our squared error, we are simply solving a system of n equations in n unknowns. Hence we can find a solution such that the estimated error in each observation is 0. This will mean that ˆ ∑( y − y )
i n 2 R2 = i =1 ˆ ∑ ( y − y ) + ∑ ( 0)
2
i i =1 i =1 n n =1
2 In order to overcome this problem (in a really imperfect way) some use R 2 = 1 + ( R 2 − 1)
which discounts the use of more variables. n −1 , n−k These measures are highly imperfect, but may be useful. They are particularly useful when first learning to interpret regression results. Notes on Specification There are several specification tricks that may be very useful. I will list some here: Dummy Variables
We may want to include variables like sex, race, religion or other politically inflammatory controls. The problem with these variables is that there are discrete values. We will often use 57 1 if xik = 0 if female male If we used this in a regression, then we would interpret β k as the effect of being a female on the independent variable. We can also use this for more than binary categories. For example, in the behavior of Jewish consumers, one may wish to differentiate between the categories of Reformed, Orthodox, and Conservative. In this case we would create 1 if xi1 = 0 if 1 if xi 2 = 0 if 1 if xi 3 = 0 if Reformed Not Reformed Orthodox Not Orthodox Conservative Not Conservative When using dummy variables, we must be careful not to fall into the dummy variable trap. Suppose we wished to use the three variables above to explain the consumption of pork: yi = β 0 xi 0 + β1 xi1 + β 2 xi 2 + β 3 xi 3 + ε ,
where xi 0 = 1 , so that the first parameter is a constant term. If every person in our sample falls into one of the three categories, then xi1 + xi 2 + xi 3 = 1 = xi 0 . But this means that our X matrix is not linearly independent, and, hence, ( X ' X ) 1 1 0 X= 1 0 1 Then,
−1 does not exist. For example, if 1 1 2 1 1 1 0 1 1 0 = 1 1 0 . X'X = 1 0 1 1 0 1 0 1 To find the inverse, we must solve X ' XA = I , or, 58 2 1 1 a11 1 1 0 a 21 1 0 1 a31 a12 a22 a32 a13 1 0 0 a23 = 0 1 0 . a33 0 0 1 This gives us 9 equations with 9 unknowns. If we examine just the first column of A, we notice that 2a11 + a21 + a31 = 1 a11 + a21 = 0 a11 + a31 = 0
This shows that there can be no solution, as ( a11 + a21 ) + ( a11 + a31 ) = 2a11 + a21 + a31 = 0 ≠ 1 .
Thus we cannot include all categories, and an intercept in the same regression. Sometimes we will wish to estimate without a constant term. This means that our category estimates are, in effect, an intercept term for each category. More often, we will drop one of the categories making it the “base case” For example, if we drop Conservative, yi = β 0 xi 0 + β1 xi1 + β 2 xi 2 + ε
then β1 , β 2 are now the difference in pork consumption between reformed and conservative consumers, and the difference between conservative and orthodox consumers. Conservative is called the base case, because everything is compared to the orthodox case. We may also wish to use interaction variables. For example, maybe being male or female alters the effect of price on consumption of a product. We can find this effect by using qi = β 0 + β1 pi + β 2 xi pi
where xi is a dummy variable representing gender. We can thus interpret the last coefficient as the effect of gender on the demand curve. Some Nonlinear Relations which can be Estimated Linearly Very often we are interested in nonlinear functional forms. Estimating nonlinear forms is sometimes quite easy using MLE or nonlinear least squares, etc. The same general 59 principles apply when using nonlinear forms. However, we will usually need to use a climbing algorithm to find the estimates. Some nonlinear forms can be represented easily as a linear function of some transformation of the data. Sometimes we will make use of these when 1. we hypothesize that a relationship exists, but 2. we don’t know what relationship exists. We can then subsequently test for correct specifications. Powers:
Powers are often used when there are diminishing marginal returns to some variable. For example, suppose we thought that apple consumption, y , was increasing with income, x1 , but that this effect weakens as the individual gets richer. Then we could define a new variable, x112 2 x x 2 = 21 ⋮ 2 xn1 and we could then estimate yi = β 0 + β1 xi1 + β 2 xi 2 .
Our hypothesis is equivalent to H 0 : β1 > 0, β 2 < 0 . We will talk about how to test a joint hypothesis late. In fact, if we had infinite data, we could create a form that was a linear function of all powers of each variable. So long as the true relationship met the requirements of Taylor’s theorem (dealing with continuity), and the error structure was well behaved, we could then find a true relationship. We will never have infinite data, but we may be able to use higher order terms to approximate nonlinear relationships. Log
Log relationships are used in a bunch of ways. Very often demand equations will take a form ln q = β 0 + β1 ln p1 + β 2 ln p2 + ⋯ 60 for a few important reasons. This is called a LogLog form. There are some very common demand theories that result in functional forms of this nature. Another reason is that, using this form: 1 β dq = 1 dp1 , q p1 or, rearranging dq p1 = ε qp1 = β1 . dp1 q
In other words, by estimating the coefficients, we are estimating price (or income) elasticities. Another common form is the Log form. In this form we transform only the dependent variable ln y = β 0 + β1 x1 + β 2 x2 + ⋯ This is often done to take care of problems with the error terms. We have made many specific assumptions about the error terms. After estimation, if our estimated errors do not fit our assumptions, our results are suspect. We will talk about formally testing these error terms later. An easy way to check your assumptions is to plot the errors on a graph and see if they look like independently identically distributed normal random draws. Sometimes transforming the left hand side variable will draw the error terms closer to the necessary assumptions. There are many other similar transformations (look up BoxCox sometime if you run into problems like this). Percentages
Sometimes our dependent variables are constructed such that the errors cannot be normally distributed. For example, suppose our dependent variable is: a percentage, a binary variable (like heads or tails), categorical data, or count data (like number of calls in an hour). In each of these cases, the variable cannot take on certain values. For example, if we observe a voting percentage equal to 0%, there is no way our error term can be positive. If it were, the true value would have to be a negative percentage (how do fewer than zero vote?). Note: if no value is observed on one of these thresholds, it may be appropriate to use OLS. The proper way to deal with these problems is to use a pdf, or 61 estimation procedure that truly represents the process. Before we have those tools, or if our data is insufficient to use those tools, it may be appropriate to use a transformation of the data. For example, if yi is the percent voting for proposition 8, then ln yi 1 − yi is increasing in yi , and ranges from negative infinity to infinity. It is okay to be creative with econometrics, so long as you can justify your creativity, and can substantiate your claims. Inference and Decisions Estimation is pointless unless we can address the questions that lead us to estimate in the first place. Sometimes these will be simple tests of a single parameter. We have already addressed how to construct such a test. Other times, it may involve many parameters, or even comparing an entire model to another model. The tests that should be reported are those that either answer the research question, or those that show the plausibility of your assumptions about the data. Hypothesis Testing Often we will want to test hypotheses involving many of the parameter estimates at once. For example, H 0 : β1 > 0, β 2 < 0 . There are three main tests to use when testing joint hypotheses. If our test is linear (and for this class it better be), then we will want to represent the hypotheses using some matrix R and some vector r. For my examples, I will illustrate two tailed tests (or tests of equality). The same principles apply as before to one tailed tests, and your software will calculate them for you. So, suppose we had estimated yi = β 0 + β1 xi1 + β 2 xi 2
and we wished to test the hypothesis H 0 : β1 = 1, β 2 = β 0 against H1 : β1 ≠ 1, or β 2 ≠ β 0 . We need to find a matrix representation. We can rewrite our hypothesis β0 0 1 0 1 1 0 −1 β1 = 0 β 2 R × β =r 62 Note that R and r have the same number of rows as there are hypotheses (or restrictions) and the same number of columns as there are parameters. Having this matrix in hand, we can proceed with constructing our test. (Generalizing these tests to nonlinear restrictions is simple, but discussion is eliminated to save time. If you need to run nonlinear tests, either consult the book or ask for help). Wald Test This test statistic takes the following form λW = ( Rβ − r ) ' R ( X ' X ) ˆ σ
2 −1 R ' −1 ( Rβ − r ) ∼ χ2 (J ) where J, the degrees of freedom, is equal to the number of restrictions (or number of rows of R). We will reject this test for high values of λW . For α = .05 , with 2 degrees of freedom, we would reject for λw > 6 (see the table in the book). Can you see where the Chisquare distribution comes from (refer to previous sections)? In smaller samples, it is more accurate to use an F statistic of the following form λW
J
Again we reject for large values. Lagrange Multiplier Test ∼ F (J,n − k) This test is a little more involved than the Wald. Note that we could actually estimate the model using the restrictions implied by the hypothesis. For example, we could estimate yi = xi1 + β 2 (1 + xi 2 )
which is equivalent to assuming H 0 above. This restriction will result in a different ˆ estimate of σ 2 , which we will call σ R 2 . This test statistic takes the following form λLM = ( Rβ − r ) ' R ( X ' X ) ˆ σR
2 −1 R ' −1 ( Rβ − r ) ∼ χ2 (J ) Again we reject for large values. An F statistic can be formed (by dividing by J as above) for small samples. 63 Likelihood Ratio Test When we use a computer to estimate, we will often be told the value of the log likelihood ˆ function evaluated at our estimates. We will call this value l β . We will call the ˆ equivalent, imposing the restrictions in R, l β R . Then the LR test statistic can be written () () ˆ ˆ λLR = 2 l β − l β R ( ( ) ( )) ∼ χ 2 (J ). It is a little harder to see how this distribution is found the way we have written this. Suffice it to say this statistic is also a function of the sum of squared error terms (try taking the log of the normal pdf and summing over observations). Specification Tests It will often be necessary to test one model against another. There are several methods for doing so. It is not appropriate to use a statistical test as the sole reason for choosing one model over another. Why? The researcher will always have more information than the test statistic (e.g. nonsample information about which model is more useful, more reasonable etc.) For example, it is unreasonable to throw out variables simply because they are insignificant if our theory suggests we should include the variables. This may only signal that we have insufficient data. We may get the chance to talk about how to use nonsample information later. R 2 : If
1. You are comparing models with the same dependent variables (so variance has the same scale) 2. The same number of independent variables. 3. Both have a constant term Then: It is reasonable to compare the R 2 when selecting which model is a better fit. In cases where the number of variables differ we may use R 2 . Neither are great measures of fit. But, they are easy to use. Nested Tests: Often we will be able to test models against each other by nesting one
model within another model. For example, suppose we wished to test which of the following models was a better fit 64 y = β 0 + β1 x1 + ⋯ + β k xk y = γ 0 + γ 1 x1 + ⋯ + γ s xs
where s < k . In this case our test can be constructed using the Wald, LM, LR, or F tests. If they differ by only one coefficient we could use a ttest. AIC: The Akaike Information Criterion allows us to compare completely different models
with the same dependent variable. It is based on a Bayesian loss function (we may talk about this later) displaying a tradeoff between fit and number of independent variables. For any model the AIC is: ˆˆ ε ' ε 2k AIC = ln + n n We would like to find the model with the LOWEST AIC. AIC measures loss, hence, we minimize it. BoxCox: We may wish to test the linear model against the log linear model. The two
models we wish to compare are y = β 0 + β1 x1 + ⋯ + β k xk
ln y = γ 0 + γ 1 x1 + ⋯ + γ k xk ˆ ˆ Let ε be the fitted error for the first model and ε l be the fitted error for the log linear form. Finally, let ɶ y=e 1 ln yi n i=1 ∑ n = ( y1 y2 … yn ) n . 1 This is just the geometric mean of y. Then ˆˆ ε 'ε y2 n ɶ ∼ χ 2 (1) λ= ˆl ' ε l ˆ 2ε This is a (strange) two tailed test. The null hypothesis of this test is that the models are equivalent. If the test statistic is lower than the left (lower) critical value, then the linear model is preferred. If the test statistic is larger than the right (higher) critical value, then the log linear form is preferred. If the test fails to reject, it is inconclusive. In this case you may decide based on which is more convenient, or correct by some other criteria. 65 Other Tests: There are other tests in the book. Feel free to use any test you feel you
understand how to use. Make sure the test is appropriate for the question, and that you know how to interpret the results. If it is a test we have not used in class, it would be good to describe the test in a paragraph or two. Let me know you have used it correctly and understand what that means. Tests of the Underlying Assumptions After we have estimated using OLS (or any other estimator) it is necessary to test the plausibility of our underlying assumptions. In the case of OLS, we have assumed: 1. Homoskedastic error term 2. Independent error terms 3. If X is random, then X is uncorrelated with the error term 4. X is linearly independent, allowing us to take the inverse (If this doesn’t hold, then X is multicolinear). 5. The error term is normally distributed. We will discuss these and how to test for them in the following sections. Note that all of these tests use a null hypothesis of the standard assumption. This means we will likely reject the standard assumptions only if the truth contradicts the standard assumptions, and our test is powerful (read: we have a lot of data). If we fail to reject, we may still have problems. We may need to adjust critical values for our sample size. In many cases it may be helpful to simply plot the error term and inspect the graph for problems that tests cannot see. In later sections we will discuss how to overcome these problems. Heteroskedasticity After estimation, it is a good idea to inspect error terms for obvious heteroskedasticity. Generally, we look to see if the variance of the error term is related to one (or a few) of the independent variables, or time if you are using a time series. Below find a just such a graph. If the plots fan out or in with one of the independent variables, then it is likely that our error terms are heteroskedastic. 66 Homoskedastic Error Term 30 20 Estimated Error 10 0 10 20 30 0 10 20 30 40 50 60 70 An Independent Variable 80 90 100 Heteroskedastic Error Term 150 100 Estimated Error 50 0 50 100 0 10 20 30 40 50 60 70 An Independent Variable 80 90 100 There are many tests of heteroskedasticity, and the most useful test will depend upon which type of heteroskedasticity you expect to find in your data. I will describe two useful tests here: BreuschPagan test (sounds like we’re testing for pagans): Suppose that our error term is a function of some of the independent variables. For example, σ 2i = α 0 + ∑ α j xij
j =1 k We wish to test H 0 : α j >0 = 0 (homoskedasticity) H1 : α j > 0, for some j > 0. 67 In order to test this, we will run a new regression using the estimated error terms (squared) from our original regression. We will estimate ˆ ε i2 = α 0 + ∑ α j xij
j =1 k ˆ For lack of better notation, call the reestimated squared error term ε 2 . The BreuschPagan statistic is BP = ∑(
i =1 n ˆ ˆ εi2 − ε 2
n ) 2 1 ˆ 2 ∑ εi2 n i =1 2 ∼ χ 2 ( k − 1) This is approximately nR 2 from the tests regression. This test assumes that error terms are distributed normally. If there is evidence that they are not approximately normal, then this test is invalid. GoldfeldQuandt test (sounds like we’re testing for quandts): This test is a little more straightforward. Suppose we believe one group of the data has a larger error term than another (maybe prewar vs. post war, or based on some other observed variable). Then, we simply divide the data into the two groups (if time is a factor, we may even leave some years out), use estimated errors to estimate variance for each group, and test the null hypothesis that both have the same variance. So suppose we believe all observations i > I have a different variance than i ≤ I . Then we would use
1I 2 ∑ εˆi I i =1 n 1 ˆ σ 22 = ∑1 εˆi 2 n − I −1 i=I + ˆ σ 12 = We test the hypothesis
2 H 0 : σ 12 = σ 2 2 H1 : σ 12 ≠ σ 2 Our test statistic is GQ = ˆ σ 12 ∼ F ( I − k, n − I −1 − k ) ˆ2 σ2 68 We reject our null if the test statistic falls outside of a critical region. If we find that our error terms are heteroskedastic, our estimates of β will still be unbiased, if all other assumptions are correct. However, our estimates of the variance of ˆ β will be biased, and, hence, our tests will be invalid if we do not correct our estimates. We will talk about how to do this later. Autocorrelation When dealing with time series data, it is often the case that our error terms will not be independent, or COV ( ε t , ε t +1 ) ≠ 0 . Below is a plot of what autocorrelated error terms might look like. In this graph the error terms sit very close to the last error term. This graph is somewhat exaggerated, and, in fact, it may be very hard to spot autocorrelation. There are several reasons for autocorrelation to appear: 1. Omission of a relevant variable. 2. Incorrect functional form. 3. Inherently correlated error terms. 69 Autocorrelated Error Term 2.5 2 1.5 1
Estimated Error 0.5 0 0.5 1 1.5 2 2.5 0 10 20 30 40 50 Time 60 70 80 90 100 In the case of Autocorrelation, the variancecovariance matrix is no longer a diagonal matrix. Autocorrelation can take many forms. One of the more simple forms is called an autoregressive (AR) process. The error term follows an AR(1) (first order autoregressive)
process if it can be written ε t = ρε t −1 + µt
where µt is independently normal with mean 0. If ρ is positive, shocks persist to a certain extent (measured by ρ ). If ρ is negative, shocks will alternate directions on average (up, down, up, down etc.). An AR(1) process is said to be stationary if ρ < 1 . If equal to 1, we have a random walk. If greater than 1, we have an unstationary process. It is really difficult to do much with an unstationary process in the way of econometrics. An unstationary process implies that the variance and mean of ε t changes over time. If we have autocorrelated errors, we hope the process is stationary so that COV ( ε t , ε t −1 ) is not dependent on t . Showing the properties of a stationary process: 70 E ( ε t ) = ρ E ( ε t −1 ) + E ( µt ) = ρ E ( ε t −1 )
Hence, if E ( ε t ) = E ( ε t −1 ) , then E ( ε t ) = E ( ε t −1 ) = 0 . And, VAR ( ε t ) = ρ 2VAR ( ε t −1 ) + VAR ( µt ) + 2COV ( ε t −1 , µt ) = ρ 2VAR ( ε t −1 ) + σ µ 2
If VAR ( ε t ) = VAR ( ε t −1 ) , then σ µ2 VAR ( ε t ) = 1− ρ 2
which can only be positive (and thus valid) if ρ < 1 . Lastly, COR ( ε t , ε t −1 ) = = = = = That should have been predictable. COV ( ε t , ε t −1 ) σ εσ ε
E ( ε t − E ( ε t ) ) ( ε t −1 − E ( ε t −1 ) ) σε2
E ( ε t ε t −1 ) σε2
E ( ( ρε t −1 + µ t ) ε t −1 ) σε2 ρσ ε 2 =ρ σε2 Testing for Autocorrelation: The most common test for autocorrelation is the DurbinWatson test: After estimating with OLS we test the null hypothesis that ρ = 0 , by forming the statistic ∑ (εˆ − εˆ )
t t −1 T 2 d= t =2 ∑ εˆ
t =1 T ≈ 2 (1 − ρ )
2 t The test statistic will be near 2 if no autocorrelation exists. The actual distribution of d is dependent upon the data matrix. However, if our software will not find the results of the
71 test, we can employ the bounds listed in the table in the back of our textbook. We reject the null hypothesis of no autocorrelation if d < d l or d > 4 − dl . We fail to reject the null hypothesis if d u < d < 4 − du . If our test statistic falls into neither of these categories, our test is inconclusive. If we find autocorrelation then: ˆ 1. β is still unbiased. ˆ 2. β is no longer efficient. 3. OLS standard errors are biased, and hence all tests run with OLS estimates are invalid. Multicolinearity If two (or some linear combination of) independent variables are highly correlated, then we face a problem of multicolinearity. Exact multicolinearity exists if ∑x
j =1 k ij αj = 0 for all i, and for some set of constants α . This means that the columns of X are not linearly independent. The dummyvariable trap is one example of exact multicolinearity. If we face exact multicolinearity, then ( X ' X )
−1 does not exist, and OLS estimates cannot be obtained. This rarely happens in practice, unless we have intentionally unintentionally created exact multicolinearity. We will face similar problems if there is nearly exact multicolinearity. This will happen when independent variables are highly correlated. The effect can be: 1. OLS cannot determine the individual effects of correlated variables with any degree of exactness (the effects of correlated variables may be mixed up). 2. Effected coefficients are likely to be large, and have large standard errors. Ttests may tell us that variables are insignificant, but Ftests will suggest they are significant. 3. Estimators are very sensitive to the deletion of a few observations. 4. Estimates vary widely (may even change sign) when other variables are dropped. How to detect a problem: 72 1. Check the correlation between variables. If two independent variables have a correlation coefficient greater than 0.8, then there may be problems. 2. Regress one independent variable on other independent variables. If the SSE is low, then there may be a problem. Normality Finally, if the residuals are not approximately normal, then our tests are not valid. There are many tests of normality. Most look for some sort of asymmetry. The JarqueBera Test is one of the more popular. It is based on the third and fourth central sample moments. The third and fourth central moments of a normal variable are E ( x − E ( x )) ( ( 3 σ σ
The sample equivalents are 3 4 ) = Skewness = 0 ) = Kurtosis = 3
3 E ( x − E ( x ))
4 1n 1n ˆ εi − ∑ε j ∑ n j =1 ˆ n i =1 S= ˆ σ3 1n 1n ˆ εi − ∑ε j ∑ n j =1 ˆ n i =1 K= ˆ σ4 4 (If you use an intercept the average error is 0). The JarqueBera tests the null hypothesis that S = 0, K = 3 . The test statistic is S 2 ( K − 3)2 2 JB = n − ∼ χ ( 2) . 24 6 We reject normality if JB is large.
Forecasting Applied econometricians will want to ask questions that may be more relevant than “which model is the best fit?” In fact, the purpose behind economic modeling is to reliably predict behavior. We may want to know how prices will react if government programs are altered, or how profits will react to a new marketing plan. Does 73 econometrics allow us to do this? In some situations we can make reasonable predictions. Factors in the value of econometric forecasting: 1. Inclusion of all relevant variables. 2. Correct functional form and assumptions. 3. Forecasting within sample – If you are forecasts include values for independent variables that are very different from those used in estimation, forecasts will be inaccurate. 4. Short range predictions – The further into the future you try to predict, the less accurate the forecasts. Some studies show economists predictions become biased if they predict more than three months ahead. This will depend upon the questions you ask the data. Forecasting using time series data is a little different (particularly if there is autocorrelation). Using the standard linear model, suppose we estimated ˆ ˆ yi = Xβ
Thus, we could predict the value of y given some hypothetical values of X (denote the hypothetical value by the 1× k row vector x 0 ). However, this would only provide a point estimate, without providing any notion of confidence. For example, previously in these notes we estimated ˆ y = 13.0833 + 4.25 x . ˆ We can then predict that if x0 = 6 , then y = 38.5833 (note that x = 6 is well within the range of data used for estimation x = 3,5, 7 ). We can make a more informative prediction by forming a confidence interval. Note that both estimates for β are distributed normally. This means that, because it is the sum of normal random variables, it is distributed normal. In order to form a standard normal statistic, we will need to find the ˆ variance of y − y . This is given by 74 2 2 ˆ ˆ ˆ ˆ ˆ ˆ VAR ( y − y ) = E ( y − y ) − E [ y − y ] = E x 0 β − x 0 β − ε x 0 β − x 0 β − ε − E x 0 β − x 0 β − ε ˆ ˆ ˆ ˆ = E x 0 β − x 0 β x 0 β − x 0 β ' + E x 0 β − x 0 β ( −ε ) + E ( −ε ) x 0 β − x 0 β ( ( )( ) ( )( ) ) 2 ( ) + E (ε 2 ) − 0 ˆ ˆ = x 0 E β − β β − β ' x 0 '+ σ 2 ˆ = x 0VAR β x 0 '+ σ 2 ( )( ) () 3.4583 −0.6250 ˆ From the previous example, we found β ∼ N β , σ 2 . −0.6250 0.1250 Thus 3.4583 −0.6250 1 2 ˆ VAR ( y − y ) = [1 6]σ 2 6 + σ −0.6250 0.1250 2 = 1.4583σ
Thus, ˆ y− y 1.4583σ
2 ˆ ˆ ∼ N ( 0,1) , or, using y = 38.5833 , σ 2 = 88.1667 , we know that 38.5833 − y ∼ t ( n − k = 1) . 11.339 From this, we find a 95% confidence interval of [ −108.82,185.99] . This is a very inaccurate prediction. This is mainly because we don’t have enough observations to narrow our prediction. Typically we would at least like to predict the sign of the value (or the sign of the change in value).
Research Process Since this is an applied course, I feel obligated to say something about the process of conducting econometric research. I think any student of econometrics should have (1) A healthy understanding of the reasons and rigor of standard econometric theory and techniques, and (2) A healthy disrespect for econometric practice. General Research Steps 1. Identify an economic question of interest. 2. Identify (or develop) economic theories addressing the question. 3. Identify data that is available to inform the theory. 75 4. Become “comfortable” with the chosen data. 5. Use theory and availability of data to derive a functional form. 6. Estimate given standard assumptions. 7. Test the standard assumptions and gauge fit. 8. Revise estimates based on violations of assumptions. 9. Conduct tests or predictions to address questions. Here are some traps some researchers fall into: a. Relying on the data for a theory, then testing with the same data. If you use significance tests to determine which variables should be included in a regression, peculiar (or unlikely) samples can lead you to incorrect conclusions. Any tests run after estimation will have severely deflated power, and standard errors will be incorrect. b. Failing to connect the theory to your functional form. You should be able to show that the functional form you use is equivalent (in some mathematical way) to your theory. This may be difficult to do with the tools you have now (try using Taylor expansions). c. Failing to identify endogeneity, or other obvious problems with the estimation technique. Make sure your statistical assumptions are not horrendously contradicted by the model. Violations of Standard Assumptions We have already talked about how to test for the important violations. We will now discuss what to do when you find violations of the standard assumptions. Most of these procedures only revise the standard error of the coefficient estimates. Low significance may be a sign of violations. Very often, applied workers will first learn about how to deal with a certain type of problem by finding nonsensical parameter or standard error estimates and trying to find a solution. By getting comfortable with the data you are using, it may be easier to identify a problem or solution. These are skills that must be learned while conducting research. 1. Heteroskedasticity Generalized Least Squares: If the error term is heteroskedastic, but not autocorrelated, then 76 d1 0 2 VAR ( ε ) = σ ⋱ =σ V 0 dn 2 where d1 ,… , d n can be different from one another. If we are to use a Least Squares estimator, we would like to alter our model of the error term such that the GaussMarkov theorem is applicable. This is done by defining a new data matrix
y * = Λy X* = ΛX , ε * = Λε
(so that the standard linear equation is premultiplied by Λ ) where Λ is a matrix such that
Λ ' Λ = V −1 Λ ' VΛ = I . What does this do for us? a. E ( ε * ) = E ( Λε ) = ΛE ( ε ) = 0 b. VAR ( ε * ) = VAR ( Λε ) = ΛVAR ( ε ) Λ ' = σ 2 ΛVΛ ' = σ 2 I . Pretty neat trick. Since this now satisfies GaussMarkov, GLS is BLUE. Hence we estimate
y * = X* β + ε * ˆ using OLS, and call the estimate β GLS for generalized least squares (GLS). In order to
form test statistics, you will need to know ˆ E β GLS = β (can you show this?)
−1 −1 ˆ VAR β GLS = VAR ( X* ' X* ) X* ' y * = VAR ( X* ' X* ) X* ' ( X* β + Λε ) ( ) ( ) ( = VAR ( ( X ' X )
*
−1 −1 ) ( )
−1 * −1 X* ' X* β + ( X* ' X* ) X* ' ε * = VAR ( X* ' X* ) X* ' ε * −1 ) ( ) −1 −1 = ( X* ' X* ) X* 'VAR ( ε * ) ( X* ' X* ) X* ' ' = ( X* ' X* ) X* ' (σ 2 I ) X* ( X* ' X* )
−1 −1 = σ 2 ( X* ' X* ) = σ 2 ( X ' Λ ' ΛX ) = σ 2 ( X ' V −1X ) −1 77 We can estimate σ 2 using ˆ σ2 = ˆˆ ε * 'ε * n−k . Now all that is left is to determine what V and Λ are. When using GLS we hypothesize that the variance of the error term is a (linear) function of the data. For our purposes, suppose that we think variance is a function of income. Then we would specify a1 0 V = ⋱ an 0 where ai is income for observation i. since the variance is σ 2 V , any linear coefficient is swallowed up in σ 2 (in other words we can’t differentiate between the two). In this case, 1 a 1 Λ= 0 0
⋱ 1 an Thus by transforming the data we simply divide all data (dependent and independent variables) by the square root of income. Using this form makes a pretty strong assumption about the nature of the heteroskedasticity. We may wish to estimate the nature of the heteroskedasticity. In this case we would use a Feasible Generalized Least Squares Estimator (FGLS). This is called Estimated Generalized Least Squares (EGLS) in the book p. 498. If we suppose that we can group the data so that VAR ( ε i ) = σ 21 , VAR ( ε i ) = σ 2 2 , i<n i≥n then the FGLS estimator can be obtained for the variance covariance matrix. The variance covariance matrix must be 78 σ 12 0 2 0 σ1 ⋮ ⋮ 0 0 0 0 0 0 ⋮ ⋮ 0 0 0⋯ 0 0 0 0 0⋯ 0 0 0 0 ⋱ ⋮ ⋮ ⋮ ⋮ ⋯ σ 12 0 0 0 0 . ⋯ 0 σ 22 0 0 0 ⋯0 0 σ 22 0 0 ⋮ ⋮ ⋮ ⋱ ⋮ 00 0 0 ⋯ σ 22 2 In order to obtain estimates of σ 12 , σ 2 we will need to partition the data into its two groups. Then (1) estimate y1 = X1 β + ε for the groups with variance σ 12 using OLS. ˆ Estimate σ 12 = ˆˆ ε 1 ' ε1 n1 − k using the residuals from this regression. Then, (2) estimate ˆˆ ε 2 'ε 2 n2 − k ˆ2 y 2 = X 2 β + ε for the groups with variance σ 2 2 using OLS. Estimate σ 2 = the residuals from this second regression. Hence, we find using ˆ σ 1 0 0 σ ˆ1 ⋮ ⋮ 00 ˆ Λ= 0 0 0 0 ⋮ ⋮ 0 0 0 ⋯ 0 0 0 0 0 ⋯ 0 0 0 0 ⋱ ⋮ ⋮ ⋮ ⋮ ˆ ⋯ σ1 0 0 0 0 ˆ ⋯ 0 σ2 0 0 0 ˆ ⋯ 0 0 σ2 0 0 ⋮ ⋮ ⋮ ⋱ ⋮ ˆ 0 0 0 0 ⋯ σ2 (note the square roots have been taken) Now, we transform the data using this matrix and find our estimates. FGLS is not BLUE. In fact it has some peculiar properties:
−1 ˆ E β FGLS = E ( X* ' X* ) X* ' y *
* * −1 * ( ) ) ( = E (( X ' X ) X ' ( X β + ε )) = β + E (( X ' X ) X 'ε ) = β
* * * * −1 * * This last step is hard to show (so trust me) because X* now depends upon the stochastic error term. 79 −1 ˆ VAR β FGLS = VAR ( X* ' X* ) X* ' ε ( ) ( ) which we have a hard time reducing, again because X* is now stochastic. FGLS is not BLUE. We cannot show that FGLS is minimum variance (we can’t easily express the variance). Further, FGLS is no longer a linear estimator of y! Hence, in small samples, all we know is that FGLS is unbiased. As the sample gets large, FGLS approaches GLS in distribution. Hence, in large samples, we can treat FGLS as if we were using GLS. It is possible to use more general forms of heteroskedasticity. This would require estimating using nonlinear least squares. If you need something like this for your research paper, see the TA, or me. It is reasonable to learn how to do this on your own. WhiteHeteroskedasticity Consistent Matrix (MOM estimator): GLS in not very reasonable in most situations, and FGLS is messy. The most popular way to deal with heteroskedasticity is to use the MOM estimator found by White. From before, we found the MOM estimator ˆ β = ( X ' X) X ' y , without assuming homoskedasticity. We need now to find an estimate of the variancecovariance matrix that is useful with heteroskedasticity. The variance of the estimator is
−1 −1 −1 ˆ VAR β = VAR ( X ' X ) X ' y = ( X ' X ) X ' ΩX ( X ' X ) , −1 () where Ω is the general variance covariance matrix. The principles of MOM allow us to find an estimator for this matrix by simply requiring the sample moments to be identical to the true moments. In other words, we assume that ˆˆ E (ε ' ε ) = Ω .
ˆ Then our MOM estimator for the variance covariance matrix of β is
−1 −1 ˆ V = ( X ' X) X 'ε 'ε X ( X ' X ) . This is called the White heteroskedasticity consistent matrix, containing the standard ˆ error estimates for β . Because this is a MOM estimator (and it meets certain requirements) it is consistent. This is particularly strange given the fact that ε ' ε is not a 80 consistent estimator of Ω . Most statistical packages will have a canned command to use White standard errors in tests. Of note: The White matrix is also consistent for autocorrelated errors. The White matrix is consistent, but not efficient. This means that in small samples, we may obtain better estimates by modeling out heteroskedasticisty and using FGLS, or GLS. 2. Autocorrelation Autocorrelation can be very difficult to deal with. For some simple forms it may be reasonable to use a White matrix. I wouldn’t advise this when using time series data where errors are likely serially correlated. Serial correlation is correlation across time (if this period depends very heavily on last period, etc.). If it is known that time is the main source of autocorrelation, it is best to model the autocorrelation. The tools for dealing with autocorrelation are largely the same as those for dealing with heteroskedasticity. GLS: We can attempt to transform our model so that errors are uncorrelated. For example, if the error term follows an AR(1) process, then: ε t = ρε t −1 + µt .
Suppose we let yt * = yt − ρ yt −1 xt * = xt − ρ xt −1 ε t * = ε t − ρε t −1 = µt
Then, by simple subtraction, we have y * = X* β + ε * where we know ε * is not autocorrelated by definition of µ . This transformation did not involve multiplication as with heteroskedasticity. Hence, standard errors reported by running OLS on the transformed regression are valid. Problems with this transformation: a. b. We don’t know ρ . We lose the first observation (because no t = 0 observation exists). 81 c. case). ˆˆ We must transform our estimates to interpret them ( (1 − ρ ) β = β GLS in this We can fix problem b by transforming only the first observation in the following way y1* = 1 − ρ 2 y1
* x1 = 1 − ρ 2 x1 ε1* = 1 − ρ 2 ε1
Note E ( ε1* ) = 1 − ρ 2 E ( ε1 ) = 0 VAR ( ε
* 1 ) = VAR ( 2 σµ 2 1 − ρ ε1 = (1 − ρ ) =σµ 1− ρ 2 2 ) 2 By modifying the transformation this way and subsequently performing OLS on this modified transformation, we find a BLUE estimator (we now satisfy GaussMarkov). FGLS: We can form this estimator in three steps: First: estimate yt = xt β + ε t using OLS. ˆ Second: estimate ρ using the fitted residuals, ε from the first step using OLS ε t = ρε t −1 + µt
ˆ Third: Use ρ to transform the data as in GLS. Problem:The standard errors are
only correct if the sample is large. 3. Multicolinearity For the most part, the only real way to overcome near exact multicolinearity is to obtain more (or different) data. Another way to overcome the problem is to impose restrictions on the parameter estimates using nonsample information (i.e. things you know, but didn’t learn from the sample you are using). This is particularly difficult at this stage, because the only restrictions you have the tools to impose are equality restrictions like those used in the F tests (see page 4378 for a description of the restricted least squares estimator). This is a cruddy way to incorporate nonsample information. Some great ways to use nonsample 82 information have been outlined by Zellner, using Bayesian and Maximum Entropy techniques. These techniques can often overcome multicolinearity (or other numerical rather than modeling) problems. These topics are generally ignored even in some graduate econometrics programs. 4. Nonnormality: If you find your estimated errors are significantly nonnormally distributed, you may be in trouble at this level (or at any level). The first thing to try is to transform you dependent variable (using log or other monotonic functions). You may also try transforming some of the independent variables, or including lagged variables if applicable. It may be possible to use ML estimates with a different distribution (if you are adventurous). Some of the more common nonnormal type regressions are easy to use (usually for count data, truncated, or discrete choices). However, it may be hard to justify more general distributions. The two main causes for nonnormal residuals are (1) not enough data and (2) incorrect model. Some have begun using nonparametric techniques, maximum entropy methods, or bootstrapping to overcome these problems. If you really run into problems, you might see me about using bootstrapping techniques. Bootstrapping determines standard errors by randomly resampling from your data matrix several (possibly thousands) of times and then using the sample distribution to determine confidence intervals. There is some voodoo in this method, but it has gained wide acceptance. Omission and Inclusion If we know we have a linear model, often the question is which variables should we include and which shouldn’t we include. There is not always a clear cut answer. However, we can use our theory to help us understand the consequences of the two possible errors. First, suppose the true model is y = Xβ + Zγ . but instead, we estimate y = Xβ + ε In other words, we exclude variables we should include. Our estimator will be ˆ β = ( X ' X) X ' y
−1 γˆ = 0
83 Certainly our estimate of γ is biased, but what about the variables we have included?
−1 −1 ˆ E β = E ( X ' X ) X ' y = E ( X ' X ) X ' ( Xβ + Zγ + ε ) () ( = E ( X ' X) ( −1 )( X ' Xβ ) + E ( ( X ' X ) )
−1 −1 X ' Zγ + E ( X ' X ) X ' ε )( ) = β + ( X ' X ) X ' Zγ Thus our estimate is biased unless γ = 0 , or X ' Z = 0 . This last condition is called orthogonality, and is generally interpreted as being unrelated or uncorrelated. Vectors are orthogonal if they form a right angle. The variance of our estimate is ˆ ˆ ˆ VAR β = VAR β − E β
−1 −1 () ( ( )) = VAR (( X ' X )
−1 −1 X 'ε )
−1 = σ 2 ( X ' X) X ' X ( X ' X) = σ 2 ( X ' X) If we had estimated with the omitted variables, we would have found (trust me). ˆ β −1 VAR = σ 2 X ' X − X ' Z ( Z ' Z ) Z ' X γˆ ( ) −1 . ˆ The variance of the estimates in β are unambiguously smaller when we omit data. Suppose instead that
y = Xβ + ε is the correct model, but that we included more variables by estimating
y = Xβ + Zγ + ε These OLS estimates will still be unbiased ˆ E β =β () E (γˆ ) = 0
but, as we calculated before, the variance will be greater for having included more variables. Hence, when deciding what variables to include we need to weigh the bias against the variance of our estimates.
Various Tools There are many models we may be interested in that do not meet the requirements of OLS, GLS or the other modifications given above. Some are very easy to learn on your own. Here I will cover those I believe are harder to understand and significantly useful. I will not cover these with any depth. I will basically tell you why you might need them, 84 and give you a basic understanding of the technique. More advanced econometrics classes will delve deeper into the theory. Instrumental Variables (See page 458 – 463) Up until now we have always assumed that X was uncorrelated with ε . However, suppose we are dealing with survey questions like “What percentage of clothing purchased do you return for a refund?” Asking individuals to recall such items may introduce error in the measurement of X . If there is measurement error, this error becomes part of our general error term, making X correlated with the error term. Consequences of correlation between error and independent variables: a. OLS produces biased estimates. b. OLS produces inconsistent estimates (they don’t converge to the truth). We can overcome this problem using a MOM estimator (they are consistent given some very general assumptions). Before, we placed the moment restrictions ∑ n =0,
i =1 n ˆ εi 0 0 1 ˆ= . X 'ε ⋮ n 0 Now, we know the second of these two restrictions should not hold (because this is the sample covariance. So, we need to find another moment restriction to derive our estimator. Suppose we could find another set of variables,
X IV , such that COV ( X IV , ε ) = 0 , but that COV ( X, X IV ) is very high (i.e. the variables are uncorrelated
with the error, but highly correlated with the independent variables we are interested in). Suppose we found as many of these variables as there are independent variables. We could premultiply our normal regression equation,
X IV ' y = X IV ' Xβ + X IV ' ε . Because COV ( X IV , ε ) = 0 , we require that 0 0 X IV ' ε = . ⋮ 0 85 Thus, ˆ X IV ' y = X IV ' Xβ IV .
If X IV ' X is invertible (requiring both matrices to be the same size) then ˆ β IV = ( X IV ' X ) X IV ' y , otherwise, we can use a more general form I won’t go into here (the computer will calculate it for you if you need it). The variables in X IV are called instruments (I don’t know why). Notice that OLS is just a special case of instrumental variables (where
−1 COV ( X, ε ) = 0 and COV ( X, X ) = VAR ( X ) ). The MOM estimate of the standard errors are obtained from the
sample moments:
−1 −1 ˆ ˆ VAR β IV = σ 2 ( X IV ' X ) ( X IV ' X IV )( X ' X IV ) () The greatest difficulty with using instrumental variables is deciding on instruments. There are more general formulas if we have a different number of instruments. The same principles apply. Generally, the more variables you can find that meet the requirements of an instrument the better. It is like incorporating more information into your estimation. An example of a good instrument: I had a professor who wanted to estimate the monetary return of education. But the problem is that individuals decision to attend a school may be correlated with their ability, which is also correlated with their pay. He found a small religious college in Idaho that determined the majority of admissions by lottery. This lottery assigned a number that was uncorrelated with ability, but correlated with the decision to attend. This was a convenient instrument. Instruments are not always easy to find. Many researchers spend their time trying to find clever instruments. Two Stage Least Squares (see pp581 – 636) Of utmost importance in economics are estimating endogenous relationships. For example, we may wish to estimate a demand curve. However, it is only possible to observe price and quantity in equilibrium. Suppose we had the following system
Q D = Xβ + ε Q S = Zγ + ε 86 Then the choice of Q , and the independent variables for price are endogenously determined by the system. In other words price determines Q , but Q also determines price. This means that the error term will also be involved in determining price, hence, OLS will be inconsistent. However, in this case we have some natural instruments. Among our independent variables there are also exogenous variables (those determined completely outside of our system). For example, individual income will determine quantity demanded, but is unlikely to be affected by quantity demanded or supplied. There are also variables in the supply equation (like production environment, weather, etc.) that are exogenous. These variables will be correlated with the price (our endogenous variable) but not with the error. Hence, we can use the exogenous variables from both equations as instruments and estimate using the instrumental variables procedure. The resulting estimates are called Two Stage Least Squares estimates (because we can obtain the same estimates using a double application of least squares see pp 614 – 615). Seemingly Unrelated Regressions (SUR see pages 549 – 554) Suppose that we wished to estimate two equations representing individual responses. For example, we may want to estimate y1 = Xβ 1 + Z1γ 1 + ε 1 y 2 = Xβ 2 + Z 2γ 2 + ε 2
where the first equation may represent our consumption of economic information, and the second our consumption of political information. It seems reasonable that the error terms might be correlated across individuals (i.e. maybe some people just don’t like reading no matter what the explanatory variables). When parameters may all differ, the dependent variables differ, but the errors may be correlated, we have seemingly unrelated regression equations. While using OLS for each equation individually will produce unbiased results, we can reduce the standard errors by making use of data from the other equation. Hence, OLS is inefficient. The system can be reduced to a single equation with heteroskedasticity and autocorrelated errors 87 y1 X 0 Z1 2 = y 0 X 0
or β1 0 β 2 ε 1 + , Z 2 γ 1 ε 2 2 γ y = Dβ + ε . Now we could use GLS to obtain BLUE estimates. SUR makes a special assumption about the autocorrelation of the error terms. Namely COV ( ε i1 , ε i2 ) = σ 12 , COV ( ε i1 , ε 2 ) = 0 if i ≠ j. Thus, the variance covariance matrix of the error vector can be j represented as ε 1 σ 2 I σ 12 I n Ω = VAR 2 = 1 n 2 ε σ 12 I n σ 2 I n Using GLS, we obtain the estimator ˆ β = ( D ' Ω −1D ) D ' Ω −1y . The standard errors can be estimated as
−1 ˆ VAR β = ( D ' Ω −1D ) . 2 The problem with these estimates is that we will not in practice know σ 12 , σ 2 , σ 12 . As () with FGLS, we can estimate these parameters using 1 n 1 ˆˆ ˆ σ 22 = ε 2 ' ε 2 n 1 ˆˆ ˆ σ 12 = ε1 ' ε 2 n ˆˆ ˆ σ 12 = ε1 ' ε1 Although this is similar to FGLS, the estimator is unbiased, but not efficient. Probit/Logit (See Chapter 23) 88 We may want to use a binary (or dummy) variable as a dependent variable. For example, we may want to explain why some individuals subscribe to the internet and others do not. Suppose we believe the decision can be summarized as 1 if yi = 0 if xβ + ε ≥ 0 xβ + ε < 0 In other words, there is some regression line representing utility (or profit) from using the internet. If this utility passes a certain threshold, the individual subscribes. Using the threshold of 0 is completely general if a constant term is included in the parameters (we would not be able to tell the difference between the constant term and the cutoff). There are several generalizations of this model including multinomial choices, and more general error structures. If the error term is standard normal, then
∞ P ( yi = 1) = P ( ε ≥ − xβ ) = ∫ φ ( z ) dz = 1 − Φ ( −xβ ) ,
− xβ − xβ P ( yi = 0 ) = P ( ε < − xβ ) = ∫ φ ( z ) dz = Φ ( −xβ )
−∞ where φ is the standard normal pdf, and Φ is the standard normal cdf. Here, we have assumed that variance is 1. This, also, does not restrict generality. Assuming different variance will only inflate our estimates of β without changing their relative size. This gives us enough information to estimate what is called the probit model using an MLE. The likelihood function can be written as L = ∏ Φ ( − xβ ) i =1 n
1− yi 1 − Φ ( −xβ ) yi so the log likelihood function is l = ∑ (1 − yi ) ln Φ ( − xβ ) + yi ln 1 − Φ ( −xβ ) i =1 n { } We use computer programs to maximize this likelihood function with respect to our parameter estimates. An alternative to the probit model is the logit model. The only difference between the models is the distribution that we use to describe the error term. We replace the standard normal cdf, Φ , with the logistic distribution function 89 F (z) = 1 . 1 + e− z Usually the results from using each process are similar. Tobit Economists often run into problems of censored independent variables. A variable is censored if it is constrained in some way. For example, individual consumption of beans is constrained to be nonnegative. If we wish to estimate a bean consumption function, x β + ε i yi = i 0 if if xi β + ε i ≥ 0 xi β + ε i < 0 we run into problems with our standard assumptions. In particular, if we use OLS, our estimates will be biased. The first figure below shows how the OLS estimates using the entire data set are extremely biased. One solution might be to throw out the censored data. However, the censoring biases our estimates. The second figure shows how this may take place. This happens because we throw out only observations that have a smaller y. Thus we bias our intercept estimate upward, and our slope estimate downward. In fact these estimates are inconsistent. In this case we have created a truncated sample. Tobin suggested using an MLE similar to the probit model (hence the name Tobit) to deal with the censored regression problem. We can write the likelihood function as
2 K y − x β 1 i ∑ ik k k =1 − σ 2 1 e 2 2πσ ɶ yi L = ∏ Φ ( − xβ ) i =1 n ɶ 1− yi where, 1 if ɶ yi = 0 if y>0 y = 0. 90 Bias of OLS in Censored Model 40 30 20 10 Y 0 10 20 30 20 0 20 40 X 60 80 100 Bias of OLS W ithout Censored Observations 40 30 20 10 Y 0 10 20 30 20 0 20 40 X 60 80 100 We can then write the log likelihood function as 91 K yi − ∑ xik β k n 1 1 k =1 ɶ ɶ l = ∑ (1 − yi ) ln Φ ( −xβ ) + yi − ln ( 2πσ 2 ) − 2 σ 2 i =1 2 Again, we maximize this function over our parameter estimates. We use computer programs to give us the estimates and the resulting standard error estimates. Poisson Regression Long ago, a man was hired to model the frequency of soldiers being kicked by mules in the French army. Evidently this was a big problem at the time. He came up with the following distribution to model this event. f ( x) = λ x e−λ
x! , x = 0,1, 2,… This is called the Poisson distribution (in honor of the mule kick counter). This distribution has mean and variance equal to λ . Typically this distribution is used to model count data (like the number of kicks in a certain time period). Count data are ordered integer values (like the number of cars stopped at a particular stoplight at a given time). These are sometimes called rare events, because they do not happen continuously. This is a bit of a misnomer. 92 Poisson Distribution with Lambda=5 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 2 4 6 8 10 12 14 16 18 20 We may wish to estimate a model with a dependent variable consisting of count data. For example, we may wish to know how some set of variables affects the number of bankruptcies in a given year. The Poisson regression model specifies that each yi is drawn from a Poisson distribution with parameter λi , where ln λi = x i β . This can be estimated using MLE. The likelihood function is
n L=∏
i =1 n e− λi λi yi e− λi λi yi =∏ yi ! yi ! i =1 Thus the Loglikelihood function is l = ∑ [ −λi + yi λi − ln yi !] .
i =1 n 93 At this point we could substitute in the equation model, and use the computer to find the optimum. This model displays the standard properties of MLE estimation. The fact that we require variance and mean to be equal may be a problem. There are some ways around it (but they are complicated).
Multiple Equations and Identification Suppose we had the following model
YΓi + XB i + ei = 0 where Y is a T × M matrix of endogenous variables (dependent variables), Γi is an M × 1 vector of coefficients, X is a T × K matrix of exogenous variables and lagged endogenous variables, Bi is a K × 1 vector of coefficients, and ei is a T × 1 vector of unobservables, and the index i denotes the ith equation in a system of several equations. Note that the above has M + K unknowns. Our exogenour variables only identify K of these parameters using exogenous information. In order to identify the other K parameters, we need to include information from other equations or restrictions. But, how do we know if we have included enough information in the model to identify these variables? This system can be rewritten as
YΓ + XB + e = 0 If it is possible to invert Γ , then we can reduce this model to
YΓΓ −1 + XBΓ −1 + eΓ −1 = 0 or,
Y = − XBΓ −1 − eΓ −1 or,
Y = XΠ + V. This is called the reduced form. Thus we need to estimate all the elements of Π = −BΓ −1 , or ΠΓ = −B, which can be rewritten as 94 [Π Γ I K ] i = 0, Bi a system of M + K unknowns and K equations. One source of identification is a restriction on parameters. For example, we may know that one of the endogenous variables in not a factor in one of the equations. Thus we could impose this restriction and possibly identify that particular equation. We will denote these restrictions using the matrix R i , J × ( M + K ) of rank J < M + K (only J independent rows), in the equation Γ R i i = R i ∆ i = 0. Bi Odd way to note this, but if an element of R i is equal to 1, then the corresponding coefficient must be 0. In this way we are adding identifying equations, so our complete system can be written ( Π ⋮ I K ) ∆i = 0. Ri ( Π ⋮ I K ) If we are able to solve this system, then must have rank M + K − 1 , allowing Ri us to identify ∆i . This will be the case if and only if rank ( R i ∆ i ) = M − 1 . This is called the rank condition. A necessary but not sufficient condition is that J ≥ M − 1 . This is called the order condition (because zeros will cause loss of rank in multiplication). The possibilities: 1. The ith equation is not identified if rank ( R i ∆ i ) < M − 1 , which must be the case if rank ( R i ) < M − 1 . In this case no consistent estimators exist.
2. The ith equation is just identified if rank ( R i ∆ i ) = M − 1 , and rank ( R i ) = M − 1 . In this case the reduced form is unique. Three stage least squares will be consistent and more efficient. Two stage least squares is consistent. 95 3. The ith equation is overidentified if rank ( R i ∆ i ) = M − 1 and rank ( R i ) > M − 1 . In this case the reduced form is not unique. Three stage least squares is consistent and more efficient. Two stage least squares is consistent. There are many ways to determine rank. Here are some simple rules often used by econometricians. 1. An equation that contains one endogenous variable and all predetermined variables in the system is just identified. 2. An equation that contains all the variables in the system is not identified. 3. If none of the excluded variables of the ith equation appears in the jth equation, the ith equation is not identified. 4. If two equations contain the same set of variables, both are not identified. 5. If any of the excluded variables of the ith equation does not appear in any linear combination of the other M − 1 equations, the ith equation is not identified. Here it might be useful to give an example. From JHGLL, page 361, suppose we had the following system of equations y1 = y2γ 21 + y3γ 31 + x1 β11 + e1 y2 = y1γ 12 + x1β12 + x2 β 22 + x3 β 32 + x4 β 42 + e2 y3 = y1γ 13 + y2γ 23 + x1β13 + x5 β53 + e3 . Is equation 1 identified? Equation 1 includes all of the endogenous variables, but excludes the 2nd, 3rd, 4th and 5th predetermined variable. Clearly conditions 1 and 2 do not apply. Both equations 2 and 3 contain one or more of the excluded variables, so 3 does not apply. 4 does not apply. Each of the excluded variables appears in the other two equations, thus 5 does not apply. Where does this leave us? We need to check the rank and order conditions. The system can be written [ y1 y2 1 y3 ] −γ 21 −γ 31 −γ 12 1 0 −γ 13 −γ 23 − [ x1 1 x2 x3 x4 β11 0 x5 ] 0 0 0 β12 β 22 β 32 β 42
0 0 0 + e = 0. 0 β 53 β13 Thus, 96 1 γ 21 γ 31 β R1∆1 = R1 11 0 0 0 0 Because we know that many of the betas are zero, we can specify that 0 0 0 0 1 0 0 0 0 0 R1 = 0 0 0 0 0 0 0 0 0 0 This has rank 4 = M + 1 = 3 + 1 . This is greater than
0 0 0 1 0 0 0 1 0 0 0 1 M − 1 thus the order condition is met. This means it is possibly identified. Next, we need to check the rank condition. 0 0 R 1∆ = 0 0 000100 000010 000001 000000 1 γ 21 0 γ 31 0 β11 0 0 1 0 0 0 γ 12
1 γ 13 γ 23 1 β13 0 0 0 β 53 γ 32 β12 β 22 β 32 β 42
0 0 β 22 0 β 32 = 0 β 42 0 0 0 0 0 β 53 This clearly has rank of 2 = M − 1 = 3 − 1 , thus this equation is exactly identified.
Three Stage Least Squares Zellner and Theil introduced the three stage least squares estimator as an efficient way to estimate systems of at least identified equations. This estimator was based on the idea of the Seemingly Unrelated Regression, thus taking account of variation in all equations. Suppose we had the following system of equations 97 y1 Z1 y 2= ⋮ y M Z2 δ1 e1 δ e 2 + 2 ⋮ ⋮ ⋱ Z M δ M e M where yi is the vector of data for the ith dependent variable, Z i = [Yi X i ] , where Yi is the matrix of all endogenous variables that have nonzero coefficients in equation I, X i are the predetermined variables that have nonzero coefficients in equation i, and γ δ i = i . Note the blanks are zero matrices. It will be useful to transform the model βi using X , the matrix of all predetermined variables in the system: X ' y1 X ' Z1 X 'y 2 = ⋮ X 'yM X ' Z2 δ1 X ' e1 δ X 'e 2 2 + . ⋮ ⋮ ⋱ X ' Z M δ M X ' e M The Kronecker product, ⊗ , becomes a convenient shorthand. Define a11 B a12 B ⋯ a1J B a B a B ⋯ a B 22 2J A ⊗ B ) = 21 ( ⋮ ⋮ ⋱ ⋮ aI 1 B aI 1 B ⋯ aIJ B thus, A A . ( I ⊗ A) = ⋱ A Our transformation can be written as ( I ⊗ X ') y = ( I ⊗ X ') Z δ + ( I ⊗ X ') e,
where 98 Z1 Z = Z2 ⋱ ZM In order to obtain the GLS estimator, we must know the variancecovariance matrix of the system,
E ( I ⊗ X ' ) ee ' ( I ⊗ X ) = E ( I ⊗ X ' ) E [ee '  X ] ( I ⊗ X ) = E ( I ⊗ X ' )( Σ ⊗ I )( I ⊗ X ) = Σ⊗ E[X ' X ] = Σ ⊗ X ' X. Thus, the GLS estimator is given by δ OLS = Z ' Σ −1 ⊗ X ( X ' X ) X ' Z
−1 { } −1 −1 Z ' Σ −1 ⊗ X ( X ' X ) X ' y. The problem: we don’t know Σ . So, we need to generate these estimates of variance that will appear in our estimates of our coefficients. Note, we cannot use our estimates to ˆ ˆ generate Σ , because our estimates depend on Σ . Instead, we will use two stage least squares estimates to generate the variancecovariance estimates (hence three stage least squares). We will use the following formular σ ij = ( yi − Ziδ 2 SLS ) ' ( yi − Ziδ 2 SLS ) .
T This produces a better estimate than two stage least squares by using information from each of the equations to produce estimates. Etc. Time Series (See chapter 20 and 21) Whole courses of study are devoted to time series analysis. The motivation for time series analysis is something like “modeling this phenomenon is too hard, but maybe I can find some temporal pattern.” If modeling is very complicated, there is a long time series data set, and you wish only to make short run predictions, time series modeling may be an option. These models depend on past values of variables to predict future values of the same variables. There are two general types of processes: a. Autoregressive process, AR( i ) 99 yt = θ1 yt −1 + θ 2 yt − 2 + ⋯ + θ i yt −i + ε t . b. Moving average process, MA( j ) yt = µ + ε t + α1ε t −1 + α 2ε t − 2 + ⋯ + α jε t − j Some variables may follow both at once (ARMA). Believe it or not, these series are not that hard to work with (at least at an elementary level). I may go over more if we have time (doubtful) Efficiency and Failings of Econometric Theory Now that we have spent so much time learning to use the techniques of statistics and estimation, it may be time to take stock. Even though we have worked to make our estimation general, there are still big questions as to the validity of the remaining assumptions. Further, how reasonable is it to use the hypothesis testing regime that is standard? There are substantial problems with much of the practice of econometrics. It is important to realize the weaknesses of standard practice, and adapt practice to your own circumstances. Problems with Hypothesis Testing Decisions There are a few problems with hypothesis testing. One important problem comes from the use of significance tests. Researchers (biologists, economists, everyone) often use a test the hypothesis that a coefficient is equal to zero to determine if it is important or not. If it is not significant, the variables are often discarded. This makes little sense. Consider the following axioms (proposed by Nester1) a. All variables have an effect b. All variables interact c. All variables are correlated d. No two populations are identical in any respect e. No data are normally distributed f. Variances are never equal g. All models are wrong h. No two numbers are the same
For a good read, try Nester, Marks R., “An Applied Statistician’s Creed” in Applied Statistics (1996) No. 4, pp 401 – 410.
1 100 i. Many numbers are very small Each of these axioms seems absolutely reasonable (in fact infallible). If we accept these, then testing significance provides no information about the importance of the variable. We only find out that (1) the effect may be small relative to the variance, and (2) we don’t have enough data. There are major problems in even well known studies stemming from this problem. This is a problem with current studies of how effective charter schools are. When we find that they have no significant effect, we only say that we don’t have enough data to determine the effect with any accuracy. If you examine their confidence intervals, there is also a possibility that they have a huge positive effect. Another problem with hypothesis testing has to do with the choice of confidence level. The practice of selecting a level of significance and using that level to make all decisions what economists call irrational!! What does this mean? Let’s take an extreme case. Suppose you needed to predict the amount of food supplies necessary to relieve a starving country. If you predict too high, the cost will be higher. If you predict too low, people will starve to death. Suppose we needed to decide between two hypotheses, ˆ H0 : y = y ɶˆ H1 : y = y > y ɶ Does it really make sense to only decide to use the estimate y if there is less than a 5%
ˆ chance that y is a better estimate? This would mean you are willing to have a 94% chance people would die to avoid the cost of providing more food (no matter how small the cost of providing that food!). Hypothesis testing is irrational because it doesn’t take into account the real world consequences of a type I and type II error. In business situations, there are smarter ways to determine which hypothesis to employ. Problems With Maximum Likelihood Maximum likelihood estimation is very useful. However, there are problems with standard error estimates (particularly in nonlinear models). Suppose we estimated a model using MLE, but we didn’t like the results of one of our hypothesis tests regarding an estimate γˆ . We could then transform our model in some inconsequential way, and come up with different standard errors that say what we like. For example, we could estimate 1 β = γ . So long as we are transforming only the coefficient estimates, our 101 transformation should not change the estimate, only the standard error estimates. We could just search until we find what a form that says what we want. Does this sound good? Problems with scientific method of learning The method of learning employed by science is slow. In fact, it is probably inappropriate for use in policy, or business. Earlier we outlined the scientific method as 1. Observe some phenomenon 2. Create a hypothesis to explain the phenomenon 3. Make predictions employing the hypothesis 4. Test the predictions through experimentation or more observations (usually using statistical theory). 5. Modify hypothesis and repeat 3 and 4. One of the mainstays of science is to have a maintained hypothesis that everyone tries to disprove by finding a possibly better hypothesis. However, when testing, the scientist gives extra weight to the maintained hypothesis, and only rejects it if the evidence is conclusive. Again we run into the problem of weighing values. Science is ignorant of values or how to weight them. When making business decisions, it is probably best to use the best performing model, rather than the scientifically acceptable model. Another problem has to do with replicability. Scientific method relies on separate researchers confirming the finding of another study in separate experiment. This has lead to a culture among some economists where each individual gets a new data set and conducts estimation. When economists do this, their estimates are no longer consistent. Why? Suppose I had a sample of 10 observations I used to estimate. Then later I would throw out those observations for a new 10 observations, etc., etc. My estimates would not converge to anything more believable than the first set of estimates. In fact, about 1 out of every 20 times I estimate, I will find something significantly different from the first set of estimates. I may be able to publish this in a journal because it is a unique finding. An efficient use of information would build new estimates using the information from previous studies and the new data. Lastly, economists are often unwilling to use information that might be helpful in solving an applied problem. When determining the effect of a price change, an economist 102 will use past performance data to estimate. While this is useful, we might be able to improve estimates by including some less rigorous information. For example, we could ask individuals what would you do if the price changed by x. There are biases in the responses to such questions, but that does not say that the responses contain no information. Too often economic surveys will fail to ask direct questions when they may be useful. Possible solutions I have raised a few questions regarding the use of standard economic practice. Many of the questions I have raised have some well understood (but controversial) solutions. While not all problems can be easily solved, the use of Bayesian statistics can eliminate many of the circularities I have talked about. We have been using frequentist theory, based on our notion of probability as the eventual proportion of outcomes if we had an infinite number of draws. Frequentist estimation assumes that parameters are fixed, and that our estimates are variable. On the other hand, Bayesians assume that parameters are random, but our estimates are fixed. Every Bayesian estimation starts with describing the information already known (called a prior) about the parameters, and incorporating the new information provided by the data. The researcher defines a loss function, incorporating their own penalties for various mistakes, and finds estimates that minimize loss. Finally, hypotheses are testing using odds ratios and a loss function. Many economists and scientists oppose the use of priors, as they may be subjective. In policy and business decisions, Bayesian estimation may make better sense, as it allows the decisionmaker to incorporate their own values and nonsample information. Bayesian estimates may be less objective, but real world decisions do not often involve objective valuation. Bayesian Econometrics (In just one lecture) p. 763 – 826 Until now we have been using frequentist theories of estimation and testing. These theories assume that distributions have fixed (but unknown) parameters, but that our data is random. We then estimate a distribution that represents the randomness that is inherent in the system. For example, using ML, we often estimate the mean and variance of a 103 normal distribution. The derived distribution is the distribution of some error term that can be used to derive distributions for our parameter estimates (which are random) while we believe the parameters are fixed. The Bayesian approach uses the distribution to represent our knowledge about the parameters of a system which are random from the scientist’s point of view. This is a less messy treatment (in my opinion). We suppose the uncertainty is in the thing we are uncertain about. In other words, we would derive a distribution of mean and variance parameters from the data and prior information. Point estimates are not generated by this procedure. All inference in Bayesian statistics is based upon the following identity defining a conditional pdf f ( y θ ) = This leads us to the following relationships f ( y,θ ) = f ( y  θ ) f (θ ) = f (θ  y ) f ( y ) which leads us to Bayes Rule f (θ  y ) = f (θ ) f ( y  θ ) f ( y) . f ( y ,θ ) f (θ ) . This holds where f ( y ) ≠ 0 . Let y be some sample observation, and θ be some unknown parameter. The function f (θ ) represents our prior knowledge of θ . This distribution will be very wide if we don’t know anything about θ . Alternatively, if we know the exact value of θ , then f (θ ) will have probability one at that value and zero elsewhere. We will most often refer to this as the prior, and represent it as p (θ ) . The function f ( y  θ ) represents the distribution of the sample data given some specific parameter values. This is referred to as the likelihood function. It contains the same information that a pdf contains in ML estimation. MLE just maximizes this function over θ . We will often represent the likelihood function as l ( y  θ ) . The function f (θ  y ) is called the posterior and will often be written as p (θ  y ) . This represents what we know about θ 104 after incorporating the information in the prior and likelihood function. Bayes rule is a way of efficiently combining nonsample and sample information. The function I haven’t mentioned, f ( y ) the unconditional density of y is generally not known. We usually only know about the distribution of y given some parameters. Luckily, this function does not contain θ , and thus behaves as a constant relative to the parameter values. In fact
∞ ∞ f ( y) = ∫ p (θ ) l ( y  θ ) dθ = ∫ f ( y,θ ) dθ , −∞ −∞ which, given y , is a constant. We know that the denominator must force the posterior to integrate to 1. Instead of making this complicated calculation, we will work in proportions. So, p (θ  y ) ∝ p (θ ) l ( y  θ ) .
This simplification will save us a lot of time. Lets look at some graphical examples 105 Uninformative Prior (one observation)
4 3.5 3 2.5 2 1.5 1 0.5 0 10 x 10
3 Prior Likelihood Posterior 8 6 4 2 0 2 4 6 8 10 106 Informative Normal Prior
5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 10 x 10
3 Prior Likelihood Posterior 8 6 4 2 0 2 4 6 8 10 107 Truncating Prior
7 x 10
3 6 Prior Likelihood Posterior 5 4 3 2 1 0 10 8 6 4 2 0 2 4 6 8 10 The Linear Case There is a lot of controversy surrounding the use of priors. Sometimes we will wish to include some nonsample information in our prior (by limiting the probability placed on a certain set of parameter values). For example, it may be unreasonable that a set of parameters is negative, or highly likely that a set of parameters are equal. Some scientists see including this information as subjective and therefore invalid. A common starting point for an initial study, is an uninformative prior. Suppose, we wish to estimate y = Xβ + ε , with a likelihood function given by l ( y, X  β , σ 2 ) = , 1 ( 2πσ ) n 22 1 1 1 exp − 2 ( y − Xβ ) ' ( y − Xβ ) ∝ n exp − 2 ( y − Xβ ) ' ( y − Xβ ) 2σ σ 2σ 108 or homoschedastic normal (this should look familiar by now). If we didn’t know anything about the distribution of parameters, our prior should reflect that. However, we know that σ must be positive. This is commonly represented as
1 p ( β , σ ) ∝ σ 0 2 if otherwise. σ >0 Note that this prior function cannot be a proper pdf. Very often we will specify prior functions that are not true pdfs, but represent clearly what we know. The joint posterior is given by 1 1 if σ >0 n +1 exp − 2 ( y − Xβ ) ' ( y − Xβ ) p ( β , σ  y, X ) ∝ σ 2σ 0 otherwise. 2 The joint posterior itself is generally not useful. However, by integrating out the variables we don’t care about, we can find the marginal distribution of a single parameter (or set of parameters). These will allow us to conduct tests. By integrating we find that p ( β  y, X ) ∝ ∫
0
∞ 1 σ n +1 1 exp − 2 ( y − Xβ ) ' ( y − Xβ ) dσ 2σ ˆ ˆ ∝ vs 2 + β − β ' X ' X β − β { ( ) ( )} − n 2 . or, that the slope parameters are distributed with a joint t distribution. Integrating out all but one slope parameter yields the conditions that ˆ β i − β iOLS s ( X ' X ) ii
−1 1 2 ∼ t (n − k ) Also
∞ p (σ  y , X ) ∝ ∝ ∫…∫
−∞ ∞ 1 −∞ σ n +1 1 exp − 2 ( y − Xβ ) ' ( y − Xβ ) d β1 … d β k 2σ vs 2 exp − 2 σ v +1 2σ 1 which is an inverse gamma distribution. Note that this is not exactly like our OLS ˆ estimation, because we have estimated the distribution of β , rather than β . 109 Subsequent studies could employ these posterior pdfs as priors. Let 1 1 if σ >0 n1 +1 exp − 2 ( y1 − X1 β ) ' ( y1 − X1 β ) p ( β , σ  y, X ) ∝ σ 2σ 0 otherwise. 2 be a prior created from an original study, and l ( y 2 , X2  β , σ 2 ) = 1 σ n2 1 exp − 2 ( y 2 − X 2 β ) ' ( y 2 − X 2 β ) 2σ be the likelihood function for our current study. Then, combining information from the two studies we obtain p ( β , σ  y1 , y 2 , X1 , X2 ) ∝ 1 σ n1 + n2 +1 1 exp − 2 ( y1 − X1 β ) ' ( y1 − X1 β ) + ( y 2 − X2 β ) ' ( y 2 − X2 β ) 2σ Using this we could find more accurate estimates for β than could be had from either study independently. Point estimates If we need point estimates, we can obtain these by defining a loss function. This loss function is subjective, but can be constructed to incorporate real world cost of error. The most common loss function is the squared error loss function, defined by ˆ L = θ −θ . This is infact the loss function used in OLS. Another common loss functions are the absolute error L = θ − θˆ , In fact any function that reflects the costs of inaccuracy will do. To find a point estimate, we minimize expected loss, ˆ min E L θ ,θ = min θˆ θˆ
∞ ∞ ( ) 2 () ∫ ˆ p (θ  y ) L θ ,θ dθ = min ˆ
θ () ∫ ˆ p (θ  y ) θ − θ dθ ( ) 2 −∞ −∞ 110 For the simple linear specification above, if we use a symmetric loss function we will obtain the OLS estimates. Hypothesis Testing Suppose we wished to compare two hypotheses, H0 and H1. Our hypotheses can be represented by H 0 :θ = θ0 H 1 : θ = θ1 If these two hypotheses are mutually exclusive and exhaustive, we can define 0 if w= 1 if H 0 is true H1 is true We assume we know the likelihood function l ( y  w ) . We may have some prior information about the hypotheses that can be represented in a prior, p ( w ) . If we have no prior information we might choose .5 if p ( w) = .5 if From Bayes rule we have p (w  y) = p ( w) l ( y  w) p( y) . w =1 w = 0. We thus have the posterior probability of H0 p (0  y ) = and H1 p (1 y ) = p (1) l ( y 1) p ( y) p ( 0) l ( y  0) p ( y) We will typically represent this in the form of posterior odds K 01 = p (0  y ) p (1 y ) = p ( 0) l ( y  0) p (1) l ( y 1) , which will be larger when the null hypothesis is more probable. 111 (Could use coin toss example here) Most will only report posterior odds without making a decision in favor of one or the other. Making a decision requires defining a loss function. We will actually have to ˆ ˆ ˆ ˆ define four loss functions: L H 0 , H 0 , L H1 , H 0 , L H 0 , H1 , L H1 , H1 . Then, we calculate our expected loss for each hypothesis ˆ ˆ ˆ E L  H 0 = p ( H 0  y ) L H 0 , H 0 + p ( H1  y ) L H1 , H 0 ( )( )( )( ) ( ( ) ( ) ( ) ˆ ˆ ˆ E L  H1 = p ( H 0  y ) L H 0 , H 1 + p ( H1  y ) L H1 , H1 ) ( ) ( ) We would then choose the hypothesis that minimizes our expected loss. 112 If we were to toss a coin n times, and it came up heads n1 times, we would learn something about the distribution of heads and tails. We might assume the binomial likelihood function p ( n1  θ , n ) = n! n−n θ n1 (1 − θ ) 1 . n1 !( n − n1 ) ! Suppose also we had a prior distribution given by the beta pdf: p (θ ) ∝ θ a −1 (1 − θ )
b −1 For right now, suppose a = b = 20, so that the prior looks like the graph below. Then, the posterior distribution is just p (θ  n1 , n ) ∝ n! n − n + 20 −1 θ n1 + 20−1 (1 − θ ) 1 n1 !( n − n1 ) ! 113 This is again a beta pdf, with mean a' ( a '+ b ' ) = (n1 + 20 − 1) (n + 20 − 1 + n − n1 + 20 − 1) = (n1 + 19) (2n + 38 − n1 ) . Also, the variance of the distribution of theta decreases as we obtain new observations. Alternatively, we could use the prior p (θ ) = 1 knowledge of the parameter value. Suppose instead we had two competing hypotheses: H 0 : θ = .5 , H1 : θ = .6 . Regarding these hypotheses, suppose we have a prior .5 if θ = .5 p (θ ) = .5 if θ = .6 Then, we would find posterior odds .5 × .5n1 × .5n − n1 K 01 = .5 × .6n1 × .4n − n1 θ (1 − θ ) to represent very little We could then define loss functions: ˆ L H ,H = 0 ( ) ˆ L(H , H ) = 2 ˆ L ( H , H ) = 50 ˆ L(H , H ) = 0
0 0 1 0 0 1 1 1 Then, we could calculate our expected loss for each hypothesis: EL ( H 0 ) = .5 × .5n1 × .5n − n1 .5 × .6n1 × .4n − n1 ×0 + × 50 .5 × .5n1 × .5n − n1 + .5 × .6 n1 × .4n − n1 .5 × .5n1 × .5n − n1 + .5 × .6 n1 × .4n − n1 .5 × .5n1 × .5n − n1 .5 × .6n1 × .4 n − n1 ×2+ ×0 .5 × .5n1 × .5n − n1 + .5 × .6n1 × .4n − n1 .5 × .5n1 × .5n − n1 + .5 × .6n1 × .4n − n1 EL ( H 0 ) = 114 ...
View Full
Document
 Fall '08
 STAFF

Click to edit the document details