Unformatted text preview: ECON2206 Revision Notes W2 SIMPLE REGRESSION MODEL MOTIVATION Much of applied econometric analysis are interested in "explaining y in terms of x" and confront three issues: 1) Since there is never an exact relationship between y and x, how do we account for the "other unobserved" variables? 2) What is the function relationship between y and x? 3) How do we invoke a ceteris paribus relationship or a causal effect between y an x? The simple linear regression model is: y = 0 + 1x + u  "u" is the stochastic error or disturbance term and represents all those unobserved factors or other factors other than x.  If all other factors (u) is held fixed, so that change in u is zero, we can observe the function relationship between y and x.  If we take the expected value of the model, (u = 0 and 0), then we can see that x has a linear effect on y. We will only get reliable estimates of 0 and 1 if we make restricting assumptions on u. As long as 0 is included in the model, nothing is lost by making the assumption that the expected value of u in the population is zero; E(u) = 0. ZCM Our crucial assumption is by defining the conditional distribution of u given any value of x. This crucial assumption is, the average value of u does not depend on the value of x. E(ux) = E(u) = 0 This is the zeroconditional mean assumption (ZCM)  The average value of the unobserved factors is the same across the population.  An important implication of ZCM is that u and x are uncorrelated. OLS Ordinary Least Squares (OLS) is a method for estimating the unknown parameters in a linear regression model. The estimates for 0 and 1 are found by minimizing the sum of squared residuals. That is, the distance between the observations in the sample and the responses predicted.  Fitted values and estimates are denoted by a HAT  The values predicted for y when x = xi (where xi = observation i) are called the fitted values  ^yi.  There is a fitted value for every observation in the sample  The residual for each observation is given by the difference between the actual, yi, and its fitted value ^yi.  The OLS regression line is also called the sample regression function (SRF). Properties of OLS Residuals  If ^ui (residual associated with observation i) is positive, line underpredicts y  If ^ui is negative, the line overpredicts y  Sum (or sample average) of OLS residuals is zero  Covariance between regressors and OLS residuals is zero Variation Total Sum of Squares (SST) = Total sample variation in the yi Explained sum of squares (SSE) = Total sample variation in the ^yi Residual sum of squares (SSR) = Sample variation in the ^ui The total variation in y can thus be expressed as: SST = SSE + SSR GoodnessOfFit (R2) The Rsquared is the ratio of explained variation to total variation. Thus interpretation is the fraction of sample variation in y that is explained by x. R2 = SSE / SST  The Rsquared of the regression is sometimes referred to as the coefficient of determination.  The percentage of sample variation in y that is explained by x. Units of Measurement & Functional Form Sometimes linear relationships between the dependent and independent variables are not appropriate for all economic application. The different types of functional forms are:  LEVEL LEVEL: y and x. One unit increase in x increases y by 1  LEVEL LOG: y and log(x).  LOG LEVEL: log(y) and x. One unit increase in x increases y by percentage amount 1. Also called the semielasticity of y with respect to x  LOG LOG: log(y) and log(x). One percentage increase in x increases y by percentage amount 1. Also called the elasticity of y with respect to x. Underlying Assumptions of OLS To establish unbiasedness of OLS estimators under a set of assumptions o SLR1: OLS are linear in parameters. o SLR2: y, x and u are all viewed as random variables of a random sample of size n o SLR3: There is some variation in xi necessary because if x varies in the population, random samples of x will typically contain variation o SLR4: Need to impose ZCM assumption to obtain unbiased estimators of 0 and 1 o SLR5: Homoskedasticity  We also assume that the variance of the error term u, is constant conditional on x, also known as the "constant variance assumption".  Denoted by: Var(ux) = 2  Although plays no role in showing that OLS estimators are unbiased, it simplifies variance calculations for the OLS estimators. When the OLS estimators are unbiased, the estimators are centred about the true population parameter. That is, unbiasedness implies that ^1 is centred about 1 W3 Multiple Regression Model Multiple regression is more agreeable to ceteris paribus analysis because it allows us to explicitly control for many other factors that affect the dependent variable where in simple regression, it was hidden in the error term. By including more than one variable, the slope parameters will have ceteris paribus interpretations. SLR v MLR There is a simple relationship between ~1 (simple regression coefficient) and ^1 (multiple regression coefficient) ~1 = ^1 + ^2~1 ~1 is the slope coefficient from the simple regression of xi2 on xi1. This equation shows how the simple regression coefficient differs from the partial effect of x1 on y. This relationship shows there are two distinct cases where they are equal: 1. Partial effect of x2 on ^y is zero in the sample. That is, ^2 = 0. 2. X1 and x2 are uncorrelated in the sample. That is, ~1 = 0. Goodness of Fit As with simple regression, we can define the SST, the SSE and the SSR the same way. Similarly, SST = SSE + SSR, and the Rsquared is defined as the proportion of sample variation in yi that is explained by the OLS regression line. R2 = SSE / SST  An important fact about R2 is that it never decreases, it usually increases when another independent variable is added to a regression.  This is because, SSR never increases when additional regressors are added to the model.  This makes it a poor tool for deciding whether one variable or several variables should be added to the model. In fact, the factor that should determine whether an explanatory variable belongs in a model is whether it has a ceteris paribus effect on y in the population. Underlying Assumptions The assumptions required to obtain the unbiased OLS estimators are: o MLR1: The population model is LINEAR in parameters. o MLR2: Random sampling of observations o MLR3: None of the independent variables is constant and there are NO exact linear relationships among the independent variables.  Perfect collinearity is not allowed.  MLR3 does allow independent variables to be correlated, just not perfectly correlated.  Perfect collinearity includes when one variable is a constant multiple of another, or when one independent variable can be expressed as an "Exact linear function" of two or more of the other. o MLR4: ZCM  The error term u has an expected value of zero given any values of the independent variables. E(ux) = 0  Ways MLR 4 can fail is omitting important factors, functional form misspecification and using x instead of x2. Under the assumptions MLR1 to MLR4, we have: E(^j) = j That is, for any values of the population parameters, the OLS estimators are unbiased estimators of the population parameters.  When we say that OLS is unbiased under MLR1 to MLR4 we mean that the procedure by which the OLS estimates are obtained is unbiased when we view the procedure as being applied across all possible random variables. What about including irrelevant variables in the regression model?  Also known as over specifying the model  When one or more of the independent variables are included in the model even though they do not have a partial effect on y in the population.  What is the effect on unbiasedness of OLS? NO EFFECT.  But it will have undesirable effects on the variances of the OLS estimators OMITTED VARIABLE BIAS When we omit a variable that actually belongs in the true population model, this is called excluding a relevant variable, or underspecifying the model.  When we omit a variable, the omitted variable will be hidden in the error term instead, v = omitted variable + u  Assuming previously a 2variable MLR, omitting one of the variables results in the model to be a SLR model instead with coefficient on the variable to be ~1 E(~1) = E(^1 + ^2 ~1) = 1 + 2~1 There is bias in the simple regression coefficient. The bias being E(~1)  1 = ^2 ~1 ^2 ~1 is the OMITTED VARIABLE BIAS. The only two cases where the simple regression coefficient will be unbiased is when: 1) 2 = 0 so x2 does not appear in the true model 2) When the sample covariance between x1 and x2, ~1 is equal to zero (they are uncorrelated). When x1 and x2 are correlated, then ~1 will have the same sign as the correlation between x1 and x2.  If ~1 > 0, the variables are positively correlated  If ~1 < 0, the variables are negatively correlated The sign of the bias depends on the signs of both 2 and ~1. The bias in the simple regression coefficient when x2 is omitted is shown as follows: corr(x1, x2) > 0 corr(x1, x2) < 0 2 > 0 Positive bias Negative bias 2 < 0 Negative bias Positive bias Variance of the OLS Estimators MLR5: Homoskedasticity assumption: The error term u has the same variance given any values of the explanatory variables. Var(ux) = 2  The variance in the error term conditional on all the explanatory variables is constant. We invoke this assumption in order to simplify the formulas to measure the spread in the sample distribution of our OLS estimators GaussMarkov Assumptions Assumptions MLR1 through to MLR5 are known as the Gauss Markov assumptions.  Under these five assumptions, the variances of the OLS estimators are given by: Var(^j) = 2 / SSTj (1 Rj2) A large variance means a less precise estimator, and this translates into a larger confidence interval and less accurate hypotheses tests. Components of the OLS variances: The Error Variance 2 : A feature of the population, nothing to do with the sample size.  Large population variance means larger variances for the OLS estimates and thus harder to estimate the partial effect of any of the independent variables on y.  Only way to reduce is to add more explanatory variables to the equation (take factors out of the error term) The total sample variation SSTj : The larger the SST, the smaller the variance of the OLS estimators. Thus all else being equal, better to have as much sample variation in xj as possible when estimating the parameters.  Increase the sample size of xj to do this.  The extreme case of NO sample variation is a violation of MLR3. The Rsquared, Rj2 , of the independent variables: Measures the goodnessoffit of the linear relationships among the independent variables. Thus an R squared that is closer to 1 indicates that x2 explains much of the variation in x1 in the sample higher correlation.  The extreme case, Rj2 = 1 is not allowed under MLR3 as it indicates there is perfect collinearity amongst the independent variables which is not allowed.  When the R squared is CLOSE to 1, this suggests multicollinearity, which is not a violation of MLR3. Multicollinearity Better to have less correlation between xj and other independent variables when estimating j.  Dropping a highly correlated variable can reduce multicollinearity but can result in omitted variable bias.  Mitigated by collecting more data. Under the GaussMarkov Assumptions, E(^2) = 2, ^2 is an unbiased estimator of 2.  Efficiency of OLS: Gauss Markov Theorem Under MLR1 through to MLR5, the OLS estimators ^j are the BEST LINEAR UNBIASED ESTIMATORS (BLUE). An estimator is a rule that can be applied to any sample of data to produce an estimate. An unbiased estimator is when the expected value of the estimator is equal to the population parameter value. Linear estimator is so if it can be expressed as a linear function of the data on the dependent variable. Best is defined as "smallest variance" (efficient) Thus: Linear, unbiased and efficient estimators. W4 Multiple Regression: Inference Knowing the expected value and variance of OLS estimators is useful, but we also need to know the full sampling distribution of the estimators in order to perform statistical inference.  Under the Gauss Markov assumptions, the distribution of the estimators can have virtually any shape.  Must invoke the NORMALITY ASSUMPTION in the population for the unobserved error term. MLR6: Population error term, u, is independent of the independent variables x, and is normally distributed with zero mean and variance 2. U ~ N(0, 2)  Implies MLR4 (ZCM) and MLR5 (homoskedasticity) MLR1 to MLR6 are known as the CLASSICAL LINEAR MODEL ASSUMPTIONS (CLM)  OLS estimators have a stronger efficiency (smallest variance) property than under the Gauss Markov assumptions  OLS estimators are the minimum variance unbiased estimators  Also implies that the OLS estimators are normally distributed Under the CLM assumption MLR1 through to MLR6, conditional on the sample values of the independent values: ^j ~ N[ j, var(^j) ] Testing Hypothesis about a Single Population Parameter We can hypothesize about the value of j and then use statistical inference to test our hypothesis. In order to conduct hypotheses tests, we need the following: ^j  j / se(^j) ~ tnk1   Where tnk1 is the tdistribution with nk1 degrees of freedom (number of values that are free to vary) n = number of observations and k = number of slope estimates; we also subtract 1 to account for our intercept estimate. Our primary interest lies in testing the NULL hypothesis, H0: j = aj  In order to test our H0, we use the tstatistic (or tratio when aj=0): ^ j  aj ^ ^ t ^ = where, se( j ) = var( j ) ^ j se( )
j One sided alternatives We must determine a rule for rejecting H0, we need to decide on the relevant alternative hypothesis. First consider a onesided alternative: H1: j > 0  Must decide on a significance level, or the "probability of rejecting H0" when it is in fact true.  A significance level of 5% means that we are willing to mistakenly reject H0 when it is in fact true 5% of the time. REJECTION Rule: H0 is rejected in favour of H1 at the 5% significance level if tstat > c  Where c is our critical value which is obtained using our level of significance and our df using our tdistribution tables. If df is large (>120), close to normal distribution.   If our onesided alterative is H1: j < 0, then our rejection rule is simply when tstat <  c  Note that when we draw a graph, the area of our rejection region is actually equal to our chosen significance level. Two sided alternative This is when our alternative is twosided as follows: H1: j 0  This alternative states that xj has a ceteris paribus effect on y without specifying whether the effect is positive or negative.  When the alternative is two sided, we are interested in the absolute value of the t statistic. Rejection rule for H0 is therefore when tstat>c  Area of our rejection region on either ends of the distribution is therefore half of our significance level. When H0 is rejected, we conclude that xj is statistically significant at the 5% level. When H0 is not rejected, we conclude that xj is statistically insignificant at the 5% level. Computing Pvalues The Pvalue: "Given the observed value of the tstatistic, what is the smallest significance level at which the null hypothesis would be rejected?"  The smaller the pvalue, the greater the evidence against the null  It is the P(T > t)  It is the probability of the null distribution beyond the observed tstatistic Economic/Statistical Significance  An explanatory variable is statistically significant when the size of the tratio is significantly large so that H0 is rejected in favour of H1 at the chosen significance level.  An explanatory variable is economically significant when the size of its estimate is sufficiently large.  An important xj should be both economically and significantly large. 1) Check for statistical significance of a variable as well as observe the magnitude of the coefficient to get an idea of its economic significance. 2) If a variable is not significantly significant at the chosen level, can still ask if the variable has an expected effect on y and whether the effect is large.  Also compute pvalues 3) Common to find variables that have small tstatistics that have the wrong sign. We can conclude that such variables are statistically insignificant. Confidence Intervals Under the classical linear model assumptions, we can construct a confidence interval for the population parameters. It provides a range of likely values for the parameters and not just a point estimate. The interval is given by: ^ ^ j c se( j )  The value for c is obtained from the tnk1 distribution tables; when df is large (>120) the distribution is very close to normal distribution and we can use N(0,1) critical values. The width of the interval depends on the standard error and the critical value:  Higher confidence level larger c Wider CI  Large standard error Wider CI As significance level falls, critical value increases, so we require a larger and larger value for our tstatistic to reject the null. As our df gets large, the tdistribution approaches a standard normal distribution TEST: When testing H0 against H1 using CI, assuming our significance level is 5%  We will reject H0 at the 5% significance level if the 95% confidence interval does not contain our value aj The FTest We may wish to test multiple hypotheses about the underlying parameters. We may want to test whether a set of independent variables has no partial effect on a dependant variable instead of testing one at a time.  To check whether a group of x variables has a JOINT effect on y, this interest can be formulated as the null. H0: 3 = 4 = 5 = 0 H1: H0 is not true  Tested using restricted and unrestricted models. Alternative holds if any or all could be different from zero. A test of multiple restrictions is called a multiple hypotheses test or a joint hypotheses test. We need the Fstatistic to test the joint hypotheses. 1) Given the unrestricted model (model with all variables), obtain the SSRur . 2) Given the restricted model (model with dropped variables), obtain the SSR (SSRr  SSRur ) /q 3) Compute Fstatistic as defined by: F SSRur /n  k 1 R 2  R 2 /q OR: Can also use the Rsquared form, which is: F = ur 2 r (1  Rur ) /df ur The Fstatistic can never be negative because SSRr can be no smaller than SSRur.  The q is the numerator degrees of freedom, equal to dfr dfur  nk1 is the denominator degrees of freedom, equal to dfur REJECTION Rule: Reject H0 in favour of H1 when F>c at the chosen significance level.  The c (critical value) depends on q, nk1 and chosen significance level.  If H0 is rejected, we say that xk are jointly statistically significant at the significance level  If H0 is not rejected, we say that xk are jointly statistically insignificant at the significance level. It is possible for variables to be jointly significant, but individual insignificant due to highly correlated variables. Computing Pvalues for Ftests In the context of Ftesting, the pvalue is defined as: P(Fq,nk1 > F statistic)  It is the probability of observing a value of F at least as large as we did given that the null hypothesis is true.  It is the probability of the Fdistribution beyond the Fstatistic  Smaller the pvalue, greater the evidence against the null. W5 Multiple Linear Regression Asymptotics What do we need for inference? We need the sampling distribution of the OLS estimators: a) MLR14 implies the OLS estimators are unbiased, and thus the expected value of the OLS estimators is equal to the population parameter. b) MLR16 (classical linear model) imply the OLS estimators are normally distributed, as well as being the minimum variance, unbiased estimators. c) Normality leads to the exact distribution of the tstat and Fstat, which are basis for inference. Without MLR6, normality will not hold. However, they hold approximately for LARGE samples. Inference will be based on large sample approximation. Consistency of OLS estimators ^ ^ When j is an estimator for parameter j from a sample of size n, j is a consistent ^ estimator for j only if the probability P( differs from j) tends to equal to zero as n goes j to infinity.  Consistency comes from the law of large numbers.  Consistency is not enough for statistical inference, we need a sampling distribution of the OLS estimators. How to do inference without MLR6? Asymptotic normality means approximately normally distributed due to large enough sample size.  Under MLR1MLR5 (without MLR6: Normality), we can still do inference due to the central limit theorem, where n is large; the OLS estimators are approximately normally distributed.  n > 30 is satisfactory for "large" n. Although this may not be sufficient for all possible distributions of u, it depends on the distribution of the error term. Asymptotic normality of OLS estimators also implies that the Fstatistic under H0 has approximate Fdistribution with large samples. Thus, joint hypothesis testing can be carried out as before. Large Sample Test: Wald Statistic  Joint hypotheses with large samples Similarly, when performing joint testing, we will have our restricted and unrestricted models from which we obtain our SSRr and SSRur respectively. Our test statistic is as follows: SSRr  SSRur 2 ~ q W = SSRur /(n  k 1) Decision Rule: Reject H0 if W > c, (the Chi squared is the critical value) Large Sample Test: Lagrange Multiplier a) The LM statistic requires estimation of the restricted model only run restricted regression and save the residual. b) Regress the residual on all the independent variables (known as Auxiliary regression) and save the Rsquared from this regression c) Compute the LM (lagrange multipler) statistic by LM = n x Ru2 (sample size multiplied by the R squared from b) d) Reject H0 if LM > c, where c is the critical value in a Chisquared distribution. Functional Forms Logarithmic Forms Rules of thumb:  When a variable is a positive dollar amount, the log is often taken (wages, prices, sales, market values etc.)  Variables such as population, total number of employees and school enrolment and so on, often appear in logarithmic form.  Variables measured in years usually appear in original form.  Variables that are a proportion or a percentage can be either original or logarithmic although there is a tendency to use them in original form. Quadratic Forms Quadratic functions are often used to capture increasing or decreasing marginal effects. For example: wage = a + bexper c exper2  This implies that experience has a diminishing effect on wage.  When coefficient on x is positive and on x2 is negative, the quadratic has a parabolic ^ 1 shape and the turning point is always determined by: x* = ^ 2 2 Interaction Terms Sometimes it is natural for the partial effect, elasticity or semielasticity of the dependent variable with respect to an explanatory variable to depend on the magnitude of another explanatory variable. For example: y = 0 + 1x1 + 2x2 + 3x1x2 + u  The partial effect of x2 on y is linearly related to x1.  y/x2 = 2 + 3 x1 There is an interaction term between x1 and x2 GoodnessofFit (Rsquared) Even though the Rsquared is the proportion of variation in y that is explained by the independent variables, irrelevant regressors can increase the Rsquared. Adjusted Rsquared SSR /(n  k 1) The adjusted R2 is given by: R2 =1 SST /(n 1)  This formula shows that it depends explicitly on k, the number of independent variables and thus is imposes a penalty for adding additional independent variables to a model. We know that R2 never falls when a new independent variable is added to a regression equation, and so the adjusted R2 is good in correcting this. Regressor Selection "Overcontrolling" for factors in multiple regression occurs when we control for factors we should not be controlling for even when the goodnessoffit favours such inclusion. In some cases it makes no sense to hold some factors fixed because they should be allowed to change when a policy variable changes. We should always include independent variables that affect y and are uncorrelated with all of the other xvariables. WHY? Does not induce multicollinearity, and it will reduce the error variance. In large sample sizes, the standard errors of the OLS estimators will be reduced as well. Prediction Predictions, although useful, are subject to sampling variation, because they are obtained using the OLS estimates. Point prediction: What is the predicted y for a given x that is E(yx)?  Simply plug in the OLS estimates.  It is an expected value of y given x Interval prediction: However, what if we want to measure the uncertainty of this predicted value? It is natural to construct a confidence interval for the true value, 0, ^ centred about the predicted value 0 . ^  To obtain a CI for 0 we need a standard error for 0 . ^ ^  for large samples, the 95% confidence interval equation is given by 0 2 se( 0 ) . ^  How to obtain the standard error for 0 ? Use cj to denote every given substituted value for each xj i) First write 0 = 0  1c1  ...  k c k ii) Then we substitute this into: y = 0 + 1x1 + ... + kxk + u iii) From this we will obtain y = 0 + 1(x1c1) + 2(x2c2) ... + u iv) We will then run the regression of yi on (xi1c1)...x(ikck) etc. The OLS estimator of 0 and its standard error are the intercept, and its standard error respectively. We can then use the standard error of this regression in our computation of the confidence interval. Predicting y when log(y) is the dependent variable We must use undo the log. However we cannot just take the exponential of the predicted value for log(y); this will systematically underestimate the expected value of y. If our model follows the CLM assumptions MLR1 MLR6, it can be shown that: 2 E(y  x) = exp( ) exp( 0 + 1 x1 + ...+ k x k ) 2 Where the variance term is the variance for our error term u [If u~N(0,2), then the expected value of exp(u) is exp(2/2)]. The equation needed to predict y is: ^ ^ ^ y = exp( 2 /2)exp(log y) ^  Where 2 is the unbiased estimator of 2.  If we assume that u is independent of the explanatory variables, then we can assume that the expected value of exp(u) is denoted by 0  Given an estimate of 0, we can predict y as follows: ^ ^ ^ y = 0 exp(log y) In order to obtain the estimate for 0, based on E[exp(u)], we will use the method of moments estimator, which is: ^ ^ 0 = n 1 exp( ui ) For interval prediction: Residual Analysis Sometimes it is useful to examine whether the actual value of the dependent variable is above or below the predicted value. That is, to examine the residuals for the individual observations. This is called residual analysis.  Is yi substantially less than predicted yi hat? ^ ^  Look at ui / to see if this ratio is within 1 standard deviation. W6 MLR: Binary Variables Many variables come in the form of binary information, it is appropriate for them to be described by binary variables or dummy variables. We can use such dummy variables to incorporate qualitative information into regression models. For example, wage = 0  0female + 1educ + u  is the difference in hourly wage between females and males, where female = 1 an male is female = 0 (thus male is base group), given the same amount of education.  Therefore, determines whether there is discrimination against women. If > 0, women earn less than men on average. KEY idea is that the level of education is the same in both expectations; the difference is therefore due to gender only.  The coefficient on the dummy variable female will measure the average difference in hourly wage between the two gender groups, given the same level of education.  Only use ONE dummy variable to describe the presence of two groups, as introducing two will induce perfect collinearity. ^ Generally, if 1 is the coefficient on a dummy variable x1, when log(y) is the dependent variable, the exact percentage difference in the predicted y when x1 = 1 versus when x1 = 0 is given by: ^ 100 [exp(1 ) 1]  This gives us a more accurate estimation of the percentage difference by which y for x1=1 is greater than for y when x1=0. Interpreting dummy variables where dependent variable is log(y) Where the dummy variable is within a model with a log dependent variable, the interpretation is simply percentage difference.  GIVEN the same independent variables for each observation, the coefficient on the dummy variable 0 shows the percentage difference in y between our binary variables. Relevant for policy analysis as the coefficient on the dummy variable is the ceteris paribus, or causal effect of the policy. Dummy variable should represent the treatment group, while other factors should be included as controls. Dummy variables for multiple categories We can also use several dummy variables in the same equation when we have more than two categories.  Generally, if we have g groups of binary variables, we will require g1 number of dummy variables with the intercept for the base group. The coefficient of the dummy variables measures the proportionate difference in the dependent variable RELATIVE to the base group holding all other factors fixed. e.g. Married men are estimated to earn 20% more than single men holding levels of education and experience fixed.  Thus, the coefficients on the dummy are the difference in the intercepts between the group and the base group. Dummy Variables and Ordinal Information Variables that take multiple values where order matters are known as ordinal variables. How to incorporate? 1) One possibility is: y = 0 + 1 ordinal variable + other factors However, this assumes that every unit increase in the ordinal variable has a constant effect on the dependent variable, so may not be a good approach. 2) Better approach is: y = 0 + 1ordinal value 1 + 2ordinal value 2 + 3ordinal value 3...  We can use a number of dummy variables to define each value.  Once again, if we have g values in our ordinal variable we will use g1 dummy variables, where the g ordinal value is represented in the BASE GROUP (all ordinal values = 0)  The coefficients is the difference in the dependent variable between an observation with a certain ordinal value compared to the base group ordinal value, holding all other factors fixed. 3) When the ordinal variable takes too many values, it would be unfeasible to have a dummy variable included for each. In this case, use the approach of RANGES.  Define the dummy variables for each range of values  The range that is excluded from being categorised as a dummy variable should be the BASE GROUP. Dummy Variables and Interactions Consider the following model: ^Log(wage) = 0.177cwork + 0.70chome + 0.17cwork.chome + other factors Base group: Those who do not use computers 17.7%: The wage differential between having a computer at work and not having a computer, holding all other factors fixed 70%: The wage differential between having a computer at home and not having a computer, holding all other factors fixed 17%: The wage differential between having both a computer at home and at work and not having a computer, holding all other factors fixed The variable cwork.chome is the interaction term Interaction terms allow a certain characteristic to depend on another characteristic, for example; allowing the marriage premium to depend on gender. Allowing for Different Slopes There may be occasion for interacting dummy variables with explanatory variables that are not dummy variables to allow for a "difference in slopes".  To test whether the return for a certain variable is the same for each category of a dummy variable allowing for a constant differential.  For example: Test whether return to education is the same for men and women allowing for a constant wage differential:  Log(wage) = ( 0 + 0female) + ( 1 + 1female)education + u  If female = 0, we will have the intercept for males and the slope of education for males is 1.  If female = 1, we will have the intercept for females ( 0 + 0) and the slope of education for females is ( 0 + 1)  Therefore 0 measures the difference in the intercept for males and females, and 1 measures the difference in slope (return to education) between males and females. 240] [Refer to graph 7.2 on page There are four difference cases depending on the signs of 0 and 1 If 0 < 0 and 1 < 0  intercept and slope for women and smaller than for men: Women earns less than men at all levels of education and this gap increases as education gets larger. If 0 > 0 and 0 < 0 Intercept for women is smaller than men and slope for women is larger than men: Women earn less than men at low levels of education, but this gap narrows as education increases and at some point, women will earn more for the same levels of education. Similar discussion for opposite signs. To estimate such a model: Wage = 0 + 0female + 1educ + 1female.educ + u Two hypotheses can be tested: a) Return to education is the same for men and women (H0: 1 = 0) b) Expected wages are the same for men and women given the same levels of education (H0: 0 = 0 and 1 = 0) Linear Probability Model Consider the case where the dependent variable gives a binary response (y=0, or y=1), this is called a linear probability model. P(y=1x) = E(yx)  The probability of "success", that is, the probability y=1 is the same as the expected value of y.  The population regression function is the probability of success (y=1) given the independent variables when the ZCM assumption holds.  P(y=1x) is known as the response probability The parameter j measures the change in the probability of success when xj changes by one unit, holding all other factors fixed. Shortcomings of LPM:  If we plug certain combinations of values for the independent variables, we can get a dependent variable value of less than 0 or greater than 1.  Furthermore, due to the binary nature of y, LPM violates the homoskedasticity assumption W7: Heteroskedasticity Recall MLR5: Homoskedasticity, the variance of the error term does not depend on the observations of the independent variable.  The estimators of the VARIANCES of the OLS estimators will be biased without MLR5, it will not cause bias in the OLS estimators themselves.  Without MLR5, the usual standard errors are incorrect and the tstat and Fstat do not follow their valid distributions which may lead to wrong conclusions.  Without Var(ux) being constant, OLS is no longer BLUE.  Without MLR5, OLS is no longer asymptotically efficient. HeteroskedasticityRobust Inference It is possible to adjust OLS standard errors and t, F and LM statistics so that they are valid in the presence of heteroskedasticity of unknown form.  Convenient as we can report new statistic that work regardless of the kind of heteroskedasticity present in the population.  Methods are known as heteroskedasticityrobust procedures because they are valid, whether or
not the errors have constant variance. Usual standard error v robust standard error  If homoskedasticity holds and errors are normally distributed, the usual tstatistics will follow an EXACT tdistribution regardless of "n".  Robust standard errors and tstatistics are justified ONLY if n is large. If n is small the robust distributions can be very far from tdistribution.  Robust standard errors can either be smaller or larger than the usual standard errors. Heteroskedasticity Tests If we want to detect the presence of heteroskedasticity, we have a few tests available. Assume MLR1MLR4 is maintained. H0: Var(ux1,x2 ... xk) = 2 The null is the assumption that MLR5 is true.  If we cannot reject this at a sufficiently small level we can conclude that heteroskedasticity is not a problem. H0: E(u2x1,x2...xk) = E(u2) = 2 If H0 any function of the xj. u2 = 0 + 1x1 + 2x2 + ... + v (v = error term with mean zero) Therefore, our null can be further denoted as: H0: 0 = 1 = ... = k = 0 BreuschPagan Test 1) Estimate y = 0 + 1x1 + 2x2 ... + kxk + u by OLS as usual. Obtain the squared OLS residuals ^u2 from this regression. 2) Run the regression for ^u2 = 0 + 1x1 + 2x2 ... + v. Keep the Rsquared from this regression R2. (Remember if MLR5 does not exist the expected value of the square of errors can be virtually any function of xj) 3) Form either the Lagrange Multiplier statistic or the Fstatistic, (where the R2 is obtained from step 2): 2 Ru 2 /k 2 ^ LM = n Ru 2 F= ~ Fk,n k 1 2 ^ (1  Ru 2 ) /(n  k 1) ^ is false, the expected value of u2 is given the independent variables can be virtually 4) Reject the null in favour of the alternative (there is heteroskedasticity) if the LM is too large compared to the critical value which leads it to lie in the rejection region. The White Test 1) Estimate y = 0 + 1x1 + 2x2 ... + kxk + u by OLS as usual. Obtain the squared OLS residuals ^u2 and the squared fitted value ^y2 from this regression. 2) Run regression for ^u2 = 0 + 1y1 + 2y2 ... + error and save the R2 from this regression. 3) Form either the Fstatistic or LM statistic: 2 Ru 2 /k 2 ^ LM = n Ru 2 F= ~ Fk,n k 1 2 ^ (1  Ru 2 ) /(n  k 1) ^ 4) Reject H0 in favour of the alternative (heteroskedasticity exists) if F or LM is too large. Weighted Least Squares Estimation We know that one possible response for detecting heteroskedasticity is to use robust errors. Another possible response is to use a "weighted least squares" method.  WLS is more efficient that OLS, and WLS leads to a new t and Fstatistics that have t and F distributions.  If the form of heteroskedasticity is known, utilise this knowledge in estimation. Assume var(ux) = 2h(x) Where h(x) is some function of the explanatory variables that determines the heteroskedasticity. Since variances must be positive, h(x) > 0 for all possible values. Assume that h(x) is known. There will be no heteroskedasticity in u/[h(x)]1/2 as var{u/h(x)]1/2x} = 2 Hence, if all variables are weighted by 1/[h(x)]1/2 the model will regain homoskedasticity. Let hi = [h(xi)]1/2 for the ith observation. i) Take yi = 0 + 1xi1 + 2xi2 ... + kxik + ui, which contains heteroskedastic errors, and transform it. Since hi is a function of xi, u/h(x) has a ZERO expected value conditional on x Hence if all the variables are weighted, regain homoskedasticity. ii) The ik that results from this weighting are called GENERALIZED LEAST SQUARES (GLS) estimators. The GLS estimators for correcting heteroskedasticity are called weighted least squares (WLS) estimators.  The idea is that less weight is given to observations with higher error variances  WLS estimators are more efficient than OLS in the presence of heteroskedasticity. Feasible GLS In most cases, the exact form of h(xi) is not known. We can model the function h and use data to estimate this.  Use ^hi instead of hi estimator is called the Feasible GLS (FGLS) estimators. 1) Run regression of y on x1,x2 ... xk and obtain the residuals ^u (OLS residuals). 2) Create log(^ui) by first squaring OLS residuals and taking the natural log of this. 3) Take the log(^ui) = 0 + 1x1 + ... + kxk + error and save the fitted values ^gi 4) Set ^hi = exp(^gi) 5) Estimate the usual equation using WLS by multiplying each variable by 1/^hi What if the h(x) function is wrong?  FGLS can be coupled with heteroskedasticityrobust standard errors  It is better to use FGLS with incorrect h(x) than to ignore heteroskedasticity altogether.  Any function of x is uncorrelated with ui, and so the weighted error is uncorrelated with the weighted regressors, therefore a misspecification of h(x) does not cause bias or inconsistency in the WLS estimator. LPM Revisited When the dependent variable y is a binary variable, the model must contain heteroskedasticity, unless all of the slope parameters are zero. We are now in a position to deal with this problem. 1) Simplest way is to use OLS estimation normally but to also compute robust standard errors in test statistics. This ignores that we actually know the form of heteroskedasticity. 2) However, generally the OLS estimators are inefficient in the LPM. Var(yx) = p(x)[1p(x)] Where, p(x) = 0 + 1x1 + ... + kxk is the response probability (success y = 1) The probability p(x) clearly depends on the unknown population parameters, however, we do have unbiased estimators of these parameters (due to OLS). For each observation i, var(yixi) is estimated by ^hi = ^yi(1^yi) where ^yi is the OLS fitted value for observation i. Estimating LPM using WLS 1. Estimate model by OLS and obtain the fitted values ^y 2. Determine whether all of the fitted values are inside the unit interval. If so proceed to step 3, if not, adjustment is needed to bring all fitted values into the unit interval 3. Construct the estimates variances using ^hi = ^yi(1^yi) 4. Estimate the normal regression equation by WLS, using the weights 1/^h The FGLS estimators still estimate the original parameters and interpretation of the parameters remain the same. FGLS is consistent, asymptotically normal and more efficient than the OLS in presence of heteroskedasticity. The associated tstat and Fstat have the usual t and F distribution for large n W8 Specification and Data Issues Functional Form Misspecification A MLR model suffers from functional form misspecification when it does not properly account for the relationship between the dependent and the observed explanatory variables.  Leads to biased OLS estimators Examples:  Omitting functions of independent variables  Misspecifying the function of the dependent variable  Omitting interaction terms RESET TEST "Regression Specification Error Test" A test to detect general function form misspecification is the RESET test. The idea behind this is:  If the original model satisfies MLR4, then no nonlinear functions of the independent variables should be significant when added to the equation. If they are significant, then the current model must be incorrect.  In particular, test whether squared and cubed fitted values are functions of x's. It should be insignificant when added to the correct model. 1) Determine the original OLS model and save the fitted values ^y 2) Test H0: 1 = 0, 2 = 0 in the model where the cubed and squared of the fitted values ^y are included into the expanded model.  Use the Fstat to test 3) Reject H0 when Fstat is greater than the critical value c Nonnested models Nested models: One model (restricted model) is a special case of the other model (unrestricted model).  Can use the exclusion Ftest to test for misspecification Nonnested models: When two models are not special cases of one another.  The usual exclusion test is not applicable The are two methods to test nonnested models: 1) Construct a comprehensive model that contains each model as a special case and then test the restrictions that led to each of the models. In other words, create a model that nests both of the nonnested models and use Ftest as usual. Consider the two nonnested models: y = 0 + 1x1 + 2x2 + u y = 0 + 1log(x1) + 2log(x2) + u The comprehensive model that nests the two unnested models is: y = 0 + 0x1 + 2x2 + 3log(x1) + 4log(x2) Test is thus: H0: 3 = 4 = 0 and H0: 1 = 2 = 0 Failing to reject (i) supports the first nonnested model Failing to reject (ii) supports the second nonnested model Let the models yield the fitted values g and h respectively. i) If the first model is correct, then h (fitted value for the second model) should be insignificant in: y = 0 + 1x1 + 2x2 + h + u  Rejecting H0: = 0 is a rejection of the second model.  We test the significance of h ii) Similarly, if the second model is correct, then g (fitted value for the first model) should be insignificant in: y = 0 + 1log(x1) + 2log(x2) + g + u  Rejecting H0: = 0 is a rejection of the first model.  We test the significance of g Both these tests are called the DavidsonMacKinnon test, and is based on the tstatistic on ^y in each of the nonnested equations.  This test has a lot of problems, for example, a clear winner may not emerge.  If neither model is rejected; use adjusted Rsquared to choose.  If both models is rejected, more work needs to be done. Unobserved Explanatory Variables and Proxy If a key variable is unobserved and omitted, the OLS estimators will generally be biased. The omittedvariable bias can be reduced by using a proxy variable to replace our unobserved variable.  For example, use IQ as a proxy of ability.  A proxy variable is something related to the unobserved variable that we would like to control for in our analysis.  Need our observable variable to be correlated to our unobserved variable. Conditions for a valid proxy A valid proxy (xk) for a key unobserved variable (xk*) must: 1) ZCM holds for observed, unobserved and proxy. That is, the error term is uncorrelated with all variables. E(u x1, x2, x3, x3*) = 0 2) If x3 is controlled for, the conditional mean of g does not depend on x1 and x2. E(x3*x1,x2,x3) = E(x3*x3) = 0 + 3x3  Equation states that, once x3 is controlled for, the expected value of its proxy does NOT depend on x1 or x2. Alternatively, the proxy of x3 has zero correlation with x1 and x2 once x3 is partialled out. This condition states that the average level of x3* (the unobserved variable) only changes with x3 (the proxy), not with any of the other independent variables. This condition implies: X3* = 0 + 3x3 + v Where v is uncorrelated with x1, x2, x3 OLS are unbiased for estimating the OLS estimates under conditions 1) and 2) 2) Test the significance of one model's fitted value in the other model. Consider again the two nonnested models: y = 0 + 1x1 + 2x2 + u y = 0 + 1log(x1) + 2log(x2) + u Properties of OLS under Measurement Error Sometimes we cannot collect data on the variable that truly affects economic behaviour. When we use an imprecise measure of an economic variable in a regression model, then our model contains measurement error. Difference between measurement error and proxy variable, is that in the measurement error case, the variable we do not observe has a well defined, quantitative meaning but our recorded measures of it may contain error. E.g. Reported annual income as a measure of actual annual income, whereas IQ score is a proxy for ability. Error in the dependent variable The dependent variable is measured with error. Let y* denote the variable we would like to explain, assume the equation satisfies the GaussMarkov assumptions. Y* = 0 + 1x1 + ... + kxk + u We let y represent the observable measure of y*. Unfortunately, due to inability to collect accurate data, we can expect y and y* to differ. e0 = y y* To estimate y*, we will have to rearrange to y* = y e0 and substitute this into equation. y = 0 + 1x1 + ... + kxk + (u + e0) < error term 1) y and the independent variables are observed so we can estimate this model by OLS. 2) Since original model satisfies MLR1MLR5, u has zero mean and uncorrelated with the independent variables, therefore we can also assume that the measurement error (which is part of the error term in this new regression) also has zero mean.  If e0 is also independent of x, ZCM will hold and the OLS estimators will be unbiased and consistent.  If e0 is not independent of x and correlates with x, it will cause bias in our estimators. Error in the independent variables Similarly here, measurement error e1 = x1 x*1 and this can be positive, negative or zero. We assume that the average measurement error in the population is zero: E(e1) = 0.  To estimate y = 0 + 1x1* +...+ kxk + u, we once again have to rearrange for x1* = x1 e1 in order to use y = 0 + 1x1* +...+ kxk + (u1e1)  If the measurement error is independent of all other x variables as well as x1 (observed), then the ZCM will hold and the OLS estimators will be unbiased and consistent.  The classical errorsinvariables (CEV) assumption is that the measurement error is uncorrelated with the unobserved explanatory variable: Cov(x1*, e1) = 0 If the CEV assumption holds, then: Cov(x1, e1) = E(x1 e1) = E(x1* e1) + E(e1 e1) = E(e1e1) > 0 and the ZCM fails to hold. The OLS estimator of 1 will be biased towards zero (attenuation bias). Missing Data When there is missing key data from certain units in our random samples, we cannot include such observations into our regression. As a result we will drop these units.  Dropping observations with missing values will only consequence in the reduction in sample size and there will be no violation to MLR1MLR5.  If the data is missing are at random, sample size is simply reduced with no violation of MLR2 (random sampling assumption)  However, if data missing follows a systematic pattern, the sample becomes nonrandom and this is a violation to MLR2 Nonrandom samples Exogenous sample selection: If the sample is chosen on the basis of the independent variables, there will be no bias.  E(ux1, x2, x3) = 0 will still hold. Endogenous sample selection: If the sample is chosen on the basis of the dependent variable, there will be bias as ZCM will not hold. Outliers Outliers are "unusual" observations that are far away from the "centre"  Outliers may have a strong influence on the OLS estimation results.  Inclusion of one or several observations causes a large change in the OLS estimates, this is known as influential observations  OLS is generally sensitive to outlying observations because large residuals once squared, receive much more weight in OLS.  In practice, should report OLS results with and without outliers. W10: Regression with Time Series Nature of Time Series Data  Temporal ordering as opposed to crosssectional data.  A sequence of random variables indexed by time is called a stochastic process or a time series process.  When we collect a time series data set, we obtain one possible outcome, or realization of the stochastic process. Examples of Time Series Models Static Models: Suppose we have time series data available on two variables y and z, where yt and zt and dated at the same period of time. A static model relating y to z is: yt = 0 + 1zt + ut t = 1,2...,n  The name "Static model" comes from the fact that we are modelling a contemporaneous (occurring in the same period of time) relationship between y and z.  Usually, a static model is postulated when a change in z at time t is believed to have an immediate (thus contemporaneous) effect on y. Finite Distributed Lag models: We allow one or more variables to affect y with a lag. yt = 0 + 0zt + 1zt1 + ... + qztq + ut  This model is called the FDL of order q  The partial effect of ztj on yt is j j = 0, 1, ..., q (holding everything else constant) When we graph the j as a function of j, we obtain the lag distribution which summarizes the dynamic effect that a temporary increase in z has on y.  The 0 is called the impact propensity or the impact multiplier; this shows that it is the immediate change in y due to the oneunit increase in z at time t (short run propensity). For example: gfr = 0 + 0pet + 1pet1 + 2pet2 + ut gfr = general fertility rate pet = real dollar value of personal tax exemption  This model shows the dynamic effect (tq) that a temporary increase in the value of personal tax exemption (z) on the general fertility rate (y).  Note the increase in z is temporary. When there is a permanent oneunit shift in z at t, zs = 0 for s < t , zs = 1 for s > t  The eventual effect is 0 + 1 + ... + q known as the long run propensity or long run multiplier.  Shows that the sum of the coefficients on current and lagged z is the longrun change in y given a permanent increase in z.  The long run propensity is the sum of all coefficients on the variables ztj To estimate the LRP, the model is reparameterised: yt = 0 + zt + 1(zt1 zt) + ... + q(ztq zt) + ut Regressing yt on all the variables provides an estimate of LRP , where = 0 + 1 + 2 etc. Time series data v crosssectional data  Observations are collected from the same objects at different points in time, therefore no "random sampling"  Purposes of time series data include measuring association of variables and forecasting with current information. Assumptions about time series regression TS1: Linear in parameters The stochastic process (xt1, xt2 ... xtk, yt) follows the linear model, where ut is the sequence of errors or disturbances. Here, the n is the number of observations (time period).  In the notation xtj, t denotes the time period and j is to indicate one of the k explanatory variables TS2: No perfect collinearity In the sample (and therefore in the underlying time series process), no independent variable is constant or a perfect linear combination of the others.  Does allow explanatory variables to be correlated, just not perfectly correlated TS3: Strict zero conditional mean For each t, the expected value of the error ut, given the explanatory variables for all time periods, is zero. E(utX) = 0, t = 1,2, ..., n  Implies that the error term at time t is uncorrelated with every explanatory variable in every time period.  E(utxt) = 0  When this holds we say that xtj are contemporaneously exogenous. It implies that ut and the explanatory variables are contemporaneously uncorrelated: Corr(xtj, ut) = 0 for all j When TS3 holds which implies the xtj are contemporaneously exogenous, then we say that the explanatory variables are strictly exogenous.  This is sufficient for proving consistency of the OLS estimators, but it is required for OLS to be unbiased. TS4: Homoskedasticity Conditional on X, the variance of ut is the same for all t: Var(utX) = Var(ut) = 2, t = 1,2...n  Variance of the error term at time t is independent of the independent variables at t  Furthermore, the variance of the error term is constant over time. TS5: No serial correlation Conditional on X, the errors in two different time periods are uncorrelated.   Corr(ut, usX) = 0 for all t s We can ignore the conditional on X When this does not hold, we say that the errors suffer from serial correlation or autocorrelation, because they are correlated across time. TS6: Normality The errors ut are independent of X and are independently and identically distributed as: Normal(0,2)  TS6 implies TS3, TS4 and TS5  A strong assumption but we require it when dealing with small samples. However, we will not need it when sample sizes are large due to asymptotic normality. Properties of OLS estimators Theorem 10.1: Unbiasedness of OLS Under assumptions TS1 (linearity), TS2(no perfect collinearity) and TS3(ZCM), the OLS estimators are unbiased conditional on X and therefore unconditionally as well: E(^j) = j j = 0,1,..., k Theorem 10.2 Sampling Variance Under the time series GaussMarkov assumptions TS1TS5, the variance of ^j conditional on X is given by: Theorem 10.3 Unbiased estimation of 2 Under assumptions TS1 TS5: E(^2) = 2 where ^2 = SSR/nk1 Theorem 10.4 GaussMarkov Theorem Under TS1 to TS5, the OLS estimators are the best, linear unbiased estimators (BLUE) conditional on X. Theorem 10.5 Normal Sampling Distribution Under TS1 to TS6, the CLM assumptions for time series, the OLS estimators are normally distributed, conditional on X. Furthermore, the Fstat and the tstat follow their exact F and t distributions respectively. Var(^jX) = 2 / SSTj (1 Rj2) Because MLR2 does not hold for time series data, the inference tools are valid under a strong set of assumptions (TS1TS6) for finite samples.  While TS3TS6 are often too restrictive, they can be relaxed for larger samples. Trend and Seasonality in Time Series Many economic time series have a common tendency of growing or shrinking over time. We must recognize that some series contain a time trend in order to draw causal inference using time series data.  A time trend is a linear function of time, as a proxy for unobserved factors  Time series variables may appear to be correlated only due to these trends  Must be able to distinguish between a true growing relationship from spurious relationships that are induced purely by time trends. Including a time trend in a regression model creates an interpretation in terms of detrending the original data series before using them in regression analysis. Seasonality Some time series also show seasonal patterns that are caused by weather or institutional arrangement. To account for seasonality in time series, seasonal dummy variables may be used (deseasonalizing)  Series that do display seasonal patterns are often seasonally adjusted before they are reported for public use. This means that the seasonable factors are removed from it. ...
View
Full
Document
This note was uploaded on 10/05/2011 for the course ECON 2206 taught by Professor Yang during the Three '11 term at University of New South Wales.
 Three '11
 yang
 Econometrics

Click to edit the document details