day8lm - 3/16/12 PADP 8130: Linear Models Data ...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 3/16/12 PADP 8130: Linear Models Data Problems Angela Fer:g, Ph.D. Data problems we will discuss today •  •  •  •  Mul:collinearity Measurement error Outliers Missing observa:ons 1 3/16/12 Mul:collinearity •  Defini:on: Independent variables are closely related to each other such that there are linear rela:onship among them. –  When one of them increases the others increase as well, making it difficult to work out the separate effect of each predictor –  This is common in social science because our variables oPen “overlap” a lot. Example When you ask people about their various aUtudes, many aUtudes are highly correlated with each other (e.g. defense and foreign policy) 10 Correlation between our two independent variables is 0.98. Defence policy 8 6 When foreign policy goes up, defence policy goes up. We can’t work out what happens when foreign policy goes up and defence stays the same. 4 2 0 0 2 4 6 8 10 Foreign policy 2 3/16/12 Symptoms of mul:collinearity •  Coefficients have high standard errors •  Coefficients have wrong sign or implausible magnitude •  Small changes in data lead to wide swings in coefficients Ways to check for mul:collinearity 1.  Calculate variance infla:on factor; a variable with VIF values>10 may be a linear combina:on of other independent variables. 2.  Regress x1 on rest of regressors, if R2 is close to 1, mul:collinearity likely. Consequence of mul:collinearity Perfect mul:collinearity à༎ OLS es:mate is not defined because X is not full rank, thus not inver:ble Near perfect mul:collinearity à༎ High variance of OLS es:mates Model: y = zβ1 + wβ 2 + ε ⎛ ⎛ β1 ⎞ 2 −1 2 Var ⎜ ⎟ = σ (X'X ) = σ ⎜ ⎜ ⎝ β2 ⎠ ⎝ Var ( β1 ) = ∑ z ∑ wz ∑ wz ∑ w 2 2 ⎞ ⎟ ⎟ ⎠ −1 σ 2 ∑ w2 ∑ z ∑ w 2 − (∑ wz)2 2 If w and z are orthogonal, then Var ( β1 ) = If not, Var ( β1 ) = σ2 = (∑ wz )2 z2 − ∑ ∑ w2 σ2 (fine). ∑ z2 σ2 = (∑ wz )2 z 2 (1 − ) ∑ ∑ z2 ∑ w2 As ρwz → 1, σ2 2 ∑ z (1 − ρwz ) Var(β1 ) → ∞ 3 3/16/12 Cure for mul:collinearity •  Drop one of the highly correlated independent variables (this makes sense when the causal rela:onships are clear) •  Make a scale with the highly correlated independent variables (this makes sense when there is an underlying variable that we haven’t/ can’t measure) Measurement error (aka errors- in- variables) •  Defini:on: Measurement error (ME) is when the variable you need cannot be measured accurately or only a proxy for the true variable is available. •  Consequences: –  ME in the dependent variable results in higher variance but es:mate is s:ll unbiased. –  ME in independent variables results in biased es:mates; in par:cular, the es:mate is afenuated (or biased toward 0). 4 3/16/12 ME in the dependent variable True 2 rela:onship: y = Xβ + e where Var (e) = σ ε . Observed: 2 y* = y + m where E (m) = 0 and Var (m) = σ m . Regress: 2 y* = Xβ + u where u = e + m and Var (u) = σ ε2 + σ m . b = (X'X)-1 X'y* Higher variance! = (X'X)-1 X'( y + m) E (b ) = β Unbiased! ME in the independent variable Observed: Regressed: x* = x + m where E (m) = 0 E (m X ) = 0 E (m y ) = 0 2 Var (m) = σ m y = x*β + e ∑ yx = ∑ y( x + m) b= ∑ x ∑ ( x + m) ∑ yx + ∑ ym = Biased! ∑ x + 2∑ xm + ∑ m E (∑ yx ) Qβ E (b ) = = E (∑ x ) + E (∑ m ) Q + σ where Q = E(∑ x ). * *2 2 2 2 2 2 2 m 2 b < β because Q < 1. Afenuated! 2 Q +σm 5 3/16/12 Cure for measurement error The most common solu:on is to use instrumental variables, which we will discuss in a few weeks. Outliers •  Defini:on: Outliers are individual observa:ons that do not fit our predic:ons. –  These “outlying” observa:ons can some:mes radically change our regression results. –  They can also help us to think about other independent variables that may be important. 6 3/16/12 Example •  Say we are interested in voter turnout around the world (what percentage of people vote). •  A reasonable hypothesis is that a more compe::ve party system would lead to more people bothering to vote. •  So let’s model turnout using compe::veness as an independent variable in a regression. Outliers 100 90 Turnout 80 Belgium Australia 70 Regression line 60 50 40 30 0 1 2 3 4 5 Competiveness But Belgium and Australia have compulsory vo:ng…need to include that in the model. 7 3/16/12 What if you can’t explain the outliers? •  Check the data – make sure it isn’t an error. •  Assess whether it mafers: 1.  How big is its residual? Standardize the residuals (called studen'zed residuals) by dividing the residual by the standard devia:on we would expect from normal sampling variability. This is like a z- sta:s:c, so about 5% of values should be above 1.96 or below 1.96, so you can work out how outlying the outliers are. 2.  Does it affect the es:mated coefficients? What is the leverage of the observa:on? Leverage •  Outlying observa:ons far from the mean make more difference to the regression line •  To detect influen:al observa:ons, we calculate a diagnos:c called DFBETA, which tells us the effect of removing the observa:on on each parameter es:mate in the model •  The DFBETA for a parameter is high (more than 1) when the observa:on has a big residual and has a lot of leverage. 8 3/16/12 Should we delete the outlier? •  Generally not a good idea unless you have some reason (the data point is a typo, there is a missing variable, etc); it is a real observa:on aPer all. •  If the “interes:ng” rela:onships are dependent on the outlier, then you need to be cau:ous when interpre:ng them. Missing observa:ons •  Problem: Some respondents did not answer some ques:ons. •  Common Solu:ons: –  Complete- case method (or listwise dele:on) –  Missing- indicator method (or dummy variable adjustment) and the stra:fica:on method –  Single Imputa:on –  Mul:ple imputa:on 9 3/16/12 Complete- case method •  Method: delete from the sample any observa:ons that have missing data on any variables in the model of interest •  Pros and Cons: –  If the data are missing at random, then the es:mates will be unbiased (if not missing at random, then es:mates may be biased) –  Standard errors will be larger because sample size is smaller than other approaches –  Wasteful of informa:on Missing- indicator method •  Method: 1.  Create dummy variable D that is equal to 1 if data are missing on variable x and equal to 0 otherwise. 2.  Then, fill in the missing observa:ons of x with some constant c (mean or median of observed x’s). 3.  Include x and D in the regression. Stra1fica1on method is just a varia:on for categorical variables; if race is missing, and has 3 categories (white, black, Hispanic), add 4th category (unknown). •  Pros and Cons: –  Uses all available informa:on –  Produces biased es:mates of the coefficients (Jones, 1996) 10 3/16/12 Single Imputa:on •  Method: –  Subs:tute some reasonable guess (imputa:on) for each missing observa:on and then do the analysis as if there were no missing data. –  The guess is usually produced through a regression (x has some missing values, regress x on all other independent variables, then predict x for missing cases using regression coefficients). •  Pros and Cons: –  If data are missing at random, OLS is consistent –  Analyzing imputed data as though it were complete data does not adjust for the fact that the imputa:on process involves uncertain:es about the missing values (SE will be underes:mate of true uncertainty about es:mate) Mul:ple Imputa:on •  Gold standard for dealing with missing data •  Method: –  Introduce a random component to the imputa:on process. –  Repeat the random imputa:on process more than once, producing mul:ple “complete” data sets. –  Get an es:mate from each data sets and then combine the mul:ple es:mates into one es:mate •  Pros and Cons: –  Produces es:mates that are consistent and asympto:cally efficient when the data are missing at random –  Hard to implement and easy to do it wrong If you want to learn more, read Allison’s Sage Green book “Missing Data”. 11 ...
View Full Document

This note was uploaded on 03/28/2012 for the course PADP 8130 taught by Professor Fertig during the Spring '12 term at LSU.

Ask a homework question - tutors are online