This preview shows page 1. Sign up to view the full content.
Unformatted text preview: 3/16/12 PADP 8130: Linear Models Data Problems Angela Fer:g, Ph.D. Data problems we will discuss today •
•
•
• Mul:collinearity Measurement error Outliers Missing observa:ons 1 3/16/12 Mul:collinearity • Deﬁni:on: Independent variables are closely related to each other such that there are linear rela:onship among them. – When one of them increases the others increase as well, making it diﬃcult to work out the separate eﬀect of each predictor – This is common in social science because our variables oPen “overlap” a lot. Example When you ask people about their various aUtudes, many aUtudes are highly correlated with each other (e.g. defense and foreign policy) 10 Correlation
between our two
independent
variables is 0.98. Defence policy 8
6 When foreign policy
goes up, defence
policy goes up.
We can’t work out
what happens when
foreign policy goes
up and defence stays
the same. 4
2
0
0 2 4 6 8 10 Foreign policy 2 3/16/12 Symptoms of mul:collinearity • Coeﬃcients have high standard errors • Coeﬃcients have wrong sign or implausible magnitude • Small changes in data lead to wide swings in coeﬃcients Ways to check for mul:collinearity 1. Calculate variance inﬂa:on factor; a variable with VIF values>10 may be a linear combina:on of other independent variables. 2. Regress x1 on rest of regressors, if R2 is close to 1, mul:collinearity likely. Consequence of mul:collinearity Perfect mul:collinearity à༎ OLS es:mate is not deﬁned because X is not full rank, thus not inver:ble Near perfect mul:collinearity à༎ High variance of OLS es:mates Model: y = zβ1 + wβ 2 + ε
⎛
⎛ β1 ⎞
2
−1
2
Var ⎜
⎟ = σ (X'X ) = σ ⎜
⎜
⎝ β2 ⎠
⎝
Var ( β1 ) = ∑ z ∑ wz
∑ wz ∑ w
2 2 ⎞
⎟
⎟
⎠ −1 σ 2 ∑ w2
∑ z ∑ w 2 − (∑ wz)2
2 If w and z are orthogonal, then Var ( β1 ) =
If not, Var ( β1 ) = σ2
=
(∑ wz )2
z2 −
∑
∑ w2 σ2
(fine).
∑ z2 σ2
=
(∑ wz )2
z 2 (1 −
)
∑
∑ z2 ∑ w2 As ρwz → 1,
σ2
2
∑ z (1 − ρwz ) Var(β1 ) → ∞ 3 3/16/12 Cure for mul:collinearity • Drop one of the highly correlated independent variables (this makes sense when the causal rela:onships are clear) • Make a scale with the highly correlated independent variables (this makes sense when there is an underlying variable that we haven’t/
can’t measure) Measurement error (aka errors in variables) • Deﬁni:on: Measurement error (ME) is when the variable you need cannot be measured accurately or only a proxy for the true variable is available. • Consequences: – ME in the dependent variable results in higher variance but es:mate is s:ll unbiased. – ME in independent variables results in biased es:mates; in par:cular, the es:mate is afenuated (or biased toward 0). 4 3/16/12 ME in the dependent variable True 2
rela:onship: y = Xβ + e where Var (e) = σ ε . Observed: 2
y* = y + m where E (m) = 0 and Var (m) = σ m . Regress: 2
y* = Xβ + u where u = e + m and Var (u) = σ ε2 + σ m . b = (X'X)1 X'y*
Higher variance! = (X'X)1 X'( y + m)
E (b ) = β
Unbiased! ME in the independent variable Observed: Regressed: x* = x + m
where
E (m) = 0
E (m X ) = 0
E (m y ) = 0
2
Var (m) = σ m y = x*β + e ∑ yx = ∑ y( x + m)
b=
∑ x ∑ ( x + m)
∑ yx + ∑ ym
=
Biased! ∑ x + 2∑ xm + ∑ m
E (∑ yx )
Qβ
E (b ) =
=
E (∑ x ) + E (∑ m ) Q + σ
where Q = E(∑ x ).
* *2 2 2 2 2 2 2
m 2 b < β because Q
< 1. Afenuated! 2
Q +σm 5 3/16/12 Cure for measurement error The most common solu:on is to use instrumental variables, which we will discuss in a few weeks. Outliers • Deﬁni:on: Outliers are individual observa:ons that do not ﬁt our predic:ons. – These “outlying” observa:ons can some:mes radically change our regression results. – They can also help us to think about other independent variables that may be important. 6 3/16/12 Example • Say we are interested in voter turnout around the world (what percentage of people vote). • A reasonable hypothesis is that a more compe::ve party system would lead to more people bothering to vote. • So let’s model turnout using compe::veness as an independent variable in a regression. Outliers
100
90 Turnout 80 Belgium Australia 70 Regression line 60
50
40
30
0 1 2 3 4 5 Competiveness But Belgium and Australia have compulsory vo:ng…need to include that in the model. 7 3/16/12 What if you can’t explain the outliers? • Check the data – make sure it isn’t an error. • Assess whether it mafers: 1. How big is its residual? Standardize the residuals (called studen'zed residuals) by dividing the residual by the standard devia:on we would expect from normal sampling variability. This is like a z sta:s:c, so about 5% of values should be above 1.96 or below 1.96, so you can work out how outlying the outliers are. 2. Does it aﬀect the es:mated coeﬃcients? What is the leverage of the observa:on? Leverage • Outlying observa:ons far from the mean make more diﬀerence to the regression line • To detect inﬂuen:al observa:ons, we calculate a diagnos:c called DFBETA, which tells us the eﬀect of removing the observa:on on each parameter es:mate in the model • The DFBETA for a parameter is high (more than 1) when the observa:on has a big residual and has a lot of leverage. 8 3/16/12 Should we delete the outlier? • Generally not a good idea unless you have some reason (the data point is a typo, there is a missing variable, etc); it is a real observa:on aPer all. • If the “interes:ng” rela:onships are dependent on the outlier, then you need to be cau:ous when interpre:ng them. Missing observa:ons • Problem: Some respondents did not answer some ques:ons. • Common Solu:ons: – Complete case method (or listwise dele:on) – Missing indicator method (or dummy variable adjustment) and the stra:ﬁca:on method – Single Imputa:on – Mul:ple imputa:on 9 3/16/12 Complete case method • Method: delete from the sample any observa:ons that have missing data on any variables in the model of interest • Pros and Cons: – If the data are missing at random, then the es:mates will be unbiased (if not missing at random, then es:mates may be biased) – Standard errors will be larger because sample size is smaller than other approaches – Wasteful of informa:on Missing indicator method • Method: 1. Create dummy variable D that is equal to 1 if data are missing on variable x and equal to 0 otherwise. 2. Then, ﬁll in the missing observa:ons of x with some constant c (mean or median of observed x’s). 3. Include x and D in the regression. Stra1ﬁca1on method is just a varia:on for categorical variables; if race is missing, and has 3 categories (white, black, Hispanic), add 4th category (unknown). • Pros and Cons: – Uses all available informa:on – Produces biased es:mates of the coeﬃcients (Jones, 1996) 10 3/16/12 Single Imputa:on • Method: – Subs:tute some reasonable guess (imputa:on) for each missing observa:on and then do the analysis as if there were no missing data. – The guess is usually produced through a regression (x has some missing values, regress x on all other independent variables, then predict x for missing cases using regression coeﬃcients). • Pros and Cons: – If data are missing at random, OLS is consistent – Analyzing imputed data as though it were complete data does not adjust for the fact that the imputa:on process involves uncertain:es about the missing values (SE will be underes:mate of true uncertainty about es:mate) Mul:ple Imputa:on • Gold standard for dealing with missing data • Method: – Introduce a random component to the imputa:on process. – Repeat the random imputa:on process more than once, producing mul:ple “complete” data sets. – Get an es:mate from each data sets and then combine the mul:ple es:mates into one es:mate • Pros and Cons: – Produces es:mates that are consistent and asympto:cally eﬃcient when the data are missing at random – Hard to implement and easy to do it wrong If you want to learn more, read Allison’s Sage Green book “Missing Data”. 11 ...
View
Full
Document
This note was uploaded on 03/28/2012 for the course PADP 8130 taught by Professor Fertig during the Spring '12 term at LSU.
 Spring '12
 Fertig

Click to edit the document details