Unformatted text preview: 3/22/11 PADP 8120: Data Analysis and Sta5s5cal Modeling Mul$variate Ordinary Least Squares Spring 2011 Angela Fer5g, Ph.D. Plan Last 5me: We introduced OLS in its bivariate form one independent variable (X) that predicts a dependent variable (Y). This 5me: We will extend in a number of ways: More than one independent variable Using categorical independent variables Accoun5ng for interac5ons between independent variables Assessing whether some models are beRer than other models 1 3/22/11 Causality For causa5on, we need 3 things: 1. Associa)on: i.e. a sta5s5cally significant rela5onship between the two variables we are interested in 2. Time ordering: i.e. cause comes from effect. Difficult for social science because we can't do experiments and we oXen have "fixed" variables like race. 3. No alterna)ve explana)ons, i.e. is it possible...? Example People in the Hebrides were convinced that body lice caused good health. Healthy people always had lots of lice and sick people had few. Should we be discouraging baths and encouraging lice? Probability not. If you lived in the Hebrides, you're likely to have lice. The only people that don't are ill or dead because lice can't live on a dead person and they don't like the heat when someone is ill and feverish. 2 3/22/11 Alterna5ve explana5ons The rela5onship could be spurious. e.g. Ice cream consump5on and spousal abuse complaints are associated should we ban ice cream? No. There is no causal rela5onship, because both are caused by another variable hot weather. The rela5onship could work through another variable (a chain rela)onship) The rela5onship could be condi)onal on another variable. e.g. Being employed may be associated with more preventa5ve health care. Why would that be? There is a media5ng variable health insurance. Employed people are much more likely to have health insurance and thus get preventa5ve care. e.g. As the price of cigareRes goes up, cigareRe consump5on goes down for young adults. There is almost no effect for older smokers (who are more likely to be very addicted). Thus the rela5onship between cigareRe price and consump5on is condi5onal on age. Observa5onal data and Mul5ple Regression We could eliminate these problems if we used experiments. Unfortunately, in social science, it is oXen impossible to experiment on people (can't make one group poor, homeless, uninsured, etc., and see what effect is has on them). Instead, we rely on observa)onal data and control for alterna5ve explana5ons using mul)ple regression. Mul5ple regression allows us to include numerous independent variables so we can include those variables that we think might be producing spurious rela5onships 3 3/22/11 Example for the day What predicts aetudes about abor5on? Hypothesis: older people are less prochoice than younger people because younger people were raised in a more socially liberal environment than their elders Sample: 100 Bri5sh people Measure of aetudes: 10 point scale "Please tell me whether you think abor5on can be jus5fied, never be jus5fied or something in between using this car" [R. given a 110 response card, where 1 is always jus5fied and 10 is never jus5fied]. ScaRerplot Linear regression line 4 3/22/11 Bivariate OLS Variable
Age Intercept Coefficient value
0.10 0.46 Standard error
0.01 0.45 pvalue
0.00 0.31 The equa5on for our linear regression is: y = 0.46 + 0.10X + e So, there seems to be a sta5s5cally significant rela5onship between aetudes and age (p=0.00). The rela5onship also appears to be large in magnitude. If James is 10 years older than Jessie, then we predict that James will be more prolife and will score about 1 point higher on our 10 point scale. where y is aetude towards abor5on, X is age, and e is the error term. Alterna5ve explana5on Could it be that the rela5onship is spurious with religiosity being associated with both age and abor5on aetudes? The data says: People that go to church 4+ 5mes/month Have a mean of 6.95 on our abor5on aetude scale Have a mean age of 58 People that go to church <1 5me/month Have a mean of 2.48 on our abor5on aetude scale Have a mean age of 26 5 3/22/11 Another scaRerplot Religious people (who are old and prolife). Linear regression line Irreligious people (who are young and prochoice). Mul5variate OLS We need to include religiosity and age as independent variables in our regression. Variable
Age (b1) Religiosity (b2) Intercept (a) Coefficient value
0.03 0.84 2.07 Standard error
0.01 0.12 0.43 pvalue
0.06 0.00 0.00 The equa5on for our mul5ple regression is: y = 2.07 + 0.03X1 + 0.84X2 + e Which means that as people go to church an extra 5me per month, their abor5on aetude score goes up by 0.84 points, holding age constant. Likewise, as people age one year, their abor5on aetude score goes up by 0.03 points, holding church a8endance constant. 6 3/22/11 Thinking about extra predictors The best way of thinking about regressions with more than one independent variable is to imagine a separate regression line for age at each value of religiosity, and vice versa. The effect of age is the slope of parallel lines, controlling for the effect of religiosity. Graphically Regression line when X2=3 Regression line when X2=4 Regression line when X2=1 Regression line when X2=2 7 3/22/11 Mul5ple regression summary Our example only has 2 predictors, but we can have any number of independent variables. Thus, mul5ple regression is a really useful extension of simple linear regression. Mul5ple regression is a way of reducing spurious rela)onships between variables by including the real cause. Mul5ple regression is also a way of tes5ng whether a rela)onship is actually working through another variable (as it appears to be in our example). Using categorical independent variables The independent variables we've been using are all interval level (age, number of 5mes aRended church, etc.) A lot of social science variables are categorical (gender, race, region, etc.) To include these, we create "dummy" variables (0/1 variables). One category must be omiRed to serve as the reference category. Male: create a variable that equals 1 when male and 0 when female (female is the reference category) Race: create 2 variables (1 if black, 0 otherwise; and 1 if other, 0 otherwise) leaving white as the reference category Region: create variables (1 if west, 0 otherwise; 1 if northeast, 0 otherwise; and 1 if midwest, 0 otherwise) leaving south as the reference category 8 3/22/11 Accoun5ng for interac5ons There was a 3rd kind of alterna5ve explana5on that we haven't looked at yet the rela5onship could be condi)onal on another variable. That is, the slopes could be different for different groups. When you put all of the groups together, you aren't geeng an accurate picture of the true effect. To deal with this, we need to interact (mul5ply) the appropriate independent variables. In an extreme case, if the effect is posi5ve for one group and nega5ve for the other, when you combine the groups, you are likely to get a zero effect. y = a + b 1 X 1 + b 2 X 2 + b3 X 1 X 2 + e
Interac5on term Interpre5ng interac5on coefficients Predicted income Extra effect of education if black (likely negative) y = a + bed Xed + bblk Xblk + bint Xed Xblk + e
Effect of education for whites Mean income when all Xs are zero Effect of being black Another way to think about it (assuming all Xs are dichotomous) Predicted income if Low Ed (Xed=0) High Ed (Xed=1) White (Xblk=0) a a+bed Black (Xblk=1) a+bblk a+bed+bblk+bint Intercepts by race +Slopes by race 9 3/22/11 Mul5ple regression w/ interac5ons Variable
Educ (b1) Black (b2) Educ*Black (b3) Intercept (a) Coefficient value
9565 19380 3886 49255 Standard error
399 10195 796 5408 pvalue
0.00 0.06 0.00 0.00 The interac5on is nega5ve as expected. But the effect of being Black is posi5ve. That seems strange. Let's graph it. Graphically The slope is steeper for whites than blacks. The return to educa5on is higher for whites. The intercept is higher for blacks than whites. Whites Black 10 3/22/11 Model fit: R2 We oXen want to know how well our model fits the data we have. We also oXen want to know whether including an extra variable or interac5on term makes a big improvement in the model or not. We can use a measure called R2 to measure how well a model fits the data. What is R2? The " Total sum of squares" is the sum of all of the squared devia5ons of each Y from the mean Y. The "Sum of squared errors" is the sum of the squared devia5ons of each Y from our model predic5ons of what Y is (). So, R2 measures the propor5on of all of the varia5on in Y that is explained by all of the independent variables that we have. 11 3/22/11 Proper5es of R2 Varies between 0 and 1; closer to 1 means independent variables beRer predict Y If our regression perfectly predicts all of the data points then R2=1 (if this happens, there's probably something wrong). Each independent variable we add to a model will either increase R2 or leave it unchanged. There is another sta5s5c called adjusted R2 which we use more oXen, but the underlying principle is similar. For example Model Educa5on Educa5on + Black Educa5on + Black + Interac5on Adjusted R2 0.1418 0.1905 0.1908 Here we can see that included race improved model fit, but the addi5on of the interac5on term didn't really do much. Note that many social science models have low R2 values, but this doesn't mean that they are useless. Rather it means that there is a lot of varia5on not explained by our independent variables. 12 ...
View
Full Document
 Summer '11
 FERTIG
 Linear Regression, Regression Analysis, regression line

Click to edit the document details