Unformatted text preview: 2/21/11 PADP 8120: Data Analysis and Sta4s4cal Modeling Associa'ons between Categorical Variables Spring 2011 Angela Fer4g, Ph.D. Plan Last 4me: We discussed how to determine whether groups were significantly different This 4me: We are doing something different we are focusing on categorical variables and how to determine whether they are associated 1 2/21/11 Some defini4ons Response or dependent variable: the variable about which comparison are made (e.g. number of prescrip4ons) Explanatory or independent variable: the variable that defines the groups across which the response variable is compared (e.g. gender) Two variables are associated if the distribu4on of the response variable changes in some way as the value of the explanatory variable changes. Categorical associa4ons If we have a categorical dependent variable (like self-reported health status), then the techniques we've discussed so far for comparing means and propor4ons don't work. We can tabulate the data and "eyeball" rela4onships, but how do we measure the associa4on between variables more rigorously? 2 2/21/11 Example for the day Say we are interested in how employment status varies by gender. We sample 105 people. Our dependent variable is employment status, which we collapse to 3 categories for simplicity: employed, unemployed and not in the labor force. Our independent variable is gender. Con4ngency table A con1ngency table (or cross-tab) displays the number of observa4ons for each combina4on of outcomes over the categories of the variable Each observa4on falls into only one cell of this table The categories are exhaus1ve and exclusive. Exhaus1ve as you can only be male or female, and you have to have one of the 3 employment categories. Exclusive as you cannot be both male and female, and you cannot have more than one employment status. We only use one independent variable here; in future weeks, we we look at many independent variables. 3 2/21/11 Our Con4ngency Table Employed Men Women Total 29 15 44 Unemployed 19 24 43 Column marginals OLF 7 11 18 Total 55 50 105 Row marginals To see how employment status varies by gender, convert to percentages within row to get the condi4onal distribu4ons. 44/105 = 42% of people are employed Condi4onal Distribu4ons Employed Men Women Total 53 30 42 Unemployed 34 48 41 OLF 13 22 17 Total (N) 100 (55) 100 (50) 100 (105) 15/50 = 30% of women are employed 29/55 = 53% men are employed 4 2/21/11 Sta4s4cal independence In order to judge whether there is an associa4on between gender and employment status, we use a concept called sta1s1cal independence. Two variables are sta1s1cally independent if the probability of falling into a par4cular column is independent of the row for the popula4on. E.g. If 50% of people are employed, then we would expect 50% of men and 50% of women to be employed if gender is sta4s4cally independent of employment status. Judging based on our sample We're in the familiar situa4on of wan4ng to know something about a popula4on, but only having a sample. We don't know whether a rela4onship that's apparent in the observed data (men appear more likely to be employed) is due to sampling varia4on or not. Our null hypothesis (H0) is that gender and employment status are sta4s4cally independent. We test this against the alterna4ve hypothesis (Ha) that they are sta4s4cally dependent. 5 2/21/11 So how do we test the null hypothesis? Expected Frequencies If there is no rela4onship, then we should expect the propor4on of employed men to be the same propor4on of employed people in the sample as a whole. One way to work out whether there are differences from the null hypothesis is to work out what the expected frequencies are if the null hypothesis were correct. We can compare these expected frequencies with the actual observed frequencies and assess whether any differences from the null hypothesis are "big" or "small". Con4ngency Table with Expected Frequencies Employed Men Women Total 29 15 44 23.0 21.0 Unemployed 19 24 43 22.5 20.5 7 11 18 OLF 9.4 8.6 Total 55 50 105 The expected frequency for employed men (if H0 is true) is 44/105 of all men (55), which is 23.0 men. 6 2/21/11 We can see that there are some big and/or small devia4ons from the null hypothesis, but we want to summarize them and assess their size. Chi-squared sta4s4c Use a chi-square sta1s1c (or 2). This is based on looking at the squared devia4ons from the expected frequencies. Some devia4ons will be big just because the numbers are big, so we also divide the squared devia4on by the expected frequency. Then we take these from each cell and add them up. Observed frequency Expected frequency Calcula4ng chi-square by hand Employed Men Women Total 29 15 44 23.0 21.0 Unemployed 19 24 40 (19 - 22.5) 2 = 0.55 22.5 OLF 7 11 18 9.4 8.6 Total 55 50 105 22.5 20.5 (29 - 23.0) 2 = 1.54 23.0 And so on. If we add all these numbers up, then our chi-square sta4s4c = 5.7 7 2/21/11 Is 5.7 a big chi-square? A "big" number tells us that H0 is unlikely because the observed frequencies are "far away" from the expected frequencies. But when is a number big enough to reject the null hypothesis? We need to look at the chi-square sampling distribu1on to figure this out. The distribu4on will vary by the size of the table. Tables with more cells will have a bigger value of chi-square just because there are more numbers to add up. We take this into account by a concept called degrees of freedom the number of "non-redundant" pieces of informa4on. Degrees of freedom In our case, we only have 2 degrees of freedom because once we know two cell numbers (and all of the marginals), we can work out the rest of the cells. Employed Men Women Total 44 43 18 29 Unemployed 19 OLF Total 55 50 105 8 2/21/11 2 distribu4on When DF are low (v = 3), most of the 2 statistics fall below 5. When DF are high (v = 10), most of the 2 statistics fall above 5. The mean of the chi-square sampling distribu4on is the number of DF. As the DF increases, the standard devia4on of the sampling distribu4on increases. Gekng the p-value Just as with z-tests and t-tests, we ask "what is the probability of gekng a value of 2 that is this far from the mean if H0 is true (no associa4on)?" As before, the area under the curve beyond that value tells us the p-value. Only difference is that the distribu4on, hence the p-value, depends on the DF. This table shows the 2 values that have a 10% to 1% probability of coming up by chance due to sampling varia4on (by DF). df\area 1 2 3 4 5 0.1 2.71 4.61 6.25 7.78 9.24 0.05 3.84 5.99 7.81 9.49 11.07 0.025 5.02 7.38 9.35 11.14 12.83 0.01 6.63 9.21 11.34 13.28 15.09 9 2/21/11 Back to our example The 2 sta4s4c was 5.7 with 2 DF. Looking at the table, this would occur by chance between 5% and 10% of the 4me. Thus, we can reject the null hypothesis that there is no rela4onship between gender and employment status at the 10% level. Strength of the associa4on So there is a (marginally) sta4s4cally significant rela4onship between gender and employment status but how strong is the associa4on? Is it economically important? We olen use an odds ra1o to look at strength of associa4on: Probability of 'success' Probability of 'failure' Odds if group 1 Odds ratio = Odds if group 2 Odds = 10 2/21/11 Let's revise our example to a 2X2 Employed Men Women Total 29 15 44 Unemployed 19 24 43 Total 48 39 87 The odds of being employed if you're a man are (29/48) / (19/48) = 1.526 The odds of being employed if you're a woman are (15/39) / (24/39) = 0.625 Odds ratio = Odds of working if man 1.526 = = 2.4 Odds of working if woman 0.625 Interpreta4on of Odds Ra4os The odds ra4o tells us how much greater or smaller the odds of "something happening" is for two different groups In our example, the odds of being employed vs. being unemployed are 2.4 4mes greater for men than for women. An odds ra4o of 1 means no difference in the odds between the groups. Values far from one (greater or smaller) tell us the strength of the associa4on is large. 11 ...
View Full Document
- Summer '11
- Frequency, Probability theory