{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

January 21 - regression analysis (2)

January 21 - regression analysis (2) - Notes on regression...

Info icon This preview shows pages 1–6. Sign up to view the full content.

View Full Document Right Arrow Icon
Image of page 1

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 2
Image of page 3

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 4
Image of page 5

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 6
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Notes on regression analysis . Basics in regression analysis — key concepts (actual implementation is more complicated) A. Collect data . B. Plot data on graph, draw a line through the middle of the scatter of points C. Determine the intercept and slope(s) of the line (called regression parameters or coefficients) D. Use statistical theory and data to determine "margin of error," "standard error," "t—statistic" for each regression coefficient Example: class survey — results are displayed as an equation or in a table A. as an equation: lg lg earnings per hour = -7.60 ~ 3.09 Female + 0.79 Age + 1.35 GPA standard error: (20.30) (1.61) (0.74) (2.31) t—statistic: [0.37] [1.92] [1.06] . [0.58] log(earnings per houd::1 0.37 — 0.23 Female + 0.08 Age + 0.15 GPA standard error: (1.17) (0.09) (0.04) (0.13) t—statistic: ” [0.32] [2.44] [1.83] [1.12] B. in a table: regression coefficient (t—statistic in parentheses independent dependent variable is earnings/hr variable in $ in logs female —3.09 —0.23 (1.92) (2.44) age 0.79 0.08 (1.06) (1.83) GPA 1.35 0.15 (0.58) (1.12) Statistical significance: margin of error, standard error, t-statistic "margin of error" = i {2 Xstandard error} or, more precisely, i {1.96 X standard erro t-statistic = | regression parameter [ + standard error if t-statistic 2 2 (more precisely, 1.96), then parameter is ”statistically significant" (it‘s unlikely we'd get an estimate of this parameter at least this far away from zero, if the true value of the parameter were actually zero) if t—statistic < 2 (or, more precisely, 1.96), then parameter is "not statistically significant" (it's not unlikely we'd get an estimate of this parameter at least this far away from zero, if the true value of the parameter were actually zero) The regression coefficient on a particular independent variable represents the effect on the dependent variable of an increase of one unit in that independent variable, with all other independent variables remaining unchanged: e. g., the above Coefficients imply that. .. o changing the sex variable from 0 (male) to 1 (female), with age and GPA remaining unchanged, would reduce earnings by $3.09 (or about 23%) (and this effect is statistically significant) 0 changing the age variable by one year, with sex and GPA remaining unchanged, would raise earnings by $0.61 (or about 8%) (but this effect is not statistically significant) 0 changing the GPA variable by one unit, with sex and age remaining unchanged, would raise earnings by $1.35 (or about 15%) (but this effect is not statistically significant) Statistical significance calculations To understand the logic of statistical significance calculations, consider a simple example involving opinion polls. We conduct a poll of 1,000 people and ask how many of them favor Candidate A. We record the answers, and find that 52.1 percent of the people in this poll favor candidate A and 47.9 percent favor candidate B. (Thus, according to this poll, candidate A leads by 52.1 — 47.9 = 4.2 percentage points.) We conduct another poll of 1,000 people and find that in this second poll, 54.3 percent of the people favor candidate A and 45.7 percent favor candidate B (so that, in this second poll, A's lead over B is 8.6 percentage points.) We conduct another poll, and another, and so on, until we have the results of thousands of polls, each of which has surveyed 1,000 people at exactly the same moment. Statistical theory tells us the following: (1) the distribution of the results of all of these polls will be centered on the percentage of the people in the population as a whole who favor A; and (2) the distribution of the results will look like a bell-shaped ("normal") curve. One of the characteristics of the normal curve is that 2.5% of the observations are located in the left "tail“ of the distribution (i.e., well below the mean or "average" value); and another 2.5% are located in the right talk (i.e., well above the mean). More precisely, if we measure the results not in percentage terms but in units of "standard errors," precisely 2.5% of the observatiOns will be 1.96 stande errors below the mean; and another 2.5% will be 1.96 standard errors above the mean. (See below. It's conventional to use the Greek letter 6, which is the lower-case "sigma," as the symbol for the standard error. The value of the standard error is calculated from the data, using formulas determined by rules of statistical inference.) In other words, it is rare for observations to be 1.96 (or, roughly speaking, 2.0) or more standard errors away from the mean —~ this occurs only about 5% of the time. Similarly, most of the time (95% of the time, to be precise), the poll will produce a result that is less than 1.96 standard errors away from the mean. MW Mu,“ Man .. (.450 + 1.456” Now suppose we want to use the results of a single opinion poll to determine how the population feels about candidate A. (Thus, we are no longer imagining that we can take thousands upon thousands of such polls.) Suppose that, in this single poll, 57.1 percent of the people support candidate A. We know that, simply as the result of chance, there is some possibility that we could get a poll result that is at least this much different from 50.0 percent even if, in the population as a whole, people are divided 50-50 over the two candidates. The logic of statistical significance then proceeds as follows: We start by assuming that the population is evenly divided between the tWo candidates. Equivalently, we assume that there is no difference between the degree of support for the two candidates in the population. (This assumption of "no difference" is usually called the "null hypothesis," and it is the starting point for all significance testing.) Next, we ask, "How likely is it that, if the population were evenly divided between the two candidates, we could nevertheless get a poll result like ours — one that is 7.1 percentage points away from 50.0 percent?" Next, we use the data to calculate the standard error of the population mean, and we measure the difference between our poll's result and the assumed 50~50 split using units of standard errors. For example, if one standard error is 3.0 percentage points, then the 7.1 percentage-point difference in this poll (57.1 — 50.0) is equivalent to 7.1 + 3.0 = 2.33 standard errors. If this result is equal or greater than 1.96 (or approximately 2.0), then this difference is called "statistically significant." In other words, in this case, if the population were truly evenly divided between the two candidates (so that each candidate enjoyed the support of 50.0 percent of the population), it is very unlikely (there is less than a 5% probability) that, simply as a result of sampling variation or "chance" a poll would show that one candidate's support differs from 50.0 percent by as much as this candidate's support does. (In other words, if the population were truly evenly divided, it is very unlikely that, simply because of sampling variation, an opinion poll would show that one candidate is ahead by at least 7.1 percentage points — which is what we actually see here.) On the other hand, suppose that the difference between the poll's result and the assumed 50—50 split is equivalent to something less than 1.96 standard errors. For example, suppose again that 57.1 percent of the people favor candidate A, but that one standard error is equal to 5.0 percentage points. Then, expressed in terms of standard error units, 7.1 percentage points (= the extent to which our poll's result departs from a 50-50 split) is 7.1 + 5.0 = 1.42, which is less than 1.96. In this case, the difference between our poll result and an assumed 50-50 split is called "not statistically significan ." In other words, in this case, if the population were truly evenly divided between the two candidates, it is not unlikely (There is better than a 5% probability) that, simply as a result of sampling variation ("chance"), we would get a poll result showing a lead for one candidate that is as different from 50% as this one. In intuitive terms, it may be helpful to think of significance testing as a kind of courtroom exercise. We begin by adopting a "null hypothesis," i.e., that there is no difference between two things._ (In the example here, we assumed that there is no difference in support for the candidates.) In courtroom terms, this would be like an assumption of "innocent until proven guilty." We then consider the statistical evidence. Ifthe evidence seems very far away from the null hypothesis (in terms of "standard deviation" units), we reject the null hypothesis on the grounds that the evidence isn't really consistent with the assumption of no difference. Likewise, if the courtroom evidence seems very far away from the assumption of innocence, we reject the assumption of innocence. A few caveats and qualifications are appropriate here. First, statistical significance is not the same as "significance" in the ordinary, everyday sense. In everyday speech, "significan " simply means "big," "sizeable," etc. In contrast, even a differencethat is small can nevertheless be "significan " in a statistical sense. Likewise, a difference that is "large" in the ordinary language sense will not necessarily be significant in a statistical sense. For example, we might sample ten men and four women and find that the sex difference in pay in this sample is $5,000 per year. This is certainly significant in the everyday sense ($5,000 is a lot of money!), but it might not be statistically significant, because it's possible that mere sampling variation alone could produce a difference of at least this size. Second, it is not correct to say that a statistically significant difference is one for which there is less than a 5% probability that it occurred due to chance; and it is equally incorrect to say that a statistically significant difference is one for which there is more than a 95% probability that it occurred due to something other than chance. Neither of these statements correctly characterizes the logic that underlies statistical significance testing. Rather, such testing proceeds as follows. First, we adopt a "null hypothesis" of a zero difference in the underlying population. (For example, if we are testing for a ' difference between the average pay of men and women, we assume no sex difference in pay for the underlying population; if we are testing for .a difference in support for tWO candidates, we assume a 50-50 split in the underlying population; and so on.) Second, we ask: if there were a zero difference in the underlying population, how likely is it that we would get a difference in a sample that is at least as far away from zero as is the difference in our sample? If the difference obtained in our sample is "far away" from zero (i.e., is at least 1.96 standard errors away from zero), then the difference is "statistically significant" — i.e., it is unlikely that chance alone could produce such a result. In other words, less than 5% of the time, chance would produce a result at least as far away from zero as our result is, if the null hypothesis of zero diflerence were correct; and more than 95% of the time, chance alone would produce a result that is closer to zero than ours is — again, if the null hypothesis were correct. Notes on indieator/dichotomous/binary/dummy variables Dummy variables are equal to 1 if the observation is in a specific category, and equal to 0 otherwise. e.g., sex: 7 F = 1 if person is female, = 0 otherwise (i.e., if male) M = 1 if person is male, = 0 otherwise (i.e., if female) The category equaling zero is often called the "reference category." (E.g., for the dummy variable called F, above, male is the reference category.) Dummy variables act like a "switch," changing a characteristic into something else. Dummy variables can also provide a convenient way to combine several different equations into a single equation. Case 1: Two equations differing only in terms of the intercept {have the same slope) two equations: written as a single equation: men: S = 0.15 + 0.20 X using M dummy: S = 0.05 + 0.20 X + 0.10 M women: S = 0.05 + 0.20 X using F dummy: S = 0.15 + 0.20 X— 0.10 F for example, using M dummy, consider the single equation above and to the right: for women, M = 0, so equation is S = 0.05 + 0.20 X + 0.10 X 0 = 0.05 + 0.20 X for men, M =1, so equation is S = 0.15 + 0.20 X+ 0.10 ><1= 0.15 + 0.20 X Thus, we havewritten both equations as a single equation using the M dummy — the one equation with the M dummy is exactly the same as the two separate equations. This is equally true of the single equation using the F dummy (can you verify this?) Thus, it doesn't matter whether we use an M dummy (making women the reference category) or an F dummy (making men the reference category). Using the M dummy tells us that the male intercept is 0.10 higher than the female intercept; using the F dummy tells us that the female intercept is 0.10 lower than the male intercept. Thus, either approach yields exactly the same information. Case 2: two equations with different intercepts and also different slopes two equations: written as a single equation: men: S=0.15+0.40X usingMdummy: S=0.05+0.21X+0.10M+0.19X><M women: S=0.05+0.21 X usinngummy: S=0.10+0.40X—0.10F —0.19X>< F for example, using M dummy: for women, M = 0, so equation is S = 0.05 + 0.21 X+ 0.10 X 0 + 0.19 X X 0 =0.05+0.21 X for men, M =1, so equation is S = 0.05 + 0.21X+ 0.10 X 1+ 0.19 X X1 = 0.15 + 0.40 X Dummy variables involving multiple categories The above examples refer to dummy variables representing exactly two categories (e. g., male/female, before/after, pass/fail, under 21/20 or older, etc.). Note that we need only one dummy variable to represent the state of being, or not being, in either of these two categories. A "one" refers to an observation that is in one category (e.g., male), and a "zero" for the same variable refers to an observation that is in the only other ("reference") category (e.g., female). More generally, a variable that has K mutually-exclusive and exhaustive categories can be represented using K—l dummy variables; and the state of being in the "reference" category is indicated by a situation in which all of the K—1 variables equal zero. (”Mutually-exclusive" means that each person, or object, can be in one and only one of the categories; "exhaustive" means that the categories cover all of the possible outcomes for that variable.) For example, suppose there are three kinds of college majors: math/science, social science, and humanities. (Note that these categories must be mutually—exclusive, so that there are no double majors; and must also be exhaustive, so that all possible majors fit into one of these three categories.) Since there are three categories (K = 3), we need K — 1 = 2 dummy variables to represent everyone‘s major: D1 = 1 if maj oring in math/science D1 = 0 otherwise (i.e., if majoring in social science or humanities) D2 = 1 if majoring in social science D2 = 0 otherwise (i.e., if majoring in math/science or humanities) How we represent the fact of majoring in humanities, the third category? If D1 = 0 and D2 = 0, then we know this is not a major in math/science, and we also know this is not a major in social science — so it must be a major in humanities. (Just the same logic applies to a single dummy variable: if M = 0, we know this is not a male, so it must be a female. Thus, with two sex categories, K = 2, and so we only need one dummy variable (K — 1 = 1) to represent the fact of being in either of the two categories.) Thus, for example, a regression equation for salary (S) that took account of age (A), sex (M), and college major (using the dummy variables D1 and D2) might look like this: S = 20000 + 500 A + 450 M +1200 D1— 400 D2 Note that this setup assumes that there may be different intercepts (but not different slopes) for the salary lines for the different majors: for math/science, D1 = 1 and D2 = 0, so S = 20000 + 500 A + 450 M + 1200 = 21200 + 500 A+ 450 M for social science, D1 = 0 and D2 = 1, so S = 20000 + 500 A + 450 M — 400 = 19600 + 500 A + 450 M for humanities, D1 = 0 and D2 = 0, so 8 = 20000 + 500 A + 450 M Note that interpreting the coefficients for these dummy variables (1200 for D1, -400 for D2) can be a little tricky. To begin with, remember that the regression coefficient for a particular independent variable always tells us the effect on the dependent variable of an increase of one unit in that independent variable, with all other independent variables remaining unchanged. Thus, for example, the coefficient of 1200 on the dummy variable for D1 tells us the effect on salary (S) of increasing D1 by one unit, with all other variables unchanged. Now, how is it possible to increase D by one unit while leaving all of the other independent variables unchanged? We do this only by increasing D1 from zero to 1, while leaving A, M and D2 unchanged. Note, however, that if we increase D1 from zero to 1 but do not change D2, we must be changing major from humanities (with D1 = 0 and D2 = 0) to math/science (with D1 = 1 and D2 = 0). So the coefficient on D, 1200, represents the effect on salary of switching major from humanities to math/science, other things (A and M) being equal. Note that we would get the same result if we assume identical values for A and M and then compare average salaries for math/science and humanities majors: from the above, math/science equation is _ S = 21200 + 500 A + 450 M humanities equation is S = 20000 + 500 A + 450 M If we assume that A and M are the same in both equations, and then subtract the two equations, thewe will get 21200 — 20000 = 1200 as the difference in pay between the two majors, other things (A and M) being equal. (Of course, this is the same result we just got by looking directly at the coefficient for D1.) Likewise, the effect on pay (other things being equal) of changing maj or from humanities to social science is -400. Note that this is the effect of switching D2 from 0 to 1, while leaving D1 unchanged (and equal to zero). Finally, how about other kinds of "maj or” effects »— for example, what is the effect of changing major from social science to math/science, other things being equal? Note that this is a little more complicated, because in this case we are turning ofi" the ”switch" for being a social science major and turning on the "switch" for being a math/science major. More precisely, we have to change D2 from 1 to zero (because we are no longer a social science major) and also change D1 from 0 to 1 (because we are turning into a math/science major). Switching D2 from 1 to 0 reduces D2 by one unit, and therefore changes pay by -400 X -1 = +400. Switching D1 from 0 to 1 raises D1 by one unit, and therefore changes pay by a further +1200 >< +1 = +1200. So the effect of changing major from social science to math/science (other things being equal) is 400 + 1200 = +1600. Allowing regression slopes gas well as intercepts) to depend on multi-category dummy variables We previously saw that it is straightforward to allow regression slopes (as well as intercepts) to depend on a two-category dummy variable (we considered the case in which a slope in a regression depends on a variable for sex, M). The extension to multi-category dummy variables is straightforward. For example, suppose that salary grows with age (A) and depends on sex (M) at different rates for persons with different majors. To allow for this possibility, you can run a regression that includes so-called "interaction ...
View Full Document

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern