This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Notes on regression analysis . Basics in regression analysis — key concepts (actual implementation is more complicated) A. Collect data . B. Plot data on graph, draw a line through the middle of the scatter of points
C. Determine the intercept and slope(s) of the line
(called regression parameters or coefﬁcients)
D. Use statistical theory and data to determine "margin of error," "standard error," "t—statistic" for each regression coefﬁcient Example: class survey — results are displayed as an equation or in a table A. as an equation: lg lg earnings per hour = 7.60 ~ 3.09 Female + 0.79 Age + 1.35 GPA
standard error: (20.30) (1.61) (0.74) (2.31)
t—statistic: [0.37] [1.92] [1.06] . [0.58]
log(earnings per houd::1 0.37 — 0.23 Female + 0.08 Age + 0.15 GPA
standard error: (1.17) (0.09) (0.04) (0.13)
t—statistic: ” [0.32] [2.44] [1.83] [1.12]
B. in a table:
regression coefficient (t—statistic in parentheses
independent dependent variable is earnings/hr
variable in $ in logs
female —3.09 —0.23
(1.92) (2.44)
age 0.79 0.08
(1.06) (1.83)
GPA 1.35 0.15
(0.58) (1.12) Statistical signiﬁcance: margin of error, standard error, tstatistic "margin of error" = i {2 Xstandard error} or, more precisely, i {1.96 X standard erro
tstatistic =  regression parameter [ + standard error
if tstatistic 2 2 (more precisely, 1.96), then parameter is ”statistically signiﬁcant"
(it‘s unlikely we'd get an estimate of this parameter at least this far away from zero,
if the true value of the parameter were actually zero)
if t—statistic < 2 (or, more precisely, 1.96), then parameter is "not statistically signiﬁcant"
(it's not unlikely we'd get an estimate of this parameter at least this far away from zero,
if the true value of the parameter were actually zero) The regression coefﬁcient on a particular independent variable represents the effect on the
dependent variable of an increase of one unit in that independent variable, with all other
independent variables remaining unchanged: e. g., the above Coefﬁcients imply that. .. o changing the sex variable from 0 (male) to 1 (female), with age and GPA remaining
unchanged, would reduce earnings by $3.09 (or about 23%) (and this effect is statistically
signiﬁcant) 0 changing the age variable by one year, with sex and GPA remaining unchanged, would raise
earnings by $0.61 (or about 8%) (but this effect is not statistically signiﬁcant) 0 changing the GPA variable by one unit, with sex and age remaining unchanged, would raise
earnings by $1.35 (or about 15%) (but this effect is not statistically signiﬁcant) Statistical signiﬁcance calculations To understand the logic of statistical signiﬁcance calculations, consider a simple example
involving opinion polls. We conduct a poll of 1,000 people and ask how many of them favor Candidate
A. We record the answers, and ﬁnd that 52.1 percent of the people in this poll favor candidate A and 47.9
percent favor candidate B. (Thus, according to this poll, candidate A leads by 52.1 — 47.9 = 4.2
percentage points.) We conduct another poll of 1,000 people and ﬁnd that in this second poll, 54.3
percent of the people favor candidate A and 45.7 percent favor candidate B (so that, in this second poll,
A's lead over B is 8.6 percentage points.) We conduct another poll, and another, and so on, until we have
the results of thousands of polls, each of which has surveyed 1,000 people at exactly the same moment. Statistical theory tells us the following: (1) the distribution of the results of all of these polls will
be centered on the percentage of the people in the population as a whole who favor A; and (2) the
distribution of the results will look like a bellshaped ("normal") curve. One of the characteristics of the normal curve is that 2.5% of the observations are located in the
left "tail“ of the distribution (i.e., well below the mean or "average" value); and another 2.5% are located
in the right talk (i.e., well above the mean). More precisely, if we measure the results not in percentage
terms but in units of "standard errors," precisely 2.5% of the observatiOns will be 1.96 stande errors
below the mean; and another 2.5% will be 1.96 standard errors above the mean. (See below. It's
conventional to use the Greek letter 6, which is the lowercase "sigma," as the symbol for the standard
error. The value of the standard error is calculated from the data, using formulas determined by rules of
statistical inference.) In other words, it is rare for observations to be 1.96 (or, roughly speaking, 2.0) or
more standard errors away from the mean —~ this occurs only about 5% of the time. Similarly, most of the
time (95% of the time, to be precise), the poll will produce a result that is less than 1.96 standard errors
away from the mean. MW Mu,“ Man
.. (.450 + 1.456” Now suppose we want to use the results of a single opinion poll to determine how the population
feels about candidate A. (Thus, we are no longer imagining that we can take thousands upon thousands of
such polls.) Suppose that, in this single poll, 57.1 percent of the people support candidate A. We know
that, simply as the result of chance, there is some possibility that we could get a poll result that is at least
this much different from 50.0 percent even if, in the population as a whole, people are divided 5050 over
the two candidates. The logic of statistical signiﬁcance then proceeds as follows: We start by assuming that the population is evenly divided between the tWo candidates.
Equivalently, we assume that there is no difference between the degree of support for the two candidates
in the population. (This assumption of "no difference" is usually called the "null hypothesis," and it is the
starting point for all signiﬁcance testing.) Next, we ask, "How likely is it that, if the population were
evenly divided between the two candidates, we could nevertheless get a poll result like ours — one that is
7.1 percentage points away from 50.0 percent?" Next, we use the data to calculate the standard error of
the population mean, and we measure the difference between our poll's result and the assumed 50~50 split
using units of standard errors. For example, if one standard error is 3.0 percentage points, then the 7.1
percentagepoint difference in this poll (57.1 — 50.0) is equivalent to 7.1 + 3.0 = 2.33 standard errors. If this result is equal or greater than 1.96 (or approximately 2.0), then this difference is called
"statistically signiﬁcant." In other words, in this case, if the population were truly evenly divided
between the two candidates (so that each candidate enjoyed the support of 50.0 percent of the population),
it is very unlikely (there is less than a 5% probability) that, simply as a result of sampling variation or
"chance" a poll would show that one candidate's support differs from 50.0 percent by as much as this
candidate's support does. (In other words, if the population were truly evenly divided, it is very unlikely
that, simply because of sampling variation, an opinion poll would show that one candidate is ahead by at
least 7.1 percentage points — which is what we actually see here.) On the other hand, suppose that the difference between the poll's result and the assumed 50—50
split is equivalent to something less than 1.96 standard errors. For example, suppose again that 57.1
percent of the people favor candidate A, but that one standard error is equal to 5.0 percentage points.
Then, expressed in terms of standard error units, 7.1 percentage points (= the extent to which our poll's
result departs from a 5050 split) is 7.1 + 5.0 = 1.42, which is less than 1.96. In this case, the difference
between our poll result and an assumed 5050 split is called "not statistically signiﬁcan ." In other words,
in this case, if the population were truly evenly divided between the two candidates, it is not unlikely
(There is better than a 5% probability) that, simply as a result of sampling variation ("chance"), we would
get a poll result showing a lead for one candidate that is as different from 50% as this one. In intuitive terms, it may be helpful to think of signiﬁcance testing as a kind of courtroom
exercise. We begin by adopting a "null hypothesis," i.e., that there is no difference between two things._
(In the example here, we assumed that there is no difference in support for the candidates.) In courtroom
terms, this would be like an assumption of "innocent until proven guilty." We then consider the statistical
evidence. Ifthe evidence seems very far away from the null hypothesis (in terms of "standard deviation"
units), we reject the null hypothesis on the grounds that the evidence isn't really consistent with the
assumption of no difference. Likewise, if the courtroom evidence seems very far away from the
assumption of innocence, we reject the assumption of innocence. A few caveats and qualiﬁcations are appropriate here. First, statistical signiﬁcance is not the same as "signiﬁcance" in the ordinary, everyday sense. In
everyday speech, "signiﬁcan " simply means "big," "sizeable," etc. In contrast, even a differencethat is
small can nevertheless be "signiﬁcan " in a statistical sense. Likewise, a difference that is "large" in the
ordinary language sense will not necessarily be signiﬁcant in a statistical sense. For example, we might
sample ten men and four women and ﬁnd that the sex difference in pay in this sample is $5,000 per year.
This is certainly signiﬁcant in the everyday sense ($5,000 is a lot of money!), but it might not be
statistically signiﬁcant, because it's possible that mere sampling variation alone could produce a
difference of at least this size. Second, it is not correct to say that a statistically signiﬁcant difference is one for which there is
less than a 5% probability that it occurred due to chance; and it is equally incorrect to say that a
statistically signiﬁcant difference is one for which there is more than a 95% probability that it occurred
due to something other than chance. Neither of these statements correctly characterizes the logic that
underlies statistical signiﬁcance testing. Rather, such testing proceeds as follows. First, we adopt a "null
hypothesis" of a zero difference in the underlying population. (For example, if we are testing for a
' difference between the average pay of men and women, we assume no sex difference in pay for the
underlying population; if we are testing for .a difference in support for tWO candidates, we assume a 5050
split in the underlying population; and so on.) Second, we ask: if there were a zero difference in the
underlying population, how likely is it that we would get a difference in a sample that is at least as far
away from zero as is the difference in our sample? If the difference obtained in our sample is "far away"
from zero (i.e., is at least 1.96 standard errors away from zero), then the difference is "statistically
signiﬁcant" — i.e., it is unlikely that chance alone could produce such a result. In other words, less than
5% of the time, chance would produce a result at least as far away from zero as our result is, if the null
hypothesis of zero diﬂerence were correct; and more than 95% of the time, chance alone would produce a
result that is closer to zero than ours is — again, if the null hypothesis were correct. Notes on indieator/dichotomous/binary/dummy variables Dummy variables are equal to 1 if the observation is in a speciﬁc category, and equal to 0 otherwise. e.g., sex: 7 F = 1 if person is female, = 0 otherwise (i.e., if male)
M = 1 if person is male, = 0 otherwise (i.e., if female) The category equaling zero is often called the "reference category." (E.g., for the dummy variable called
F, above, male is the reference category.) Dummy variables act like a "switch," changing a characteristic into something else. Dummy variables
can also provide a convenient way to combine several different equations into a single equation. Case 1: Two equations differing only in terms of the intercept {have the same slope) two equations: written as a single equation:
men: S = 0.15 + 0.20 X using M dummy: S = 0.05 + 0.20 X + 0.10 M
women: S = 0.05 + 0.20 X using F dummy: S = 0.15 + 0.20 X— 0.10 F for example, using M dummy, consider the single equation above and to the right: for women, M = 0, so equation is S = 0.05 + 0.20 X + 0.10 X 0 = 0.05 + 0.20 X
for men, M =1, so equation is S = 0.15 + 0.20 X+ 0.10 ><1= 0.15 + 0.20 X Thus, we havewritten both equations as a single equation using the M dummy — the one equation with
the M dummy is exactly the same as the two separate equations. This is equally true of the single equation using the F dummy (can you verify this?) Thus, it doesn't
matter whether we use an M dummy (making women the reference category) or an F dummy (making
men the reference category). Using the M dummy tells us that the male intercept is 0.10 higher than the
female intercept; using the F dummy tells us that the female intercept is 0.10 lower than the male
intercept. Thus, either approach yields exactly the same information. Case 2: two equations with different intercepts and also different slopes two equations: written as a single equation:
men: S=0.15+0.40X usingMdummy: S=0.05+0.21X+0.10M+0.19X><M
women: S=0.05+0.21 X usinngummy: S=0.10+0.40X—0.10F —0.19X>< F for example, using M dummy: for women, M = 0, so equation is S = 0.05 + 0.21 X+ 0.10 X 0 + 0.19 X X 0
=0.05+0.21 X for men, M =1, so equation is S = 0.05 + 0.21X+ 0.10 X 1+ 0.19 X X1
= 0.15 + 0.40 X Dummy variables involving multiple categories The above examples refer to dummy variables representing exactly two categories (e. g.,
male/female, before/after, pass/fail, under 21/20 or older, etc.). Note that we need only one dummy
variable to represent the state of being, or not being, in either of these two categories. A "one" refers to
an observation that is in one category (e.g., male), and a "zero" for the same variable refers to an
observation that is in the only other ("reference") category (e.g., female). More generally, a variable that has K mutuallyexclusive and exhaustive categories can be
represented using K—l dummy variables; and the state of being in the "reference" category is indicated by a situation in which all of the K—1 variables equal zero. (”Mutuallyexclusive" means that each person, or
object, can be in one and only one of the categories; "exhaustive" means that the categories cover all of
the possible outcomes for that variable.) For example, suppose there are three kinds of college majors: math/science, social science, and
humanities. (Note that these categories must be mutually—exclusive, so that there are no double majors;
and must also be exhaustive, so that all possible majors ﬁt into one of these three categories.) Since there
are three categories (K = 3), we need K — 1 = 2 dummy variables to represent everyone‘s major: D1 = 1 if maj oring in math/science
D1 = 0 otherwise (i.e., if majoring in social science or humanities) D2 = 1 if majoring in social science
D2 = 0 otherwise (i.e., if majoring in math/science or humanities) How we represent the fact of majoring in humanities, the third category? If D1 = 0 and D2 = 0, then we
know this is not a major in math/science, and we also know this is not a major in social science — so it
must be a major in humanities. (Just the same logic applies to a single dummy variable: if M = 0, we
know this is not a male, so it must be a female. Thus, with two sex categories, K = 2, and so we only
need one dummy variable (K — 1 = 1) to represent the fact of being in either of the two categories.) Thus, for example, a regression equation for salary (S) that took account of age (A), sex (M), and
college major (using the dummy variables D1 and D2) might look like this: S = 20000 + 500 A + 450 M +1200 D1— 400 D2 Note that this setup assumes that there may be different intercepts (but not different slopes) for the salary
lines for the different majors: for math/science, D1 = 1 and D2 = 0, so S = 20000 + 500 A + 450 M + 1200
= 21200 + 500 A+ 450 M for social science, D1 = 0 and D2 = 1, so S = 20000 + 500 A + 450 M — 400
= 19600 + 500 A + 450 M for humanities, D1 = 0 and D2 = 0, so 8 = 20000 + 500 A + 450 M Note that interpreting the coefﬁcients for these dummy variables (1200 for D1, 400 for D2) can
be a little tricky. To begin with, remember that the regression coefﬁcient for a particular independent
variable always tells us the effect on the dependent variable of an increase of one unit in that independent
variable, with all other independent variables remaining unchanged. Thus, for example, the coefﬁcient
of 1200 on the dummy variable for D1 tells us the effect on salary (S) of increasing D1 by one unit, with
all other variables unchanged. Now, how is it possible to increase D by one unit while leaving all of the other independent
variables unchanged? We do this only by increasing D1 from zero to 1, while leaving A, M and D2
unchanged. Note, however, that if we increase D1 from zero to 1 but do not change D2, we must be
changing major from humanities (with D1 = 0 and D2 = 0) to math/science (with D1 = 1 and D2 = 0). So
the coefﬁcient on D, 1200, represents the effect on salary of switching major from humanities to
math/science, other things (A and M) being equal. Note that we would get the same result if we assume identical values for A and M and then
compare average salaries for math/science and humanities majors: from the above, math/science equation is _ S = 21200 + 500 A + 450 M
humanities equation is S = 20000 + 500 A + 450 M If we assume that A and M are the same in both equations, and then subtract the two equations, thewe will
get 21200 — 20000 = 1200 as the difference in pay between the two majors, other things (A and M) being
equal. (Of course, this is the same result we just got by looking directly at the coefﬁcient for D1.) Likewise, the effect on pay (other things being equal) of changing maj or from humanities to
social science is 400. Note that this is the effect of switching D2 from 0 to 1, while leaving D1
unchanged (and equal to zero). Finally, how about other kinds of "maj or” effects »— for example, what is the effect of changing
major from social science to math/science, other things being equal? Note that this is a little more
complicated, because in this case we are turning oﬁ" the ”switch" for being a social science major and
turning on the "switch" for being a math/science major. More precisely, we have to change D2 from 1 to
zero (because we are no longer a social science major) and also change D1 from 0 to 1 (because we are
turning into a math/science major). Switching D2 from 1 to 0 reduces D2 by one unit, and therefore
changes pay by 400 X 1 = +400. Switching D1 from 0 to 1 raises D1 by one unit, and therefore changes
pay by a further +1200 >< +1 = +1200. So the effect of changing major from social science to
math/science (other things being equal) is 400 + 1200 = +1600. Allowing regression slopes gas well as intercepts) to depend on multicategory dummy variables We previously saw that it is straightforward to allow regression slopes (as well as intercepts) to
depend on a twocategory dummy variable (we considered the case in which a slope in a regression
depends on a variable for sex, M). The extension to multicategory dummy variables is straightforward.
For example, suppose that salary grows with age (A) and depends on sex (M) at different rates for persons
with different majors. To allow for this possibility, you can run a regression that includes socalled
"interaction ...
View
Full Document
 Spring '09

Click to edit the document details