This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Things I Have Learned (So Far) Jacob Cohen New Iork University ABSTRACT: This is an account of what I have learned
(so far) about the application of statistics to psychology
and the other sociobiomedical sciences. It includes the
principles "less is more" (fewer variables, more highly
targeted issues, sharp rounding oﬂ), “simple is better”
(graphic representation, unit weighting for linear corn—
posites), and “some things you learn aren’t so. ” I have
learned to avoid the many misconceptions that surround
Fisherian null hypothesis testing. I have also learned the
importance of power analysis and the determination of
just how big (rather than how statistically signiﬁcant) are
the effects that we study. Finally, I have learned that there
is no royal road to statistical induction, that the informed
judgment of the investigator is the crucial element in the
interpretation of data, and that things take time. What I have learned (so far) has come from working with
students and colleagues, from experience (sometimes bit—
ter) with journal editors and review committees, and from
the writings of, among others, Paul Meehl, David Bakan.
William Rozeboom, Robyn Dawes, Howard Wainer,
Robert Rosenthal, and more recently, Gerd Gigerenzer,
Michael Cakes, and Leland Wilkinson. Although they
are not always explicitly referenced, many of you will be
able to detect their footprints in what follows. Some Things You Learn Aren’t So One of the things I learned early on was that some things
you learn aren’t so. In graduate school, right after World
War II, I learned that for doctoral dissertations and most
other purposes, when comparing groups, the proper sam
ple size is 30 cases per group. The number 30 seems to
have arisen from the understanding that with fewer than
30 cases, you were dealing with “small” samples that re
quired specialized handling with “small—sample statistics”
instead of the criticalratio approach we had been taught.
Some of us knew about these exotic smallsample statis
tics—in fact, one of my fellow doctoral candidates un
dertook a dissertation, the distinguishing feature of which
was a sample of only 20 cases per group, so that he could
demonstrate his prowess with small—sample statistics. It
wasn’t until some years later that I discovered (mind you,
not invented) power analysis, one of whose fruits was the
revelation that for a two—independent—group—mean com
parison with n = 30 per group at the sanctiﬁed twotailed
.05 level, the probability that a medium—sized effect would
be labeled as signiﬁcant by the most modern methods (a
t test) was only .47. Thus, it was approximately a coin
ﬂip whether one would get a signiﬁcant result, even
though, in reality, the effect size was meaningful. My n = 1304 20 friends power was rather worse (.33). but of course
he couldn‘t know that. and he ended up with nonsignif
icant resultswwith which he proceeded to demolish an important branch of psychoanalytic theory.
Less Is More One thing I learned over a long period of time that is so
is the validity of the general principle that less is more.
except of course for sample size (Cohen & Cohen, 1983.
pp. l69—l7 l ). l have encountered too many studies with
prodigious numbers of dependent variables, or with what
seemed to me far too many independent variables, or
(heaven help us) both. In any given investigation that isn‘t explicitly ex
ploratory, we should be studying few independent vari
ables and even fewer dependent variables, for a variety of reasons.
If all of the dependent variables are to be related to all of the independent variables by simple bivariate anal
yses or multiple regression, the number of hypothesis tests
that will be performed willynilly is at least the product
of the sizes ofthe two sets. Using the .05 level for many
tests escalates the experimentwise Type I error rate—or
in plain English, greatly increases the chances of discov
ering things that aren’t so. If, for example, you study 6
dependent and 10 independent variables and should ﬁnd
that your harvest yields 6 asterisks, you know full well
that if there were no real associations in any of the 60
tests, the chance of getting one or more “signiﬁcant” re
sults is quite high (something like 1 — .9560, which equals,
coincidentally. .95), and that you would expect three spu
riously signiﬁcant results on the average. You then must
ask yourself some embarrassing questions. such as, Well,
which three are real“), or even, Is six signiﬁcant signiﬁ
cantly more than the chanceexpected three? (It so hap—
pens that it isn’t.) And of course, as you’ve probably discovered, you’re
not likely to solve your multiple tests problem with the
Bonferroni maneuver. Dividing .05 by 60 sets a pertest
signiﬁcance criterion of .05/60 = 0.00083, and therefore
a critical twosided t value of about 3.5. The effects you‘re
dealing with may not be large enough to produce any
interesting is that high, unless you’re lucky. Nor can you ﬁnd salvation by doing six stepwise
multiple regressions on the 10 independent variables. The
amount of capitalization on chance that this entails is
more than I know how to compute, but certainly more
than would a simple harvest of asterisks for 60 regression
coeﬂicients (Wilkinson, 1990, p. 481). In short, the results of this humongous study are a
muddle. There is no solution to your problem. You December 1990  American Psychologist Copyright I990 by the American Psychological Associalion, Inc. 0003066X/90/SOO.75
Vol. 45. No. l2, 304r3l3 wouldn’t, of course, write up the study for publication as
ifthe unproductive three quarters ofyour variables never
existed. . . . The irony is that people who do studies like this
often start off with some useful central idea that, if pur
sued modestly by means of a few highly targeted variables
and hypotheses, would likely produce signiﬁcant results.
These could, if propriety or the consequences of early
toilet training deemed it necessary, successfully withstand
the challenge of a Bonferroni or other experimentwise
adjusted alpha procedure. A special case of the toomany—variables problem
arises in multiple regression—correlation analysis with
large numbers of independent variables. As the number
of independent variables increases, the chances are that
their redundancy in regard to criterion relevance also
increases. Because redundancy increases the standard er
rors of partial regression and correlation coeﬂicients and
thus reduces their statistical signiﬁcance, the results are
likely to be zilch. I have so heavily emphasized the desirability of
working with few variables and large sample sizes that
some of my students have spread the rumor that my idea
of the perfect study is one with 10,000 cases and no vari
ables. They go too far. A less profound application of the lessismore prin ciple is to our habits of reporting numerical results. There
are computer programs that report by default four, ﬁve,
or even more decimal places for all numerical results.
Their authors might well be excused because, for all the
programmer knows, they may be used by atomic scien—
tists. But we social scientists should know better than to
report our results to so many places. What, pray, does an
r = .12345 mean? or, for an IQ distribution, a mean of
105.6345? For N = 100, the standard error of the r is
about .1 and the standard error of the IQ mean about
1.5. Thus, the 345 part of r = .12345 is only 3% of its
standard error, and the 345 part of the IQ mean of
105.6345 is only 2% of its standard error. These super
ﬂuous decimal places are no better than random numbers.
lThey are actually worse than useless because the clutter
they create, particularly in tables, serves to distract the
eye and mind from the necessary comparisons among the
meaningful leading digits. Less is indeed more here. Simple Is Better I’ve also learned that simple is better, which is a kind of
loose generalization of less is more. The simpleisbetter
idea is widely applicable to the representation, analysis,
and reporting of data. If, as the old cliché has it, a picture is worth a thou This invited address was presented to the Division of Evaluation, Mea
surement, and Statistics (Division 5) at the 98th Annual Convention of
the American Psychological Association in Boston, August [3, I990. I am grateful for the comments on a draft provided by Patricia
Cohen, Judith Rabkin, Raymond Katzell, and Donald F. Klein. Correspondence concerning this article should be addressed to Jacob
Cohen, Department of Psychology, New York University, 6 Washington
Pl., 5th Floor, New York, NY 10003. December 1990  American Psychologist sand words, in describing a distribution. a frequency
polygon or, better still, a Tukey (1977. pp. 1—26) stem
and leafdiagram is usually worth more than the ﬁrst four
moments, that is, the mean, standard deviation, skewness,
and kurtosis. I do not question that the moments efﬁ
ciently summarize the distribution or that they are useful
in some analytic contexts. Statistics packages eagerly give
them to us and we dutifully publish them, but they do
not usually make it possible for most of us or most of the
consumers of our products to see the distribution. They
don‘t tell us, for example, that there are no cases between
scores of 72 and 90, or that this score of 24 is somewhere
in left ﬁeld, or that there is a pileup of scores of 9. These
are the kinds of features of our data that we surely need
to know about, and they become immediately evident
with simple graphic representation. Graphic display is even more important in the case
of bivariate data. Underlying each productmoment cor
relation coeﬂicient in an acre of such coefﬁcients there
lies a simple scatter diagram that the r presumes to sum—
marize, and well it might. That is, it does so ifthe joint
distribution is moreorless bivariate normal—which
means, among other things, that the relationship must be
linear and that there are no wild outlying points. We know
that least squares measures, like means and standard de—
viations, are sensitive to outliers. Well, Pearson correla
tions are even more so. About 15 years ago, Wainer and
Thissen (1976) published a data set made up of the heights
in inches and weights in pounds of25 subjects. for which
the r was a perfectly reasonable .83. But if an error in
transcription were made so that the height and weight
values for one of the 25 subjects were switched, the r
would become —.26, a rather large and costly error! There is hardly any excuse for gaps, outliers, cur
vilinearity, or other pathology to exist in our data un
beknownst to us. The same computer statistics package
with which we can do very complicated analyses like
quasi—Newtonian nonlinear estimation and multidimen
sional scaling with Guttman’s coefﬁcient of alienation
also can give us simple scatter plots and stem and leaf
diagrams with which we can see our data. A proper mul
tiple regression/correlation analysis does not begin with
a matrix of correlation coefﬁcients, means, and standard
deviations, but rather with a set of stem and leaf diagrams
and scatter plots. We sometimes learn more from what
we see than from what we compute; sometimes what we
learn from what we see is that we shouldn’t compute, at
least not on those data as they stand. Computers are a blessing, but another of the things
I have learned is that they are not an unmixed blessing.
Forty years ago, before computers (B.C., that is), for my
doctoral dissertation, I did three factor analyses on the
l l subtests of the WechslerBellevue, with samples of 100
cases each of psychoneurotic, schizophrenic, and brain
damaged patients. Working with a pad and pencil, lO—to
theinch graph paper, a table of products of twodigit
numbers, and a Friden electromechanical desk calculator that did square roots “automatically,” the whole process
took the better part of a year. Nowadays, on a desktop 1305 computer, the job is done virtually in microseconds (or
at least licketysplit), But another important difference
between then and now is that the sheer laboriousness of
the task assured that throughout the entire process I was
in intimate contact with the data and their analysis. There
was no chance that there were funny things about my
data or intermediate results that I didn’t know about,
things that could vitiate my conclusions. I know that I sound my age, but don’t get me
wrong—I love computers and revel in the ease with which
data analysis is accomplished with a good interactive sta—
tistics package like SYSTAT and SYGRAPH (Wilkinson,
1990). I am, however, appalled by the fact that some pub
lishers of statistics packages successfully hawk their wares
with the pitch that it isn’t necessary to understand sta
tistics to use them. But the same package that makes it
possible for an ignoramus to do a factor analysis with a
pulldown menu and the click of a mouse also can greatly
facilitate with awesome speed and efﬁciency the perfor
mance of simple and informative analyses. A prime example of the simpleis—better principle is
found in the compositing of values. We are taught and
teach our students that for purposes of predicting a cri—
terion from a set of predictor variables, assuming for sim
plicity (and as the mathematicians say, “with no loss of
generality”), that all variables are standardized, we achieve
maximum linear prediction by doing a multiple regression
analysis and forming a composite by weighting the pre
dictor z scores by their betas. It can be shown as a math
ematical necessity that with these betas as weights, the
resulting composite generates a higher correlation with
the criterion in the sample at hand than does a linear
composite formed using any other weights. Yet as a practical matter, most of the time, we are
better off using unit weights: +1 for positively related
predictors, —l for negatively related predictors, and O,
that is, throw away poorly related predictors (Dawes,
1979; Wainer, I976). The catch is that the betas come
with guarantees to be better than the unit weights only
for the sample on which they were determined. (It’s al
most like a TV set being guaranteed to work only in the
store.) But the investigator is not interested in making
predictions for that sample—he or she knows the criterion
values for those cases. The idea is to combine the pre
dictors for maximal prediction for future samples. The
reason the betas are not likely to be optimal for future
samples is that they are likely to have large standard errors.
For the typical 100 or 200 cases and 5 or 10 correlated
predictors, the unit weights will work as well or better. Let me offer a concrete illustration to help make the
point clear. A running example in our regression text (Cohen & Cohen, 1983) has for a sample of college faculty
their salary estimated from four independent variables:
years since PhD, sex (coded in the modern manner—l
for female and 0 for male), number of publications, and
number of citations. The sample multiple correlation
computes to .70. What we want to estimate is the cor
relation we would get if we used the sample beta weights
in the population, the crossvalidated multiple correlation, which unfortunately shrinks to a value smaller than the
shrunken multiple correlation. For N = 100 cases. using
Rozeboom‘s (I978) formula, that comes to .67. Not bad.
But using unit weights, we do better: .69. With 300 or
400 cases, the increased sampling stability pushes up the
crossvalidated correlation, but it remains slightly smaller
than the .69 value for unit weights. Increasing sample
size to 500 or 600 will increase the cross—validated cor
relation in this example to the point at which it is larger
than the unitweighted .69, but only trivially, by a couple
of points in the third decimal! When sample size is only
50, the crossvalidated multiple correlation is only .63,
whereas the unit weighted correlation remains at .69. The
sample size doesn’t affect the unit weighted correlation
because we don’t estimate unstable regression coefﬁcients.
It is, of course, subject to sampling error, but so is the
crossvalidated multiple correlation. Now, unit weights will not always be as good or better
than beta weights. For some relatively rare patterns of
correlation (suppression is one), or when the betas vary
greatly relative to their mean, or when the ratio of sample
size to the number of predictors is as much as 30 to 1
and the multiple correlation is as large as .75, the beta
weights may be better, but even in these rare circum
stances, probably not much better. Furthermore, the unit weights work well outside the
context of multiple regression where we have criterion
data—that is, in a situation in which we wish to measure
some concept by combining indicators, or some abstract
factor generated in a factor analysis. Unit weights on
standardized scores are likely to be better for our purposes
than the factor scores generated by the computer program,
which are, after all, the fruits of a regression analysis for
that sample of the variables on the factor as criterion. Consider that when we go to predict freshman grade
point average from a 30item test, we don’t do a regression
analysis to get the “optimal” weights with which to com
bine the item scores—we just add them up, like Galton
did. Simple is better. We are, however, not applying the simpleisbetter
principle when we “simplify” a multivalued graduated
variable (like IQ, or number of children, or symptom
severity) by cutting it somewhere along its span and mak—
ing it into a dichotomy. This is sometimes done with a
profession of modesty about the quality or accuracy of
the variable, or to “simplify” the analysis. This is not an
application, but rather a perversion of simple is better,
because this practice is one of willful discarding of infor—
mation. It has been shown that when you so mutilate a
variable, you typically reduce its squared correlation with
other variables by about 36% (Cohen, 1983). Don’t do
it. This kind of simpliﬁcation is of a piece with the practice
of “simplifying” a factorial design ANOVA by reducing
all cell sizes to the size of the smallest by dropping cases.
They are both ways of throwing away the most precious
commodity we deal with: information. Rather more generally, I think I have begun to learn
how to use statistics in the social sciences. The atmosphere that characterizes statistics as ap 1 306 December 1990  American Psychologist plied in the social and biomedical sciences is that of a
secular religion (Salsburg, 1985), apparently of Judeo—
Christian derivation. as it employs as its most powerful
icon a six—pointed cross. often presented multiply for en
hanced authority. I confess that I am an agnostic. The Fisherian Legacy When I began studying statistical inference, I was met
with a surprise shared by many neophytes. I found that
if, for example, I wanted to see whether poor kids esti
mated the size of coins to be bigger than did rich kids,
after I gathered the data, I couldn’t test this research hy~
pothesis, but rather the null hypothesis that poor kids
perceived coins to be the same size as did rich kids. This
seemed kind of strange and backward to me, but I was
rather quickly acculturated (or, if you like, converted, or
perhaps brainwashed) to the Fisherian faith that science
proceeds only through inductive inference and that in
ductive inference is achieved chieﬂy by rejecting null hy
potheses, usually at the .05 level. (It wasn‘t until much
later that I learned that the philosopher of science, Karl
Popper, 1959, advocated the formulation of falsiﬁable re
search hypotheses and designing research that could fal
sify them.) The fact that Fisher’s ideas quickly became the basis
for statistical inference in the behavioral sciences is not
surprising—they were very attractive. They offered a de
terministic scheme, mechanical and objective, indepen
dent of content, and led to clearcut yes—no decisions.
For years, nurtured 0n the psychological statistics text—
books of the 19405 and 19505, I never dreamed that they
were the source of bitter controversies (Gigerenzer &
Murray, 1987). Take, for example, the yes—no decision feature. It
was quite appropriate to agronomy, which was where
Fisher came from. The outcome of an experiment can
quite properly be the decision to use this rather than that
amount of manure or to plant this or that variety of wheat.
But we do not deal in manure, at least not knowingly.
Similarly, in other technologies—for example, engineer
ing quality control or education—research is frequently
designed to produce decisions. However, things are not
quite so clearly decisionoriented in the development of
scientific theories. Next, consider the sanctiﬁed (and sanctifying) magic
.05 level. This basis for decision has played a remarkable
role in the social sciences and in the lives of social sci
entists. In governing decisions about the status of null
hypotheses, it came to determine decisions about the ac
ceptance of doctoral dissertations and the granting of re
search funding, and about publication, promotion, and
whether to have a baby just now. Its arbitrary unreason
able tyranny has led to data fudging of varying degrees
of subtlety from grossly altering data to dropping cases
where there “must have been” errors. The Null Hypothesis Tests Us We cannot charge R. A. Fisher with all of the sins of the
last half century that have been committed in his name (or more often anonymously but as part of his legacy).
but they deserve cataloging (Gigerenzer & Murray. 1987:
Oakes, 1986). Over the years. I have learned not to make
errors of the following kinds: When a Fisherian null hypothesis is rejected with
an associated probability of. for example. .026. it is not
the case that the probability that the null hypothesis is
true is .026 (or less than .05, or any other value we can
specify). Given our framework of probability as longrun
relative frequency—as much as we might wish it to be
otherwise—this result does not tell us about the truth of
the null hypothesis. given the data. (For this we have to
go to Bayesian or likelihood statistics, in which probability
is not relative frequency but degree of belief.) What it
tells us is the probability of the data. given the truth of
the null hypothesis—which is not the same thing, as much
as it may sound like it. Ifthep value with which we reject the Fisherian null
hypothesis does not tell us the probability that the null
hypothesis is true, it certainly cannot tell us anything
about the probability that the research or alternate hy
pothesis is true. In fact. there is no alternate hypothesis
in Fisher’s scheme: Indeed, he violently opposed its in
clusion by Neyman and Pearson. Despite widespread misconceptions to the contrary,
the rejection of a given null hypothesis gives us no basis
for estimating the probability that a replication of the
research will again result in rejecting that null hypothesis. Of course, everyone knows that failure to reject the
Fisherian null hypothesis does not warrant the conclusion
that it is true. Fisher certainly knew and emphasized it,
and our textbooks duly so instruct us. Yet h0w often do
we read in the discussion and conclusions of articles now
appearing in our most prestigiousjournals that “there is
no difference” or “no relationship”? (This is 40 years
after my N = 20 friend used a nonsigniﬁcant result to
demolish psychoanalytic theory.) The other side of this coin is the interpretation that
accompanies results that surmount the .05 barrier and
achieve the state of grace of “statistical signiﬁcance.”
“Everyone” knows that all this means is that the effect is
not nil, and nothing more. Yet how often do we see such
a result to be taken to mean, at least implicitly, that the
effect is significant, that is. important. large. If a result is
highly signiﬁcant, say p < .001, the temptation to make
this misinterpretation becomes all but irresistible. Let’s take a close look at this null hypothesis—the
fulcrum of the Fisherian scheme—that we so earnestly
seek to negate. A null hypothesis is any precise statement
about a state of affairs in a population, usually the value
ofa parameter, frequently zero. It is called a “null” hy
pothesis because the strategy is to nullify it or because it
means “nothing doing.” Thus, “The difference in the
mean scores of US. men and women on an Attitude To
ward the U.N. scale is zero” is a null hypothesis. “The
product—moment r between height and IQ in high school
students is zero” is another. “The proportion of men in
a population of adult dyslexics is .50” is yet another. Each
is a precise statement—for example, if the population r December I990  American Psychologist 1307 between height and IQ is in fact .03. the null hypothesis
that it is zero is false. It is also false if the r is .01. .001,
or .000001! A little thought reveals a fact widely understood
among statisticians: The null hypothesis, taken literally
(and that’s the only way you can take it in formal hy
pothesis testing), is always false in the real world. It can
only be true in the bowels of a computer processor run
ning a Monte Carlo study (and even then a stray electron
may make it false). If it is false, even to a tiny degree, it
must be the case that a large enough sample will produce
a signiﬁcant result and lead to its rejection. So if the null
hypothesis is always false, what’s the big deal about re
jecting it? Another problem that bothered me was the asym
metry of the Fisherian scheme: If your test exceeded a
critical value, you could conclude, subject to the alpha
risk, that your null was false, but if you fell short of that
critical value, you couldn‘t conclude that the null was
true. In fact, all you could conclude is that you couldn't
conclude that the null was false. In other words, you could
hardly conclude anything. And yet another problem I had was that if the null
were false, it had to be false to some degree. It had to
make a difference whether the population mean difference
was 5 or 50, or whether the population correlation was
.10 or .30, and this was not taken into account in the
prevailing method. I had stumbled onto something that
I learned after awhile was one of the bases of the Neyman—
Pearson critique of Fisher’s system of statistical induction. In I928 (when I was in kindergarten), Jerzy Neyman
and Karl Pearson’s boy Egon began publishing papers
that offered a rather different perspective on statistical
inference (Neyman & Pearson, 1928a, l928b). Among
other things, they argued that rather than having a single
hypothesis that one either rejected or not, things could
be so organized that one could choose between two hy
potheses, one of which could be the null hypothesis and
the other an alternate hypothesis. One could attach to
the precisely deﬁned null an alpha risk, and to the equally
precisely deﬁned alternate hypothesis a beta risk. The
rejection of the null hypotheses when it was true was an
error of the ﬁrst kind, controlled by the alpha criterion,
but the failure to reject it when the alternate hypothesis
was true was also an error, an error of the second kind,
which could be controlled to occur at a rate beta. Thus,
given the magnitude of the difference between the null
and the alternate (that is, given the hypothetical popu
lation effect size), and setting values for alpha and beta,
one could determine the sample size necessary to meet
these conditions. Or, with the effect size, alpha, and the
sample size set, one could determine the beta, or its com
plement, the probability of rejecting the null hypothesis,
the power of the test. Now, R. A. Fisher was undoubtedly the greatest sta
tistician of this century, rightly called “the father of mod
ern statistics,” but he had a blind spot. Moreover, he was
a stubborn and frequently vicious intellectual opponent.
A feud with Karl Pearson had kept Fisher’s papers out of Biometrika. which Karl Pearson edited. After oldman
Pearson retired, efforts by Egon Pearson and Neyman to
avoid battling with Fisher were to no avail. Fisher wrote
that they were like Russians who thought that “pure sci
ence” should be “geared to technological performance"
as “in a ﬁveyear plan.“ He once led off the discussion
on a paper by Neyman at the Royal Statistical Society
by saying that Neyman should have chosen a topic “on
which he could speak with authority" (Gigerenzer &
Murray, 1987, p. 17). Fisher ﬁercely condemned the Ney
man—Pearson heresy. I was of course aware of none of this. The statistics
texts on which I was raised and their later editions to
which I repeatedly turned in the 19505 and 19605 pre
sented null hypothesis testing a la Fisher as a done deal,
as the way to do statistical inference. The ideas of Neyman
and Pearson were barely or not at all mentioned, or dis
missed as too complicated. When I ﬁnally stumbled onto power analysis, and
managed to overcome the handicap of a background with
no working math beyond high school algebra (to say
nothing of mathematical statistics), it was as if I had died
and gone to heaven. After I learned what noncentral dis
tributions were and ﬁgured out that it was important to
decompose noncentrality parameters into their constit
uents of effect size and sample size, I realized that I had
a framework for hypothesis testing that had four param
eters: the alpha signiﬁcance criterion, the sample size, the
population effect size, and the power of the test. For any
statistical test, any one of these was a function of the
other three. This meant, for example, that for a signiﬁ—
cance test of a product—moment correlation, using a two
sided .05 alpha criterion and a sample size of 50 cases,
if the population correlation is .30, my longrun proba
bility of rejecting the null hypothesis and ﬁnding the sam
ple correlation to be signiﬁcant was .57, a coin flip. As
another example, for the same 01 = .05 and population r =
.30, ifI want to have .80 power, I could determine that I
needed a sample size of 85. Playing with this new toy (and with a small grant
from the National Institute of Mental Health) I did what
came to be called a metaanalysis of the articles in the
I960 volume of the Journal of Abnormal and Social Psy—
chology (Cohen, I962). I found, among other things, that
using the nondirectional .05 criterion, the median power
to detect a medium effect was .46—a rather abysmal re
sult. Of course, investigators could not have known how
underpowered their research was, as their training had
not prepared them to know anything about power, let
alone how to use it in research planning. One might think
that after 1969, when I published my power handbook
that made power analysis as easy as falling off a log, the
concepts and methods of power analysis would be taken
to the hearts of null hypothesis testers. So one might think. (Stay tuned.) Among the less obvious beneﬁts of power analysis
was that it made it possible to “prove” null hypotheses.
Of course, as I‘ve already noted, everyone knows that one
can’t actually prove null hypotheses. But when an inves 1308 December 1990 ~ American Psychologist tigator means to prove a null hypothesis. the point is not
to demonstrate that the population effect size is, say. zero
to a million or more decimal places. but rather to show
that it is of no more than negligible or trivial size (Cohen,
1988. pp. 16—l7). Then, from a power analysis at, say,
a = .05. with power set at, say. .95, so that 6 = .05. also.
the sample size necessary to detect this negligible effect
with .95 probability can be determined. Now if the re
search is carried out using that sample size, and the result
is not signiﬁcant, as there had been a .95 chance of de
tecting this negligible effect. and the effect was not de
tected, the conclusion is justiﬁed that no nontrivial effect
exists, at the i6 = .05 level. This does, in fact, probabil
istically prove the intended null hypothesis of no more
than a trivially small effect. The reasoning is impeccable,
but when you go to apply it, you discover that it takes
enormous sample sizes to do so. For example, if we adopt
the above parameters for a signiﬁcance test of a correlation
coefficient and r = .10 is taken as a negligible effect size,
it requires a sample of almost 1,300 cases. More modest
but still reasonable demands for power of course require
smaller sample sizes, but not sufﬁciently smaller to matter
for most investigators—even .80 power to detect a pop
ulation correlation of . 10 requires almost 800 cases. So
it generally takes an impractically large sample size to
prove the null hypothesis as I’ve redeﬁned it; however,
the procedure makes clear what it takes to say or imply
from the failure to reject the null hypothesis that there is
no nontrivial effect. A salutary effect of power analysis is that it draws
one forcibly to consider the magnitude of eﬂects. In psy
chology, and especially in soft psychology, under the sway
of the Fisherian scheme, there has been little conscious
ness of how big things are. The very popular ANOVA
designs yield F ratios, and it is these whose size is of
concern. First off is the question of whether they made
the sanctifying .05 cut—off and are thus signiﬁcant, and
then how far they fell below this cutoff: Were they perhaps
highly significant (p less than .01) or very highly signiﬁcant
(less than .001)? Because science is inevitably about mag
nitudes, it is not surprising how frequently 1) values are
treated as surrogates for effect sizes. One of the things that drew me early to correlation
analysis was that it yielded an r, a measure of effect size,
which was then translated into a t or F and assessed for
signiﬁcance, whereas the analysis of variance or covari
ance yielded only an F and told me nothing about effect
size. As many of the variables with which we worked
were expressed in arbitrary units (points on a scale, trials
to learn a maze), and the Fisherian scheme seemed quite
complete by itself and made no demands on us to think
about effect sizes, we simply had no language with which
to address them. In retrospect, it seems to me simultaneously quite
understandable yet also ridiculous to try to develop the—
ories about human behavior with p values from Fisherian
hypothesis testing and no more than a primitive sense of
effect size. And I wish I were talking about the long, long
ago. In 1986, there appeared in the New York Times a UPI dispatch under the headline “Children‘s Height
Linked to Test Scores.” The article described a study that
involved nearly 14.000 children 6 to 17 years of age that
reported a definite link between height (age and sex—ad
justed) and scores on tests of both intelligence and
achievement. The relationship was described as signiﬁ
cant, and persisting. even after controlling for other fac
tors. including socioeconomic status. birth order, family
size, and physical maturity. The authors noted that the
effect was small, but signiﬁcant, and that it didn‘t warrant
giving children growth hormone to make them taller and
thus brighter. They speculated that the effect might be
due to treating shorter children as less mature. but that
there were alternative biological explanations. Now this was a newspaper story. the fruit ofthe ever
inquiring mind of a science reporter. not a journal article.
so perhaps it is understandable that there was no effort
to deal with the actual size ofthis small effect. But it got
me to wondering about how small this signiﬁcant rela
tionship might be. Well, if we take signiﬁcant to mean p
< .00l (in the interest of scientiﬁc toughmindedness), it
turns out that a correlation of .0278 is signiﬁcant for
14,000 cases. But I’ve found that when dealing with vari
ables expressed in units whose magnitude we understand.
the effect size in linear relationships is better compre
hended with regression than with correlation coefficients.
So, accepting the authors” implicit causal model. it works
out that raising a child‘s IQ from 100 to 130 would require
giving the child enough growth hormone to increase his
or her height by 14 ft (more or less). lfthe causality goes
the other way, and one wanted to create basketball players,
a 4in. increase in height would require raising the IQ
about 900 points. Well, they said it was a small effect.
(When I later checked the journal article that described
this research, it turned out that the correlation was much
larger than .0278. It was actually about .I I, so that for a
30point increase in IQ it would take only enough gr0wth
hormone to produce a 3.5ft increase in height, or with
the causality reversed, a 4—in. increase in height would
require an increase of only 233 IQ points.) I am happy to say that the long neglect of attention
to effect size seems to be coming to a close. The clumsy
and fundamentally invalid boxscore method of literature
review based on p values is being replaced by effectsize—
based metaanalysis as formulated by Gene Glass ( I977).
The effect size measure most often used is the standard—
ized mean difference d of power analysis. Several book
length treatments of metaanalysis have been published,
and applications to various ﬁelds of psychology are ap
pearing in substantial numbers in the Psychological Bul
letin and other prestigious publications. In the typical
meta—analysis, the research literature on some issue is
surveyed and the effect sizes that were found in the rel
evant studies are gathered. Note that the observational
unit is the study. These data do not only provide an es
timate of the level and variability of the effect size in a
domain based on multiple studies and therefore on many
observations, but by relating effect size to various sub—
stantive and methodological characteristics over the stud December 1990  American Psychologist I309 ies. much can be learned about the issue under investi—
gation and how best to investigate it. One hopes that this
ferment may persuade researchers to explicitly report ef
fect sizes and thus reduce the burden on metaanalysts
and others of having to make assumptions to dig them
out of their inadequately reported research results. In a
ﬁeld as scattered (not to say anarchic) as ours, metaanal
ysis constitutes a welcome force toward the cumulation
of knowledge. Metaanalysis makes me very happy. Despite my careerlong identiﬁcation with statistical
inference, I believe, together with such luminaries as
Meehl (1978) Tukey (1977), and Gigerenzer (Gigerenzer
& Murray, 1987), that hypothesis testing has been greatly
overemphasized in psychology and in the other disciplines
that use it. It has diverted our attention from crucial is
sues. Mesmerized by a single all—purpose. mechanized.
“objective” ritual in which we convert numbers into other
numbers and get a yes—no answer, we have come to neglect
close scrutiny of where the numbers came from. Recall
that in his delightful parable about averaging the numbers
on football jerseys, Lord (1953) pointed out that “the
numbers don‘t know where they came from.” But surely
we must know where they came from and should be far
more concerned with why and what and how well we are
measuring, manipulating conditions, and selecting our
samples. We have also lost sight of the fact that the error vari
ance in our observations should challenge us to efforts to
reduce it and not simply to thoughtlessly tuck it into the
denominator of an F or t test. How To Use Statistics So, how would I use statistics in psychological research?
First of all, descriptively. John Tukey’s (1977) Erploralory
Data Analysis is an inspiring account of how to effect
graphic and numerical analyses of the data at hand so as
to understand them. The techniques, although subtle in
conception. are simple in application. requiring no more
than pencil and paper (Tukey says if you have a hand—
held calculator, ﬁne). Although he recognizes the impor
iance ofwhat he calls conﬁrmation (statistical inference),
he manages to ﬁll 700 pages with techniques of “mere”
description. pointing out in the preface that the emphasis
on inference in modern statistics has resulted in a loss of
ﬂexibility in data analysis. Then, in planning research, I think it wise to plan
the research. This means making tentative informed
judgments about, among many other things, the size of
the population effect or effects you’re chasing, the level
of alpha risk you want to take (conveniently. but not nec
essarily .05), and the power you want (usually some rel
atively large value like .80). These speciﬁed, it is a simple
matter to determine the sample size you need. It is then
a good idea to rethink your speciﬁcations. If, as is often
the case, this sample size is beyond your resources, con
sider the possibility of reducing your power demand or.
perhaps the effect size, or even (heaven help us) increasing
your alpha level. Or, the required sample may be smaller
than you can comfortably manage, which also should lead you to rethink and possibly revise your original
speciﬁcations. This process ends when you have a credible
and viable set of speciﬁcations, or when you disc0ver that
no practicable set is possible and the research as originally
conceived must be abandoned. Although you would
hardly expect it from reading the current literature, failure
to subject your research plans to power analysis is simply
irrational. Next, I have learned and taught that the primary
product of a research inquiry is one or more measures
ofeffect size, not p values (Cohen. 1965). Effectsize mea
sures include mean differences (raw or standardized).
correlations and squared correlation of all kinds, odds
ratios, kappas—whatever conveys the magnitude of the
phenomenon of interest appropriate to the research con
text. If, for example. you are comparing groups on a vari
able measured in units that are well understood by your
readers (IQ points. or dollars, or number of children. or
months of survival), mean differences are excellent mea
sures of effect size. When this isn‘t the case, and it isn’t
the case more often than it is, the results can be translated
into standardized mean differences (d values) or some
measure of correlation or association (Cohen, 1988). (Not
that we understand as well as we should the meaning of
a given level of correlation [Oakes, 1986, pp. 88—92]. It . has been shown that psychologists typically overestimate how much relationship a given correlation represents,
thinking ofa correlation of .50 not as its square of .25
that its proportion of variance represents, but more like
its cube root of about .80, which represents only wishful
thinking! But that’s another story.) Then, having found the sample effect size, you can
attach a p value to it. but it is far more informative to
provide a conﬁdence interval. As you know, a conﬁdence
interval gives the range of values of the effectsize index
that includes the population value with a given proba
bility. It tells you incidentally whether the effect is sig
niﬁcant, but much more—it provides an estimate of the
range of values it might have, surely a useful piece of
knowledge in a science that presumes to be quantitative.
(By the way, I don‘t think that we should routinely use
95% intervals: Our interests are often better served by
more tolerant 80% intervals.) Remember that throughout the process in which you
conceive. plan, execute, and write up a research, it is on
your informed judgment as a scientist that you must rely,
and this holds as much for the statistical aspects of the
work as it does for all the others. This means that your
informed judgment governs the setting of the parameters
involved in the planning (alpha, beta, population effect
size, sample size, conﬁdence interval), and that informed
judgment also g0verns the conclusions you will draw. In his brilliant analysis of what he called the “infer
ence revolution" in psychology, Gerd Gigerenzer showed
how and why no single royal road ofdrawing conclusions
from data is possible, and particularly not one that does
not strongly depend on the substantive issues concerned—
that is, on everything that went into the research besides
the number crunching. An essential ingredient in the re 1310 December 1990  American Psychologist search process is the judgment ofthe scientist. He or she
must decide by how much a theoretical proposition has
been advanced by the data. just as he or she decided what
to study, what data to get. and how to get it. I believe that
statistical inference applied with informed judgment is a
useful tool in this process, but it isn‘t the most important
tool: It is not as important as everything that came before
it. Some scientists, physicists for example, manage without
the statistics. although to be sure not without the informed
judgment. Indeed, some pretty good psychologists have
managed without statistical inference: There come to
mind Wundt, Kohler. Piaget. Lewin. Bartlett. Stevens.
and if you’ll permit me. Freud. among others. Indeed.
Skinner ( I95 7) thought ofdedicating his book Verbal Be
havior (and I quote) “to the statisticians and scientiﬁc
methodologists with whose help this book would never
have been completed” (p. l I l). I submit that the proper
application of statistics by sensible statistical methodol
ogists (Tukey, for example) would not have hurt Skinner‘s
work. It might even have done it some good. The implications of the things I have learned (so far)
are not consonant with much of what I see about me as
standard statistical practice. The prevailing yes—no de
cision at the magic .05 level from a single research is a
far cry from the use of informed judgment. Science simply
doesn‘t work that way. A successful piece of research
doesn‘t conclusively settle an issue, it just makes some
theoretical proposition to some degree more likely. Only
successful future replication in the same and different
settings (as might be found through metaanalysis) pro
vides an approach to settling the issue. How much more
likely this single research makes the proposition depends
on many things. but not on whether p is equal to or greater
than .05: .05 is not a cliffbut a convenient reference point
along the possibility—probability continuum. There is no
ontological basis for dichotomous decision making in
psychological inquiry. The point was neatly made by
Rosnow and Rosenthal (1989) last year in the American
Rryclzologist. They wrote “surely, God loves the .06 nearly
as much as the .05” (p. 1277). To which I say amen! Finally, I have learned. but not easily, that things
take time. As I’ve already mentioned, almost three de
cades ago, I published a power survey of the articles in
the 1960 volume of the Journal o/‘Abnormal and Social
Psychology (Cohen, 1962) in which I found that the me
dian power to detect a medium effect size under repre
sentative conditions was only .46. The first edition of my
power handbook came out in 1969. Since then. more
than two dozen power and effectsize surveys have been
published in psychology and related ﬁelds (Cohen, I988.
pp. xi—xii). There have also been a slew of articles on
poweranalytic methodology. Statistics textbooks. even
some undergraduate ones, give some space to power anal
ysis, and several computer programs for power analysis
are available (e.g., Borenstein & Cohen, 1988). They tell
me that some major funding entities require that their
grant applications contain power analyses, and that in
one of those agencies my p0wer book can be found in
every oﬁice. The problem is that. as practiced. current research
hardly reﬂects much attention to power. How often have
you seen any mention of power in the journals you read.
let alone an actual power analysis in the methods sections
of the articles? Last year in Psrclio/ogica/ Bulle/in. Sedl—
meier and Gigerenzer ( 1989) published an article entitled
“Do Studies of Statistical Power Have an Effect on the
Power of Studies?“. The answer was no. Using the same
methods I had used on the articles in the 1960 Journal
()fxlbllOerlll and Social th'c/zologt‘ (Cohen. 1962). they
performed a power analysis on the 1984 Journal ol‘Ab—
normal Psychology and found that the median power un
der the same conditions was .44. a little worse than the
.46 I had found 24 years earlier. It was worse still (.37)
when they took into account the occasional use of an
experimentwise alpha criterion. Even worse than that. in
some I 1% of the studies. research hypotheses were framed
as null hypotheses and their nonsigniﬁcance interpreted
as confirmation. The median power of these studies to
detect a medium effect at the twotailed .05 level was :25!
These are not isolated results: Rossi. Rossi. and Cottrill
(in press). using the same methods. did a power survey
ofthe 142 articles in the 1982 volumes ofthe Journal of
Personality and Social Psgt‘clzolotU' and the Journal o/kll)
normal Rst'cliology and found essentially the same results. A less egregious example of the inertia of method
ological advance is set correlation. which is a highly ﬂex
ible realization ofthe multivariate general linear model.
I published it in an article in 1982. and we included it in
an appendix in the I983 edition of our regression text
(Cohen. I982: Cohen & Cohen. 1983). Set correlation
can be viewed as a generalization of multiple correlation
to the multivariate case. and with it you can study the
relationship between anything and anything else, con
trolling for whatever you want in either the anything or
the anything else, or both. I think it’s a great method: at
least, my usually critical colleagues haven‘t complained.
Yet. as far as I‘m aware. it has hardly been used outside
the family. (The publication ofa program as a SYSTAT
supplementary module [Cohen. 1989] may make a dif
ference.) But I do not despair. I remember that W. S. Gosset.
the fellow who worked in a brewery and appeared in print
modestly as “Student.” published the I test a decade before
we entered World War I. and the test didn‘t get into the
psychological statistics textbooks until after World
War II. These things take time. So. ifyou publish something
that you think is really good. and a year or a decade or
two go by and hardly anyone seems to have taken notice.
remember the I test, and take heart. REFERENCES Borenstein, M.. & Cohen. J. ([988). S/aIirrii'al poiwr analrris' A com
puter program. Hillsdale. NJ: Erlbaum.
Children's height linked to test scores. (October 7. I986). New lurk Timer. p. C4.
Cohen. J. ( I962). The statistical power ofabnormal—social psychological research: A review. Journal ofmlhnormal and Social Rr_1'c'/io/ogj; 65.
145—153. December 1990 0 American Psychologist 13ll Cohen. J. (1965). Some statistical issues in psychological research. In
B. B. Wolman (Ed.). Handbook oi'clinical pvyc'hologj' (Pp. 95—121).
New York: McGraw—Hill. Cohen. J. (1982). Set correlation as a general multivariate dataanalytic
method. Multivariate Behavioral Research. I 7, 301—341. Cohen. J. (1983). The cost of dichotomization. Applied Rtyehological
.11easurement. 7, 249—253. Cohen. J. (1988). Statistical power analysis/or the behavioral sciences
(2nd ed.). Hillsdale. NJ: Erlbaum. Cohen, J. (1989). SE TCOR: Set correlation analysis, a supplementary
module/or S YS TAT and SYGRAPH. Evanston. IL: SYSTAT. Cohen. J.. & Cohen. P. (1983). Applied multiple regression/correlation
aha/isixfor the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Dawes. R. M. (1979). The robust beauty of improper linear models in
decision making. American Psychologist. 34. 571582. Gigerenzer. G.. & Murray, D. J. (1987). Cognition as intuitive statistics.
Hillsdale. NJ: Erlbaum. Glass. G. V. (1977). Integrating ﬁndings: The metaanalysis of research.
1n L. Shulman (Ed.). Review of research in education (Vol. 5, pp.
351—379). Itasca. IL: Peacock. Lord, F. M. (1953). On the statistical treatment of football numbers.
American Psychologist. 8, 750—751. Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir
Ronald, and the slow progress of soft psychology. Journal of Consulting
and Clinical Psychology, 46, 806—834. Neyman, .l.. & Pearson, E. (1928a). On the use and interpretation of
certain test criteria for purposes of statistical inference: Part 1. Bio
metrika. 20A. 175—240. Neyman, J.. & Pearson, E. (1928b). On the use and interpretation of certain test criteria for purposes of statistical inference: Part 11. Bio
metrika, 20:1. 263294. Oakes. M. (1986). Statistical inference: t connneniartfor ilie \oeialaiui
behavioral teieneet. New York: Wiley. Popper. K. (1959). The logic oi scientific (li'smren: New York: Basic
Books. Rosnow. R. L.. & Rosenthal. R. (1989). Statistical procedures and the
justiﬁcation of knowledge in psychological science, American Psy
chologist. 44. 12761284. Rossi. .1. S.. Rossi, 5. R.. & Cottrill. S. D. (in press). Statistical power in
research in social and abnormal psychology. Journal of Consulting
and Clinical Psychology. Rozeboom, W. W. (1978). Estimation of crossvalidated multiple cor
relation: A clariﬁcation. Psrveho/ozit‘al Bulletin. 85, 1348—1351. Salsburg, D. S. (1985). The religion of statistics as practiced in medical
journals. The American Statistician, 39, 220423. Sedlmeier, P.. & Gigerenzer, G. (1989). Do studies of statistical power
have an effect on the p0wer of studies? Psychological Bulletin. [05,
309—316. Skinner. B. F. (1957). [erba/ behavior. New York: Appleton—Century
Crofts. Tukey. J. W. (1977). Erp/oratory data analysis. Reading. MA: Addison
Wesley. Wainer. H. (1976). Estimating coefﬁcients in linear models: It don‘t make
no nevermind. hydrological Bulletin. 83. 2 l 3217. Wainer. H.. & Thissen. D. (1976). When jackkniﬁng fails (or does it?)
Rsychomelrika, 41. 9—34. Wilkinson, L. (1990). SYSI‘iT: The rvstein/or statistics: Evanston. IL:
SYSTAT. 1312 December 1990  American Psychologist ...
View
Full Document
 Spring '12
 Veiga

Click to edit the document details