Statistical Thinking

What you'll learn to do: define basic elements of a statistical investigation

822px-Fisher_iris_versicolor_sepalwidth.svg Once a psychologist has performed an experiment or study and gathered her results, she needs to organize the information in a way so that she can draw conclusions from the results. What does the information mean? Does it support or reject the hypothesis? Is the data valid and reliable, and is the study replicable?

Psychologists use statistics to assist them in analyzing data, and also to give more precise measurements to describe whether something is statistically significant. Analyzing data using statistics enables researchers to find patterns, make claims, and share their results with others. In this section, you'll learn about some of the tools that psychologists use in statistical analysis.

Learning Objectives

  • Define reliability and validity
  • Describe the importance of distributional thinking and the role of p-values in statistical inference
  • Describe the role of random sampling and random assignment in drawing cause-and-effect conclusions
  • Describe replication and its importance to psychology

Interpreting Experimental Findings

Once data is collected from both the experimental and the control groups, a statistical analysis is conducted to find out if there are meaningful differences between the two groups. A statistical analysis determines how likely any difference found is due to chance (and thus not meaningful). In psychology, group differences are considered meaningful, or significant, if the odds that these differences occurred by chance alone are 5 percent or less. Stated another way, if we repeated this experiment 100 times, we would expect to find the same results at least 95 times out of 100.

The greatest strength of experiments is the ability to assert that any significant differences in the findings are caused by the independent variable. This occurs because random selection, random assignment, and a design that limits the effects of both experimenter bias and participant expectancy should create groups that are similar in composition and treatment. Therefore, any difference between the groups is attributable to the independent variable, and now we can finally make a causal statement. If we find that watching a violent television program results in more violent behavior than watching a nonviolent program, we can safely say that watching violent television programs causes an increase in the display of violent behavior.

Reporting Research

When psychologists complete a research project, they generally want to share their findings with other scientists. The American Psychological Association (APA) publishes a manual detailing how to write a paper for submission to scientific journals. Unlike an article that might be published in a magazine like Psychology Today, which targets a general audience with an interest in psychology, scientific journals generally publish peer-reviewed journal articles aimed at an audience of professionals and scholars who are actively involved in research themselves.

Link to Learning

The Online Writing Lab (OWL) at Purdue University can walk you through the APA writing guidelines.

A peer-reviewed journal article is read by several other scientists (generally anonymously) with expertise in the subject matter. These peer reviewers provide feedback—to both the author and the journal editor—regarding the quality of the draft. Peer reviewers look for a strong rationale for the research being described, a clear description of how the research was conducted, and evidence that the research was conducted in an ethical manner. They also look for flaws in the study's design, methods, and statistical analyses. They check that the conclusions drawn by the authors seem reasonable given the observations made during the research. Peer reviewers also comment on how valuable the research is in advancing the discipline’s knowledge. This helps prevent unnecessary duplication of research findings in the scientific literature and, to some extent, ensures that each research article provides new information. Ultimately, the journal editor will compile all of the peer reviewer feedback and determine whether the article will be published in its current state (a rare occurrence), published with revisions, or not accepted for publication.

Peer review provides some degree of quality control for psychological research. Poorly conceived or executed studies can be weeded out, and even well-designed research can be improved by the revisions suggested. Peer review also ensures that the research is described clearly enough to allow other scientists to replicate it, meaning they can repeat the experiment using different samples to determine reliability. Sometimes replications involve additional measures that expand on the original finding. In any case, each replication serves to provide more evidence to support the original research findings. Successful replications of published research make scientists more apt to adopt those findings, while repeated failures tend to cast doubt on the legitimacy of the original article and lead scientists to look elsewhere. For example, it would be a major advancement in the medical field if a published study indicated that taking a new drug helped individuals achieve a healthy weight without changing their diet. But if other scientists could not replicate the results, the original study’s claims would be questioned.

Dig Deeper: The Vaccine-Autism Myth and the Retraction of Published Studies

Some scientists have claimed that routine childhood vaccines cause some children to develop autism, and, in fact, several peer-reviewed publications published research making these claims. Since the initial reports, large-scale epidemiological research has suggested that vaccinations are not responsible for causing autism and that it is much safer to have your child vaccinated than not. Furthermore, several of the original studies making this claim have since been retracted.

A published piece of work can be rescinded when data is called into question because of falsification, fabrication, or serious research design problems. Once rescinded, the scientific community is informed that there are serious problems with the original publication. Retractions can be initiated by the researcher who led the study, by research collaborators, by the institution that employed the researcher, or by the editorial board of the journal in which the article was originally published. In the vaccine-autism case, the retraction was made because of a significant conflict of interest in which the leading researcher had a financial interest in establishing a link between childhood vaccines and autism (Offit, 2008). Unfortunately, the initial studies received so much media attention that many parents around the world became hesitant to have their children vaccinated (Figure 1). For more information about how the vaccine/autism story unfolded, as well as the repercussions of this story, take a look at Paul Offit’s book, Autism’s False Prophets: Bad Science, Risky Medicine, and the Search for a Cure.

A photograph shows a child being given an oral vaccine. Figure 1. Some people still think vaccinations cause autism. (credit: modification of work by UNICEF Sverige)

Reliability and Validity

Reliability and validity are two important considerations that must be made with any type of data collection. Reliability refers to the ability to consistently produce a given result. In the context of psychological research, this would mean that any instruments or tools used to collect data do so in consistent, reproducible ways. Unfortunately, being consistent in measurement does not necessarily mean that you have measured something correctly. To illustrate this concept, consider a kitchen scale that would be used to measure the weight of cereal that you eat in the morning. If the scale is not properly calibrated, it may consistently under- or overestimate the amount of cereal that’s being measured. While the scale is highly reliable in producing consistent results (e.g., the same amount of cereal poured onto the scale produces the same reading each time), those results are incorrect. This is where validity comes into play. Validity refers to the extent to which a given instrument or tool accurately measures what it’s supposed to measure. While any valid measure is by necessity reliable, the reverse is not necessarily true. Researchers strive to use instruments that are both highly reliable and valid.

Everyday Connection: How Valid Is the SAT?

Standardized tests like the SAT are supposed to measure an individual’s aptitude for a college education, but how reliable and valid are such tests? Research conducted by the College Board suggests that scores on the SAT have high predictive validity for first-year college students’ GPA (Kobrin, Patterson, Shaw, Mattern, & Barbuti, 2008). In this context, predictive validity refers to the test’s ability to effectively predict the GPA of college freshmen. Given that many institutions of higher education require the SAT for admission, this high degree of predictive validity might be comforting.

However, the emphasis placed on SAT scores in college admissions has generated some controversy on a number of fronts. For one, some researchers assert that the SAT is a biased test that places minority students at a disadvantage and unfairly reduces the likelihood of being admitted into a college (Santelices & Wilson, 2010). Additionally, some research has suggested that the predictive validity of the SAT is grossly exaggerated in how well it is able to predict the GPA of first-year college students. In fact, it has been suggested that the SAT’s predictive validity may be overestimated by as much as 150% (Rothstein, 2004). Many institutions of higher education are beginning to consider de-emphasizing the significance of SAT scores in making admission decisions (Rimer, 2008).

In 2014, College Board president David Coleman expressed his awareness of these problems, recognizing that college success is more accurately predicted by high school grades than by SAT scores. To address these concerns, he has called for significant changes to the SAT exam (Lewin, 2014).

Introduction to Statistical Thinking

Coffee cup with heart shaped cream inside. Figure 2. People around the world differ in their preferences for drinking coffee versus drinking tea. Would the results of the coffee study be the same in Canada as in China? [Image: Duncan,, CC BY-NC 2.0,]
Does drinking coffee actually increase your life expectancy? A recent study (Freedman, Park, Abnet, Hollenbeck, & Sinha, 2012) found that men who drank at least six cups of coffee a day had a 10% lower chance of dying (women 15% lower) than those who drank none. Does this mean you should pick up or increase your own coffee habit? Modern society has become awash in studies such as this; you can read about several such studies in the news every day.Conducting such a study well, and interpreting the results of such studies requires understanding basic ideas of statistics, the science of gaining insight from data. Key components to a statistical investigation are:

  • Planning the study: Start by asking a testable research question and deciding how to collect data. For example, how long was the study period of the coffee study? How many people were recruited for the study, how were they recruited, and from where? How old were they? What other variables were recorded about the individuals? Were changes made to the participants’ coffee habits during the course of the study?
  • Examining the data: What are appropriate ways to examine the data? What graphs are relevant, and what do they reveal? What descriptive statistics can be calculated to summarize relevant aspects of the data, and what do they reveal? What patterns do you see in the data? Are there any individual observations that deviate from the overall pattern, and what do they reveal? For example, in the coffee study, did the proportions differ when we compared the smokers to the non-smokers?
  • Inferring from the data: What are valid statistical methods for drawing inferences “beyond” the data you collected? In the coffee study, is the 10%–15% reduction in risk of death something that could have happened just by chance?
  • Drawing conclusions: Based on what you learned from your data, what conclusions can you draw? Who do you think these conclusions apply to? (Were the people in the coffee study older? Healthy? Living in cities?) Can you draw a cause-and-effect conclusion about your treatments? (Are scientists now saying that the coffee drinking is the cause of the decreased risk of death?)

Notice that the numerical analysis (“crunching numbers” on the computer) comprises only a small part of overall statistical investigation. In this section, you will see how we can answer some of these questions and what questions you should be asking about any statistical investigation you read about.

Distributional Thinking

When data are collected to address a particular question, an important first step is to think of meaningful ways to organize and examine the data. Let's take a look at an example.

Example 1: Researchers investigated whether cancer pamphlets are written at an appropriate level to be read and understood by cancer patients (Short, Moriarty, & Cooley, 1995). Tests of reading ability were given to 63 patients. In addition, readability level was determined for a sample of 30 pamphlets, based on characteristics such as the lengths of words and sentences in the pamphlet. The results, reported in terms of grade levels, are displayed in Figure 3.

Table showing patients' reading levels and pahmphlet's reading levels. Figure 3. Frequency tables of patient reading levels and pamphlet readability levels.

Testing these two variables reveal two fundamental aspects of statistical thinking:
  • Data vary. More specifically, values of a variable (such as reading level of a cancer patient or readability level of a cancer pamphlet) vary.
  • Analyzing the pattern of variation, called the distribution of the variable, often reveals insights.

Addressing the research question of whether the cancer pamphlets are written at appropriate levels for the cancer patients requires comparing the two distributions. A naïve comparison might focus only on the centers of the distributions. Both medians turn out to be ninth grade, but considering only medians ignores the variability and the overall distributions of these data. A more illuminating approach is to compare the entire distributions, for example with a graph, as in Figure 2.

Bar graph showing that the reading level of pamphlets is typically higher than the reading level of the patients. Figure 4. Comparison of patient reading levels and pamphlet readability levels.
Figure 2 makes clear that the two distributions are not well aligned at all. The most glaring discrepancy is that many patients (17/63, or 27%, to be precise) have a reading level below that of the most readable pamphlet. These patients will need help to understand the information provided in the cancer pamphlets. Notice that this conclusion follows from considering the distributions as a whole, not simply measures of center or variability, and that the graph contrasts those distributions more immediately than the frequency tables.

Statistical Significance

Even when we find patterns in data, often there is still uncertainty in various aspects of the data. For example, there may be potential for measurement errors (even your own body temperature can fluctuate by almost 1°F over the course of the day). Or we may only have a “snapshot” of observations from a more long-term process or only a small subset of individuals from the population of interest. In such cases, how can we determine whether patterns we see in our small set of data is convincing evidence of a systematic phenomenon in the larger process or population? Let's take a look at another example.

Example 2: In a study reported in the November 2007 issue of Nature, researchers investigated whether pre-verbal infants take into account an individual’s actions toward others in evaluating that individual as appealing or aversive (Hamlin, Wynn, & Bloom, 2007). In one component of the study, 10-month-old infants were shown a “climber” character (a piece of wood with “googly” eyes glued onto it) that could not make it up a hill in two tries. Then the infants were shown two scenarios for the climber’s next try, one where the climber was pushed to the top of the hill by another character (“helper”), and one where the climber was pushed back down the hill by another character (“hinderer”). The infant was alternately shown these two scenarios several times. Then the infant was presented with two pieces of wood (representing the helper and the hinderer characters) and asked to pick one to play with.

The researchers found that of the 16 infants who made a clear choice, 14 chose to play with the helper toy. One possible explanation for this clear majority result is that the helping behavior of the one toy increases the infants’ likelihood of choosing that toy. But are there other possible explanations? What about the color of the toy? Well, prior to collecting the data, the researchers arranged so that each color and shape (red square and blue circle) would be seen by the same number of infants. Or maybe the infants had right-handed tendencies and so picked whichever toy was closer to their right hand?

Well, prior to collecting the data, the researchers arranged it so half the infants saw the helper toy on the right and half on the left. Or, maybe the shapes of these wooden characters (square, triangle, circle) had an effect? Perhaps, but again, the researchers controlled for this by rotating which shape was the helper toy, the hinderer toy, and the climber. When designing experiments, it is important to control for as many variables as might affect the responses as possible. It is beginning to appear that the researchers accounted for all the other plausible explanations. But there is one more important consideration that cannot be controlled—if we did the study again with these 16 infants, they might not make the same choices. In other words, there is some randomness inherent in their selection process.


Maybe each infant had no genuine preference at all, and it was simply “random luck” that led to 14 infants picking the helper toy. Although this random component cannot be controlled, we can apply a probability model to investigate the pattern of results that would occur in the long run if random chance were the only factor.

If the infants were equally likely to pick between the two toys, then each infant had a 50% chance of picking the helper toy. It’s like each infant tossed a coin, and if it landed heads, the infant picked the helper toy. So if we tossed a coin 16 times, could it land heads 14 times? Sure, it’s possible, but it turns out to be very unlikely. Getting 14 (or more) heads in 16 tosses is about as likely as tossing a coin and getting 9 heads in a row. This probability is referred to as a p-value. The p-value represents the likelihood that experimental results happened by chance. Within psychology, the most common standard for p-values is “p < .05”. What this means is that there is less than a 5% probability that the results happened just by random chance, and therefore a 95% probability that the results reflect a meaningful pattern in human psychology. We call this statistical significance.

So, in the study above, if we assume that each infant was choosing equally, then the probability that 14 or more out of 16 infants would choose the helper toy is found to be 0.0021. We have only two logical possibilities: either the infants have a genuine preference for the helper toy, or the infants have no preference (50/50) and an outcome that would occur only 2 times in 1,000 iterations happened in this study. Because this p-value of 0.0021 is quite small, we conclude that the study provides very strong evidence that these infants have a genuine preference for the helper toy.

If we compare the p-value to some cut-off value, like 0.05, we see that the p=value is smaller. Because the p-value is smaller than that cut-off value, then we reject the hypothesis that only random chance was at play here. In this case, these researchers would conclude that significantly more than half of the infants in the study chose the helper toy, giving strong evidence of a genuine preference for the toy with the helping behavior.


Photo of a diverse group of college-aged students. Figure 5. Generalizability is an important research consideration: The results of studies with widely representative samples are more likely to generalize to the population. [Image: Barnacles Budget Accommodation]
One limitation to the study mentioned previously about the babies choosing the "helper" toy is that the conclusion only applies to the 16 infants in the study. We don’t know much about how those 16 infants were selected. Suppose we want to select a subset of individuals (a sample) from a much larger group of individuals (the population) in such a way that conclusions from the sample can be generalized to the larger population. This is the question faced by pollsters every day.

Example 3: The General Social Survey (GSS) is a survey on societal trends conducted every other year in the United States. Based on a sample of about 2,000 adult Americans, researchers make claims about what percentage of the U.S. population consider themselves to be “liberal,” what percentage consider themselves “happy,” what percentage feel “rushed” in their daily lives, and many other issues. The key to making these claims about the larger population of all American adults lies in how the sample is selected. The goal is to select a sample that is representative of the population, and a common way to achieve this goal is to select a random sample that gives every member of the population an equal chance of being selected for the sample. In its simplest form, random sampling involves numbering every member of the population and then using a computer to randomly select the subset to be surveyed. Most polls don’t operate exactly like this, but they do use probability-based sampling methods to select individuals from nationally representative panels.

In 2004, the GSS reported that 817 of 977 respondents (or 83.6%) indicated that they always or sometimes feel rushed. This is a clear majority, but we again need to consider variation due to random sampling. Fortunately, we can use the same probability model we did in the previous example to investigate the probable size of this error. (Note, we can use the coin-tossing model when the actual population size is much, much larger than the sample size, as then we can still consider the probability to be the same for every individual in the sample.) This probability model predicts that the sample result will be within 3 percentage points of the population value (roughly 1 over the square root of the sample size, the margin of error). A statistician would conclude, with 95% confidence, that between 80.6% and 86.6% of all adult Americans in 2004 would have responded that they sometimes or always feel rushed.

The key to the margin of error is that when we use a probability sampling method, we can make claims about how often (in the long run, with repeated random sampling) the sample result would fall within a certain distance from the unknown population value by chance (meaning by random sampling variation) alone. Conversely, non-random samples are often suspect to bias, meaning the sampling method systematically over-represents some segments of the population and under-represents others. We also still need to consider other sources of bias, such as individuals not responding honestly. These sources of error are not measured by the margin of error.

Cause and Effect Conclusions

In many research studies, the primary question of interest concerns differences between groups. Then the question becomes how were the groups formed (e.g., selecting people who already drink coffee vs. those who don’t). In some studies, the researchers actively form the groups themselves. But then we have a similar question—could any differences we observe in the groups be an artifact of that group-formation process? Or maybe the difference we observe in the groups is so large that we can discount a “fluke” in the group-formation process as a reasonable explanation for what we find?

Example 4: A psychology study investigated whether people tend to display more creativity when they are thinking about intrinsic (internal) or extrinsic (external) motivations (Ramsey & Schafer, 2002, based on a study by Amabile, 1985). The subjects were 47 people with extensive experience with creative writing. Subjects began by answering survey questions about either intrinsic motivations for writing (such as the pleasure of self-expression) or extrinsic motivations (such as public recognition). Then all subjects were instructed to write a haiku, and those poems were evaluated for creativity by a panel of judges. The researchers conjectured beforehand that subjects who were thinking about intrinsic motivations would display more creativity than subjects who were thinking about extrinsic motivations. The creativity scores from the 47 subjects in this study are displayed in Figure 4, where higher scores indicate more creativity.

Image showing a dot for creativity scores, which vary between 5 and 27, and the types of motivation each person was given as a motivator, either extrinsic or intrinsic. Figure 6. Creativity scores separated by type of motivation.

In this example, the key question is whether the type of motivation affects creativity scores. In particular, do subjects who were asked about intrinsic motivations tend to have higher creativity scores than subjects who were asked about extrinsic motivations?

Figure 4 reveals that both motivation groups saw considerable variability in creativity scores, and these scores have considerable overlap between the groups. In other words, it’s certainly not always the case that those with extrinsic motivations have higher creativity than those with intrinsic motivations, but there may still be a statistical tendency in this direction. (Psychologist Keith Stanovich (2013) refers to people’s difficulties with thinking about such probabilistic tendencies as “the Achilles heel of human cognition.”)

The mean creativity score is 19.88 for the intrinsic group, compared to 15.74 for the extrinsic group, which supports the researchers’ conjecture. Yet comparing only the means of the two groups fails to consider the variability of creativity scores in the groups. We can measure variability with statistics using, for instance, the standard deviation: 5.25 for the extrinsic group and 4.40 for the intrinsic group. The standard deviations tell us that most of the creativity scores are within about 5 points of the mean score in each group. We see that the mean score for the intrinsic group lies within one standard deviation of the mean score for extrinsic group. So, although there is a tendency for the creativity scores to be higher in the intrinsic group, on average, the difference is not extremely large.

We again want to consider possible explanations for this difference. The study only involved individuals with extensive creative writing experience. Although this limits the population to which we can generalize, it does not explain why the mean creativity score was a bit larger for the intrinsic group than for the extrinsic group. Maybe women tend to receive higher creativity scores? Here is where we need to focus on how the individuals were assigned to the motivation groups. If only women were in the intrinsic motivation group and only men in the extrinsic group, then this would present a problem because we wouldn’t know if the intrinsic group did better because of the different type of motivation or because they were women. However, the researchers guarded against such a problem by randomly assigning the individuals to the motivation groups. Like flipping a coin, each individual was just as likely to be assigned to either type of motivation. Why is this helpful? Because this random assignment tends to balance out all the variables related to creativity we can think of, and even those we don’t think of in advance, between the two groups. So we should have a similar male/female split between the two groups; we should have a similar age distribution between the two groups; we should have a similar distribution of educational background between the two groups; and so on. Random assignment should produce groups that are as similar as possible except for the type of motivation, which presumably eliminates all those other variables as possible explanations for the observed tendency for higher scores in the intrinsic group.

But does this always work? No, so by “luck of the draw” the groups may be a little different prior to answering the motivation survey. So then the question is, is it possible that an unlucky random assignment is responsible for the observed difference in creativity scores between the groups? In other words, suppose each individual’s poem was going to get the same creativity score no matter which group they were assigned to, that the type of motivation in no way impacted their score. Then how often would the random-assignment process alone lead to a difference in mean creativity scores as large (or larger) than 19.88 – 15.74 = 4.14 points?

We again want to apply to a probability model to approximate a p-value, but this time the model will be a bit different. Think of writing everyone’s creativity scores on an index card, shuffling up the index cards, and then dealing out 23 to the extrinsic motivation group and 24 to the intrinsic motivation group, and finding the difference in the group means. We (better yet, the computer) can repeat this process over and over to see how often, when the scores don’t change, random assignment leads to a difference in means at least as large as 4.41. Figure 5 shows the results from 1,000 such hypothetical random assignments for these scores.

Standard distribution in a typical bell curve. Figure 7. Differences in group means under random assignment alone.

Only 2 of the 1,000 simulated random assignments produced a difference in group means of 4.41 or larger. In other words, the approximate p-value is 2/1000 = 0.002. This small p-value indicates that it would be very surprising for the random assignment process alone to produce such a large difference in group means. Therefore, as with Example 4, we have strong evidence that focusing on intrinsic motivations tends to increase creativity scores, as compared to thinking about extrinsic motivations.

Notice that the previous statement implies a cause-and-effect relationship between motivation and creativity score; is such a strong conclusion justified? Yes, because of the random assignment used in the study. That should have balanced out any other variables between the two groups, so now that the small p-value convinces us that the higher mean in the intrinsic group wasn’t just a coincidence, the only reasonable explanation left is the difference in the type of motivation. Can we generalize this conclusion to everyone? Not necessarily—we could cautiously generalize this conclusion to individuals with extensive experience in creative writing similar the individuals in this study, but we would still want to know more about how these individuals were selected to participate.


Close-up photo of mathematical equations. Figure 8. Researchers employ the scientific method that involves a great deal of statistical thinking: generate a hypothesis --> design a study to test that hypothesis --> conduct the study --> analyze the data --> report the results. [Image: widdowquinn]
Statistical thinking involves the careful design of a study to collect meaningful data to answer a focused research question, detailed analysis of patterns in the data, and drawing conclusions that go beyond the observed data. Random sampling is paramount to generalizing results from our sample to a larger population, and random assignment is key to drawing cause-and-effect conclusions. With both kinds of randomness, probability models help us assess how much random variation we can expect in our results, in order to determine whether our results could happen by chance alone and to estimate a margin of error.

So where does this leave us with regard to the coffee study mentioned previously (the Freedman, Park, Abnet, Hollenbeck, & Sinha, 2012 found that men who drank at least six cups of coffee a day had a 10% lower chance of dying (women 15% lower) than those who drank none)? We can answer many of the questions:

  • This was a 14-year study conducted by researchers at the National Cancer Institute.
  • The results were published in the June issue of the New England Journal of Medicine, a respected, peer-reviewed journal.
  • The study reviewed coffee habits of more than 402,000 people ages 50 to 71 from six states and two metropolitan areas. Those with cancer, heart disease, and stroke were excluded at the start of the study. Coffee consumption was assessed once at the start of the study.
  • About 52,000 people died during the course of the study.
  • People who drank between two and five cups of coffee daily showed a lower risk as well, but the amount of reduction increased for those drinking six or more cups.
  • The sample sizes were fairly large and so the p-values are quite small, even though percent reduction in risk was not extremely large (dropping from a 12% chance to about 10%–11%).
  • Whether coffee was caffeinated or decaffeinated did not appear to affect the results.
  • This was an observational study, so no cause-and-effect conclusions can be drawn between coffee drinking and increased longevity, contrary to the impression conveyed by many news headlines about this study. In particular, it’s possible that those with chronic diseases don’t tend to drink coffee.

This study needs to be reviewed in the larger context of similar studies and consistency of results across studies, with the constant caution that this was not a randomized experiment. Whereas a statistical analysis can still “adjust” for other potential confounding variables, we are not yet convinced that researchers have identified them all or completely isolated why this decrease in death risk is evident. Researchers can now take the findings of this study and develop more focused studies that address new questions.

Learn More

Explore these outside resources to learn more about applied statistics:

Think It Over

  • Find a recent research article in your field and answer the following: What was the primary research question? How were individuals selected to participate in the study? Were summary results provided? How strong is the evidence presented in favor or against the research question? Was random assignment used? Summarize the main conclusions from the study, addressing the issues of statistical significance, statistical confidence, generalizability, and cause and effect. Do you agree with the conclusions drawn from this study, based on the study design and the results presented?
  • Is it reasonable to use a random sample of 1,000 individuals to draw conclusions about all U.S. adults? Explain why or why not.


cause-and-effect: related to whether we say one variable is causing changes in the other variable, versus other variables that may be related to these two variables.

distribution: the pattern of variation in data

generalizability: related to whether the results from the sample can be generalized to a larger population.

population: a larger collection of individuals that we would like to generalize our results to

p-value: the probability of observing a particular outcome in a sample, or more extreme, under a conjecture about the larger population or process

random assignment: using a probability-based method to divide a sample into treatment groups.

random sampling: using a probability-based method to select a subset of individuals for the sample from the population.

reliability: consistency and reproducibility of a given result

sample: the collection of individuals on which we collect data

statistic: a numerical result computed from a sample (e.g., mean, proportion)

statistical significance: a result is statistically significant if it is unlikely to arise by chance alone

validity: accuracy of a given result in measuring what it is designed to measure

margin of error: the expected amount of random variation in a statistic; often defined for 95% confidence level.

Psych in Real Life: Brain Imaging and Messy Science

This is a little difficult for a psychologist to ask, but here goes: when you think of a “science” which one of these is more likely to come to mind: physics or psychology?

We suspect you chose “physics” (though we don’t have the data, so maybe not!).

Despite the higher "status" of physics and chemistry in the world of science over psychology, good scientific reasoning is just as important in psychology. Valid logic, careful methodology, strong results, and empirically supported conclusions should be sought after regardless of the topic area.

We would like to you to exercise your scientific reasoning using the example below. Read the passage “Watching TV is Related to Math Ability” and answer a few questions afterwards.

Watching TV is Related to Math Ability

Television is often criticized for having a negative impact on our youth. Everything from aggressive behavior to obesity in children seems to be blamed on their television viewing habits. On the other hand, TV also provides us with much of our news and entertainment, and has become a major source of education for children, with shows like Sesame Street teaching children to count and say the alphabet.

Recently, researchers Ian McAtee and Leo Geraci at Harvard University did some research to examine if TV watching might have beneficial effects on cognition. The approach was fairly simple. Children between the ages of 12-14 were either asked to watch a television sitcom or do arithmetic problems, and while they were doing these activities, images of their brains were recorded using fMRI (functional magnetic resonance imaging). This technique measures the flow of blood to specific parts of the brain during performance, allowing scientists to create images of the areas that are activated during cognition.

Two images of brain fMRI scans. The top image shows red areas of activation in three different regions on the back of the head, and he bottom scan shows activation in two similar areas. A bar showing the intensity of the activation from red (2) to yellow (10) is shown next to the brain scans.Results revealed that similar areas of the parietal lobes were active during TV watching (the red area of the brain image on the top) and during arithmetic solving (the red area of the brain image on the bottom). This area of the brain has been implicated in other research as being important for abstract thought, suggesting that both TV watching and arithmetic processing may have beneficial effects on cognition. "We were somewhat surprised that TV watching would activate brain areas involved in higher-order thought processes because TV watching is typically considered a passive activity," said McAtee. Added Geraci, "The next step is to see what specific content on the TV show led to the pattern of activation that mimicked math performance, so we need to better understand that aspect of the data. We also need to compare TV watching to other types of cognitive skills, like reading comprehension and writing." Although this is only the beginning to this type of research, these findings certainly question the accepted wisdom that the "idiot box" is harmful to children's cognitive functioning.

Try It

Please rate whether you agree or disagree with the following statements about the article. There are no correct answers.

The article was well written.

  •   strongly agree
  •   disagree
  •   agree
  •   strongly agree

The title, “Watching TV is Related to Math Ability” was a good description of the results.

  •   strongly agree
  •   disagree
  •   agree
  •   strongly agree

The scientific argument in the article made sense.

  •   strongly agree
  •   disagree
  •   agree
  •   strongly agree

It is pretty surprising to learn that watching television can improve your math ability, and the fact that we can identify the area in the brain that produces this relationship shows how far psychology has progressed as a science.

Or maybe not.

The article you just read and rated was not an account of real research. Ian McAtee and Leo Geraci are not real people and the study discussed was never conducted (as far as we know). The article was written by psychologists David McCabe and Alan Castel for a study they published in 2008.[6] They asked people to do exactly what you just did: read this article and two others and rate them.

McCabe and Castel wondered if people’s biases about science influence the way they judge the information they read. In other words, if what you are reading looks more scientific, do you assume it is better science?

In recent years, neuroscience has impressed a lot of people as “real science,” when compared to the “soft science” of psychology. Did you notice the pictures of the brain next to the article that you just read? Do you think that picture had any influence on your evaluation of the scientific quality of the article? The brain pictures actually added no new information that was not already in the article itself other than showing you exactly where in the brain the relevant part of the parietal lobe is located. The red marks are in the same locations in both brain pictures, but we already knew that “Results revealed that similar areas in the parietal lobes were active during TV watching…and during arithmetic solving.”

The McCabe & Castel Experiment

McCabe and Castel wrote three brief (fake) scientific articles that appeared to be typical reports like those you might find in a textbook or news source, all with brain activity as part of the story. In addition to the one you read (“Watching TV is related to math ability “) others had these titles: “Meditation enhances creative thought” and “Playing video games benefits attention.”

All of the articles had flawed scientific reasoning. In the “Watching TV is Related to Math Ability” article that you read, the only “result” that is reported is that a particular brain area (a part of the parietal lobe) is active when a person is watching TV and when he or she is working on math. The highlighted part of the next sentence is where the article goes too far: “This area of the brain has been implicated in other research as being important for abstract thought, suggesting that both tv watching and arithmetic processing may have beneficial effects on cognition.”

The fact that the same area of the brain is active for two different activities does not “suggest” that either one is beneficial or that there is any interesting similarity in mental or brain activity between the processes. The final part of the article goes on and on about how this supposedly surprising finding is intriguing and deserves extensive exploration.

The researchers asked 156 college students to read the three articles and rate them for how much they made sense scientifically, as well as rating the quality of the writing and the accuracy of the title.

Everybody read exactly the same articles, but the picture that accompanied the article differed according to create three experimental conditions. For the article in the brain image condition, subjects saw one of the following brain images to the side of the article:

3 different images. The first is the brain activation fMRI showing activity in the brain, the other shows an overhead fMRI of activation and a statement that says Figure 9. Subjects in the experimental condition were shown ONE of the applicable brain images with each article they read.

Graphs are a common and effective way to display results in science and other areas, but most people are so used to seeing graphs that (according to McCabe and Castel) people should be less impressed by them than by brain images. The figures below show the graphs that accompanied the three articles for the bar graph condition. The results shown in the graphs were made up by the experimenters, but what they show is consistent with the information in the article.

3 bar graphs. The first on Figure 10. Participants in the bar graph condition were shown ONE of the bar graphs with each article they read.

Finally, in the control condition, the article was presented without any accompanying figure or picture. The control condition tells us how the subjects rate the articles without any extraneous, but potentially biasing, illustrations.

The Procedure

Each participant read all three articles: one with a brain image, one with a bar graph, and one without any illustration (the control condition). Across all the participants, each article was presented approximately the same number of times in each condition, and the order in which the articles were presented was randomized.

The Ratings

Immediately after reading each article, the participants rated their agreement with three statements: (a) The article was well written, (b) The title was a good description of the results, and (c) The scientific reasoning in the article made sense. Each rating was on a 4-point scale: (score=1) strongly disagree, (score=2) disagree, (score=3) agree, and (score=4) strongly agree. Remember that the written part of the articles was exactly the same in all three conditions, so the ratings should have been the same if people were not using the illustrations to influence their conclusions.

Before going on, let’s make sure you know the basic design of this experiment. In other words, can you identify the critical variables used in the study according to their function?



The first two questions for the participants were about (a) the accuracy of the title and (b) the quality of the writing. These questions were included to assure that the participants had read the articles closely. The experimenters expected that there would be no differences in the ratings for the three conditions for these questions. For the question about the title, their prediction was correct. Subjects gave about the same rating to the titles in all three conditions, agreeing that it was accurate.

For question (b) about the quality of the writing, the experimenters found that the two conditions with illustrations (the brain images and the bar graphs) were rated higher than the control condition. Apparently just the presence of an illustration made the writing seem better. This result was not predicted.


The main hypothesis behind this study was that subjects would rate the quality of the scientific reasoning in the article higher when it was accompanied by a brain image than when there was a bar graph or there was no illustration at all. If the ratings differed among conditions, then the illustrations—which added nothing substantial that was not in the writing—had to be the cause.

Try It

Use the graph below to show your predicted results of the experiment. Move the bars to the point where you think people generally agreed or disagreed with the statement that "the scientific reasoning in the article made sense." Higher bars mean that the person believes the reasoning in the article is better, and a lower bar means that they judge the reasoning as worse. Click on "Show Results" when you are done to compare your prediction with the actual results.


McCabe and Castel conducted two more experiments, changing the stories, the images, and the wording of the questions in each. Across the three experiments, they tested almost 400 college students and their results were consistent: participants rated the quality of scientific reasoning higher when the writing was accompanied by a brain image than in other conditions.

The implications of this study go beyond brain images. The deeper idea is that any information that symbolizes something we believe is important can influence our thinking, sometimes making us less thoughtful than we might otherwise be. This other information could be a brain image or some statistical jargon that sounds impressive or a mathematical formula that we don’t understand or a statement that the author teaches at Harvard University rather than Littletown State College.

In a study also published in 2008, Deena Weisberg and her colleagues at Yale University conducted a study similar to the one you just read.[7] Weisberg had people read brief descriptions of psychological phenomena (involving memory, attention, reasoning, emotion, and other similar topics). They rated the scientific quality of the explanations. Instead of images, Weisberg had some explanations that included entirely superfluous and useless brain information (e.g., “people feel strong emotion because the amygdala processes emotion”) or no such brain information. Weisberg found that a good explanation was rated as even better when it included a brain reference (which was completely irrelevant). When the explanation was flawed, students were fairly good at catching the reasoning problems UNLESS the explanation contained the irrelevant brain reference. In that case, the students rated the flawed explanations as being good. Weinstein and her colleague call the problem “the seductive allure of neuroscience explanations.”

Does it Replicate? The Messy World of Real Science

A few years after the McCabe and Castel study was published, some psychologists[8] at the University of Victoria in New Zealand, led by Robert Michael, were intrigued by the results and they were impressed by how frequently the paper had been cited by other researchers (about 40 citations per year between 2008 and 2012—a reasonably strong citation record). They wanted to explore the brain image effect, so they started by simply replicating the original study.[9]

In their first attempt at replication, the researchers recruited and tested people using an online site called Mechanical Turk. With 197 participants, they found no hint of an effect of the brain image on people’s judgments about the validity of the conclusions of the article they read. In a second replication study, they tested students from their university and again found no statistically significant effect. In this second attempt, the results were in the predicted direction (the presence of a brain image was associated with higher ratings), but the differences were not strong enough to be persuasive. They tried slight variations on instructions and people recruited, but across 10 different replication studies, only one produced a statistically significant effect.

So, did Dr. Michael and his colleagues accuse McCabe and Castel of doing something wrong? Did they tear apart the experiments we described earlier and show that they were poorly planned, incorrectly analyzed, or interpreted in a deceptive way?

Not at all.

It is instructive to see how professional scientists approached the problem of failing to replicate a study. Here is a quick review of the approach taken by the researchers who did not replicate the McCabe and Castel study:

  • First, they did not question the integrity of the original research. David McCabe[10] and Alan Castel are respected researchers who carefully reported on a series of well-conducted experiments. They even noted that the original paper was carefully reported, even if journalists and other psychologists had occasionally exaggerated the findings: “Although McCabe and Castel (2008) did not overstate their findings, many others have. Sometimes these overstatements were linguistic exaggerations…Other overstatements made claims beyond what McCabe and Castel themselves reported.” [p. 720]
  • Replication is an essential part of the scientific process. Michael and his colleagues did not back off of the importance of their difficulty reproducing the McCabe and Castel results. Clearly, McCabe and Castel’s conclusions—that “there is something special about the brain images with respect to influencing judgments of scientific credibility”—need to taken as possibly incorrect.
  • Michael and his colleagues looked closely at the McCabe and Castel results and their own, and they looked for interesting reasons that the results of the two sets of studies might be different.

      • Subtle effects: Perhaps the brain pictures really do influence their judgments, but only for some people or under very specific circumstances.
      • Alternative explanations: Perhaps people assume that irrelevant information is not typically presented in scientific reports. People may have believed that the brain images provided additional evidence for the claims.
      • Things have changed: The McCabe and Castel study was conducted in 2008 and the failed replication was in 2013. Neuroscience as very new to the general public in 2008, but a mere 5 years later, in 2013, it may have seemed less impressive.

Do images really directly affect people’s judgments of the quality of scientific thinking? Maybe yes. Maybe no. That’s still an open question.

The "Replication Crisis"

In recent years, there has been increased effort in the sciences (psychology, medicine, economics, etc.) to redo previous experiments to test their reliability. The findings have been disappointing at times.

The Reproducibility Project has attempted to replicate 100 studies within the field of psychology that were published with statistically significant results; they found that many of these results did not replicate well. Some did not reach statistical significance when replicated. Others reached statistical significance, but with much weaker effects than in the original study.

How could this happen?

  • Chance. Psychologist use statistics to confirm that their results did not occur simply because of chance. Within psychology, the most common standard for p-values is “p < .05”. This p-value means that there is less than a 5% probability that the results of an experiment happened just by random chance, and a 95% probability that the results were statistically significant. Even though a published study may reveal statistically significant results, there is still a possibility that those results were random.
  • Publication bias. Psychology research journals are far more likely to publish studies that find statistically significant results than they are studies that fail to find statistically significant results. What this means is that studies that yield results that are not statistically significant are very unlikely to get published. Let’s say that twenty researchers are all studying the same phenomenon. Out of the twenty, one gets statistically significant results, while the other nineteen all get non-significant results. The statistically significant result was likely just a result of randomness, but because of publication bias, that one study’s results are far more likely to be published than are the results of the other nineteen.

Note that this "replication crisis" itself does not mean that the original studies were bad, fraudulent, or even wrong. What it means, at its core, is that replication found results that were different from the results of the original studies. These results were sufficiently different that we might no longer be secure in our knowledge of what those results mean. Further replication and testing in other directions might give us a better understanding of why the results were different, but that too will require time and resources.

One Final Note

When we wrote to Dr. Alan Castel for permission to use his stimuli in this article, he not only consented, but he also sent us his data and copies of all of his stimuli. He sent copies of research by a variety of people, some research that has supported his work with David McCabe and some that has not. He even included a copy of the 10-experiment paper that you just read about, the one that failed to replicate the McCabe and Castel study.

The goal is to find the truth, not to insist that everything you publish is the last word on the topic. In fact, if it is the last word, then you are probably studying something so boring that no one else really cares.

Scientists disagree with one another all the time. But the disagreements are (usually) not personal. The evidence is not always neat and tidy, and the best interpretation of complex results is seldom obvious. At its best, it is possible for scientists to disagree passionately about theory and evidence, and later to relax over a cool drink, laugh and talk about friends or sports or life and love.

  1. David "P. McCabe & Alan D. Castel (2008). Seeing is believing: The effect of brain images on judgments of scientific reasoning. Cognition, 107, 343-352."
  2. Deena "Skolnick Weisberg, Frank C. Keil, Joshua Goodstein, Elizabeth Rawson, & Jeremy R. Gray (2008). The seductive allure of neuroscience explanations. Journal of Cognitive Neuroscience, 20(3), 470-477"
  3. Robert "B. Michael, Eryn J. Newman, Matti Vuorre, Geoff Cumming, and Maryanne Garry (2013). On the (non)persuasive power of a brain image. Psychonomic Bulletin & Review, 20(4), 720-725."
  4. They "actually tried to replicate Experiment 3 in the McCabe and Castel study. You read Experiment 1. These two experiments were similar and supported the same conclusions, but Dr. Michael and his colleagues preferred Experiment 3 for some technical reasons."
  5. David "McCabe, the first author of the original paper, tragically passed away in 2011 at the age of 41. At the time of his death, he was an assistant professor of Psychology at Colorado State University and he had started to build a solid body of published research, and he was also married with two young children. The problems with replicating his experiments were only published after his death, so it is impossible to know what his thoughts might have been about the issues these challenges raised."
  6. David "P. McCabe & Alan D. Castel (2008). Seeing is believing: The effect of brain images on judgments of scientific reasoning. Cognition, 107, 343-352."
  7. Deena "Skolnick Weisberg, Frank C. Keil, Joshua Goodstein, Elizabeth Rawson, & Jeremy R. Gray (2008). The seductive allure of neuroscience explanations. Journal of Cognitive Neuroscience, 20(3), 470-477"
  8. Robert "B. Michael, Eryn J. Newman, Matti Vuorre, Geoff Cumming, and Maryanne Garry (2013). On the (non)persuasive power of a brain image. Psychonomic Bulletin & Review, 20(4), 720-725."
  9. They "actually tried to replicate Experiment 3 in the McCabe and Castel study. You read Experiment 1. These two experiments were similar and supported the same conclusions, but Dr. Michael and his colleagues preferred Experiment 3 for some technical reasons."
  10. David "McCabe, the first author of the original paper, tragically passed away in 2011 at the age of 41. At the time of his death, he was an assistant professor of Psychology at Colorado State University and he had started to build a solid body of published research, and he was also married with two young children. The problems with replicating his experiments were only published after his death, so it is impossible to know what his thoughts might have been about the issues these challenges raised."

Licenses and Attributions