Over the past few years, an international team of almost 200 psychologists has been trying to repeat a set of previously published experiments from its field, to see if it can get the same results. Despite its best efforts, the project, called Many Labs 2, has only succeeded in 14 out of 28 cases. Six years ago, that might have been shocking. Now it comes as expected (if still somewhat disturbing) news.
In recent years, it has become painfully clear that psychology is facing a “reproducibility crisis,” in which even famous, long-established phenomena—the stuff of textbooks and TED Talks—might not be real. There’s social priming, where subliminal exposures can influence our behavior. And ego depletion, the idea that we have a limited supply of willpower that can be exhausted. And the facial-feedback hypothesis, which simply says that smiling makes us feel happier.
One by one, researchers have tried to repeat the classic experiments behind these well-known effects—and failed. And whenever psychologists undertake large projects, like Many Labs 2, in which they replicate past experiments en masse, they typically succeed, on average, half of the time.
Ironically enough, it seems that one of the most reliable findings in psychology is that only half of psychological studies can be successfully repeated.
That failure rate is especially galling, says Simine Vazire from the University of California at Davis, because the Many Labs 2 teams tried to replicate studies that had made a big splash and been highly cited. Psychologists “should admit we haven’t been producing results that are as robust as we’d hoped, or as we’d been advertising them to be in the media or to policy makers,” she says. “That might risk undermining our credibility in the short run, but denying this problem in the face of such strong evidence will do more damage in the long run.”
Many psychologists have blamed these replication failures on sloppy practices. Their peers, they say, are too willing to run small and statistically weak studies that throw up misleading fluke results, to futz around with the data until they get something interesting, or to only publish positive results while hiding negative ones in their file drawers.
But skeptics have argued that the misleadingly named “crisis” has more mundane explanations. First, the replication attempts themselves might be too small. Second, the researchers involved might be incompetent, or lack the know-how to properly pull off the original experiments. Third, people vary, and two groups of scientists might end up with very different results if they do the same experiment on two different groups of volunteers.
The Many Labs 2 project was specifically designed to address these criticisms. With 15,305 participants in total, the new experiments had, on average, 60 times as many volunteers as the studies they were attempting to replicate. The researchers involved worked with the scientists behind the original studies to vet and check every detail of the experiments beforehand. And they repeated those experiments many times over, with volunteers from 36 different countries, to see if the studies would replicate in some cultures and contexts but not others. “It’s been the biggest bear of a project,” says Brian Nosek from the Center for Open Science, who helped to coordinate it. “It’s 28 papers’ worth of stuff in one.”
Despite the large sample sizes and the blessings of the original teams, the team failed to replicate half of the studies it focused on. It couldn’t, for example, show that people subconsciously exposed to the concept of heat were more likely to believe in global warming, or that moral transgressions create a need for physical cleanliness in the style of Lady Macbeth, or that people who grow up with more siblings are more altruistic. And as in previous big projects, online bettors were surprisingly good at predicting beforehand which studies would ultimately replicate. Somehow, they could intuit which studies were reliable.
But other intuitions were less accurate. In 12 cases, the scientists behind the original studies suggested traits that the replicators should account for. They might, for example, only find the same results in women rather than men, or in people with certain personality traits. In almost every case, those suggested traits proved to be irrelevant. The results just weren’t that fickle.
Likewise, Many Labs 2 “was explicitly designed to examine how much effects varied from place to place, from culture to culture,” says Katie Corker from Grand Valley State University, who chairs the Society for the Improvement of Psychological Science. “And here’s the surprising result: The results do not show much variability at all.” If one of the participating teams successfully replicated a study, others did, too. If a study failed to replicate, it tended to fail everywhere.
It’s worth dwelling on this because it’s a serious blow to one of the most frequently cited criticisms of the “reproducibility crisis” rhetoric. Surely, skeptics argue, it’s a fantasy to expect studies to replicate everywhere. “There’s a massive deference to the sample,” Nosek says. “Your replication attempt failed? It must be because you did it in Ohio and I did it in Virginia, and people are different. But these results suggest that we can’t just wave those failures away very easily.”
This doesn’t mean that cultural differences in behavior are irrelevant. As Yuri Miyamoto from the University of Wisconsin at Madison notes in an accompanying commentary, “In the age of globalization, psychology has remained largely European [and] American.” Many researchers have noted that volunteers from Western, educated, industrialized, rich, and democratic countries—WEIRD nations—are an unusual slice of humanity who think differently than those from other parts of the world.
In the majority of the Many Labs 2 experiments, the team found very few differences between WEIRD volunteers and those from other countries. But Miyamoto notes that its analysis was a little crude—in considering “non-WEIRD countries” together, it’s lumping together people from cultures as diverse as Mexico, Japan, and South Africa. “Cross-cultural research,” she writes, “must be informed with thorough analyses of each and all of the cultural contexts involved.”
Nosek agrees. He’d love to see big replication projects that include more volunteers from non-Western societies, or that try to check phenomena that you’d expect to vary considerably outside the WEIRD bubble. “Do we need to assume that WEirDness matters as much as we think it does?” he asks. “We don’t have a good evidence base for that.”
Sanjay Srivastava from the University of Oregon says the lack of variation in Many Labs 2 is actually a positive thing. Sure, it suggests that the large number of failed replications really might be due to sloppy science. But it also hints that the fundamental business of psychology—creating careful lab experiments to study the tricky, slippery, complicated world of the human mind—works pretty well. “Outside the lab, real-world phenomena can and probably do vary by context,” he says. “But within our carefully designed studies and experiments, the results are not chaotic or unpredictable. That means we can do valid social-science research.”
The alternative would be much worse. If it turned out that people were so variable that even very close replications threw up entirely different results, “it would mean that we could not interpret our experiments, including the positive results, and could not count on them happening again,” Srivastava says. “That might allow us to dismiss failed replications, but it would require us to dismiss original studies, too. In the long run, Many Labs 2 is a much more hopeful and optimistic result.”
* A mention of the marshmallow test was removed from an early paragraph, since the circumstances there differ from those of other failed replications.
We want to hear what you think about this article. Submit a letter to the editor or write to firstname.lastname@example.org.