Psychology's Replication Crisis Has a Silver Lining

It’s an opportunity for the field to lead.

Pedestrians are reflected in the glass side of the John Hancock Tower in downtown Boston. (Brian Snyder / Reuters)

There is a crisis in psychology. It’s not those rare cases of outright fraud, as when the social psychologist Diedrik Stapel simply made up the results of dozens of experiments and published them in top journals. The more serious problem—the one that keeps many of us up at night—has to do with the practices of honest and well-intentioned researchers.

Over the last several years, commentators have pointed out that much of psychologists’ standard operating procedure—our style of collecting data, analyzing our results, reporting our findings, and deciding what to submit to publication—is biased toward “false positives,” where random effects are reported as significant findings. Too many of us engage in “p-hacking,” for instance, where we rummage through our data looking for statistically significant findings, and then, with the most innocent of intentions, convince ourselves that these findings are precisely what we predicted in the first place.

It’s not surprising, then, that when other psychologists attempt to replicate published work—as was done recently with 100 studies, in a much-cited article in Science—most of the results don't hold up.

This is bad news. It means that when a new finding is published in one of our flagship journals, an informed psychologist can wonder about how many times the authors tried variants of their experiment before striking gold, what analyses they didn’t report, whether they crafted their hypotheses in respond to their findings, and whether they would get the same findings if they did the experiment again. I just read a study reporting that drinking sauerkraut juice makes you more likely to support extreme right-wing ideology. Maybe this is a robust and powerful finding, and maybe if you did the study again, you’d get the same result. But I’m entitled to be skeptical.

These critiques have generated some bitterness. Nobody likes to be told that they’re doing things wrong and that their results can’t be trusted. (I’ve had studies of mine replicate and others fail to replicate—believe me, the former feels nicer.) There are legitimate complaints about the nasty and gleeful tone taken by some of the critics, particularly over social media—the phrase “replication bully” has recently been coined—and there are legitimate worries that certain individuals, and certain research programs, have been targeted for special scrutiny. But most of the criticisms are reasonable and persuasive.

It’s worth emphasizing that it’s not just our crisis. Similar issues arise in psychiatry, economics, particle physics, and, most of all, and medical research. It’s hard not to be shocked at the recent report in a Nature paper reporting a failure to replicate significant experiments in the domain of cancer research in 47 out of 53 cases. The special attention that psychology has received might not be because we are unusually bad scientists, but because we are reflective about our research, and devoted to cleaning up our act.

So there’s a crisis. But there are a few reasons why it’s not as bad as you might have heard.

For one thing, any psychologist can easily list mountains of findings that are rock-solid. There are basic discoveries about memory, language, perception, reasoning, development, social psychology, clinical psychology, neuroscience, and so on—findings that have been replicated countless times, that have led to rich theoretical advances and real practical improvements in people’s lives. And each day, new publications emerge that report robust and convincing findings that move the field forward. Things aren’t that bad.

And while I’m all in favor of replication, not every failure to replicate is a cause for hand-wringing. Even if an experiment is perfect, sometimes effects don’t show up due to random chance.

And often an experiment isn’t perfect. Most psychology research explores subtle theoretically-motivated predictions, typically looking for effects that can only arise in certain controlled environments. If the experiment isn’t carefully done, you won’t get the findings.

Here’s an example: It’s long been known that children can use the syntax of a word to guess at its meaning. If you show a 3-year-old a strange object and say “This is a dax”, she will think that the noun “dax” refers to the object; while if you say “This is a daxy one”, she will think that the adjective “daxy” refers to a property of the object, such as its color.

This is a robust finding, one that bears on theories of how children learn to talk. But it’s easy to fail to get this effect—just run your experiment badly. When saying the sentences, mumble, so that the children aren’t sure what you said. Loom over the children during the experiment, so that they are too frightened to pay attention. Test in a busy room so that the children are too distracted to focus.

If even a finding as simple and easily replicated as this one can fail to emerge if the experimenter is unskilled, consider how easy it is not to get more subtle—but very real—effects.

Think of an experiment on analogy with a recipe. Suppose your friend tells you that your recipe for apple pie is a failure, the pie might have tasted great when you used the recipe to make it yourself, but when he tried it, it turned out awful. The pie does not replicate! Does this mean you should abandon the recipe? Well, maybe. But what if turns out that your friend used moldy apples, that his oven is broken, and he neglected to add the crust. The failure wasn’t due to the recipe, but to the cook.

Plainly, a failure to replicate means a lot when it’s done by careful and competent experimenters, and when it’s clear that the methods are sensitive enough to find an effect if one exists. Many failures to replicate are of this sort, and these are of considerable scientific value. But I’ve read enough descriptions of failed replications to know how badly some of them are done.  I’m aware as well that some attempts at replication are done by undergraduates who have never run a study before. Such replication attempts are a great way to train students to do psychological research, but when they fail to get an effect, the response of the scientific community should be: Meh.

A final reason for calm is that there are good fixes for our field, solutions to where we’ve been going wrong. These includes ideas as to how to improve our statistical analyses, our modes of data collection, and our publication practices.

It’s somewhat awkward that Psychology is Patient Zero here, and that what seems like a family quarrel gets aired out in the pages of The New York Times. But there’s value to this sort of public exposure. It’s important for non-scientists to have some degree of scientific literacy, beyond a passing familiarity with certain theories and discoveries. Scientific literacy requires an appreciation of how science works, and how it stands apart from other human activities. A public discussion about how scientists make mistakes and how they can work to correct them will help advance scientific understanding more generally. Psychology can lead the way here.