The Inevitable Evolution of Bad Science

A simulation shows how the incentives of modern academia naturally select for weaker and less reliable results.

Peter Nicholls / Reuters

Bacteria, animals, languages, cancers: all of these things can evolve, which we know from the work of legions of scientists. You could argue that science itself also evolves.  Researchers vary in their methods and attitudes, in ways that affect their success, and they pass those traits to the students they train. Over time, the very culture of science is sculpted by natural selection—and according to Paul Smaldino and Richard McElreath, it is headed in an unenviable direction.

The problem, as others have noted, is that what is good for individual scientists is not necessarily what is good for science as a whole. A scientist’s career currently depends on publishing as many papers as possible in the most prestigious possible journals. More than any other metric, that’s what gets them prestige, grants, and jobs.

Now, imagine you’re a researcher who wants to game this system. Here’s what you do. Run many small and statistically weak studies. Tweak your methods on the fly to ensure positive results. If you get negative results, sweep them under the rug. Never try to check old results; only pursue new and exciting ones. These are not just flights of fancy. We know that such practices abound. They’re great for getting publications, but they also pollute the scientific record with results that aren’t actually true. As Richard Horton, editor of The Lancet once wrote, “No one is incentivized to be right. Instead, scientists are incentivized to be productive.”

This is not a new idea. In the 1970s, social scientist Donald Campbell wrote that any metric of quality can become corrupted if people start prioritizing the metric itself over the traits it supposedly reflects. “We realized that his argument works even if individuals aren’t trying to maximize their metrics,” says Smaldino.

He and McElreath demonstrated this by creating a mathematical model in which simulated labs compete with each other and evolve—think SimAcademia. The labs choose things to study, run experiments to test their hypotheses, and try to publish their results. They vary in how much effort they expend in testing their ideas, which affects how many results they get, and how reliable those results are. There’s a trade-off: more effort means truer but fewer publications.

In the model, as in real academia, positive results are easier to publish than negative one, and labs that publish more get more prestige, funding, and students. They also pass their practices on. With every generation, one of the oldest labs dies off, while one of the most productive one reproduces, creating an offspring that mimics the research style of the parent. That’s the equivalent of a student from a successful team starting a lab of their own.

Over time, and across many simulations, the virtual labs inexorably slid towards less effort, poorer methods, and almost entirely unreliable results. And here’s the important thing: Unlike the hypothetical researcher I conjured up earlier, none of these simulated scientists are actively trying to cheat. They used no strategy, and they behaved with integrity. And yet, the community naturally slid towards poorer methods. What the model shows is that a world that rewards scientists for publications above all else—a world not unlike this one—naturally selects for weak science.

“The model may even be optimistic,” says Brian Nosek from the Center of Open Science, because it doesn’t account for our unfortunate tendency to justify and defend the status quo. He notes, for example, that studies in the social and biological sciences are, on average, woefully underpowered—they are too small to find reliable results.

Low statistical power is an obvious symptom of weak research. It is easily calculated, and people have been talking about it since the 1960s. And yet, in over 50 years, it hasn’t improved at all. Indeed, “there is still active resistance to efforts to improve statistical power by scientists themselves,” says Nosek. “With desire to get it published dominating desire to get it right, researchers will defend low statistical power despite it having zero redeeming qualities for science.”

Scientists are now grappling with the consequences of that stagnation. In many fields, including neuroscience, genetics, psychology, ecology, and biomedicine, there’s talk of a reproducibility crisis, where weak and poorly designed studies have flooded the world with doubtful findings. “We spend a lot of time complaining about the culture of science, but verbal arguments allow people to talk past each other,” says Smaldino. “A formal model allows you to be clearer about what you’re talking about.”

For example, many scientists have focused on replication—repeating past studies to see if their results hold—as a way of improving the reliability of science. But that won’t fix things, according to Smaldino and McElreath’s model. Their labs could spend time replicating past work, and if those attempts failed, the original researchers took a big reputational hit. But that didn’t matter “because there are way more results than can possibly be replicated,” says Smaldino. In the long run, labs that used shoddy methods got away with it, even if others occasionally called them out on their dubious results.

“As long as the incentives are there, then rewards will be there for those who can cheat the system, whether they do so intentionally or not,” says Smaldino. To improve science, the incentives must change.

Those changes have to be pervasive, but they don’t have to be big, says Nosek. For example, when scientists go up for promotions, they are often asked to submit their full list of papers. No one has the time to read all of those, so committee members default to imperfect metrics like number of papers or prestige of journals. “An easy change is to ask the candidate to send three articles, which the committee can read and evaluate in detail,” says Nosek. “Now, the candidate's incentives are to produce three outstanding pieces of work.”

But the U.K. has already instigated such a system to judge its scientists,  and Andrew Higginson and Marcus Munafo, two psychologists from the universities of Exeter and Bristol respectively, would disagree that it’s better.  They used another mathematical model to predict how scientists should act to maximize the value of their publications to their career. And they found that if people are judged based on a small number of high-impact publications, their best strategy is to focus all their effort on underpowered studies that only go after new findings without checking old ones. As a result, half of what they publish will be wrong.

There are other solutions. Some scientists have argued for a system of “pre-registration,” where work is evaluated on the back of their ideas and plans, before any actual work is carried out. They commit to carry out the plans to the letter, and journals commit to publishing the results come what may. That reduces the capacity and incentive to mess with studies to boost one’s odds of getting a paper. It also moves the focus away from eye-catching results and towards solid, reliable methods. Almost 40 journals are publishing these kinds of Registered Reports, and there are moves to tie them more closely to grants, so that a single review of a study’s methods guarantees funding and publication.

Putting a premium on transparency can also help, says Simine Vazire, a psychologist at the University of California, Davis. “If authors are required to disclose more details about their research, journals and reviewers will be in a better position to evaluate the quality of studies, and it will be much harder for authors to game the system.”

Top journals like Nature and Science are indeed encouraging authors to be more transparent about their data and methods, while providing checklists to make it easier for editors to inspect the statistical qualities of new papers. And Nosek’s Center for Open Science has created standards for transparency, openness, and reproducibility that journals and funding agencies can sign up to, and badges for good behavior.

Ultimately, “changing incentives across the complex science ecosystem is a coordination problem,” says Nosek. “Institutions, funders, editors, societies, and researchers themselves all need to change their expectations a little or else no change will be effective.”

Munafo is hopeful. “We have moved on from describing the problem to understanding its nature,” he says. “This is a healthy sign. Hopefully it will lead to clues as to where we can most efficiently change incentive structures. We're in the middle of a fascinating natural experiment, with lots of innovations being introduced or piloted. What works and doesn't work, and what is popular versus unpopular, remains to be seen.”

I don’t want to be overly pessimistic,” says Smaldino. “There are a lot of really high-quality scientists who strive to do high-quality work. There are tons of individuals realize that quality matters. I just hope that sentiment prevails.”