In recent years, scientists have been dealing with concerns about a reproducibility crisis—the possibility that many published findings may not actually be true. Psychologists have grappled intensively with this problem, trying to assess its scope and look for solutions. And two reports from pharmaceutical companies have suggested that cancer biologists have to face a similar reckoning.  

In 2011, Bayer Healthcare said that its in-house scientists could only validate 25 percent of basic studies in cancer and other conditions. (Drug companies routinely do such checks so they can use the information in those studies as a starting point for developing new drugs.) A year later, Glenn Begley and Lee Ellis from Amgen said that the firm could only confirm the findings in 6 out of 53 landmark cancer papers—just 11 percent. Perhaps, they wrote, that might explain why “our ability to translate cancer research to clinical success has been remarkably low.”

But citing reasons of confidentiality, neither the Bayer nor Amgen teams released the list of papers that they checked, or their methods or results. Ironically, without that information, there was no way of checking if their claims about irreproducibility were themselves reproducible. “The reports were shocking, but also seemed like finger-pointing,” says Tim Errington, a cell biologist at the Center for Open Science (COS).

Elizabeth Iorns had the same thought, and she saw a way to do a better and more transparent job. She had founded a start-up called Science Exchange, which uses a large network of contract labs to provide research support to scientists—and in some cases, check their work. She contacted the COS, and together, they launched the Reproducibility Project: Cancer Biology—an initiative that used the Science Exchange labs to replicate key results from the 50 most cited papers in cancer biology, published between 2010 and 2012. (The COS recently used the same model for psychology studies to good effect.)  

The results from the first five of these replication attempts were published today—and they offer no clean answers. Two of them largely (but not entirely) confirmed the conclusions of the original studies. One failed to do so. And two were inconclusive for technical reasons—the mouse strains or cancer cell lines that were used in the original studies didn’t behave in the same way the second time round. These uncertainties mean that it’s very hard to say whether each replication attempt “worked,” or whether each original study was actually reproducible.

“Everyone wants us to paint the project in black and white,” says Errington. “What percent of these papers replicate? I’ve been asked that so many times, but it’s not an easy question.” To him, the project’s goal isn’t to get a hard percentage, but to understand why two seemingly identical goes at the same experiment might produce different results, and to ultimately make it easier for one group of scientists to check another’s work.

The Reproducibility Project team pre-registered all of their work. That is, for each targeted paper, they wrote up their experimental plans in full, ran them past the original authors, and submitted them to the journal eLife for peer review. Only then did they start the experiments. Once the results were in, they were reviewed a second time, before being published.

The hardest part, by far, was figuring out exactly what the original labs actually did. Scientific papers come with methods sections that theoretically ought to provide recipes for doing the same experiments. But often, those recipes are incomplete, missing out important steps, details, or ingredients. In some cases, the recipes aren’t described at all; researchers simply cite an earlier study that used a similar technique. “I’ve done it myself: you reference a previous paper and that one references a paper and that one references a paper, and now you’ve gone years and the methodology doesn’t exist,” admit Errington. “Most people looking at these papers wouldn’t even think of going through these steps. They’d just guess. If you asked 20 different labs to replicate a paper, you’d end up with 10 different methodologies that aren’t really comparable.”

So, in every case, he had to ask the scientists behind the original experiments for the details of their work. Oftentimes, the person who actually did the experiments had left the lab, so an existing team member had to rummage through old notebooks or data files. The project ended up being hugely time-consuming for everyone concerned. “We spent a boatload of time trying to get back to ground zero,” says Errington.

And for what? The results of the first five papers show just how hard it is to interpret a replication attempt in this field. For example, in 2012, Levi Garraway at the Dana-Farber Cancer Institute found that melanoma skin cancers frequently carry mutations in a gene called PREX2. His team then showed that these mutations accelerate the growth of human melanoma cells that were transplanted onto mice. But replicating team couldn’t confirm the latter result; in their experiment, the PREX2 mutations made no difference.

Does that mean that Garraway’s study was wrong? Not quite. Even though the replication team got their melanoma cells and mice from the same source as Garraway’s group, in their hands, the transplanted tumours grew much faster than had been reported. The PREX2 mutations made no difference because all the cells were already zooming along in sixth gear. Small differences in the ways the cells were grown or the mice were housed could have contributed to the differences between these studies, writes Roger Davis, a cell biologist at the University of Masschussetts Medical School, reviewed the PREX2 replication paper.

In another case, Irving Weissman from Stanford Medicine showed that cancer cells carry high levels of a protein called CD47, and antibodies that target this protein can slow the growth of human tumor cells that had been transplanted into mice. In this case, the replication experiment was inconclusive because all the transplanted tumors would spontaneously regress, antibodies or no.

Some might argue that these differences arise because the project relied on contractors, who lack the experience and artisanal skills of the scientists in the original teams. Iorns disagrees. “The teams were all selected for their technical expertise in the experiments being conducted,” she says. “They routinely run these types of experiments all the time.”

Instead, she and Errington argue that the differences stem from the inherent and underappreciated variability of the cells and animals being used in these studies. In psychology, researchers who replicate a study have no choice but to recruit different volunteers, who might differ from the original sample in critical ways. But in theory, cancer biologists should be able to use the exact same lineage of cells or breed of rodents—genetically identical and sourced from the same suppliers—which should behave in the same way. “But some of these models kind of fell apart, and you can’t dismiss that,” says Errington. He hopes that these results will spur other scientists to better explore those variations, and include more quality control steps in their work.

And perhaps the most important result from the project so far, as Daniel Engber wrote in Slate, is that it has been “a hopeless slog.” “If people had deposited raw data and full protocols at the time of publication, we wouldn’t have to go back to the original authors,” says Iorns. That would make it much easier for scientists to truly check each other’s work.

The National Institutes of Health seem to agree. In recently released guidelines, meant to improve the reproducibility of research, they recommend that journals ask for more thorough methods sections and more sharing of data. And in this, the Reproducibility Project have modelled the change they want to see, documenting every step of their project on a wiki.

“We want to applaud replication efforts like this,” says Atul Butte from the University of California, San Francisco, whose study was among the two that were successfully reproduced. “It is important for the public to have trust in scientists, and belief in the veracity of our published findings.” But he suggests that the team chooses their targeted studies in a “more impactful manner”—not by citations, but by those that are most likely to lead to new treatments.

In the meantime, the team still needs to finish its first wave of replications. They initially set out to replicate 50 old papers, but the unexpectedly high costs of doing so have forced them to scale back. “In the end, we think we’ll complete 30,” says Iorns.