You read a scientific paper, look at the results, and ask yourself: Are these real? Do they reflect something genuine about the world, or are they statistical flukes? This ability to critically analyze publications girds all of science. It is the essence of the peer-review process. And, apparently, it's harder than it looks.

Consider psychology. The field has been recently embarrassed by failed attempts to repeat the results of classic textbook experiments, and a mounting realization that many papers are the result of commonly accepted statistical shenanigans rather than careful attempts to test hypotheses. Tellingly, as I covered in August, a coordinated attempt to repeat 100 published experiments, led by Brian Nosek from the University of Virginia, managed to reproduce the results of just a third of them.

Which raises the question: Exactly how good are psychologists at working out whether their own studies are reliable?

Pretty good, actually, according to Anna Dreber from the Stockholm School of Economics, provided you pool their wisdom—and get them to gamble. Dreber created a stock market for scientific publications, where psychologists could buy or sell “stocks” in 44 published studies based on how reproducible they deemed the findings. And these markets predicted the outcomes of actual replication attempts pretty well, and certainly far better than any of the traders did on their own.

Dreber's experiment was born in a bar. Over drinks with her husband Johan Almenberg and roommate Thomas Pfeiffer, she was talking about an attention-grabbing psychological study that she thought was “cute, but unlikely to be true.” When she wondered how good her instincts were, Pfeiffer brought up another paper by economist Robin Hanson at George Mason University. Titled Could Gambling Save Science?, it suggested that researchers could get a more honest consensus on scientific controversies by betting on their outcomes, in the way that traders bet on the future prices of goods.

“It blew us all away,” says Dreber. In 2012, she and her colleagues contacted Nosek, who agreed to add prediction markets to his big Reproducibility Project.

Here's how it worked. Each of the 92 participants received $100 for buying or selling stocks on 41 studies that were in the process of being replicated. At the start of the trading window, each stock cost $0.50. If the study replicated successfully, they would get $1. If it didn't, they'd get nothing. As time went by, the market prices for the studies rose and fell depending on how much the traders bought or sold.

The participants tried to maximize their profits by betting on studies they thought would pan out, and they could see the collective decisions of their peers in real time. The final price of the stocks, at the end of two-week experiment, reflected the probability that each study would be successfully replicated, as determined by the collective actions of the traders. If it was $0.83, that meant the market predicted an 83 percent chance of replication success. If that final price was over $0.50, Dreber's team considered it to be a prediction of success; if it was under, it was a prediction of failure.  

In the end, the markets correctly predicted the outcomes of 71 percent of the replications—a statistically significant, if not mind-blowing score. Then again, based on the final prices, the team only expected the markets to be right just 69 percent of the time—which they roughly were. (Remember that those prices are probabilities of success, so they naturally contain uncertainties about their own predictions.)

“There is some wisdom of crowds; people have some intuition about which results are true and which are not,” says Dreber. “Which makes me wonder: What's going on with peer review? If people know which results are really not likely to be real, why are they allowing them to be published?”

Well, says Nosek, market participants only care about whether the study will replicate, while reviewers are also looking at experimental design, importance, interest, and other factors. Also, reviewers, by their nature, work alone, and Dreber's traders performed poorly when working solo. When Dreber actually asked them to predict the replication odds for each study, they were right just 58 percent of the time—no better than chance. Collectively, they became more effective because they could see what their peers were thinking.

“This shows that there is information known in advance of conducting a replication that anticipates replication success,” says Nosek. What kind of information? “I'm not sure I had a clear strategy,” says Marcus Munafo from Bristol University, who was one of the better-performing traders and who has also used prediction markets to evaluate science. He paid attention to statistical power, the journals that the original studies were published in, and which branch of psychology they were part of. “Beyond that, I simply used my gut instinct for whether the original finding felt plausible.”

That's the most interesting bit, says Daniele Fanelli from Stanford University, who studies research bias and misconduct. “It opens some fascinating research questions about understanding which factors are consciously or unconsciously most informative to participants,” he says.

Nosek adds, “We may be able to use prediction markets to be more efficient in deciding what needs to be replicated, and making estimates of uncertainty about studies for which replication is not likely or possible.”

But Fanelli isn't convinced, saying that it “seems like a rather laborious process that's unlikely to be applied across the board.” Hanson has heard similar skepticism before. “We’ve had enough experiments with prediction markets over the years that these findings are not at all surprising,” he says, but “I expect that most ordinary academic psychologists will require stronger incentives than personal curiosity to participate.”

Their success in these markets would have to be tied to tangible benefits, like actual money or the likelihood of securing publications, grants, and jobs. “Imagine that one or more top journals used prediction-market chances that a paper’s main result would be confirmed as part of deciding whether to publish that paper,” he says. “The authors and their rivals would have incentives to trade in such markets, and others would be enticed to trade when they expect that trades by insiders, or their rivals alone, are likely to produce biased estimates.”

The prediction markets have uses beyond analyzing the reliability of individual studies. They also provide an interesting look at the scientific process itself. Using the final market prices and a few statistical traits, Dreber's team could backtrack through each study's history and show how its hypotheses became strengthened or weakened with every step along the way.

For example, before any of these experiments were actually done, what were the odds that they were testing hypotheses that would turn out to be true? Just 8.8 percent, it turned out. This reflects the fact that psychologists often look for phenomena that would be new and surprising.

More worryingly, after the experiments were completed, reviewed, and published, the odds that their hypotheses were true improved to just 56 percent. “So, if you read through these journals and ask, ‘Is this true or not?,’ you could flip a coin!” says Dreber. “That's pretty bad I think. People often say that if you have a p-value that's less than 0.05, there is a 95 percent probability that the hypothesis is true. That's not right. You need a high-powered replication.”

Indeed, the team calculated that if other researchers successfully replicated the studies' results, then their hypotheses stand a 98 percent chance of being true. If the attempts failed, the odds dropped back down to 6 percent. “The failed replications provided about equivalent amount of doubt as the initial demonstration provided belief,” says Nosek. “It's as if we are back to square one—an interesting set of mostly implausible ideas awaiting evidence to draw a strong conclusion.”

Dreber is now repeating her experiments for other fields like experimental economics. “I don't want to single out psychology,” she says. “Maybe things are worse in other fields and at least the psychologists seem willing to take this seriously.”