For many economists, nothing is more exasperating than watching well-intentioned policies fall short because they were based on ideology, gut judgment, or something else besides sound evidence. In the now-infamous Kansas tax experiment, lawmakers made huge tax cuts in hopes of spurring economic growth—but ended up starving the state’s schools and infrastructure. In the short run, rent control in San Francisco benefited tenants who lived in controlled units, but the policy ultimately contributed to rising rental prices citywide as landlords took affected properties off the market. “Ban the box” policies, intended to help ex-offenders find jobs by prohibiting employers from inquiring about criminal histories, instead increased discrimination against young black and Hispanic men. In our own research, we have found that the presence of an equal-employment opportunity statement in job ads makes minority workers less likely to apply.
Rather than rolling their eyes at these failures, citizens should look more closely at why it is so much harder to base policies on hard information than on pure intuition. The problem isn’t just that relevant data are hard to come by. It’s also that the data we have are usually collected in one-off studies involving unrepresentative populations over short spans of time. The research community needs to define clearer standards for what makes useful evidence, and it must insist that experiments be designed accordingly.
The new Foundations for Evidence-Based Policymaking Act requires federal agencies to spell out which questions they’re trying to answer and then to collect data systematically. But if called upon to help with this research, outside experts will need to do their part, too, by putting themselves in the shoes of the public officials who have to make the best decisions they can with the information they have.
As experimental economists, we are delighted to see randomized controlled trials being used to evaluate social policies. But as we documented in a recent working paper, the typical path from research to policy leaves much to be desired. Upon observing that a program has a statistically significant effect—often, over a short period, and within a small sample unrepresentative of the population—researchers might recommend that policy makers adopt the program across an entire city, state, or country. Unfortunately, this approach does not guarantee that the initial results are reproducible, persist over time, or can be scaled to a larger population. Important questions go unanswered: If we were to run the same trial again, would we observe the same effect? If we were to repeat the experiment in a different context or population, would we expect the findings to persist?
We, too, understand the temptation to generalize from preliminary results that seem really compelling. A decade ago, one of us, John List, working with other scholars at the University of Chicago and Harvard University, launched the Chicago Heights Early Childhood Center, one of the most comprehensive longitudinal early-childhood studies ever conducted. Preliminary results suggest that a program that includes financial rewards for young children and their families for good attendance and other important behaviors has been successful in increasing children’s test scores. However, the impact of these interventions on academic performance may be short-lived. As such, we need to wait patiently for medium- and long-term results (and measure additional outcomes besides test scores) before we can convincingly recommend our program to policy makers.
Jumping the gun can have substantial real-world consequences. Nothing demonstrates the need for long-term follow-ups better than the Moving to Opportunity program, in which families from impoverished neighborhoods got the chance to move to better-off areas. The program was a success, but that wasn’t immediately evident. The substantial returns weren’t fully clear until participating children reached adulthood.
Academia, however, makes little provision for tracking the effects of a program over the very long term. Even tenured scholars face the pressure to publish. Journals in economics value novel, surprising, and positive findings. Yet they do not strongly encourage replications or follow-up studies. Meanwhile, many research grants are too small to allow researchers to collect large samples or long-term data. As a result, economists, other social scientists, and even medical researchers are encouraged to design experiments that barely meet minimum standards for effective testing. Relatively brief trials with small samples that yield a “wow” result tend to be the desired outcome, ensuring a well-published journal article and a boost to researchers’ careers.
Elected or appointed officials also face limitations. They often work on tight political cycles that don’t allow for years of evaluation. Their terms are often up before long-term results of an intervention become clear. Of course, they also might be tempted to cherry-pick findings that support their policy agenda and gloss over unfavorable results.
Imagine that, in this environment, you are a politician who seeks to reduce unemployment by investing significant funds in a novel program that has shown promising early results in a randomized controlled trial last year in a neighboring state. Two years after implementing the program, your district still lags behind neighboring districts in terms of employment statistics, with no sign of improvement. What went wrong?
Unfortunately, there are many possible explanations. Given that the early trial did not track long-term outcomes, it is difficult to say whether the impact of the original program persisted after the brief initial success. Even though the trial took place in a neighboring state, the people who decided to participate in that experiment might have had different characteristics and needs from those of the people you were targeting. The initial experiment was not replicated, and the sample size was relatively small, meaning that the original positive result might merely have been a result of chance and not something that could be reproduced elsewhere—a “false positive,” in the parlance of science. Because some trials are never published (in part because they don’t yield impressive results), you will never know whether other researchers had tested similar programs before, found unimpressive results, and stashed them away in their file drawer.
Until we create a safe political environment for evaluation and continuous learning, rigorous experiments will feel risky to many policy makers who fear that programs will hit the chopping block if they don’t achieve the near-term positive result they seek.
When done correctly, partnerships between policy makers and researchers have the potential to improve the lives of millions. Consider the massive exercise of scaling up a program in India that calibrates a teacher’s instruction to the students’ achievement level. It took more than a decade’s work and countless controlled trials, but a collaboration between researchers and a nongovernmental organization helped to shape a program that now successfully serves pupils in more than 100,000 schools. In the United States, results from the Family Options Study trial allowed cities across the country to tackle homelessness by showing that “housing first” policies can be cost-effective.
Lately there are signs that a day of reckoning has arrived in our profession. Researchers are trying to assess more systematically whether experimental findings in economics and, more broadly, in social sciences can be reproduced. There has been a successful push for pre-registration, in which researchers must commit to a design and analysis plan before they begin a trial (which thus prevents “fishing” for good results). These shifts will raise the bar for social-science research and—we hope—make scholars focus on credibility, generalizability, and scalability as guiding principles when they design their experiments. In this new world, research dollars are invested to understand not just whether an intervention works, but also why, for whom, and where. Policy makers can also contribute to this change. We believe that if policy makers demand more credible evidence, researchers will supply it.