Based on our rough calculations, less than $1 out of every $100 of government spending is backed by even the most basic evidence that the money is being spent wisely. As former officials in the administrations of Barack Obama (Peter Orszag) and George W. Bush (John Bridgeland), we were flabbergasted by how blindly the federal government spends. In other types of American enterprise, spending decisions are usually quite sophisticated, and are rapidly becoming more so: baseball’s transformation into “moneyball” is one example. But the federal government—where spending decisions are largely based on good intentions, inertia, hunches, partisan politics, and personal relationships—has missed this wave.
Allow us to share some behind-the-scenes illustrations of what our crazy system of budgeting looks like—and to propose how the lessons of moneyball could make our government better.
When one of us (Peter) began his tenure as the director of the Congressional Budget Office in 2007, he took a Willie Sutton approach to the nation’s huge and growing fiscal mess: he went after health care, which makes up roughly a quarter of the federal government’s spending, because that’s where the money is.
The moneyball formula in baseball—replacing scouts’ traditional beliefs and biases about players with data-intensive studies of what skills actually contribute most to winning—is just as applicable to the battle against out-of-control health-care costs. According to the Institute of Medicine, more than half of treatments provided to patients lack clear evidence that they’re effective. If we could stop ineffective treatments, and swap out expensive treatments for ones that are less expensive but just as effective, we would achieve better outcomes for patients and save money.
Both parties should find much to like in such an approach. It would offer Republicans a way to constrain the growth of government spending and take pressure off private businesses weighed down with health expenses. And it would offer Democrats a means of preserving the integrity of Medicare and Medicaid and thereby restoring faith in a core government function.
And yet getting funding for the research needed to assess and compare medical treatments has been like pulling teeth. As a rule, legislators seem to lack a natural affinity for economists and budget analysts (alas, they are hardly alone). But Peter made himself exceptionally unpopular with some Democrats and many Republicans by insisting on such funding in the 2009 stimulus bill, and then working to expand it in the 2010 “Obamacare” legislation. Despite these modest successes, less than $1 out of every $1,000 that the government spends on health care this year will go toward evaluating whether the other $999-plus actually works.
Getting the right information is less than half the battle. Acting on it, once it’s in hand, is harder still. As one small example, some evidence suggests that moving toward “bundled” payments for all services needed by a patient during a course of medical treatment could produce better value than paying piecemeal for each service and procedure, because the piecemeal approach creates an incentive for more care rather than better care. During one meeting with members of Congress in 2008 to discuss how to expand bundling and include a performance incentive in kidney dialysis, Shelley Berkley, a Democratic congresswoman from Nevada, accused Peter, as he remembers it, of trying to destroy the dialysis industry. “You and your staff may have your Ph.D.s, but you have no clue,” he recalls her saying. “We don’t need any of your fancy analysis.” (Berkley says she does not remember the meeting, or those comments.) Berkley had received campaign contributions from several dialysis companies and organizations, and her husband owned a dialysis business. Whether these factors may have influenced her thinking is a question we will leave for the reader.
It is indisputable, however, that a move toward payments based on performance would harm some business interests. If most of your profits come from, say, a medical device or procedure that is covered by Medicare but doesn’t work all that well, you’re likely to resist anyone sorting through what works and what doesn’t, never mind changing payment accordingly. Health-care interests are wise to invest millions of dollars in campaign contributions and lobbying to protect billions of dollars in profits.
Your other author (John) received his own lessons on why moneyball doesn’t play in Washington a few years earlier, when he joined the administration of George W. Bush. Bush, of course, was not only a former baseball-team owner but also the first president to hold an M.B.A. In every domestic-policy briefing John led in the first term of the Bush administration, the president would ask some version of the following questions: How do we know this program will achieve the results as advertised? Who will run this program, and is that person an effective manager? How will that person and the program be held accountable for producing results?
Year after year, with the help of the Program Assessment Rating Tool (PART) introduced by Bush’s Office of Management and Budget in 2002, the administration identified federally funded and administered programs that were not working as advertised, and tried to get them to improve or be discontinued. And yet these efforts rarely gained traction on Capitol Hill with either party.
The Bush administration initially had high hopes for this project. The goal was to build on a Clinton-era law called the Government Performance and Results Act, which aimed “to provide for the establishment of strategic planning and performance measurement in the Federal Government,” but did not attempt to tie performance directly to continued financing.
By the end of Bush’s second term, PART had assessed about 1,000 programs. Of them, 19 percent were rated “effective,” 32 percent “moderately effective,” 29 “adequate,” 3 percent “ineffective,” and 17 percent “results not demonstrated” (meaning that the programs couldn’t be assessed, because of insufficient data). This information was used to develop the president’s budget proposals to Congress, but PART was not developed in cooperation with Congress, and Congress gave its assessments little heed.
The White House Task Force for Disadvantaged Youth, which John co-chaired, highlights the extreme disconnect between effectiveness ratings and appropriation decisions. For the first time ever, in 2003, the task force tallied and studied all 339 federally funded programs addressing disadvantaged youth—a confusing and costly tangle—to find ways to improve the system. With help from 10 federal departments and from experts inside and outside of government, the task force found that the federal government was spending $223.5 billion every year on programs with aims ranging from promoting health and nutrition to preventing teen pregnancy, high-school dropouts, and youth violence. Despite this wide range, overlap among the programs was common. The task force’s final report documented 67 different youth programs that promoted “character education,” 89 that purported to build “self-sufficiency skills,” and 97 that sought to “prevent substance abuse”—with little or no coordination or knowledge-sharing among programs with similar goals.
More troubling, the vast majority of the programs could not provide any meaningful information indicating how well they served young people. Some reported basic operational data, but few had undergone rigorous evaluations looking at how the programs affected participants. At the time of the White House Task Force report, fewer than 10 percent of the programs had been assessed by PART, and more than half had not been evaluated at all in the previous five years.
With so little performance data, it’s impossible to say how many of the programs were effective. But you don’t have to be a Tea Party organizer to harbor skepticism. Since 1990, the federal government has put 11 large social programs, collectively costing taxpayers more than $10 billion a year, through randomized controlled trials, the gold standard of evaluation. Ten out of the 11—including Upward Bound and Job Corps—showed “weak or no positive effects” on their participants. This is not to say that all 10 programs deserve to be eliminated. But at a minimum, collecting rigorous evidence could help spur programs to improve over time.
One of the programs studied by the Task Force for Disadvantaged Youth that did collect meaningful performance data and was rated by a PART assessment was the Even Start Family Literacy Program, a Department of Education project aimed at improving the literacy of low-income parents and their children. Unfortunately, the data showed that “children and parents … did not gain more than children and parents in the control group.” PART rated the program “ineffective.”
In 2003, John and officials at the Office of Management and Budget began trying to redirect funding from Even Start to better-performing programs. But Even Start was founded in 1989 by Bill Goodling, a well-liked Republican congressman who had been the chairman of the House Education and the Workforce Committee, and had previously served as a teacher, principal, and school superintendent in Pennsylvania. So Congress continued to fund this ineffective, if well-meaning, program to the tune of more than $1 billion over the life of the Bush administration.
Even Start is no different from most federal programs. Evidence of success is barely considered when legislation is proposed and discussed in committee and on the floor of Congress. There is no systematic way in which members of Congress or other key decision makers are informed about that evidence, or lack thereof. They instead tend to rely on ad hoc assessments provided by lobbyists and interest groups. And once legislation is passed and a program is up and running, there is no mechanism for automatically tracking its effectiveness, beyond counting the number of people served by a program, no matter the impact it has on their lives.
The consequences of failing to measure the impact of so many of our government programs—and of sometimes ignoring the data even when we do measure them—go well beyond wasting scarce tax dollars. Every time a young person participates in a program that doesn’t work but could have participated in one that does, that represents a human cost. And failing to do any good is by no means the worst sin possible: some state and federal programs actually harm the people who participate in them.
You’ve surely heard of Scared Straight, a program started in a New Jersey prison in the 1970s that brings at-risk youth to meet with hardened inmates who tell them about the harsh realities of life behind bars. The program has gotten an extra dose of attention lately because of the A&E reality TV show Beyond Scared Straight, which takes viewers inside similar (and generally more harrowing) prison programs for young people in different states, blue and red, across the country.
It turns out that Scared Straight–style programs are actually pretty effective—at increasing criminal behavior. Rigorous research conducted by Anthony Petrosino and researchers at the Campbell Collaboration shows that instead of scaring kids and turning them away from risky, criminal behavior, the programs do just the opposite: they make the kids about 12 percent more likely to commit a crime.
Fortunately, the Department of Justice is acting on these findings and warning state governments to stop funding Scared Straight and similar programs. But Scared Straight is not the only government program that’s been shown to cause harm. The federal government’s long-running after-school program, 21st Century Community Learning Centers, has shown no effect on academic outcomes on elementary-school students—and significant increases in school suspensions and incidents requiring other forms of discipline. The Bush administration attempted to reduce funding for the program. But following impassioned testimony on behalf of the program by Arnold Schwarzenegger, then a potential candidate for governor of California, congressional appropriators agreed to restore all funding. Today the program still gets more than $1 billion a year in federal funds.
What can we do to promote moneyball in government? The first (and easiest) step is simply collecting more information on what works and what doesn’t.
The Obama administration has already pushed federal agencies to bolster their analytic capabilities and to show how their funding priorities are evidence-based, particularly in their budget submissions. As a result, the administration’s 2014 budget proposal had an unprecedented focus on evidence and results.
A nonprofit organization that advocates for evidence-based decision making, called Results for America, has proposed a number of measures that would expand on these efforts. It is calling for reserving 1 percent of program spending for evaluation: for every $99 we spend on a program to improve education, reduce crime, or bolster health, we would spend $1 making sure the program actually works.
The Harvard economist Jeffrey Liebman has written that, based on his simple but convincing calculations, “spending a few hundred million dollars more a year on evaluations could save tens of billions of dollars by teaching us which programs work and generating lessons to improve programs that don’t.” Who wouldn’t want a 100-fold return on investment?
The more evidence we have, the stronger it is; and the more systematically it is presented, the harder it will be for lawmakers to ignore. Still, linking evaluation to program funding will be tough, as both of us have seen in practice, again and again.
One thing that is essential to a more results-driven government is holding politicians accountable for their support of failing programs. Interest groups regularly rate politicians on their adherence to a particular perspective. What if we had a Moneyball Index, easily accessible to voters and the media, that rated each member of Congress on their votes to fund programs that have been shown not to work?
Even absent such public shaming, the government is taking steps in the right direction. The Department of Education’s Investing in Innovation (i3) program for improving student achievement and educator effectiveness, for instance, gives priority to projects backed by rigorous evidence of success, while still allocating a portion of its funds for promising programs willing to build evidence over time. The program originated in the rush and jumble of the Recovery Act, so it bypassed some typical congressional hurdles. But the performance mandate now built into i3’s design provides a model for how the federal government can make decisions about programs based on impact. Liebman has put forward some good ideas about how to expand upon that model. He suggests that, to start, 5 percent of the dedicated funding that’s delivered each year by the federal government to state and local governments—which includes major programs like the Community Development Block Grant and the Community Mental Health Services Block Grant—be reserved for programs that have demonstrated their worth. That share could rise over time as the evidence base expands.
New York City Mayor Michael Bloomberg is taking another promising approach, essentially creating probationary programs that must prove themselves to become permanent. The city’s Center for Economic Opportunity seeks out new, innovative programs with potential to combat the poverty cycle, and then oversees rigorous evaluations “to determine their effectiveness in reducing poverty, encouraging savings, and empowering low-income workers to advance in their careers.” The programs that produce the strongest results become eligible for further city funding; if a program isn’t having the intended effect, dollars are shifted to those things that work. This approach has now spread to seven other urban areas, with the help of the Obama administration’s performance-based Social Innovation Fund.
How can we steer dollars away from well-established programs that aren’t working? The U.S. Department of Health and Human Services has shown one nuanced approach. In 2011, the Obama administration built on the Bush administration’s attempt to examine how children were faring in individual Head Start programs across the country, and began a crackdown on providers failing the kids they were supposed to serve. Instead of threatening to scrap Head Start altogether, the agency refused to renew funding for the bottom 10 percent of local programs—132 in all. These centers were told that they had failed to meet quality standards. To requalify for Head Start funding, they would have to make substantive improvements and then compete for funds against other providers in their area. Autopilot funding for lousy centers came to an end.
Another encouraging data point: two years ago, when fiscal pressures really began to mount at the federal level, Congress finally pulled the funding for the ineffective Even Start literacy program. Over time, the data won out. “Under the gun, Congress can do the right thing,” says Robert Gordon, who worked in the Office of Management and Budget under President Obama. “Now there’s no money to waste, so interest-group politics and bogus arguments don’t carry as much weight as they used to. There’s reason for optimism.”
We’re optimistic too, even though the obstacles to moneyball in government are daunting. Absent major changes in campaign finance, special interests that profit from blind budgeting will still have a powerful means of thwarting reform. Agencies’ staff will roll their eyes at the next round of “budget reforms,” wait out the incumbent, and then continue business as usual. And members of Congress will stay wedded to their legacy programs.
But we believe the federal budget crunch will force change. Already, many cities have had to choose between fewer cops and fewer teachers; between slower ambulance response and less-frequent garbage removal. The federal government is now beginning to face similarly stark choices. Do we really want to furlough hundreds of FBI agents at a time of heightened threats? Or lay off air-traffic controllers? Do we really want big cuts at the National Institutes of Health or to early-childhood-education investments, both of which are engines of economic growth? Do we really want to eat our seed corn?
Both parties have signed up to reduce nondefense discretionary spending—that is, the money for everything from the Food and Drug Administration to the NIH to the Veterans Administration—to levels that will be at least $350 billion lower through 2022 than they were in 2012. That would bring discretionary spending as a share of the economy to its lowest level on record (data go back to 1962). We should not do this blindly.
Moneyball doesn’t happen overnight. Many of the “sabermetrics” practices that transformed baseball a decade ago can actually be traced back to the Brooklyn Dodgers general manager Branch Rickey, who is universally known for breaking the color barrier in baseball but mostly unacknowledged for being the first GM to hire a professional statistician. Rickey’s approach got little traction for nearly half a century, until the pivotal 2002 season, when the Oakland Athletics’ general manager, Billy Beane, built his club on analysts’ data rather than scouts’ beliefs.
What prompted Beane’s gutsy decision to revive the data-centered approach? With $40 million to spend on players in 2002, the A’s had to compete on a comically uneven playing field with big-market teams like the Yankees, which spent $125 million on its roster that same year. In other words, scarcity drove Beane’s break from established tradition. We hope and expect it will have the same effect on Washington in the years to come.
We want to hear what you think about this article. Submit a letter to the editor or write to firstname.lastname@example.org.