It was almost exactly a year ago that 11 former Atlanta educators were convicted of conspiring to tamper with thousands of students’ test scores. The cheating scandal, which led to years of prison time for some of the offenders, has grown to symbolize the ills of America’s emphasis on standardized testing. Tell teachers their salaries are tied to test scores and, the thinking goes, they’ll do whatever it takes to ensure those scores are up to par—even if that means fudging the numbers. Even if that means hurting student achievement.
A state investigation concluded in 2011 that many of the Atlanta students whose scores were falsified by nearly 200 teachers were consequently excluded from remedial education they otherwise needed. And a later Georgia State University study, published last May, found that the tampering had negative long-term academic impacts for those kids, particularly in language arts. “When test results are falsified and students who have not mastered the necessary material are promoted, our students are harmed, parents lose sight of their child’s true progress, and taxpayers are cheated,” Nathan Deal, the governor of Georgia, said in a statement announcing the findings of the state’s probe.
Deceptive scoring practices can be found in schools across the country, and they seem to be growing in popularity in an era that places heavy emphasis on standardized testing. But rarely do those practices involve the kind of cheating that happened in Atlanta, where teachers were caught erasing and changing students’ answers. Instead, they’re typically a lot more subtle—a teacher turning a blind eye to a few errors, for example, or grading an open-ended response leniently—and a lot less selfish. And it turns out that this kind of manipulation might even benefit kids.
One recent study, published earlier this month by the National Bureau of Economic Research (NBER), looks at New York’s Regents Exams, the high-school tests in a handful of subjects that students are required to pass to graduate. Until 2010, teachers were responsible for grading their own students’ exams; they were also required to rescore any tests that fell just a few points below the proficiency threshold. These scoring policies, the economists found, enabled widespread manipulation: 40 percent of the scores near the cutoffs—or 6 percent of all the exams in core subjects—were inflated.
A similar analysis of students in Sweden, published by the Stanford Institute for Economic Policy Research in February, focused on the nationwide math exams administered among all ninth-graders to help determine their GPA and eligibility for high school. (High school in Sweden starts at grade 10.) As was the case with New York’s Regents Exams, the tests are graded locally by students’ own teachers, who at times had to award points based on subjective qualities like “clarity” and “beautiful expression.” True to form, the researchers found that a good deal of test-score inflation happens in the Scandinavian country as well.
The prevalence of test-score manipulation in the United States is well-documented. In fact, with the help of the same researchers who authored the Regents Exams study, The Wall Street Journal in 2011 revealed a significant spike in the number of exams in all the main subjects with scores of 65 points out of 100—the minimum passing grade. (The authors of the Sweden study based their conclusions on similar patterns.) The New York State Education Department quickly adopted a series of changes to grading policies, and by 2012, evidence of manipulation had all but disappeared. What hasn’t been well-documented are the causes and consequences of such manipulation.
There’s good evidence that score manipulation does harm kids, particularly when teachers are falsifying their responses outright for the sake of avoiding sanctions. But there’s also good evidence to suggest that score inflation—teachers grading a bit more leniently, often because they think the student underperformed on the exam—may have positive effects as well. While inflating an individual student’s test score doesn’t magically inject her with more knowledge, the two aforementioned studies indicate it significantly boosts her odds of overcoming an obstacle increasingly critical to future success: high-school graduation. In New York, according to the NBER authors, having a Regents score manipulated to fall above a cutoff increased the probability of graduating by a hefty 22 percentage points. And because black students were more likely than their white peers to have scores just below the cutoff—and because the score inflation was more common at schools with high concentrations of low-income students of color—the manipulation actually shaved 5 percentage points off the gap between white and black students’ graduation rates. Once the state changed its scoring policies, roughly a quarter of just-below-the-cutoff students weren’t able to pass their exams even after retaking them and thus couldn’t graduate.
The Sweden study yielded similar graduation-rate results, but it also revealed broader advantages. Unlike the U.S., Sweden keeps detailed longitudinal data that allowed the researchers to track student progress not only throughout school, but into the labor market, painting a comprehensive picture of the potential long-term effects of score manipulation. And those effects went well beyond just better grades in one course. For one, the students who had their math scores inflated performed better in other subjects, too, and ultimately received higher cumulative GPAs. The students were more likely to attend college and secure higher-paying jobs. They were less likely to wind up pregnant as teenagers.
The Stanford economists who conducted the analysis speculate that the reason is psychological—the higher test score boosts a kid’s confidence and effort, and perhaps boosted other teachers’ opinions of them as well. For years, psychology studies have demonstrated powerful effects from a phenomenon known as “stereotype threat”: when individuals are primed with negative stereotypes about a group they belong to, they can fail to perform at their genuine ability level on tests. The study on Swedish students suggests a contrary effect: when kids thought they did better than they actually did, that confidence boost helped them to perform better than they previously could.
When Deal’s office announced the findings of its investigation into the Atlanta cheating scandal, it described the teachers’ and administrators’ actions as “ethical failings.” After all, the educators who tampered with the tests didn’t do so in the name of their students’ educational success; they were avoiding their own punishment. Researchers have long suspected that harsh accountability policies such as those enacted under No Child Left Behind encourage teachers to act dishonestly: “The incidence of negative events associated with high-stakes testing is so great, corruption is inevitable and widespread,” wrote the researchers Sharon Nichols and David Berliner in a 2005 study on the repercussions of such testing.
But the Atlanta example, according to Thomas Dee, a Stanford economist who directs the university’s Center for Education Policy Analysis and co-authored the Regents study, is an anomaly. Indeed, a growing body of international research suggests that the prospect of a raise—or the threat of sanctions—seldom induces teachers to fudge their students’ test scores. Altruistic motivations appear to be at play.
In New York City, for example, Regents scores factor into schools’ progress reports (and, until recently, teachers’ evaluations). Yet manipulation was actually just as prevalent—if not more prevalent—before the city introduced accountability systems under No Child Left Behind. Similar trends were found even in a randomized experiment that explicitly linked teacher pay to Regents scores teachers at some schools.
Dee suspects that teachers often choose to bump up a student’s test score based on “soft information” about that student. By the time the exams are administered, teachers are typically familiar with a kid’s aptitude—whether she’s a good student and well-behaved classmate, how much effort she puts into her homework. If that student’s performance on the Regents exam understates her real-life academic performance—maybe she was sick on the day of the test; maybe her nerves got the best of her—it’s easy to see why a teacher would be tempted to inflate her score.
The authors of the Sweden study draw similar conclusions. While external factors may play a role— the country’s middle schools complete for students and are ranked in part based on the average GPA of their exiting class—teachers, it seems, mostly use their discretion to ‘undo’ having a bad day on the test.” Many may also “experience emotional discomfort when awarding bad grades.”
All this suggests that a little cheating does more good than harm, helping shepherd the kids in need on the path to college or a good career and compensating for the systemic challenges that perpetuate stubborn achievement gaps. But what about the just-below-the-cutoff students who aren’t lucky enough to have their scores inflated? As the authors of the Sweden paper put it, “teacher discretion undermines the equality of opportunity.”
In New York, attitudes toward manipulation—the propensity among teachers to score leniently—appear to have varied significantly from school to school. They also, interestingly, may have even varied within schools. In the Regents study, white and Asian students were more likely than their black and Latino counterparts to have their test scores manipulated if they fell just short of the cutoff—there were just much more black and Latino students total who scored below the threshold. In other words, the score manipulation may have contributed to inequality just as much as it erased it.
“You could argue that … these are students who are close to the threshold, very near it, and it appears that teachers are using information outside of the Regents exam when deciding when to give a students a little bit of a nudge over that threshold,” Dee said. “Getting students to graduate at a higher rate is unequivocally a good thing—being a high-school dropout is sometimes called an economic death sentence with some legitimacy.”
“But someone else might legitimately argue that another goal here is fairness and consistency in high-stakes evaluation procedures,” he continued. “And there’s some capriciousness here that could be understood as problematic … If teachers are going off script in making those designations, we might worry about the implicit biases they bring to bear in making those decisions.” Even seeing a student’s name on a test, Dee said, might lead a teacher to make subconscious assumptions about her merit—to exercise (or not exercise) discretion when scoring the exam.
Meanwhile, in some cases, inflating a student’s Regents Exam score actually undermined her odds of getting a diploma. At the time of the analysis, students could receive one of two diplomas: a basic one and an advanced one. Although the practice may have helped students seeking the former, it hurt those seeking the latter—likely because they weren’t, according to the study, “pushed to re-learn the introductory material or re-take the introductory class that the more advanced coursework requires.”
All that aside, the research lends credence to the criticism of standardized testing as a flawed measurement of student achievement. “As the stakes associated with a test go up, so does the uncertainty about the meaning of a score on the test,” Nichols and Berliner wrote in their analysis, “The Inevitable Corruption of Indicators and Educators Through High-Stakes Testing.” As Dee argued, maybe teachers should have the ability to use their discretion and incorporate “soft information” into their scoring—as long as there’s a systemic way to do so that doesn’t benefit certain students and not others.
This research “really constitutes a cautionary tale about the design elements associated with these tests—[a reminder] that we should take particular care in how we grade them,” Dee said. “This is really going to be salient as we continue to move into this Common Core era, where tests are going to have more open-response elements. There’s going to be human scoring involve in that, and this tells us we want to pay particular attention to … how that scoring occurs.”