Is This the Singularity for Standardized Tests?

GPT-4’s mastery of the SAT will re-entrench the power and influence of rote exams.

Close-up photo of answers being filled out by pencil in a bubble-sheet
Illustration by The Atlantic. Source: Rasit Aydogan / Anadolu Agency / Getty.

Last fall, when generative AI abruptly started turning out competent high-school- and college-level writing, some educators saw it as an opportunity. Perhaps it was time, at last, to dispose of the five-paragraph essay, among other bad teaching practices that have lingered for generations. Universities and colleges convened emergency town halls before winter terms began to discuss how large language models might reshape their work, for better and worse.

But just as quickly, most of those efforts evaporated into the reality of normal life. Educators and administrators have so many problems to address even before AI enters the picture; the prospect of utterly redesigning writing education and assessment felt impossible. Worthwhile, but maybe later. Then, with last week’s arrival of GPT-4, came another provocation. OpenAI, the company that created the new software, put out a paper touting its capacities. Among them: taking tests. AIs are no longer just producing passable five-paragraph essays. Now they’re excelling at the SAT, “earning” a score of 1410. They’re getting passing grades on more than a dozen different AP exams. They’re doing well enough on bar exams to be licensed as lawyers.

It would be nice if this news inspired educators, governments, certification agencies, and other groups to rethink what these tests really mean—or even to reinvent them altogether. Alas, as was the case for rote-essay writing, whatever appetite for change the shock inspires might prove to be short-lived. GPT-4’s achievements help reveal the underlying problem: Americans love standardized tests as much as we hate them—and we’re unlikely to let them go even if doing so would be in our best interest.

Many of the initial responses to GPT-4’s exam prowess were predictably immoderate: AI can keep up with human lawyers, or apply to Stanford, or make “education” useless. But why should it be startling in the slightest that software trained on the entire text of the internet performs well on standardized exams? AI can instantly run what amounts to an open-book test on any subject through statistical analysis and regression. Indeed, that anyone is surprised at all by this success suggests that people tend to get confused about what it means when computers prove effective at human activities.

Back in the late 1990s, nobody thought a computer could ever beat a human at Go, the ancient Chinese game played with black and white stones. Chess had been mastered by supercomputers, but Go remained—at least in the hearts of its players—immune to computation. They were wrong. Two decades later, DeepMind’s AlphaGo was regularly beating Go masters. To accomplish this task, AlphaGo initially mimicked human players’ moves before running innumerable games against itself to find new strategies. The victory was construed by some as evidence that computers could overtake people at complex tasks previously thought to be uniquely human.

By rights, GPT-4’s skill at the SAT should be taken as the opposite. Standardized tests feel inhuman from the start: You, a distinct individual, are forced to perform in a manner that can be judged by a machine, and then compared with that of many other individuals. Yet last week’s announcement—of the 1410 score, the AP exams, and so on—gave rise to an unease similar to that produced by AlphaGo.

Perhaps we’re anxious not that computers will strip us of humanity, but that machines will reveal the vanity of our human concerns. The experience of reasoning about your next set of moves in Go, as a human player doing so from the vantage point of human culture, cannot be replaced or reproduced by a Go-playing machine—unless the only point of Go were to prove that Go can be mastered, rather than played. Such cultural values do exist: The designation of chess grand masters and Go 9-dan professionals suggests expertise in excess of mere performance in a folk game. The best players of chess and Go are sometimes seen as smart in a general sense, because they are good at a game that takes smarts of a certain sort. The same is true for AIs that play (and win) these games.

Standardized tests occupy a similar cultural role. They were conceived to assess and communicate general performance on a subject such as math or reading. Whether and how they ever managed to do that is up for debate, but the accuracy and fairness of the exams became less important than their social function. To score a 1410 on the SAT says something about your capacities and prospects—maybe you can get into Stanford. To pursue and then emerge victorious against a battery of AP tests suggests general ability warranting accelerated progress in college. (That victory doesn’t necessarily provide that acceleration only emphasizes the seduction of its symbolism.) The bar exam measures—one hopes—someone’s subject-matter proficiency, but doesn’t promise to ensure lawyerly effectiveness or even competence. To perform well on a standardized test indicates potential to perform well at some real future activity, but it has also come to have some value in itself, as a marker of success at taking tests.

That value was already being questioned, machine intelligence aside. Standardized tests have long been scrutinized for contributing to discrimination against minority and low-income students. The coronavirus pandemic, and its disruptions to educational opportunity, intensified those concerns. Many colleges and universities made the SAT and ACT optional for admissions. Graduate schools are giving up on the GRE, and aspiring law students may no longer have to take the LSAT in a couple of years.

GPT-4’s purported prowess at these tests shows how little progress has been made at decoupling appearance from reality in the tests’ pursuit. Standardized tests might fairly assess human capacity, or they might do so unfairly, but either way, they hold an outsize role in Americans’ conception of themselves and their communities. We’re nervous that tests might turn us into computers, but also that computers might reveal the conceit of valuing tests so much in the first place.

AI-based chess and Go computers didn’t obsolesce play by people, but they did change human-training practices. Large language models may do the same for taking the SAT and other standardized exams, and evolve into a fancy form of test prep. In that case, they could end up helping those who would already have done well enough to score even higher. Or perhaps they will become the basis for a low-cost alternative that puts such training in the hands of everyone—a reversal of examination inequity, and a democratization of vanity. No matter the case, the standardized tests will persist, only now the chatbots have to take them too.