GPT-4 Might Just Be a Bloated, Pointless Mess

Will endless “scaling” of our current language models really bring true machine intelligence?

Simple illustration of a text-chat bubble, repeated at different sizes
Matt Chase / The Atlantic

As a rule, hyping something that doesn’t yet exist is a lot easier than hyping something that does. OpenAI’s GPT-4 language model—much anticipated; yet to be released—has been the subject of unchecked, preposterous speculation in recent months. One post that has circulated widely online purports to evince its extraordinary power. An illustration shows a tiny dot representing GPT-3 and its “175 billion parameters.” Next to it is a much, much larger circle representing GPT-4, with 100 trillion parameters. The new model, one evangelist tweeted, “will make ChatGPT look like a toy.” “Buckle up,” tweeted another.

One problem with this hype is that it’s factually inaccurate. Wherever the 100-trillion-parameter rumor originated, OpenAI’s CEO, Sam Altman, has said that it’s “complete bullshit.” Another problem is that it elides a deeper and ultimately far more consequential question for the future of AI research. Implicit in the illustration (or at least in the way people seem to have interpreted it) is the assumption that more parameters—which is to say, more knobs that can be adjusted during the learning process in order to fine-tune the model’s output—always lead to more intelligence. Will the technology continue to improve indefinitely as more and more data are crammed into its maw? When it comes to AI, how much does size matter?

This turns out to be the subject of intense debate among the experts. On one side, you have the so-called scaling maximalists. Raphaël Millière, a Columbia University philosopher whose work focuses on AI and cognitive science, coined the term to refer to the group most bullish about the transformative potential of ramping up. Their basic idea is that the structure of existing technologies will be sufficient to produce AI with true intelligence (whatever you interpret that to mean); all that’s needed at this point is to make that structure bigger—by multiplying the number of parameters and shoveling in more and more data. Nando de Freitas, the research director at DeepMind, epitomized the position last year when he tweeted, “It’s all about scale now! The Game is Over!” (He did go on, confusingly, to enumerate several other ways he thinks models must improve; DeepMind declined to make de Freitas available for an interview.)

The notion that simply inflating a model will endow it with fundamentally new abilities might seem prima facie ridiculous, and even a few years ago, Millière told me, experts pretty much agreed that it was. “This once was a view that would have been considered perhaps ludicrous or at least wildly optimistic,” he said. “The Overton window has shifted among AI researchers.” And not without reason: Scaling, AI researchers have found, not only hones abilities that language models already possess—making conversations more natural, for example—but also, seemingly out of nowhere, unlocks new ones. Supersized models have gained the sudden ability to do triple-digit arithmetic, detect logical fallacies, understand high-school microeconomics, and read Farsi. Alex Dimakis, a computer scientist at the University of Texas at Austin and a co-director of the Institute for Foundations of Machine Learning, told me he became “much more of a scaling maximalist” after seeing all the ways in which GPT-3 has surpassed earlier models. “I can see how one might look at that and think, Okay, if that’s the case, maybe we can just keep scaling indefinitely and we’ll clear all the remaining hurdles on the path to human-level intelligence,” Millière said.

His sympathies lie with the opposite side in the debate. To those in the scaling-skeptical camp, the maximalist stance is magical thinking. Their first objections are practical: The bigger a language model gets, the more data are required to train it, and we may well run out of high-quality, published text that can be fed into the model long before we achieve anything close to what the maximalists envision. What this means, the University of Alberta computer scientist Rich Sutton told me, is that language models are only “weakly scalable.” (Computation power, too, could become a limiting factor, though most researchers find this prospect less concerning.)

There may be ways to mine more material that can be fed into the model. We could transcribe all the videos on YouTube, or record office workers’ keystrokes, or capture everyday conversations and convert them into writing. But even then, the skeptics say, the sorts of large language models that are now in use would still be beset with problems. They make things up constantly. They struggle with common-sense reasoning. Training them is done almost entirely up front, nothing like the learn-as-you-live psychology of humans and other animals, which makes the models difficult to update in any substantial way. There is no particular reason to assume scaling will resolve these issues. “It hasn’t improved nearly as much as one might hope,” Ernest Davis, a computer-science professor at New York University, told me. “It’s not at all clear to me that any amount of feasible scaling is going to get you there.” It’s not even clear, for that matter, that a purely language-based AI could ever reproduce anything like human intelligence. Speaking and thinking are not the same thing, and mastery of the former in no way guarantees mastery of the latter. Perhaps human-level intelligence also requires visual data or audio data or even physical interaction with the world itself via, say, a robotic body.

Although these are convincing arguments, scaling maximalism has become something of a straw man for AI skeptics, Millière told me. Some experts have expressed a more measured faith in the power of scaling. Sutton, for example, has argued that new models will be necessary to solve the problems with current ones but also that those new models must be even more scalable than their predecessors to achieve human-level intelligence. In fact, relatively few researchers in the field subscribe to a more extreme position. In a survey of the natural-language-processing community, data scientists found that, to their surprise, researchers greatly overestimated support among their peers for the view that “scaling solves practically any important problem.” On average, they predicted that nearly half of their colleagues subscribed to this view; in fact, only 17 percent did. An abiding faith in the power of scaling is by no means the prevailing dogma, but for some reason, experts think it is.

In this way, the scaling debate is representative of the broader AI discourse. It feels as though the vocal extremes have drowned out the majority. Either ChatGPT will completely reshape our world or it’s a glorified toaster. The boosters hawk their 100-proof hype, the detractors answer with leaden pessimism, and the rest of us sit quietly somewhere in the middle, trying to make sense of this strange new world.