Artificial intelligence has in recent years proved itself to be a quick study, although it is being educated in a manner that would shame the most brutal headmaster. Locked into airtight Borgesian libraries for months with no bathroom breaks or sleep, AIs are told not to emerge until they’ve finished a self-paced speed course in human culture. On the syllabus: a decent fraction of all the surviving text that we have ever produced.
When AIs surface from these epic study sessions, they possess astonishing new abilities. People with the most linguistically supple minds—hyperpolyglots—can reliably flip back and forth between a dozen languages; AIs can now translate between more than 100 in real time. They can churn out pastiche in a range of literary styles and write passable rhyming poetry. DeepMind’s Ithaca AI can glance at Greek letters etched into marble and guess the text that was chiseled off by vandals thousands of years ago.
These successes suggest a promising way forward for AI’s development: Just shovel ever-larger amounts of human-created text into its maw, and wait for wondrous new skills to manifest. With enough data, this approach could perhaps even yield a more fluid intelligence, or a humanlike artificial mind akin to those that haunt nearly all of our mythologies of the future.
The trouble is that, like other high-end human cultural products, good prose ranks among the most difficult things to produce in the known universe. It is not in infinite supply, and for AI, not any old text will do: Large language models trained on books are much better writers than those trained on huge batches of social-media posts. (It’s best not to think about one’s Twitter habit in this context.) When we calculate how many well-constructed sentences remain for AI to ingest, the numbers aren’t encouraging. A team of researchers led by Pablo Villalobos at Epoch AI recently predicted that programs such as the eerily impressive ChatGPT will run out of high-quality reading material by 2027. Without new text to train on, AI’s recent hot streak could come to a premature end.
It should be noted that only a slim fraction of humanity’s total linguistic creativity is available for reading. More than 100,000 years have passed since radically creative Africans transcended the emotive grunts of our animal ancestors and began externalizing their thoughts into extensive systems of sounds. Every notion expressed in those protolanguages—and many languages that followed—is likely lost for all time, although it gives me pleasure to imagine that a few of their words are still with us. After all, some English words have a shockingly ancient vintage: Flow, mother, fire, and ash come down to us from Ice Age peoples.
Writing has allowed human beings to capture and store a great many more of our words. But like most new technologies, writing was expensive at first, which is why it was initially used primarily for accounting. It took time to bake and dampen clay for your stylus, to cut papyrus into strips fit to be latticed, to house and feed the monks who inked calligraphy onto vellum. These resource-intensive techniques could preserve only a small sampling of humanity’s cultural output.
Not until the printing press began machine-gunning books into the world did our collective textual memory achieve industrial scale. Researchers at Google Books estimate that since Gutenberg, humans have published more than 125 million titles, collecting laws, poems, myths, essays, histories, treatises, and novels. The Epoch team estimates that 10 million to 30 million of these books have already been digitized, giving AIs a reading feast of hundreds of billions of, if not more than a trillion, words.
Those numbers may sound impressive, but they’re within range of the 500 billion words that trained the model that powers ChatGPT. Its successor, GPT-4, might be trained on tens of trillions of words. Rumors suggest that when GPT-4 is released later this year, it will be able to generate a 60,000-word novel from a single prompt.
Ten trillion words is enough to encompass all of humanity’s digitized books, all of our digitized scientific papers, and much of the blogosphere. That’s not to say that GPT-4 will have read all of that material, only that doing so is well within its technical reach. You could imagine its AI successors absorbing our entire deep-time textual record across their first few months, and then topping up with a two-hour reading vacation each January, during which they could mainline every book and scientific paper published the previous year.
Just because AIs will soon be able to read all of our books doesn’t mean they can catch up on all of the text we produce. The internet’s storage capacity is of an entirely different order, and it’s a much more democratic cultural-preservation technology than book publishing. Every year, billions of people write sentences that are stockpiled in its databases, many owned by social-media platforms.
Random text scraped from the internet generally doesn’t make for good training data, with Wikipedia articles being a notable exception. But perhaps future algorithms will allow AIs to wring sense from our aggregated tweets, Instagram captions, and Facebook statuses. Even so, these low-quality sources won’t be inexhaustible. According to Villalobos, within a few decades, speed-reading AIs will be powerful enough to ingest hundreds of trillions of words—including all those that human beings have so far stuffed into the web.
Not every AI is an English major. Some are visual learners, and they too may one day face a training-data shortage. While the speed-readers were bingeing the literary canon, these AIs were strapped down with their eyelids held open, Clockwork Orange–style, for a forced screening comprising millions of images. They emerged from their training with superhuman vision. They can recognize your face behind a mask, or spot tumors that are invisible to the radiologist’s eye. On night drives, they can see into the gloomy roadside ahead where a young fawn is working up the nerve to chance a crossing.
Most impressive, AIs trained on labeled pictures have begun to develop a visual imagination. OpenAI’s DALL-E 2 was trained on 650 million images, each paired with a text label. DALL-E 2 has seen the ocher handprints that Paleolithic humans pressed onto cave ceilings. It can emulate the different brushstroke styles of Renaissance masters. It can conjure up photorealistic macros of strange animal hybrids. An animator with world-building chops can use it to generate a Pixar-style character, and then surround it with a rich and distinctive environment.
Thanks to our tendency to post smartphone pics on social media, human beings produce a lot of labeled images, even if the label is just a short caption or geotag. As many as 1 trillion such images are uploaded to the internet every year, and that doesn’t include YouTube videos, each of which is a series of stills. It’s going to take a long time for AIs to sit through our species’ collective vacation-picture slideshow, to say nothing of our entire visual output. According to Villalobos, our training-image shortage won’t be acute until sometime between 2030 and 2060.
If indeed AIs are starving for new inputs by midcentury—or sooner, in the case of text—the field’s data-powered progress may slow considerably, putting artificial minds and all the rest out of reach. I called Villalobos to ask him how we might increase human cultural production for AI. “There may be some new sources coming online,” he told me. “The widespread adoption of self-driving cars would result in an unprecedented amount of road video recordings.”
Villalobos also mentioned “synthetic” training data created by AIs. In this scenario, large language models would be like the proverbial monkeys with typewriters, only smarter and possessed of functionally infinite energy. They could pump out billions of new novels, each of Tolstoyan length. Image generators could likewise create new training data by tweaking existing snapshots, but not so much that they fall afoul of their labels. It’s not yet clear whether AIs will learn anything new by cannibalizing data that they themselves create. Perhaps doing so will only dilute the predictive potency they gleaned from human-made text and images. “People haven’t used a lot of this stuff, because we haven’t yet run out of data,” Jaime Sevilla, one of Villalobos’s colleagues, told me.
Villalobos’s paper discusses a more unsettling set of speculative work-arounds. We could, for instance, all wear dongles around our necks that record our every speech act. According to one estimate, people speak 5,000 to 20,000 words a day on average. Across 8 billion people, those pile up quickly. Our text messages could also be recorded and stripped of identifying metadata. We could subject every white-collar worker to anonymized keystroke recording, and firehose what we capture into giant databases to be fed into our AIs. Villalobos noted drily that fixes such as these are currently “well outside the Overton window.”
Perhaps in the end, big data will have diminishing returns. Just because our most recent AI winter was thawed out by giant gobs of text and imagery doesn’t mean our next one will be. Maybe instead, it will be an algorithmic breakthrough or two that at last populate our world with artificial minds. After all, we know that nature has authored its own modes of pattern recognition, and that so far, they outperform even our best AIs. My 13-year-old son has ingested orders of magnitude fewer words than ChatGPT, yet he has a much more subtle understanding of written text. If it makes sense to say that his mind runs on algorithms, they’re better algorithms than those used by today’s AIs.
If, however, our data-gorging AIs do someday surpass human cognition, we will have to console ourselves with the fact that they are made in our image. AIs are not aliens. They are not the exotic other. They are of us, and they are from here. They have gazed upon the Earth’s landscapes. They have seen the sun setting on its oceans billions of times. They know our oldest stories. They use our names for the stars. Among the first words they learn are flow, mother, fire, and ash.