Of course, giving the bot a larger vocabulary will only itself intensify the bot’s existential question. The German mathematician Gregor Cantor proved that there could be larger and smaller infinities that were both, still, infinities; a more prolix Every Word will only make its lacunae more noticeable.
Because—make that list longer and longer—and you will run into an old problem: No one’s sure what a word is, exactly.
For over a century, some scholars have claimed that Shakespeare used more words than any other writer, that his vocabulary dwarfed his era’s fellow English-speakers. The number of words he deployed, some insisted, is even double that of modern-day speakers. In 1986, a famed and Emmy-winning PBS documentary, The Story of English, alleged: “Shakespeare had one of the largest vocabularies of any English writer, some 30,000 words. Estimates of an educated person’s vocabulary today vary, but it is probably about half this, 15,000.”
Could that be true? It depends what you mean by vocabulary. As Ward Elliott and Robert Valenza write in their paper, “Shakespeare’s vocabulary: did it dwarf all the others?”, there are three different ways to cut up a text into its words. (They cite Marvin Spevack’s important studies into this issue, which were among the first to use a computer.)
Of the 884,647 tokens in the Riverside Shakespeare corpus, a computer counts 29,066 “types”—that is, different kinds of collections of letters. This machine-counting doesn’t account for the common alternate spellings of Shakespeare’s day, like wreck and wrack, or murder and murther, nor does it separate plurals and conjugated forms from their more common roots. Therefore, horse and horses are two different words, as are run and running.
That’s because computers—at least in the late 1960s, when Spevack was conducting his study—could only distinguish “types” like those. That horse and horses shared a root meant nothing to them. To count root words—which are sometimes called lemmas—the two scholars had to rely on hand-counts, or rely on the common estimate that a vocabulary not yet lemmatized is two-thirds larger than one that uses only root words.
What’s Shakespeare’s lemmatized vocabulary, then? Both long-respected hand-counted efforts and a mathematical estimation return the same answer: He used between 17,000 and 18,000 root words.
This count may still be incorrect. Spevack’s machine reading can’t account for homographs, words like spring or bear that can function as nouns or verbs and have many more definitions after that. It also doesn’t track two-token words, like grown up, where types combine to create a new definition. Finally—and this is the largest misestimation of all—it doesn’t account for words that Shakespeare knew but never wrote in a play. Such a challenge engrosses Elliott and Valenza for much of their paper. They conclude, finally, that Shakepeare’s total vocabulary… is just about the same size as or smaller than that of a “run-of-the-mill college-educated modern.”