The Corpus in the Court: 'Like Lexis on Steroids'

Say goodbye to the dictionary definition. Courts, long dependent on the vagaries of language, have new quantitative tools they can use to precisely pin down how words are used.

Linguists have introduced a new wave of data-crunching methods to answer knotty questions about language use, and a decision handed down by the Supreme Court this week reveals that the legal world is taking notice. For too long, judges and attorneys have relied on scattershot, impressionistic evidence to back up linguistic claims, even though language is often central to judicial decision-making. New high-tech tools for analyzing common usage promise a welcome antidote, but there is a danger in expecting them to cut through the inherent slipperiness of words.

The Supreme Court, to its credit, drew on some nuanced linguistic expertise in its ruling on Tuesday that corporations are not entitled to "personal privacy," rebuffing arguments from the plaintiff in the case, AT&T, that big companies deserve that protection since they can be considered "persons" under the law. "Personal" is just the adjective form of the noun "person," AT&T contended.

Not so fast, wrote Chief Justice John Roberts in the Court's unanimous opinion. There is no "grammatical imperative" dictating that the legal meaning of a noun must extend to a corresponding adjective. To back up this point, he turned to dictionary definitions, a common gambit when the Court needs to settle a semantic score. "Corny," he illustrated with the help of Webster's, "has little to do with 'corn.'" In oral arguments, Roberts had come up with more pairs of related nouns and adjectives that diverge in meaning: "squirrel" and "squirrely," "craft" and "crafty," "pastor" and "pastoral."

But to really drive the point home, Roberts pulled out a litany of two-word phrases in which "personal" describes something we associate with humans, not big corporations: "We do not usually speak of personal characteristics, personal effects, personal correspondence, personal influence, or personal tragedy as referring to corporations or other artificial entities," the opinion read. "This is not to say that corporations do not have correspondence, influence, or tragedies of their own, only that we do not use the word 'personal' to describe them."

Behind Roberts' folksy linguistic wisdom lay some cold, hard, empirical facts. In an amicus brief that helped shape the decision, Neal Goldfarb, filing on behalf of The Project On Government Oversight, mustered evidence supporting the conclusion that the word "personal" is typically used for human beings. In the oral arguments, Justice Ruth Bader Ginsburg made AT&T's lawyers look foolish by citing the brief's "dozens and dozens of examples to show that, overwhelmingly, 'personal' is used to describe an individual, not an artificial being." Going beyond the authority of the dictionary, Goldfarb's brief appealed to a new type of language authority: the corpus.

A corpus is an enormous collection of texts that can be analyzed for usage patterns -- "like Lexis on steroids," Goldfarb explained to the Court, referring to the legal database that the Justices are all familiar with. Using corpora (that's the plural) compiled by Mark Davies at Brigham Young University, Goldfarb pulled up the most common nouns that the adjective "personal" can modify: "personal life," "personal experience," "perĀ­sonal relationship," "personal problem," and so forth (including the examples Roberts would cite in his opinion). He could even zero in on the most common combinations for a particular era -- in this case the 1970s, when Congress enacted the the Freedom of Information Act exemption that AT&T was wrangling over.

The explaining power of "the dictionary" is often invoked in arguments and opinions (with the behemoths, Webster's New International and the Oxford English Dictionary, favored at the Supreme Court level), but even unabridged dictionary definitions can never encompass the variety of real-life contexts for words as they make their way in the world. For that you need a corpus. Corpus analysis has already transformed how dictionaries are being made, and now it is making a belated appearance in the courtroom.

Nowadays, corpus analysis is no longer an esoteric art for linguists and lexicographers only. The BYU corpora, containing hundreds of millions of words in both contemporary and historical English, are open to the public and require a modicum of training for those who want to delve into them. In investigating the uses of the word "personal," for instance, Roberts and his fellow Justices could easily have run the same queries that Goldfarb did to discover which nouns most often partner up with the adjective. And more corpus tools are coming online now for public consumption, notably the Ngram Viewer for analyzing the colossal Google Books collection (though it can't yet provide fine-grained grammatical details, like information on adjective-noun sequences).

While the corpus revolution promises to put judicial inquiries into language patterns on a firmer, more systematic footing, the results are still prey to all manner of human interpretation. Strict originalists on the bench might find solace in the ability to pinpoint the meaning of ordinary language at different historical junctures. Those seeing the law as more protean, subject to changes in meaning over time, would instead focus on revelations about the ever-shifting nature of word usage. But at least these ideological arguments can proceed on a basis of concrete facts about how we use language, rather than on a welter of idiosyncratic assumptions, as has too often been the case.

Though the introduction of new, data-driven techniques for looking at language is a welcome development, there are inevitable limits to this type of analysis. Consider the surprisingly snarky coda from Justice Roberts at the end of the opinion: "We trust that AT&T will not take it personally." Wouldn't this seem to undercut the whole idea that "personal" is just for us humans? Not really, because Roberts was using the opportunity for an ironic commentary on the whole idea of a corporate entity like AT&T pleading for "personal" treatment. That very playfulness with language -- sarcasm, intentional ambiguity, allusion -- remains difficult to capture in any corpus analysis. No matter how much courts would like to view language as a closed, formal system, words can always wriggle free for reinterpretation. Sorry, SCOTUS. Nothing personal.