Can Big Data Find the Next 'Harry Potter'?

A new algorithm aims to inject some science into the art of publishing.

Hannibal Hanschke / Reuters

Like other cultural industries, publishing is founded on hits. Yet the business of predicting best sellers remains an enigmatic art—the province chiefly of gut instinct and educated guess. Sometimes these faculties serve the industry well; other times not so much, especially when it comes to first-time authors. J. K. Rowling and John Grisham endured serial rejection before landing the deals that brought their work to the masses. E. L. James’ Fifty Shades of Grey found a traditional publisher only after it had been self-published.

A computer algorithm able to identify best-selling texts with at least 80 percent success sounds like science fiction. But the “bestseller-ometer”—the subject of an upcoming tome The Bestseller Code: Anatomy of the Blockbuster Novel, by Jodie Archer, ex-research lead on literature at Apple, and Matthew L. Jockers, an associate professor of English at the University of Nebraska-Lincoln—is emphatically non-fictional. The algorithm’s claimed efficacy is based on a track record “predicting” New York Times best sellers when applied retrospectively to novels from the past 30 years.

Several years in the offing and the product of the processing power of thousands of computers, the bestseller-ometer represents an attempt to identify the characteristics of best-selling fiction at scale by interrogating a massive body of literature (20,000-plus novels). By seeking to put the traits that set it apart from lesser-selling work on something approaching a scientific footing, the project provides a data-driven check to received wisdom about the “secrets” behind top-selling fiction. It also presages a possible future where publishers turn to technology to help cut through the vagaries of picking prospective best sellers.

But how could an algorithm capture the richness and complexity of literature? What could such disparate best sellers as the pulpy beach reads of James Patterson and critically acclaimed literary fiction of Jonathan Franzen have in common? And how can a computer account for the zeitgeist that forms an essential backdrop to a book’s appeal?

* * *

The enterprise was conceived at Stanford University around 2008. Jockers, then a lecturer at the Palo Alto campus, was a leading light in the emerging field of “digital humanities”—the application of computer-enabled quantitative analysis to text (he later co-founded the Stanford Literary Lab). Archer, a graduate student, was “skeptical” computers could say anything substantive about literature. A demonstration of a computer model’s prowess in picking the genre of Shakespeare’s plays based on textual markers did little to allay her doubts—she was impressed by the spectacle as a computational feat but otherwise “underwhelmed.”

“So what; I already knew Macbeth was a tragedy.” She urged Jockers: “We have to [pose] a question we can’t answer, that moves the field forward,”

That question: “Why do we all read the same book?”

Archer’s interest in this was piqued a few years earlier, as an editor at Penguin in London amid the hoopla surrounding Dan Brown’s The Da Vinci Code. That book had been widely panned by critics, yet found a mass audience (80 million copies sold to date). Was there was a “textual charisma,” as The Bestseller Code puts it, to which readers were, perhaps unwittingly, responding?

The algorithm Archer and Jockers subsequently built isn’t the first attempt to apply the clarifying power of Big Data to books. Inkitt, the Berlin startup behind what’s been billed as the “first novel selected by an algorithm,” intensively tracks reader responses to stories posted to its web platform to identify potential best sellers. London’s Jellybooks, founded in 2011, measures “reader engagement” later in the literary production cycle, immediately before books are published, using software downloaded by readers onto their devices in exchange for advance access to a title. But the bestseller-ometer stands apart in joining old-school literary scholarship to computational horsepower. The Bestseller Code, an amplification of Archer’s 2014 dissertation, documents the intricate considerations that went into “training the machine to read” and unpack the micro-decisions at the level of diction and syntax involved in crafting best-selling fiction.

“These algorithms aren’t magic,” says University of Notre Dame assistant professor of English Matthew Wilkens, himself a digital humanist. “They reflect [the same] interpretative and analytical choices [involved in] reading one book closely; you’re looking for certain repetitions, word usage patterns, thematic emphases and allusions. It’s not work that can be done by someone not familiar with literature.”

* * *

So what does a suitably trained algorithm have to say about storytelling that hooks readers en masse?

No surprise about some of the elements: authoritative “voice”; spare, plainspoken, often colloquial, prose; declarative verbs that connote action-oriented take-charge characters.

Others are less obvious. By cataloging words associated with certain subjects, Archer and Jockers identified narrative “cohesion” as a habit of top-selling authors. Danielle Steel and John Grisham typically devote one-third of their novels to a “signature” topic—“domestic life” in Steel’s case, “lawyers and the law” for Grisham—and these form part of an overall mix that lends itself to conflict; topics between which they can toggle to generate dramatic friction. Outside the home, Steel often thrusts her characters into life-and-death medical situations, for example. Conversely, lesser-selling novels tend to be cacophonous and diffuse, biting off more than they can chew, populated by unrelated topics.

Then there are findings that confound expectations. Sex doesn’t sell. In fact, it’s a distinctly minority taste, confined to a vanishingly small proportion of best-selling material, according to the bestseller-ometer. This discovery occurred early in the research, so the 2011 emergence of Fifty Shades, heaving with hot and heavy erotic scenes, came as a plot twist. But running their model, Archer and Jockers found the book’s number-one subject was “human closeness” (the most prevalent topic across all the bestsellers they looked at, in fact). Fifty Shades was chiefly about the emotional intimacy between its characters.

The model yielded further clues to the appeal of the much-mocked book. Mapping its emotional trajectory, as suggested by emotionally charged words, Archer and Jockers uncovered a rhythmic tempo to its cycles of unrest and closure. Plotted on a graph, this describes a near-perfect undulating waveform. “James writes emotional turns with such a regularity of beat that the reader feels the thrum of her words in their bodies like the effect of club music,” Archer and Jockers report.

* * *

The authors reject any suggestion their algorithm has hit upon a formula for would-be best-selling novelists—more like some instructive data-points. Indeed, much of what they found reveals the diverse directions in which popular fiction may be taken based on common foundations—the possibilities inherent in the form. The algorithm located bestsellers across seven plot types, for example. And a book was no less likely to best-sell if it ended on a downer (all the better for a sequel, The Bestseller Code notes).

Archer and Jockers also eschew any notion they’re out to “disrupt” publishing. They have no immediate plans to commercialize their creation; Jockers characterizes it as a proof of concept, a “prototype” for the approach’s potential in tackling literary questions. But, in an algorithm that could sharpen publishers’ ability to identify prospective bestsellers at the manuscript stage, they’ve developed a potentially valuable piece of intellectual property.

Johnny Geller, co-CEO of the London-based literary agency Curtis Brown, was interested enough to secure an advance copy of The Bestseller Code that he was halfway through when we spoke in late August. He sees a potential application for such a tool at publishing’s “discovery” stage, when agents are screening submissions. Still, he thinks it would be an adjunct to human acumen, rather than anything that might one day supplant it. “I use human algorithms all the time, but they only take you [so far],” he says. “You need a human with feeling, the ability to be surprised.”

Knopf editor Carole Baron, who’s edited Danielle Steel, Elmore Leonard, Judy Blume, and other big-name authors, says she’s “skeptical” of the forecasting power of an algorithm based on already-published works. “Can you predict the future in literature and art when you can’t factor in the zeitgeist? We’re always surprised.”

Zeitgeist might explain the fate of Dave Eggers’ The Circle. The bestseller-ometer anointed the 2013 novel as the exemplary best-selling text from the past 30 years. It ticked all the boxes for popular page-turning fiction and was accorded a 100 percent chance of best-selling. The algorithm was correct; The Circle sold 220,000 copies as of June, based on Nielsen BookScan figures cited in Publishers Weekly. But these are respectable rather than meteoric numbers.

Baron says it’s a supreme attunement to zeitgeist that, in part, explains the success of Danielle Steel, the most popular author currently writing (going by sales: 650 million and counting). “I used to tell her, ‘you’re a channeler,’” recounts Baron. “I … believe these words and ideas are in the world and grab hold of some people. Danielle would … say, ‘I have this idea,’ and the whole thing would [come out] practically whole-cloth. She’d work and work on it, but the initial idea [would] hit her in the middle of the night.”

Of course, inertia around Steel, Patterson and other so-called “franchise” authors who now perennially sit atop the best seller lists means publishers are less inclined than ever to divert funds to unknown writers. And this is where the bestseller-ometer may find its most noble application, says Archer: as a democratizing force, a tool to ease publishers’ concerns about taking a flier on a rookie author—like J. K. Rowling or John Grisham back in the day—languishing in the slush pile with no literary pedigree but a manuscript that aces the algorithm, suggesting they’re worth a second look.

“[It’s] Mrs. Smith from Iowa who’s just written a lovely book [that this] could massively help,” she says.