The Voice in the Machine

Is lifelike synthetic speech finally within reach?

By Arnie Cooper

Ever since the Voder, Bell Labs’ artificial-voice machine, blurted out a barely intelligible “Good evening, radio audience …” at the 1939 New York World’s Fair, voice engineers have been striving to generate lifelike synthetic speech. Unlike today’s automated systems, the Voder needed an operator who knew which keys to press to elicit “speech” that, for all its marvels, sounded like it was coming from a tuba rather than a human being.

Scientists continued refining their synthetic voices through the 1960s. In the 1970s, advances in computers ironically brought human voices back into the mix, with digital recorded speech providing canned audio responses. Researchers began chopping up dialogue into the smallest units of speech, phonemes, and using software programs to re-form those bits into words, phrases, and sentences. Unfortunately, such utterances sounded pretty much the way “re-formed” chicken nuggets taste. Since the mid-1990s, expanding “digital libraries” have allowed for storage of more phonemes that could be split into even smaller units, adding authenticity to the “voice.” But even today’s state-of-the-art systems, like AT&T’s Natural Voices, still don’t capture the range of human emotion.

That’s exactly what Gershon Silbert, a 61-year-old former concert pianist and the CEO of VivoText, an Israeli start-up he founded in 2008, hopes to achieve. VivoText’s text-to-speech engine draws on two pieces of technology: a proprietary voice-sample database that enables the portrayal of “emotion”; and software that Silbert devised to generate virtual-music performances that capture the expressiveness of professional musicians.

Not that Silbert thinks the best text-to-speech platforms used in audio books, video games, and e-mail readers lack expressiveness. “The pitch goes up and down,” he told me. “The timing changes. They do have expression; it’s just that what they’re expressing is sometimes inappropriate and inaccurate, and in many cases not enough.”

Most phoneme databases have been created by voice actors who maintained a neutral tone to generate what Silbert calls “okay speech that works.” But when generated through these machines, sentences that demand emotion tend to fall flat. Silbert also wants to move beyond the pre-programmed phrase templates of existing technologies and allow a more open-ended sentence structure. To do that, the VivoText software interprets standard text-enhancement markings such as italicized and capped words and automatically analyzes other elements of syntax and semantics in a given text. Silbert’s system of “context analysis” will enable VivoText to know, for example, whether to emphasize the we or the you in the question What can we do for you? It also enables users to further control the tone by selecting one of several settings (“happy,” “sad,” “deliberate,” “enthusiastic”). As Silbert told me, “We don’t want to impose our worldview on other people.”

Silbert began dabbling in the field during the mid-1990s, ultimately creating his Music Objects Recognition technology for making computer-generated music performances that sounded human. But making money from that technology proved difficult. “Venture capitalists saw this as a very small niche market, not really worth funding,” he told me. “But what came out in conversations was the idea of applying it to text-to-speech. I figured, If that’s what people want, why not do it?” His company targets electronic publishing, specifically the audio-book market, which currently comprises a meager percentage of the hundreds of thousands of new titles published in the U.S. each year.

“On the emotive side of it, VivoText definitely has something ahead of the competition,” says Charles Palmer, the executive director of the Center for Advanced Entertainment and Learning Technologies at Pennsylvania’s Harrisburg University. But listening to a 100,000-word book may be another story. As Palmer said to me, “Right now, we’re used to listening to automated voices in small bursts. I’m wondering how long a synthesized voice can really keep someone engaged.”

While Silbert acknowledges that VivoText is not about to compete with Derek Jacobi reading Shakespeare, he says that for informational or technical books, his relatively mellifluous text-to-speech engine will do just fine. The same goes for other voice-supported platforms like toys and games, GPS navigation, and SMS and e-mail reading. Though Silbert won’t say which of those platforms will first use VivoText, the company plans to launch its first product roughly in time for you to not just read this, but hear it—and, he hopes, with F-E-E-L-I-N-G.

This article available online at:

http://www.theatlantic.com/magazine/archive/2011/11/the-voice-in-the-machine/308690/