Technology November 2011

The Voice in the Machine

Is lifelike synthetic speech finally within reach?
Mary Evans/Ronald Grant/Everett Collection

Ever since the Voder, Bell Labs’ artificial-voice machine, blurted out a barely intelligible “Good evening, radio audience …” at the 1939 New York World’s Fair, voice engineers have been striving to generate lifelike synthetic speech. Unlike today’s automated systems, the Voder needed an operator who knew which keys to press to elicit “speech” that, for all its marvels, sounded like it was coming from a tuba rather than a human being.

Scientists continued refining their synthetic voices through the 1960s. In the 1970s, advances in computers ironically brought human voices back into the mix, with digital recorded speech providing canned audio responses. Researchers began chopping up dialogue into the smallest units of speech, phonemes, and using software programs to re-form those bits into words, phrases, and sentences. Unfortunately, such utterances sounded pretty much the way “re-formed” chicken nuggets taste. Since the mid-1990s, expanding “digital libraries” have allowed for storage of more phonemes that could be split into even smaller units, adding authenticity to the “voice.” But even today’s state-of-the-art systems, like AT&T’s Natural Voices, still don’t capture the range of human emotion.

That’s exactly what Gershon Silbert, a 61-year-old former concert pianist and the CEO of VivoText, an Israeli start-up he founded in 2008, hopes to achieve. VivoText’s text-to-speech engine draws on two pieces of technology: a proprietary voice-sample database that enables the portrayal of “emotion”; and software that Silbert devised to generate virtual-music performances that capture the expressiveness of professional musicians.

Not that Silbert thinks the best text-to-speech platforms used in audio books, video games, and e-mail readers lack expressiveness. “The pitch goes up and down,” he told me. “The timing changes. They do have expression; it’s just that what they’re expressing is sometimes inappropriate and inaccurate, and in many cases not enough.”

Most phoneme databases have been created by voice actors who maintained a neutral tone to generate what Silbert calls “okay speech that works.” But when generated through these machines, sentences that demand emotion tend to fall flat. Silbert also wants to move beyond the pre-programmed phrase templates of existing technologies and allow a more open-ended sentence structure. To do that, the VivoText software interprets standard text-enhancement markings such as italicized and capped words and automatically analyzes other elements of syntax and semantics in a given text. Silbert’s system of “context analysis” will enable VivoText to know, for example, whether to emphasize the we or the you in the question What can we do for you? It also enables users to further control the tone by selecting one of several settings (“happy,” “sad,” “deliberate,” “enthusiastic”). As Silbert told me, “We don’t want to impose our worldview on other people.”

Silbert began dabbling in the field during the mid-1990s, ultimately creating his Music Objects Recognition technology for making computer-generated music performances that sounded human. But making money from that technology proved difficult. “Venture capitalists saw this as a very small niche market, not really worth funding,” he told me. “But what came out in conversations was the idea of applying it to text-to-speech. I figured, If that’s what people want, why not do it?” His company targets electronic publishing, specifically the audio-book market, which currently comprises a meager percentage of the hundreds of thousands of new titles published in the U.S. each year.

“On the emotive side of it, VivoText definitely has something ahead of the competition,” says Charles Palmer, the executive director of the Center for Advanced Entertainment and Learning Technologies at Pennsylvania’s Harrisburg University. But listening to a 100,000-word book may be another story. As Palmer said to me, “Right now, we’re used to listening to automated voices in small bursts. I’m wondering how long a synthesized voice can really keep someone engaged.”

While Silbert acknowledges that VivoText is not about to compete with Derek Jacobi reading Shakespeare, he says that for informational or technical books, his relatively mellifluous text-to-speech engine will do just fine. The same goes for other voice-supported platforms like toys and games, GPS navigation, and SMS and e-mail reading. Though Silbert won’t say which of those platforms will first use VivoText, the company plans to launch its first product roughly in time for you to not just read this, but hear it—and, he hopes, with F-E-E-L-I-N-G.

Presented by

Arnie Cooper is a writer in Santa Barbara.

How to Cook Spaghetti Squash (and Why)

Cooking for yourself is one of the surest ways to eat well. Bestselling author Mark Bittman teaches James Hamblin the recipe that everyone is Googling.

Join the Discussion

After you comment, click Post. If you’re not already logged in you will be asked to log in or register.

blog comments powered by Disqus


How to Cook Spaghetti Squash (and Why)

Cooking for yourself is one of the surest ways to eat well.


Before Tinder, a Tree

Looking for your soulmate? Write a letter to the "Bridegroom's Oak" in Germany.


The Health Benefits of Going Outside

People spend too much time indoors. One solution: ecotherapy.


Where High Tech Meets the 1950s

Why did Green Bank, West Virginia, ban wireless signals? For science.


Yes, Quidditch Is Real

How J.K. Rowling's magical sport spread from Hogwarts to college campuses


Would You Live in a Treehouse?

A treehouse can be an ideal office space, vacation rental, and way of reconnecting with your youth.

More in Technology

More back issues, Sept 1995 to present.

Just In