That’s exactly what Gershon Silbert, a 61-year-old former concert pianist and the CEO of VivoText, an Israeli start-up he founded in 2008, hopes to achieve. VivoText’s text-to-speech engine draws on two pieces of technology: a proprietary voice-sample database that enables the portrayal of “emotion”; and software that Silbert devised to generate virtual-music performances that capture the expressiveness of professional musicians.
Not that Silbert thinks the best text-to-speech platforms used in audio books, video games, and e-mail readers lack expressiveness. “The pitch goes up and down,” he told me. “The timing changes. They do have expression; it’s just that what they’re expressing is sometimes inappropriate and inaccurate, and in many cases not enough.”
Most phoneme databases have been created by voice actors who maintained a neutral tone to generate what Silbert calls “okay speech that works.” But when generated through these machines, sentences that demand emotion tend to fall flat. Silbert also wants to move beyond the pre-programmed phrase templates of existing technologies and allow a more open-ended sentence structure. To do that, the VivoText software interprets standard text-enhancement markings such as italicized and capped words and automatically analyzes other elements of syntax and semantics in a given text. Silbert’s system of “context analysis” will enable VivoText to know, for example, whether to emphasize the we or the you in the question What can we do for you? It also enables users to further control the tone by selecting one of several settings (“happy,” “sad,” “deliberate,” “enthusiastic”). As Silbert told me, “We don’t want to impose our worldview on other people.”
Silbert began dabbling in the field during the mid-1990s, ultimately creating his Music Objects Recognition technology for making computer-generated music performances that sounded human. But making money from that technology proved difficult. “Venture capitalists saw this as a very small niche market, not really worth funding,” he told me. “But what came out in conversations was the idea of applying it to text-to-speech. I figured, If that’s what people want, why not do it?” His company targets electronic publishing, specifically the audio-book market, which currently comprises a meager percentage of the hundreds of thousands of new titles published in the U.S. each year.
“On the emotive side of it, VivoText definitely has something ahead of the competition,” says Charles Palmer, the executive director of the Center for Advanced Entertainment and Learning Technologies at Pennsylvania’s Harrisburg University. But listening to a 100,000-word book may be another story. As Palmer said to me, “Right now, we’re used to listening to automated voices in small bursts. I’m wondering how long a synthesized voice can really keep someone engaged.”