Earlier this year, Stephen Hawking, who has relied on the same computerized system to communicate for more than 20 years, received a much-needed upgrade. The system, which allowed Hawking to translate text into speech via a sensor on his cheek, had become prohibitively slow as the physicist’s progressing ALS left him with weakened control of his facial muscles.
When Intel began work on his newer, faster computer, Hawking had one requirement: The software could change, but the sound of his speech had to remain the same.
“The voice has become so iconic that he considers that his own personal voice,” Horst Haussecker, the director of Intel’s Computational Imaging Lab and a leader of the project, recently told NPR. “It’s based on, you know, slightly outdated technology, but it makes it very unique and you couldn’t copy it even if you wanted to.”
“It is the best I have heard,” Hawking wrote of his iconic voice on his personal website, “although it gives me an accent that has been described variously as Scandinavian, American, or Scottish.”
But even a case of mistaken nationality isn’t enough to damage the link between the sound of the machine and the man it speaks for; over time, one has become emblematic of the other. Stephen Hawking does not sound like a computer—Stephen Hawking sounds like Stephen Hawking.
Most people who cannot speak, though, do not have the luxury of being Stephen Hawking.
An estimated eight out of every 1,000 Americans, or 2.5 million people, are severely speech-impaired due to a variety of conditions: head injuries, congenital disorders like cerebral palsy, or degenerative diseases like Hawking’s ALS. Many of them rely on text-to-speech machines, typing words that are then vocalized electronically. They sound like computers. And because computers are manufactured in batches of more than one, they also sound like each other.
In August 2002, Rupal Patel, a speech-science professor at Northeastern University, was at a speech-technology conference in Odese, Denmark to present the results of her latest research. People with dramatic speech impairment, she had found, were still able to control the melody of their voices (also called the “prosody” of a voice, or its pitch, tempo, and volume) even when they couldn’t form words; as a result, many people forewent their communication devices when talking to those closest to them, relying on inflection to help convey meaning.
Walking through the conference’s exhibition hall after her presentation, Patel passed a young woman and older man engaged in conversation, their voices indistinguishable from one another—both were using the same text-to-speech system.
Patel paused, listened. The same sound, she realized, was all around her. People throughout the hall—“nearly half the room,” she recalls—were using nearly identical voices.
“That’s when I put two and two together,” she says. “I thought, well, if they have this part of their voice that’s preserved, maybe I would be able to build a voice for them.”
The idea stayed with her. For the next few years, Patel developed and fine-tuned her process, and in 2007 she received a grant from the National Science Foundation to pursue the project that would become VocaliD (pronounced “vocality”), a for-profit company that creates personalized voices for text-to-speech systems by blending sounds taken from speech-impaired people with words recorded by healthy donors. (The price of a voice, she says, will ultimately depend on demand.)
The company's technology is based on the “source-filter theory,” which breaks the production of human speech into two components. One is the source, or the sound made by the vibrations of the vocal cords. The other is the filter, or the vocal tract: the path of these vibrations as they echo through the chambers of the neck and head. Conditions that cause speech impairment mainly affect the filter; the prosody of a voice is controlled by the source, which is usually left intact.
To create a voice, Patel says, “we’re taking the filter, the shape of the vocal tract, from the voice donor, and the source from the individual who’s given us something as limited as a vowel.” After taking a short recording from a recipient—who often can only vocalize as much as an “ahhh” sound—the VocaliD team selects a donor with a similar filter and uses a computer algorithm to layer one over the other. Donations come via the company’s “voice bank,” which opened to the public over Thanksgiving weekend. To donate, a person needs a computer, a microphone, and a few hours of time to record the hundreds of sentences Patel has compiled from old stories and common phrases to encompass all of the sounds of the English language.
From there, she explains, “we chop that blended voice into little snippets of speech that can be rearranged any way, by gluing together little bits of a sentence.”
Patel estimates that somewhere between 500 and 600 people have already donated their voices, and that around 24,000 people have signed up to donate in the future—a number she hopes will allow the team to more effectively pair a recipient with a voice.
“In the past, we were doing some really [basic] matching, like age and gender,” she says. “We’re developing some new techniques to do more sophisticated matching for the kind of voice you have,” taking into account things like “voice quality,” or hoarseness; regional accent; and height and weight, both of which affect the vocal tract.
And further down the road, Patel says, she’d like to look into ways to accommodate VocaliD’s recipients as they age. “If you have a recording of one person going through time, you’ll see that voice is changing,” she says. “Maybe it’s not that you have to get a brand-new donor and recipient. Maybe there’s a way to change it computationally … It would be an exciting thing, if we could build someone a voice when they’re a kid and grow it over time.”
The voice bank may be new, but the ability of machines to generate human speech predates even electricity.
Over a century before the modern computer would be developed, Hungarian inventor Wolfgang von Kempelen began his work on the first speech-synthesis machine in 1770. The final product, which would take him two decades to complete, used a bellows to simulate lungs, a reed to create vibrations, and rubber “mouth,” with tubes and levers that could be manipulated to create the sounds of vowels and consonants. According to his 1791 book The Mechanism of Human Speech, with a Description of a Speaking Machine, von Kempelen’s creation could imitate human speech well enough for people to recognize phrases in French and Italian.
In 1845, using a bellows design similar to von Kempelen’s, German scientist Joseph Faber unveiled his own talking machine, the “Euphonia,” at Philadelphia’s Musical Fund Hall. The machine, emblazoned with the image of a disembodied female head, had a “ghostly monotone,” historian David Lindsay wrote, but could speak every European language and sing “God Save the Queen.”
Both von Kempelen’s and Faber’s devices caught the attention of Alexander Graham Bell, who used their work as inspiration for his own model of the human vocal tract in 1860, 16 years before filing his patent for the telephone. Bell Labs, the company he later founded, was at the forefront of text-to-speech technology through its transition to the digital age: In 1961, the company was the first to synthesize speech with a computer, using an IBM machine to sing the song “Daisy Bell.” (Author Arthur C. Clarke, who happened to witness the demonstration, later recreated it with Hal, the computer in 2001: A Space Odyssey.)
Stephen Hawking’s voice—based on the “outdated technology” that Intel’s Haussecker referenced—comes from DECTalk, one of the first personal text-to-speech devices. Invented in the early 1980s by Dennis Klatt, an engineer at the Massachusetts Institute of Technology, the device originally had only three voices: Hawking’s, “Perfect Paul,” based on Klatt’s own voice; “Beautiful Betty,” based on his wife, and a child’s voice, which he named “Kit the Kid.” DECTalk has since added six additional voices (and dropped the adjectives—the newcomers are simply named “Harry,” “Ursula,” etc.), but Paul quickly became the standard in artificial voices—the voice was so common, in fact, that it was used by the National Weather Service until earlier this year.
“Nowadays, there are more choices than there were 10 years ago,” Patel says, but they remain limited. “[For example,] your GPS can speak in an Australian accent, American accent, male or female. Those are the kinds of choices people can make about their voice, but they’re not specific—there’s not a Bostonian speaking in a Bostonian accent.”
Paul’s ubiquity—and the small size of the pool of current options—throw into stark relief what the voice-impaired have lost. Like fingerprints, each human voice is unique to its owner; even the voices of identical twins have measurable differences.
“It’s really hard to overstate how important the voice is in the way we present ourselves to the world,” says Jody Kreiman, a speech scientist at the University of California Los Angeles’ Bureau of Glottal Affairs and the author of the book Voices and Listeners. “In the same way you look at someone and start drawing conclusions, you hear a voice and start drawing conclusions … Are they in a good mood or a bad mood? Are they healthy? Educational level, [whether] they’re good-looking or not, [whether] they’re a leader.”
“When you lose your voice,” Kreiman adds, “you lose your social self.”
“The real question is, how fast can we get people their voices?” Patel says. As of a few months ago, the waiting list was about a thousand names long. Each voice takes around 10-15 hours to build, once everything is recorded, but VocaliD has more to do before it can begin work in earnest—there are technological tweaks to be made, money to raise. To keep the eventual cost of a voice as low as possible, Patel is also looking into other ways to market her technology: “You might want [it] if you want an email to be read out loud in your voice,” she said, “or when you’re playing a video game and want to sound like yourself.”
In the meantime, VocaliD has thus far successfully created voices for three people, all teen girls, as part of its beta-testing phase. One of them, Samantha Grimaldo, is featured in a video on the company’s website, where the VocaliD team and Samantha’s mother watch her receive her new voice.
Seated at her family’s kitchen table, she types out a sentence on her tablet: “My favorite food is pizza.”
She's grinning, though the inflectionless sound that emerges gives no indication of her excitement. Samantha's new voice doesn’t sound completely natural. It sounds, still, like a robotic voice—but it doesn’t sound like anyone else, either.