From Your Lips to Your Printer
Finally, voice-recognition software that (almost) lives up to its promise to liberate those unable or unwilling to type.
FOR years I knew exactly what a computer would have to do to make itself twice as useful as it already was. It would have to show that it could accurately convert the sound of spoken language to typed-up text. I had a specific chore in mind for such a machine. I would give it the tape recordings I make during interviews or while attending speeches, and it would give me back a transcript of who said what. This would save the two or three hours it takes to listen to and type up each hour's worth of recorded material.
This machine would have advantages for other people, too. It would help groups that want minutes of their meetings or brainstorming sessions, legal professionals who need quick transcripts of what just happened at trials, students in big lecture halls, people who want to dictate e-mail while stuck in traffic, and those who, owing to disability or stress injury, are not able to type.
For years I despaired that such a machine would ever exist. The demonstrations I saw at computer shows, starting in the mid-1980s, left me with the impression that the speech-text barrier in technology was as formidable as the blood-brain barrier long seemed to be in medicine. At the shows the creator of each new system would carefully utter a phrase, which the computer would faithfully render on its screen. But if someone in the audience asked to see the computer handle a different phrase, or if someone with a different voice tried the same phrase, the system would be stumped. The demo person would start talking about the great new version that would be available next year.
Hardened by this experience, I hesitate to say what I'm about to, but here it is: the great new version may have arrived -- or at least a significantly better version. It doesn't do what I dream of, yet, but it does do important things well.
People within the computing industry are mainly excited about the business potential of "embedded" voice-recognition technology. This ranges from the familiar speech options in voice-mail systems ("To keep holding forever, please press or say 'two'") to hand-held devices that will record spoken appointments or phone numbers. Embedded systems have a very wide range of potential uses, and they're technically easier to pull off than full "dictation" systems, which aspire to let the user say anything he might otherwise enter on a keyboard. They're easier because the options the system has to consider are limited: after the voice-mail system asks you to press or say "two," it doesn't have to be able to distinguish "two" from "to" or "too." It needs only to know that all of them, plus "dew" and "do," sound similar -- and different from "four," "for," and "pour" or "three," "tree," and "the."
What I find exciting is the debut of the first plausible dictation technology. It comes from Dragon Systems, of Newton, Massachusetts, and it's called Dragon NaturallySpeaking. Dragon has been a small but admired contender in this field for more than a decade; this year it was acquired by Lernout & Hauspie, a Belgian firm that has battled IBM for overall leadership in commercial speech-recognition technology. With Version 5 of NaturallySpeaking, released in August, Lernout & Hauspie has gained an edge in dictation technology. Now I know that if my hands stopped working, I could still at least compose e-mail.
There are three leading dictation systems, and it's easy to try each one for yourself, because each comes with a thirty-day money-back guarantee. NaturallySpeaking Preferred costs $199; ViaVoice Advanced Edition, from IBM, costs $99.95; and Voice Xpress Advanced (which I did not review), also from Lernout & Hauspie, costs $79. What they offer and how they work are very similar. Each comes with a CD for installation, a detailed instruction manual (and on-screen tutorial), and a telephone-operator-style headset and microphone. You plug the headset cord into the sound card or audio port of your computer (something all modern systems have). The headset is designed to keep the microphone very close to your mouth, where it needs to be for accurate recognition.
Both programs require a lot of processing speed and disk space. They work better and faster if they can load most of their reference data onto your hard disk rather than having to read it from the CD, so you should have at least 300 megabytes of disk space free for installation. Both programs ran acceptably on my three-year-old Pentium II computer, but they are said to be significantly faster on a Pentium III, which includes advanced functions for sound processing. Each program requires you to begin by spending ten to thirty minutes reading sample text to the computer, so that it can be "trained" in the patterns of your voice, and each allows briefer, incremental training sessions to refine recognition as you go on.
The main difference between the programs, at least for me, is that Dragon's just works better. To be more precise, its recognition rate is high enough that I willingly made the small adjustments in my working style necessary to use the system. The payoff for learning to work the IBM system was too low. At the end of the first day I spent trying the Dragon program, it recognized nearly everything I said, and I had little trouble persuading it that some instructions -- for example, "go to end of line" -- were meant to control the program itself rather than to be typed out. ViaVoice and I seemed to be fighting each other, and after a week I put it away. Dragon has also been the consistent winner in computer-magazine reviews.
YOU would think that the trick to making these programs work is to speak slowly and separate each word from its neighbor. In fact the recognition rate goes down if you speak in an artificial way, because the analysis of each word depends on hearing it with its neighbors. The uh sound in English, which linguists call schwa, means little on its own, but in the words pronounced "I wannuh Coke" and spelled "I want a Coke," a good system recognizes the schwa as the word "a." David Leffell is a professor at the Yale medical school who began using Dragon two years ago and now uses it for most of his writing, from e-mail to journal articles. "I speak quickly," he told me in a (dictated) e-mail message, "and was delighted to discover the paradox that Dragon NaturallySpeaking actually doesn't work well with people who speak slowly. I have a colleague who has been unable to train his system because of his slower speech style."
Rather than slow speaking, what counts is using what I think of as a "radio voice" -- that is, pretending that you are a National Public Radio anchor and speaking as sonorously as possible, keeping your tongue dancing around in your mouth to enunciate all sounds properly and trying hard not to skip the syllables or entire words that people skip in normal speech. This takes practice, and you don't want to do it in a busy office, but it has some of the charms of singing in the shower. The more you use the program, the better it works, because each time you correct a mistake or use a new vocabulary word, it adjusts its "probability" models for converting sounds to words. The main peril of the program is that it requires exceptionally sharp proofreading, because it will leave out or insert words or guess at but correctly spell wrong words. Thus spell-check does no good.
How do the systems do it at all? The fundamental science of speech recognition is heavily mathematical, based on probability calculations and "information theory" -- the study of detecting meaningful patterns in murky, messed-up data. (The recent book by Daniel Jurafsky and James Martin, both of the University of Colorado, explains all this in 900 pages that move right along.) Speech-recognition software is comparable to image-enhancement systems, which infer what blurry photos would look like if the focus had been sharp. For speech recognition the blurry image is the series of sound waves a speaker produces; the goal is to figure out what sentence was most likely to have been the genesis of those sounds. "Most likely" is the best the programs can do, because so many different words and phrases are pronounced similarly ("I want a Ford or Chevy" / "I want a four-door Chevy") and speakers can pronounce the same phrase in so many different ways. The programs have become steadily more usable not as a result of any dramatic conceptual breakthrough but as a result of slow and steady improvement in probability calculations.
The process of guessing the most likely sentence has three stages. First, the computer captures the sound waves the speaker generates, tries to filter them from coughs, hmmmms, and meaningless background noise, and looks for the best match with the phonemes available. (A phoneme is the basic unit of the spoken word. The English t sound, for instance, is written as the phoneme /t/ and comes in at least half a dozen varieties, or "allophones," depending on whether the sound is aspirated, as in "toy"; unaspirated, as in "stamp"; dentalized, as in "breadth"; or present in one of several other forms.) Because people speak not in discrete words but in phrases, the next stage of recognition is to group a stream of phonemes into the most likely combination of words. The final stage is to evaluate all the possible sentences that might conceivably have produced a group of sounds and calculate which is the most likely possibility. The software judges what is likely using enormous databases of actual written and spoken language that the software designers have amassed, checking which words are likely to appear in the vicinity of which others.
I never got a satisfactory answer from academic and corporate researchers to one question about the databases: Why, if the preponderance of analyzed material is in English, is speech recognition thought to work more or less equally well in a variety of languages? Until recently, of course, it didn't work very well in any language. The explanation I got was that the constant rise in computing speed has increased the practical value of databases. When probabilities are worked out on a word-by-word basis, they give limited guidance to recognition systems. People say "I" more often than "eye" or "aye," so a computer interpreting the single phoneme /ay/ would render it as the most likely choice: "I." But computers are now fast enough to perform "trigram analysis" on the incoming stream of phonemes -- to consider how probable each word is based on the two words before it, each of which has been judged most likely based on the two before it, and so on. This leads to guesses that are far more precise: "The skipper said aye," "I need a correction in my right eye," "The computer is from IBM."
IS all this -- the designers' effort to create the program and the users' to learn how to take advantage of it -- worthwhile? Before I began this project, I was sure the answer would be no. Everyone involved with speech recognition stresses that the programs are not aimed at people who type a lot and can do it very fast. Rather they are meant as supplements for people who have a physical or a circumstantial reason to avoid typing, people who never learned to type well, people who need to dictate while driving or while their hands or eyes are occupied. I have no such reasons to give up typing, but I now view Dragon as a genuinely plausible alternative.
For example (and no doubt you've seen this coming), as a torture test I composed this entire article by dictation to Dragon. At a technical level the experience was surprisingly painless. Once I had the system "trained" to my voice, I often went for six or eight sentences without having to correct the transcription. That's a longer passage than I can type error-free, although it's faster to correct typos at the keyboard than by voice. To change "had" to "hat," for example, requires two keystrokes -- backspace and t; with Dragon you say, "Select 'had.' 'Hat.'" Both are easy, but speaking takes a few seconds more.
The technology worked well enough to allow me to shift my attention to loftier questions, especially about the connection between the means of composition and the style and content of thought. When computers first became widespread, many savants claimed that they would be the end of careful writing. If it was so easy to turn out so much copy, people wouldn't think before they wrote, and we would all go to hell. Bring back the pencil and the foolscap sheaf! (By the way, Dragon got "foolscap sheaf" right the first time, but it thought "go to hell" should be "go to help.") Writing may have indeed changed in the computer age, but the reasons have little to do with the means of composition. Instead they are the spread of e-mail -- which has replaced phone calls, not essay writing with a quill pen -- and the shorter attention spans encouraged by advertising, TV programming, and the Internet.
I found trying to compose aloud a far bigger shift than the one from typewriter to computer. Dictating prose would probably seem more natural to those used to writing with a pen or a pencil -- something I abandoned for the typewriter after fifth grade. What pen or pencil composition and speech recognition have in common is that you must think out much or all of the sentence before beginning to record it, to avoid the nuisance of writing it over again or of saying "Select line -- delete that" to Dragon. When composing at a computer I tend to type a sentence twenty different ways while figuring out what to do with it.
Like most people who have never begun a writing session by saying "Miss Jones, take a memo!," I've looked down on dictation, considering it suitable only for the most utilitarian documents. But for some people it can be liberating. David Leffell, who teaches dermatology at Yale, wrote a complete book, by dictating most of the draft and revisions to Dragon. "I am used to dictating material to begin with, so this simply eliminates the transcription step," he told me by e-mail. "Voice dictation is a tool that shortens the distance between my neurons and the ink on the page. In that way, it gets us closer to the science fiction fantasy of a brain chip that automatically downloads our thoughts to a page."
That's how it works for him. But when it came time to revise this article, I found that I had to go back to the keyboard. It was too hard to think without moving my hands. I'm not sure I'd go through the exercise of dictating a draft again. But I am comforted to think that I could.
James Fallows is The Atlantic's national correspondent.
Illustration by Giacomo Marchesi.
The Atlantic Monthly; December 2000; From Your Lips to Your Printer - 00.12; Volume 286, No. 6; page 106-108.