FOR years I knew exactly what a computer would have to do to make itself twice as useful as it already was. It would have to show that it could accurately convert the sound of spoken language to typed-up text. I had a specific chore in mind for such a machine. I would give it the tape recordings I make during interviews or while attending speeches, and it would give me back a transcript of who said what. This would save the two or three hours it takes to listen to and type up each hour's worth of recorded material.
This machine would have advantages for other people, too. It would help groups that want minutes of their meetings or brainstorming sessions, legal professionals who need quick transcripts of what just happened at trials, students in big lecture halls, people who want to dictate e-mail while stuck in traffic, and those who, owing to disability or stress injury, are not able to type.
For years I despaired that such a machine would ever exist. The demonstrations I saw at computer shows, starting in the mid-1980s, left me with the impression that the speech-text barrier in technology was as formidable as the blood-brain barrier long seemed to be in medicine. At the shows the creator of each new system would carefully utter a phrase, which the computer would faithfully render on its screen. But if someone in the audience asked to see the computer handle a different phrase, or if someone with a different voice tried the same phrase, the system would be stumped. The demo person would start talking about the great new version that would be available next year.
Hardened by this experience, I hesitate to say what I'm about to, but here it is: the great new version may have arrived -- or at least a significantly better version. It doesn't do what I dream of, yet, but it does do important things well.
People within the computing industry are mainly excited about the business potential of "embedded" voice-recognition technology. This ranges from the familiar speech options in voice-mail systems ("To keep holding forever, please press or say 'two'") to hand-held devices that will record spoken appointments or phone numbers. Embedded systems have a very wide range of potential uses, and they're technically easier to pull off than full "dictation" systems, which aspire to let the user say anything he might otherwise enter on a keyboard. They're easier because the options the system has to consider are limited: after the voice-mail system asks you to press or say "two," it doesn't have to be able to distinguish "two" from "to" or "too." It needs only to know that all of them, plus "dew" and "do," sound similar -- and different from "four," "for," and "pour" or "three," "tree," and "the."
What I find exciting is the debut of the first plausible dictation technology. It comes from Dragon Systems, of Newton, Massachusetts, and it's called Dragon NaturallySpeaking. Dragon has been a small but admired contender in this field for more than a decade; this year it was acquired by Lernout & Hauspie, a Belgian firm that has battled IBM for overall leadership in commercial speech-recognition technology. With Version 5 of NaturallySpeaking, released in August, Lernout & Hauspie has gained an edge in dictation technology. Now I know that if my hands stopped working, I could still at least compose e-mail.
There are three leading dictation systems, and it's easy to try each one for yourself, because each comes with a thirty-day money-back guarantee. NaturallySpeaking Preferred costs $199; ViaVoice Advanced Edition, from IBM, costs $99.95; and Voice Xpress Advanced (which I did not review), also from Lernout & Hauspie, costs $79. What they offer and how they work are very similar. Each comes with a CD for installation, a detailed instruction manual (and on-screen tutorial), and a telephone-operator-style headset and microphone. You plug the headset cord into the sound card or audio port of your computer (something all modern systems have). The headset is designed to keep the microphone very close to your mouth, where it needs to be for accurate recognition.
Both programs require a lot of processing speed and disk space. They work better and faster if they can load most of their reference data onto your hard disk rather than having to read it from the CD, so you should have at least 300 megabytes of disk space free for installation. Both programs ran acceptably on my three-year-old Pentium II computer, but they are said to be significantly faster on a Pentium III, which includes advanced functions for sound processing. Each program requires you to begin by spending ten to thirty minutes reading sample text to the computer, so that it can be "trained" in the patterns of your voice, and each allows briefer, incremental training sessions to refine recognition as you go on.
The main difference between the programs, at least for me, is that Dragon's just works better. To be more precise, its recognition rate is high enough that I willingly made the small adjustments in my working style necessary to use the system. The payoff for learning to work the IBM system was too low. At the end of the first day I spent trying the Dragon program, it recognized nearly everything I said, and I had little trouble persuading it that some instructions -- for example, "go to end of line" -- were meant to control the program itself rather than to be typed out. ViaVoice and I seemed to be fighting each other, and after a week I put it away. Dragon has also been the consistent winner in computer-magazine reviews.