Automatic speech recognition, or ASR, is an area that has gripped the firm’s chief speech scientist, Xuedong Huang, since he entered a doctoral program at Scotland’s Edinburgh University. “I’d just left China,” he says, remembering the difficulty he had in using his undergraduate knowledge of the American English to parse the Scottish brogue of his lecturers. “I wished every lecturer and every professor, when they talked in the classroom, could have subtitles.”
In order to reach that kind of real-time service, Huang and his team would first have to create a program capable of retrospective transcription. Advances in artificial intelligence allowed them to employ a technique called deep learning, wherein a program is trained to recognize patterns from vast amounts of data. Huang and his colleagues used their software to transcribe the NIST 2000 CTS test set, a bundle of recorded conversations that’s served as the benchmark for speech recognition work for more than 20 years. The error rates of professional transcriptionists in reproducing two different portions of the test are 5.9 and 11.3 percent. The system built by the team at Microsoft edged past both.
“It wasn’t a real-time system,” acknowledges Huang. “It was very much like we wanted to see, with all the horsepower we have, what is the limit. But the real-time system is not that far off.”
Indeed, the promise of ASR programs capable of accurately transcribing interviews or meetings as they happen no longer seems so outlandish. At Microsoft’s Build conference last month, the company’s vice-president, Harry Shum, demonstrated a PowerPoint transcription service that would allow the spoken words of the presentation to be tied to individual slides. The firm is also in a close race with the likes of Apple and Google to perfect the transcripts produced by its real-time mobile translation app.
Huang believes the point at which transcription software will overtake human capabilities is open to interpretation. “The definition of a perfect result would be controversial,” he says, citing the error rates among human transcriptionists. “How ‘perfect’ this is depends on the scenario and the application.”
An ASR system tasked with transcribing speech in real time is only deemed successful if every word is interpreted correctly, something that largely has been achieved with mobile assistants like Cortana and Siri, but has yet to be mastered in real-time translation apps. However, a growing number of computer scientists are realizing that standards do not need to be as high when it comes to the automatic transcription of recorded audio, where any mistakes in the text can be amended after the fact.
Two companies—Trint, a start-up in London, and Baidu, the Chinese internet giant with an application called SwiftScribe—have begun to offer browser-based tools that can convert recordings of up to an hour into text with a word-error rate of 5 percent or less.* On the page, their output looks very similar to the raw documents I typed out in real-time during the many meetings I attended as a freelance transcriptionist: at best, a Joycean stream-of-consciousness marvel, and at worst, gobbledygook. But by turning the user from a scribe into an editor, both programs can shave hours off an onerous and distracting task.