When AI Can Transcribe Everything

Tech companies are rapidly developing tools to save people from the drudgery of typing out conversations—and the impact could be profound.

A typewriter and a tablet at a book fair in Frankfurt, Germany (Kai Pfaffenbach / Reuters)

What is the best way to describe Rupert Murdoch having a foam pie thrown at his face? This wasn’t much of a problem for the world’s press, who were content to run articles depicting the incident during the media mogul’s testimony at a 2011 parliamentary committee hearing as everything from high drama to low comedy. It was another matter for the hearing’s official transcriptionist. Typically, a transcriptionist’s job only involves typing out the words as they were actually said. After the pie attack—either by choice or hemmed in by the conventions of house style—the transcriptionist decided to go the simplest route by marking it as an “[interruption].”

Across professional fields, a whole multitude of conversations—meetings, interviews, and conference calls—need to be transcribed and recorded for future reference. This can be a daily, onerous task, but for those willing to pay, the job can be outsourced to a professional transcription service. The service, in turn, will employ staff to transcribe audio files remotely or, as in my own couple of months in the profession, attend meetings to type out what is said in real time.

Despite the recent emergence of browser-based transcription aids, transcription’s an area of drudgery in the modern Western economy where machines can’t quite squeeze human beings out of the equation. That is until last year, when Microsoft built one that could.

Automatic speech recognition, or ASR, is an area that has gripped the firm’s chief speech scientist, Xuedong Huang, since he entered a doctoral program at Scotland’s Edinburgh University. “I’d just left China,” he says, remembering the difficulty he had in using his undergraduate knowledge of the American English to parse the Scottish brogue of his lecturers. “I wished every lecturer and every professor, when they talked in the classroom, could have subtitles.”

In order to reach that kind of real-time service, Huang and his team would first have to create a program capable of retrospective transcription. Advances in artificial intelligence allowed them to employ a technique called deep learning, wherein a program is trained to recognize patterns from vast amounts of data. Huang and his colleagues used their software to transcribe the NIST 2000 CTS test set, a bundle of recorded conversations that’s served as the benchmark for speech recognition work for more than 20 years. The error rates of professional transcriptionists in reproducing two different portions of the test are 5.9 and 11.3 percent. The system built by the team at Microsoft edged past both.

“It wasn’t a real-time system,” acknowledges Huang. “It was very much like we wanted to see, with all the horsepower we have, what is the limit. But the real-time system is not that far off.”

Indeed, the promise of ASR programs capable of accurately transcribing interviews or meetings as they happen no longer seems so outlandish. At Microsoft’s Build conference last month, the company’s vice-president, Harry Shum, demonstrated a PowerPoint transcription service that would allow the spoken words of the presentation to be tied to individual slides. The firm is also in a close race with the likes of Apple and Google to perfect the transcripts produced by its real-time mobile translation app.

Huang believes the point at which transcription software will overtake human capabilities is open to interpretation. “The definition of a perfect result would be controversial,” he says, citing the error rates among human transcriptionists. “How ‘perfect’ this is depends on the scenario and the application.”

An ASR system tasked with transcribing speech in real time is only deemed successful if every word is interpreted correctly, something that largely has been achieved with mobile assistants like Cortana and Siri, but has yet to be mastered in real-time translation apps.  However, a growing number of computer scientists are realizing that standards do not need to be as high when it comes to the automatic transcription of recorded audio, where any mistakes in the text can be amended after the fact.

Two companies—Trint, a start-up in London, and Baidu, the Chinese internet giant with an application called SwiftScribe—have begun to offer browser-based tools that can convert recordings of up to an hour into text with a word-error rate of 5 percent or less.* On the page, their output looks very similar to the raw documents I typed out in real-time during the many meetings I attended as a freelance transcriptionist: at best, a Joycean stream-of-consciousness marvel, and at worst, gobbledygook. But by turning the user from a scribe into an editor, both programs can shave hours off an onerous and distracting task.

The amount of time saved, of course, is contingent on the quality of the audio. Trint and SwiftScribe tend to make short work of face-to-face interviews with the bare minimum of ambient noise, but struggle to transcribe recordings of crowded rooms, telephone interviews with bad reception, or anyone who speaks with an accent that isn’t American or British English. My attempt to run a recording of a German-accented speaker through Trint, for example, saw the engine interpret “it was rather cold, but the atmosphere was great” as “That heart is also all barf. Yes. His first face.”

“We don’t claim that this turnaround in a couple of minutes of an interview like this is perfect,” says Jeff Kofman, Trint’s CEO. “But, with good audio, it can be close to perfect. You can search it, you can hear it, you [can] find the errors, and you know within seconds what was actually said.”

According to Kofman, most of the people using Trint are journalists, followed by academics doing qualitative research and clients in business and healthcare—in other words, professions expected to transcribe a large volume of audio on tight deadlines. That’s in keeping with the anonymized data on user behavior being collected by the developer Ryan Prenger and his colleagues at SwiftScribe. While there is a long tail of users who Prenger speculates are simply AI enthusiasts eager to test out SwiftScribe’s capabilities, he’s also spotted several “power users” that are running audio through the program on almost a daily basis. It’s left him optimistic about the range of people the tool could attract as ASR technology continues to improve.

“That’s the thing with transcription technology in general,” says Prenger. “Once the accuracy gets above a certain bar, everyone will probably start doing their transcriptions that way, at least for the first several rounds.” He predicts that, ultimately, automated transcription tools will increase both the supply of and the demand for transcripts. “There could be a virtuous circle where more people expect more of their audio that they produce to be transcribed, because it’s now cheaper and easier to get things transcribed quickly. And so, it becomes the standard to transcribe everything.”

It’s a future that Trint is consciously maneuvering itself to exploit. The company just raised $3.1 million in seed money to fund its next round of expansion. Kofman and his team plan to demonstrate its capabilities later this month at the Global Editors Network in Vienna. Their aim is to have the transcription of the event’s keynote address up on the Washington Post’s website within the hour.

It’s difficult to predict precisely what this new order could look like, although casualties are expected. The stenographer would likely join the ranks of the costermonger and the iceman in the list of forgotten professions. Journalists could spend more time reporting and writing, aided by a plethora of assistive writing tools, while detectives could analyze the contradictions in suspect testimony earlier. Captioning on YouTube videos could be standard, while radio shows and podcasts could become accessible to the hard of hearing on a mass scale. Calls to acquaintances, friends, and old flames could be archived and searched in the same way that social-media messages and emails are, or intercepted and hoarded by law-enforcement agencies.

For Huang, transcription is just one of a whole range of changes ASR is set to provide that will fundamentally change society itself, one that can already be glimpsed in voice assistants like Cortana, Siri, and Amazon’s Alexa. “The next wave, clearly, is beyond the devices that you have to touch,” he says, envisioning computing technology discreetly woven into a range of working environments. “UI technology that can free people from being tethered to the device will be in the front and center.”

For the moment, however, the engineers behind automated transcribers will have to content themselves with more germane users: the journalist sweating a deadline, or the transcriptionist working out the right way to describe a man being pied in a parliamentary select committee.

* This article originally stated that SwiftScribe is a subsidiary of Baidu. We regret the error.