When Is a Caption Close Enough?

YouTube’s notoriously nonsensical auto-captions are improving. But there’s a deeper problem.

The actor Nick Offerman stares angrily at the camera, with a caption say that says "Laughs."
NBC / Alamy / The Atlantic

In March, Rikki Poynter flew to Orlando for a YouTuber convention. The event, Playlist Live, boasted a roster of performers who had collectively racked up billions of views: a 16-year-old who put elastic bands around a pumpkin until it exploded, the twins who played the evil stepsisters on Jane the Virgin, a guy who pranked people from inside a snowman suit. Poynter, whose own YouTube channel has more than 85,000 subscribers, was invited to speak as part of a panel on mental health. But she also arrived with a message for her fellow internet celebrities: Your video captions are terrible.

Since 2015, Poynter, a deaf 28-year-old, has built her following around a campaign she calls #NoMoreCRAPtions. In her videos, she asks YouTubers to ditch the automatic captions that YouTube itself generates, which are notorious for delivering run-on sentences studded with nonsensical or occasionally obscene phrases. Take Jimmy Kimmel Live’s Guillermo Rodriguez knocking back a shot of maple syrup and cheering “Old Anna!” (He’s actually saying “O Canada!”) Or the influencer Emma Chamberlain declaring that once her plane lands, “I’ll be embarrassed.” (She’s saying “I’ll be in Paris.”) One video from Playlist Live rendered “Check out their booth” as “Check out their boobies.”

In hundreds of videos, tweets, messages, and handwritten letters, Poynter has urged creators to write their own captions or employ professionals to help get the job done right. She’s hardly a techno-skeptic: Like many people who identify as deaf or Deaf (deaf describes a medical condition; Deaf denotes a cultural identity), she has embraced social media to connect with others in her community. But #NoMoreCRAPtions highlights how impressive advances in assistive technologies such as automatic captioning can obscure these technologies’ imperfections. The campaign is Poynter’s way of pushing back against any misguided notion that deaf people live in a technological future that hasn’t yet arrived for everyone else.

For television broadcasts, the Federal Communications Commission oversees captioning, a legal requirement under the Americans With Disabilities Act; the same legislation requires movie theaters to make closed captioning available for most films. YouTube, however, doesn’t fall under FCC jurisdiction, and has so far eluded regulation under the ADA. While film and television employ fleets of human stenocaptioners and voice writers to ensure that captions meet the required standards, YouTube’s automatic captions (visible when viewers click a button at the bottom of a video) are created by speech-recognition software. Trained on a vast database of language gleaned from the web, the software uses probability to guide its selection of words and phrases: If the two previous words were daily exercise, the next word is more likely to be routine than orphan.

Poynter insists that proper captions should not only fit the right words to a video’s audio content—a feat that automation struggles to achieve—but also use correct grammar and punctuation, describe sounds like the eerie creak of a door or the crackle of gunfire, and differentiate between speakers so deaf audiences know who’s talking. Although the term craptions predates Poynter’s campaign, her efforts have struck a chord, drawing international media attention and setting off waves of advocacy by hearing YouTube stars. After watching one of her videos in 2015, the LGBTQ-rights activist Tyler Oakley posted a video for his 6 million subscribers explaining how to add captions and why they matter. Recently, deaf fans rejoiced when, in a Facebook comment, the British radio personality Phil Lester announced his commitment to better captioning.

“Until Rikki had reached out, captioning my videos had not crossed my mind, as I know that YouTube auto-captions all videos,” Emma Lock, also known as Emzotic, a British former zookeeper whose channel features tips on caring for giant snails and bearded dragons, told me over email.

Even with the wave of good will, though, it’s hard to tell who has truly committed to Poynter’s message. She says that many creators caption for a while before quietly letting it slide. (A representative for Oakley told me that since 2017, all his videos have been manually captioned by a paid transcription service. Lester did not respond to a request for comment.) YouTubers sometimes balk at the time and money it takes to create their own captions or to hire others to do so. Some creators make a healthy living from their videos, but most struggle to churn out content and attract sponsors, often while working other jobs.

In a few of her videos, Poynter offers viewers a promo code to get 10 minutes of professional captioning for free from a company called Rev (the service Oakley uses). One of a number of players in the cheap-captions market, Rev charges consumers a dollar per minute of audio and promises an accuracy rate of 99 percent. Message boards where transcriptionists and captioners gather tend to paint a grim picture of life as a Rev freelancer, however. “A monkey at the zoo gets paid better than you,” someone wrote on the anonymous employer-review site Indeed.com. (Rev’s CEO and co-founder, Jason Chicola, says that transcriptionists are generally paid 50 cents per minute of finished audio, and defended the pay rate as adequate for someone who can caption quickly and accurately.)

YouTubers who can’t or won’t pay for a service like Rev can turn to crowdsourcing. YouTube’s community-contributions feature allows all viewers to submit their own captions for a video, with the permission of the video’s creator. But when the feature first launched, some popular channels quickly filled up with emoticons, LOLs, and false captions that commented on the action on-screen. Some channels became known for their dissonant captions, like the comedy channel Markiplier. YouTube responded by instituting a vetting process. Before publication, proposed captions can now be flagged by other contributors, which triggers a review by YouTube’s human moderators (automated systems also assess the reliability of contributors in order to filter out pranksters). By default, it’s the crowd that performs quality control: When enough votes of approval have piled up, captions go live regardless of whether the creator has reviewed them.

To satisfy the demands of #NoMoreCRAPtions, YouTube could, in theory, require that all video creators must add their own captions or hire someone to do so. Liat Kaver, a product manager at Google, which owns YouTube, pointed out that such a rule wouldn’t necessarily ensure that video creators would do a good job. It also might exclude some creators, such as “people who are illiterate, but upload videos to YouTube,” and “citizen journalism or creators that are registering events as they happen (e.g. Arab Spring),” Kaver, who is herself deaf and uses a cochlear implant, told me over email.

A second option would be for YouTube itself to hire people to caption videos. When I asked whether Google would consider doing so to voluntarily meet ADA requirements, a spokesperson responded that the company had nothing further to add.

The other—and perhaps most likely—possibility is that technology will simply catch up. Auto-captions, Poynter notes, are actually getting better, pushed forward in part by a growing demand for captioning from businesses with international workforces for whom English is a second language. Rev, the captioning company, now offers its own automatic speech-recognition software on its sister site, Temi, for a lower price than its human-captioner services. The company still extols the superiority of human transcriptionists, but claims that computer-generated transcripts are becoming competitive with those made by people.

This rapid improvement of auto-captions gets at the tensions #NoMoreCRAPtions brings to light. While assistive technologies do end up working wonders for some deaf people, these technologies can overpromise effectiveness and convenience. Hearing aids, for instance, can fail to capture certain frequencies. Cochlear implants, for those who choose to use them, usually require years of auditory-verbal training. (Some people see the cochlear implant as tantamount to cultural and linguistic genocide.) The ADA mandates that people with disabilities should have equal access to public goods and services, but regarding captions, it’s difficult to say exactly what that means. How accurate is accurate enough? Rev’s human captioners are held to a standard of 99 percent accuracy, while its automatic captions achieve an accuracy rate of about 85 to 90 percent. Google declined to provide a number, but told me that YouTube’s speech-recognition accuracy improved by 50 percent from 2015 to 2017.

The FCC has no real metrics for measuring accuracy. A spokesperson told me that the agency has been stymied by a lack of public consensus on how to set quantitative standards. The National Association of the Deaf, meanwhile, “believes that technology will progress to the point where automated captioning will meet acceptable levels of accuracy, but also believes that it will be many years before such an acceptable accuracy rate is achieved,” the association’s CEO, Howard A. Rosenblum, told me over email. The association is currently suing Harvard and MIT over lack of captioning on their e-learning videos posted on YouTube. While the platform itself is not subject to ADA regulations, the association contends that content posted there by certain businesses and organizations is.

As YouTube’s auto-captions move into the realm of good enough—coherent, but still flawed—hearing individuals and institutions might marvel at assistive technology to the point that they consider the problem of access to online videos solved. But good enough, of course, isn’t the same as equal. Poynter runs up against this conflict often, even in the real world. In advance of her panel at Playlist Live, she asked the festival to provide a professional transcriber to caption the talk in real time to help her follow along. The day she arrived in Orlando, the captioner canceled. Poynter says that Playlist made a significant effort to find another transcriber, but no one was available. (The festival did not respond to requests for comment.)

Poynter ultimately had to rely on American Sign Language interpreters, even though she’s more comfortable following a conversation with transcriptions. “It sucks, but … what can you do?” Poynter tells her audience in a video dispatch from the edge of the bed in her hotel room. “It’s something. Hopefully it’ll get better.”