Here’s a fun experiment: Next time you’re on a crowded bus, loudly announce, “Hey Siri! Text mom, ‘I'm pregnant.’” Chances are you’ll get some horrified looks as your voice awakens iPhones in nearby commuters’ pockets and bags. They’ll dive for their phones to cancel your command.
But what if there was a way to talk to phones with sounds other than words? Unless the phones’ owners were prompted for confirmation—and realized what was going on in time to intervene—they’d have no idea that anything was being texted on their behalf.
Turns out there’s a gap between the kinds of sounds that people and computers understand as human speech. Last summer, a group of Ph.D. candidates at Georgetown and Berkeley exploited that gap: They developed a way to create voice commands that computers can parse—but that sound like meaningless noise to humans. These “hidden voice commands,” as the researchers called them, can deliver a message to Google Assistant-enabled Android phones nearby through bursts of what sounds like scratchy static.
For the commands to work, the speaker that broadcasts them has to be nearby: The researchers found that commands became ineffective at a distance of about 12 feet. But that doesn’t mean someone has to be conspicuously close to a device for their hidden-command attack to succeed. A message could be encoded into the background of a popular YouTube video, for example, or broadcast on the radio or TV.
(Earlier this month, a local news report about a young child who ordered a dollhouse through Amazon’s voice assistant triggered Amazon Echo devices sitting near viewers’ TVs to place the same order during the segment.)
The primary way people interact with smartphones is by touching them. That’s why smartphone screens can be thoroughly locked down, requiring a passcode or thumbprint to access. But voice is becoming an increasingly important interface, too, turning devices into always-listening assistants ready to take on any task their owner yells their way. Put in Apple’s new wireless earphones, and Siri becomes your point of contact for interacting with your smartphone without taking it out of your pocket or bag.
The more sensors get packed into our ubiquitous pocket-computers, the more avenues someone can use to control them. (In the field of security research, this is known as an ‘increased attack surface.’) Microphones can be hijacked with ultrasonic tones for market research. Cameras can receive messages from rapidly flickering lights, which can be used for surveillance and connectivity or even to disable or alter a phone’s features.
Most assistants include some safeguards against overheard or malicious commands. The phrases I suggested you shout out earlier will prompt phones within earshot to ask for confirmation. Siri, for example, will read back the contents of the text or tweet a user dictates before actually sending it off. But a determined attacker could conceivably defeat the confirmation, too. All it would take is a simple “yes” before a device’s owner realizes what’s going on and says “no.”
Hidden voice commands can cause more damage than just a false text or silly tweet. An iPhone whose owner has already linked Siri to a Venmo account, for example, will send money in response to a spoken instruction. Or a voice command could tell a device to visit a website that automatically downloads malware.
The researchers developed two different sets of hidden commands to work on two different types of victims. One set was created to work on Google Assistant, which is challenging to hoodwink because the inner workings of how it processes human speech aren’t public. To start, the researchers used obfuscating algorithms to make computer-spoken commands less recognizable to human ears but still understandable by the digital assistants. They kept iterating until they found the sweet spot where the audio was least recognizable to people but most consistently picked up by the devices.
The resulting hidden commands aren’t complete gibberish. They mostly just sound like they’re spoken by a fearsome demon rather than your average human.
If you know you’re about to hear a masked voice command, you’re probably more likely to be able to parse it. So to avoid those priming effects, the Georgetown and Berkeley researchers enlisted Americans through Mechanical Turk, Amazon’s service for hiring workers for small projects, to listen to the original and garbled commands and write down what they heard.
The difference between man and machine was most pronounced with the simple command “Okay, Google.” When it was said normally, people and devices understood it about 90 percent of the time. But when the command was processed and masked, humans could only understand it about 20 percent of the time—but Google Assistant actually got better at understanding it, interpreting it correctly 95 percent of the time. (The effects were less drastic with “Turn on airplane mode”: Human understanding dropped from 69 to 24 percent when the command was masked, and device accuracy fell from 75 to 45 percent.)
When a colleague and I tried out the researchers’ voice commands on an Android phone and an iPhone running the Google app, we had limited success. “Okay, Google” seemed to work more than the other hidden commands, but “What is my current location” got us everything from “rate my current location” to “Frank Ocean.” That may be in part because we were playing YouTube recordings from our laptops, which likely degraded their quality.
The researchers also developed attacks designed not for Google Assistant but for an open-source speech-recognition program, whose code they could peruse in order to tailor their hidden voice commands as closely as possible to satisfy the algorithm. The resulting audio clips sound less demonic and more like white noise. Most are truly indecipherable, even if you know you’re listening for words: Not a single Mechanical Turk worker could piece together even half of the words in these obfuscated commands.
And if you don’t know you’re listening for words, you might not even know what just happened. When the researchers put a hidden phrase in between two human-spoken phrases and asked the Amazon Turk workers to transcribe the entire thing, only one quarter even tried to transcribe the middle phrase.
After they set about tricking digital assistants, the team of researchers brainstormed ways to improve defenses against attacks like theirs. A simple notification isn’t enough, they determined, because it’s easily ignored or drowned out by other noise. A confirmation is a bit better, but it can be defeated by another hidden command. And speaker-recognition technology, which would ostensibly teach a device to recognize and only respond to its owner’s voice, is often inaccurate and requires cumbersome training.
The best options, they concluded, are machine-learning solutions that either try to ensure that a speaker is actually human by analyzing certain characteristics in a spoken command, or that filter each command through a process that slightly degrades the quality of incoming instructions. In the latter case, already-garbled “hidden” instructions would become too distorted to be recognized, but human speech would still be intelligible, the thinking goes.
But if filters make it harder for devices to understand humans, even a little bit, companies might be reticent to apply them. For frustrated users whose digital assistants rarely understand them, less accuracy could be a deal breaker.
Before voice assistants start taking on more and more sensitive operations, however—making large bank transfers, for example, or even just tweeting photos—voice-activated assistants will need to become more adept at fending off attackers. Otherwise, an anonymous, satanic voice in a YouTube video could cause a lot more damage than a shouted command in a crowded bus.
We want to hear what you think about this article. Submit a letter to the editor or write to firstname.lastname@example.org.