Watching kitten videos on YouTube mostly was a harmless activity until now: computer scientists from Georgetown University and the University of California, Berkeley, have discovered that it’s possible to control a smartphone via covert voice commands hidden in something as seemingly benign as a viral video of tabbies playing in a cardboard box.
Micah Sherr, a computer science department professor at Georgetown, says the research was inspired by the proliferation of voice-controlled systems. "Amazon Echo was coming out when we started this work," Sherr notes.
Since then, Google has launched Google Home, a similar always-listening device, and electronic devices lost in a messy bedroom can now be recovered by speaking “Okay Google” or “Hey Siri.”
The new research shows how keeping such devices on always-listen mode could lead to a cyberattack. Sherr says a cybercriminal could attempt to plant malware on the device using a hidden voice command.
During the experiment, the researchers registered a random domain name and were able to command the devices to open that URL through distorted, “hidden” commands. If that URL contains malware, it's game-over, according to Sherr.
“If I can command your phone and go to a website that I can control, I can download malware that can do horrible things,” he says.
During the study, two types of tests were conducted on Apple and Android phones that were set to listening mode, black-box and white-box. The black-box test involved trying to successfully send commands to a smartphone utilizing proprietary voice-recognition software, such as Siri or Okay Google. The phone was trained by speaking "OK Google" three times by one researcher, and attack commands were executed via the voice recordings by another researcher.
In the black box experiments, the researchers successfully breached the phones, both Apple and Android, that were running Okay Google. The good news, however, was that their attempts to crack Siri were mostly fruitless: Those experiments only yielded “limited access” to the iPhones, says Sherr. “Apple Siri is far more conservative than Google in activating the speech recognition system," he notes.
The white-box test involved sending voice commands to a computer running open source voice recognition software Sphinx. The white-box test assumes the attacker has full knowledge of how the voice recognition system works, and would look at how the machine learning or the artificial intelligence component of that system works, says Sherr. From there, he or she could create attack commands that are understandable by a computer, but not by humans.
The white-box attack was far more effective when compared to the black-box attack, says Sherr. “It was able to produce audio that was more often interpreted correctly by the computer and more often not interpreted correctly by humans. However, the tradeoff here is that the white-box attack requires the attacker to know exactly how the speech recognition system works,” he says.
Black Hat’s CISO Summit Aug 2 offers executive-level insights into technologies and issues security execs need to keep pace with the speed of business. Click to register.
Sherr and his fellow researchers found that creating covert commands comes down to knowing the difference between what a computer needs to understand speech, and what a human needs to understand speech. Being able to exploit that difference -- create commands that would go unnoticed to the naked human ear -- is where the vulnerability sits.
“We don’t claim in the paper that it is very easy to launch an attack. A lot of things would have to line up for the attack to happen,” Sherr says. “We’re trying to raise awareness that these commands are possible.”
The researchers also studied how to defend a voice attack. It would be helpful if devices running voice-command software elicited a tone when the device was being queried, he says. But the catch is that if the user was watching a video when the hidden attack was happening, he or she might not hear the tone on the device. The report goes into other defenses, including a user vocally confirming a command, and speaker recognition, which could be successful assuming the threat actor doesn’t have a sample of the device-owner’s voice.
“It’s pretty easy to get a recording of someone’s voice" online if they are featured in a YouTube or other video, for example, warns Sherr.
Here is a video demonstration by the researchers:
- 6 Ways to Keep Android Phones Safe
- One iPhone In Every Large Company Infected With Malware
- 5 Ways To Protect Your Network From New Graduates
- 'Dogspectus' Breaks New Ground For Android Ransomware