Silicon Valley is making the bet that, in the near future, we’ll talk to all of our tech. But as we rush toward that reality, beware: the mechanism used to listen to what we say and turn it into a command for a virtual personal assistant or smart home gadget might not be secure.
Security researchers have shown that they can generate an audio clip that sounds like innocuous speech, but actually gives a secret command to a voice transcription system, like the ones used by Amazon’s Alexa and Apple’s Siri. The paper, published on ArXiv and not peer-reviewed, claims 100% accuracy in tricking Mozilla’s open-source DeepSpeech system. The Mozilla algorithm is freely available, so anyone can download it, but typically large tech companies like Google and Apple build their own systems based on the same core technology.
With this system an attacker would be able to play an audio clip and give undetectable commands to a speech recognition system or virtual personal assistant. Since nearly all smartphones are outfitted with one of these assistants, and the assistants are often given the power to transfer money and send text messages, this form of attack could be a potential threat for any smartphone user. In these first tests, the audio was fed directly into DeepSpeech, and not played out loud over a speaker.
Nicolas Carlini and David Wagner, the authors of the paper, built a machine learning system that takes two inputs: an audio file and a desired transcription. The system alters the audio file very slightly over and over again, trying to get the target voice recognition system to recognize bits and pieces of the desired transcription, while altering the audio as little as possible.
The resulting attack just sounds like a bit of static under normal speech when heard by a human, almost like a bad phone connection. The attack can also be hidden even more effectively in music. The two audio clips below, from the paper’s website, each direct Mozilla’s DeepSpeech algorithm to transcribe “okay google browse to evil dot com.”
This isn’t to say we’re all doomed to have attackers constantly whispering secret messages into our digital assistant’s ears. These samples don’t yield the same results when played out loud over a speaker, but that doesn’t mitigate the potential threat. This paper’s authors note that similar techniques for fooling image-recognition algorithms were also first tested inside a computer, and then found to work when processed by a digital camera.
There are simple things that smartphone makers can do to mitigate the threat as well, like asking a user for confirmation before navigating to a website or sending a text.