Sending Inaudible Commands to Voice Assistants

Researchers have demonstrated the ability to send inaudible commands to voice assistants like Alexa, Siri, and Google Assistant.

Over the last two years, researchers in China and the United States have begun demonstrating that they can send hidden commands that are undetectable to the human ear to Apple's Siri, Amazon's Alexa and Google's Assistant. Inside university labs, the researchers have been able to secretly activate the artificial intelligence systems on smartphones and smart speakers, making them dial phone numbers or open websites. In the wrong hands, the technology could be used to unlock doors, wire money or buy stuff online ­-- simply with music playing over the radio.

A group of students from University of California, Berkeley, and Georgetown University showed in 2016 that they could hide commands in white noise played over loudspeakers and through YouTube videos to get smart devices to turn on airplane mode or open a website.

This month, some of those Berkeley researchers published a research paper that went further, saying they could embed commands directly into recordings of music or spoken text. So while a human listener hears someone talking or an orchestra playing, Amazon's Echo speaker might hear an instruction to add something to your shopping list.

Posted on May 15, 2018 at 6:13 AM • 36 Comments

Comments

Clive RobinsonMay 15, 2018 6:57 AM

There are two currently known ways to attack these devices,

1, Out of band.
2, In band.

Out of band attacks alow the use of understandable by human speach waveforms but frequency shifted out of the human audible range. They can happen because of the samole rate and filtering values are setup incorrectly in the device. Thus a signal above human hearing range is either processed directly or due to sampling issues "down converted" to a lower audio frequency range inside the device that makes it audible to the device.

In band signalling is a more interesting process in that it reveals quite substantial differences between the way humans interpret sounds into inteligable information than device algorothms do. Put way over simply it is possible to find a number of vastly different to humans input signals that when they go through the device algorithms they end up producing the same inteligable information to later processing stages.

Obviously with a little thought it can be seen how to make hybrid systems that use various asspects of both inband and out of band signalling to produce considerably more ways to achive the non inteligable to humans but inteligable to devices wave forms.

Clive RobinsonMay 15, 2018 7:32 AM

@ All,

This os not the only problem Alexa has been having over the past three months Alexa has been making bone chilling spooky laughs apparently without being woken up,

https://www.theverge.com/circuitbreaker/2018/3/7/17092334/amazon-alexa-devices-strange-laughter

Maybe somebody has been testing an attack method to freek people out ;-)

One thought that occurs is "snorers asleep", it is fairly well known that a snoring person rarely gets woken by their own snores but will get woken by others snores and noises around their residence.

It may be possible that some snores "wake Alexa" etc and cause the problem....

On another note, both the articles @Bruce links to they talke about "white noise" activating these devices. Technically that is highly unlikely. Put simply white noise has a specific set of characteristics when used in a formal technical sense. However most humans treat many forms of weighted noise sounds the same as white noise.

So what you do is use certain short lengths of weighted noise signal that is speciffically designed to cause the filters in the device to falsely decode the noise as a word or even phoneme. You then string these differently weighted short lengths of noise together to make the required words. Whilst it is noise to the human ear it is not "white noise" when viewed on suitable quality test instruments.

Dan HMay 15, 2018 7:55 AM

Even though the auditory sounds are undetectable to the human ear, they are still using a frequency. Couldn't phones have some sort of jamming device built in that only allows for specific frequencies?

Rick LobrechtMay 15, 2018 9:29 AM

The In bound attack sounds like auditory stenography. Here being used to attack personal assistants, but I wonder if similar techniques could be used to hide messages.

Interesting research.

I have Hey Siri turned off on all our devices. I do find it convenient to use Siri to search on our Apple TV, and can see the allure of having an always listening assistant. We can't trust the leaders (Google, Amazon) in this space, and now we can't trust the audio that they may be listening to either.

Paul GregoireMay 15, 2018 10:10 AM

Device manufacturers could most certainly limit the listen ranges to those of human speech or better yet, human hearing, but one would assume that they do not, so that they may use the extended ranges for their own AD purposes. I'm not a tin-foil-hat person, but they make vast sums on advertising and I can't see them giving up a cash-cow.

Clive RobinsonMay 15, 2018 10:37 AM

@ Paul Gregoire,

Device manufacturers could most certainly limit the listen ranges to those of human speech or better yet, human hearing, but one would assume that they do not, so that they may use the extended ranges for their own AD purposes.

The audio device manufacturers have had these high audio frequency ranges and sampling rates, long before the likes of Alexa etc were thought of.

The primary reason originally was "home musicians" however the need for greater audio bandwidth and higher sampling rates is now being pushed by "instrumentation" and "Software Defined Radio" markets. Things will improve further when Microsoft actually get their act together and start suppotting the higher standards as Linux and other *nix's currently do out of the box.

CallMeLateForSupperMay 15, 2018 10:42 AM

@Clive
"[...] the samole rate and filtering values are setup incorrectly [...]"

That "threw" this EE. Thirty-nine years ago I beat my way through "Signals & Systems 321", hating every lecture and homework assignment. A tiny fraction of the course work stuck but most dropped into Write-Only-Memory (the same virtual oubliette where we carefully segregate e,g, disastrous relationships).

But now all the angst has returned, because of a term that I should ,,,um... a term that ought to "ring a faint bell", but doesn't: "samole rate". I actually Wikipedia'd it; and DDGed it; and then Startpage'd it.... all to no avail. What is this new "rate"? If *they* don't know either, then maybe it would be more profitable to back away and go wash the breakfast dishes.

"Yet in these thoughts myself almost despising
Haply I ..."
notice that replacing the "o" in "samole" with "p" solves my mystery.

Truth be told, I look forward to your treatments here and am better off for studying them. I am writing this to confess that, for the first time, one of your dear typos caused - not seconds of re-reading - but minutes of scurrying. You da man. :-)

ConnorMay 15, 2018 10:43 AM

My guess would be that the manufacturers would want this ability. That way, they could have a TV commercial where the actors say, "Alexa, order me dinner" and the commercial would have an inaudible sound that would prevent the device from activating in the room. Otherwise, you'd have the devices all over the world calling in at once.

War GeekMay 15, 2018 11:46 AM

Aha! I finally understand the Cuban Embassy Migraines!

It was a canny Cuban black Op to prank the American and Canadian embassies. Using the dolphin attack to ultrasonically whisper to all their cell phones and Alexas in order to have them all buy metric tons of politically themed toilet paper with their least favorite American presidential candidate printed on them!

Alyer Babtu May 15, 2018 12:06 PM

Following a link at boingboing, some interesting papers at the site openai.com on the general problem of “mistaken” AI output, and attendant security risks. E.g. blog.openai.com/adversarial-example-research

Angry PrimateMay 15, 2018 12:09 PM

Reason #75 why I will *NEVEREVEREVEREVER* allow one of these devices in my house, and 75, sub-paragraph (b) why I turn off Siri and its ilk on all devices.

neillMay 15, 2018 1:57 PM

if memory serves me right nielsen TV ratings in the U.S. are generated via 'listening' to the room noise, and recording audio signatures that, when analyzed, reveal which TV program is being watched

possible that there are embedded "subliminal" messages not only for machines (to improve accuracy) but also humans ... (buy me, buy me, buy me)

PeaceHeadMay 15, 2018 3:06 PM

Oh geez,

Thanks for this deep info...

I knew something like this was a potential risk, but I didn't know it had progressed to this point of anti-sophistication.

On the plus side, I'm a trained recording engineer, so I guess I got some significant homework to do.
Filters/editors/filtration and DSP still have a future, of course.

I am no longer linking to my music page, due to this topic and it's ZTE-ish timing.
Instead, here's something more important: https://www.dontbankonthebomb.com/2018_report/

--> May Peacefulness Prevail Within All Realms of Existence

Bauke Jan DoumaMay 15, 2018 6:29 PM

@CallMeLateForSupper
Yes, Clive's misspelling algorithm has its idiosyncrasies.

Bauke Jan DoumaMay 15, 2018 6:38 PM

@Neill
No sir, these TV ratings are generated from monitoring peaks in a country's toilet water demand (i.e. flushes) and their synchronization with ad blocks.

Anon Y. MouseMay 15, 2018 7:19 PM

Ai-yi-yi. Part of the horror of the novel "1984" was the thought
of having surveillance devices ("telescreens") covering not only
public spaces, but every part of private residences. Today it is
not through government mandate, rather people willingly installing
these devices in their homes.

Didn't anybody learn the lesson of OnStar? If memory serves, it was
over fifteen years ago that the FBI got a warrent to clandestinely
listen in to suspects in their cars through that service. It will
only be a matter of time before that happens with Siri, Alexa, et alia.
If not law enforcement, then possibly hackers or rogue employees.

As one example, would you want to have a confidential consultation with
a lawyer, in person or on the phone, while they were in the room with
one of these devices?

KaosagntMay 15, 2018 7:28 PM

Remember the old POTS telephone system. 3KHz bandwidth was enough to distinguish a persons voice. Therefore learn from the past and filter the incoming signal with a bandpass filter from 300-3.3KHz. Of course the designers of Siri, Alexa and that Google thing will have to make their software work with this change...

Fearful PrimateMay 15, 2018 7:33 PM

@Angry Primate

"I will *NEVEREVEREVEREVER* allow one of these devices in my house"
"I turn off Siri and its ilk on all devices."

May I presume then, that you only are able to turn off Siri and its ilk on devices that are *outside* of your house? :)

PeaceHeadMay 16, 2018 1:21 AM

By the way, what are the voice command syntaxes to turn off all of the aforementioned devices?

I don't own any, but it could be useful to know how to verbally turn all of them off (if possible)...
So, for example, when entering a room, delivering several powerdown/poweroff/shutdown/shutoff/quit/halt/end/exit/disable types of compound commands via recording or just verbally (with or without a soundalike :).

I like the idea of kicking them to the curb before they start any worse troubles...

echoMay 16, 2018 5:02 AM

@Anon Y. Mouse

UK state sector (and state sector funded) forget "capture it all" works both ways. I have witenessed enough malpractice (malpractice in public office, malfeasance, unethical and unprofessional behaviour and failure of standards etcetera) my policy is now to record everything.

If the public knew what happened behind closed doors and what is actively covered up and how they play the system I can only guess at the outcome.

I was punched by one UK police officer within approximately the past three months while his squad looked the other way. "Professional standards" looked the other way. I will be very surprised if there is a record on the system. This time the police didn't even pretend to investigate. I have also experienced cameras being turned on and off to edit the narrative of a situation so police harassment was not captured.

What if an external investigating body with community oversight had the power to silently turn on and off police cameras, or forcethe camera to keep running? Nobody thinks fo this and the question is why not?

PACE is now a sham.

JamesMay 16, 2018 5:10 AM

Using such devices is plain stupid. It's allowing a listening device in your house and paying for it. The inaudible commands are the least of the problem. They have microphones and some even cameras, they are connected and they are black boxes. Sure, they have "safeguards", like bugs in software have not been found before... Not to mention that the vendor can always push a firmware update that can change all functionality. A while back when someone wanted to listen they needed to break in, install covert devices, quite a lot of work ... now people bring in the listening devices themselves, and pay for them. That's a lot of trust ...

But most people have nothing to hide anyway, so who cares ? Those gadgets are cool.

Clive RobinsonMay 16, 2018 7:45 AM

@ CallMeLate...,

I am writing this to confess that, for the first time, one of your dear typos caused - not seconds of re-reading - but minutes of scurrying. You da man. :-)

I'm sorry that my "fat finger" problems on the phone touch screen caused you some angst.

The old eyes here are not what they once were, especially when I'm tired[1].

But you might have noticed my typos come in three flavours,

1, Fat Finger issues.
2, Spelling mistake issues.
3, Wrong word issues.

The latter of which has caused some inadvertant humour over the years... Especially when I've done it by a fat finger issue on the suggested word spell checker, where "bare and bear" get swapped thus "Ruppert the bear" takes on a whole different persona.

[1] I've been having fatigue issues since coming out of hospital having been admitted for sepsis. Which got to the point where my fever was boiling my brains and causing unpleasent twitching and spasms.

Clive RobinsonMay 16, 2018 7:55 AM

@ echo,

So when is somebody going to build an ORAC?

Oh dear that caused a flash back to sets more wobbly than Dr Who, and terrible filming of a certain University not that far from Leeds.

That said Blake's Seven did provide some amusment if not enjoyment at the time.

echoMay 16, 2018 8:29 AM

@James

I have voice activated everything turned off. I say "off". A bit is flipped and a light turns from greeen to red (or the equivalent) but does off mean off?

I rarely carry my phone when out. I have been using a phone as a clock and occasional watch. I don't need need my brain niggling me while trying to get to sleep or feeling like I am on the end of a dog lead so am investing in a new watch and clocks. I have been meaning to do so for ages and kept forgetting so just bought them on impulse. I also need a decent egg timer. The one I have is so useless I never use it not that cooking needs millisecond accuracy and a furiously active net connected supercomputer either. I don't know. Maybe it's just age or something but there is a thing about the analogue. I have been using computers sicne school and the older I get the more I want something with a real knob on it not a plasticky monument to anodyne corporate egomania.

JamesMay 16, 2018 8:46 AM

@echo: well you can never be sure it is really off until you unplug it. Hell some of those things can even have batteries... In theory when you turn it off it should be off, but what do you do about exploits, programming "errors" or not so nice firmware updates ? The manufacturer can always push a firmware that makes the damn thing look off... There was something a while back about LG or Samsung smart TVs that used to record audio and send it to their servers ... They eventually said it was a "mistake". The capability for those things being nasty is there. As i said, there is a lot of trust involved...

echoMay 16, 2018 10:23 AM

@James

Yes, we have to balance the overkeen and genuine mistakes (which include degrees of professional negligience through to out of character stupidity) and planned malice. This can be very difficult to determine until after the event and even then not necessarily.

I just slung out my old CRT television which has been unplugged and gathering dust since the stone age. I'm sad to see it go as it is still a quality and functioning item. Nobody apart from an avid enthusiast or museum would want it. As much a marvel as new televisions are I don't feel the same about them. As expensive as it was I much preferred a long return trip by door to door full service Mercedes taxi yesterday. It actually had real knobs on the doors for the seat controls. Not everything is a slab of plastic! Yes, I know it's probably a pressure sensitive piezo electric thingy whatsit but like the old Model M keboard and my "design classic" laptops at least I have the illusion of real keys.

The "end point" is obviously a thing on this blog. Design also extends beyond the surface to the whole experience. As much as security encompasses human rights perhaps designers should more properly consider security too? It can be difficult when the human is lost in the process.

D-503May 16, 2018 2:02 PM

I'm surprised no one has mentioned that audio clip that has been circulating around the mainstream media lately.
You know, the one where someone says "Yanny" loud and clear, no ambiguity about it. No way anyone could hear anything different.
Except, when I look online, there are self-appointed experts who insist that the audio clip says "Laurel".
Even worse, the person sitting next to me when I listened to the audio clip insisted the word was unambiguously "Laurel". She filed that one under "men just don't listen."
What's going on here?
Part of it is that every human has more sensitivity to some sound frequencies that others, and this frequency spectrum varies from individual to individual.
Another part of it is that pattern-matching methods often include a self-reinforcing (positive feedback) mechanism. So, for example, you see either one animal, or the other, but not both at the same time.
A third part of it is that you hear what you're primed to hear. Another example here. And a paper on a related topic.

albertMay 16, 2018 2:11 PM

@Clive, @etc.,

A few years ago, I acquired an off-brand Windows netbook (which soon became a Linux netbook:) It was small enough to fit on my nightstand and let me listen to streamed music at night. I was excited until I tried it. The sound quality was awful. There was nothing audible below ~300-400Hz*. I thought I'd check out the IC in the audio section, and I found that the chip they used featured the full 20-20kHz spec. I think they crippled the output with a simple high pass filter. I never pursued that issue.

Now, why would an Alexa need the full frequency range when 300-3000Hz is all that's required for voice communication?

----------
* Music requires at minimum 50Hz to 15kHz, 20-20k is considered high quality.
. .. . .. --- ....

Clive RobinsonMay 16, 2018 4:38 PM

@ albert,

Now, why would an Alexa need the full frequency range when 300-3000Hz is all that's required for voice communication?

Believe it or not it's often to improve the signal quality with respect to background noise.

Speach is somewhat "predictive" as are most of the other sound sources in your home. Thus with sifficient fidelity and matched filters you can pull wanted signals up and push unwanted signals down with respect to the noise floor.

I'm away from the dead tree cave at the moment otherwise I could give you a number of refrences.

So instead I'll go with a simple explanation as to what's been happening with "signal prediction" over the past few decades...

If you have a highfidelity signal stored in memory you can "tune the filters" to it by running them over and over the stored signal with slightly different settings, you can then use various algorithms to get a better match to a stored assumed signal such as the last time a particular person said "Alexa" etc.

In effect you run a highly maluable filter set over the stored input signal and take the estimated signal features and compare them to known signals. You then use the difference to further tune the filters. In effect the filters appear to take a drunkards walk towards the most likely stored refrence signal.

However these algorithms are not inherently stable, especially those that use positive feed back for faster matching. Thus they can optomise to the wrong word, that is a noise signal such as "unvoiced noise" can end up looking like a known voiced signal, that is a false positive occures...

The better the quality of the stored input signal especially the dynamic range then supposadly the better the match... That said there are two ways to improve the dynamic range. The first is to have a better quality A to D converter with more bits. The second is to take lots of readings with an A to D with less bits in resolution then average a number of readings. Each time you double up on readings that get averaged, you effectively get an extra bit of resolution. The averaging also has a secondry effect of reducing random noise by a similar amount.

KaosagntMay 16, 2018 7:49 PM

@PeaceHead
"By the way, what are the voice command syntaxes to turn off all of the aforementioned devices?"

From memory I think it should be:**

"Destruct sequence 1, code 1-1 A."
"Destruct sequence 2, code 1-1 A-2B."
"Destruct sequence 3, code 1 B-2B-3."

The device will reply:

"Destruct sequence completed and engaged. Awaiting final code for (time interval) countdown."

Next sequence should be:
"Code zero zero zero. Destruct. Zero."

That should be enough to disable all the aforementioned devices.....

** http://memory-alpha.wikia.com/wiki/Auto-destruct

Alyer Babtu May 16, 2018 9:02 PM

The root cause of the problems of these devices is an insufficient attention to the polished gems of wisdom acquired painfully over ages concerning the necessity of good bureaucracy. No complex organization, of which these are examples, survives without it.

Alyer Babtu May 17, 2018 3:22 AM

Or as another person remarked (in a different context), “let all things be done decently and in order”. Instead, we have an entertain/tech ingenious/shiny/utopia/now free-for-all. Real brilliance requires serious constraints. Otherwise invention just becomes narcissism.

Leave a comment

Allowed HTML: <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre>

Photo of Bruce Schneier by Per Ervland.

Schneier on Security is a personal website. Opinions expressed are not necessarily those of IBM Resilient.