Hacking Voice Assistant Systems with Inaudible Voice Commands

Turns out that all the major voice assistants—Siri, Google Now, Samsung S Voice, Huawei
HiVoice, Cortana and Alexa—listen at audio frequencies the human ear can’t hear. Hackers can hijack those systems with inaudible commands that their owners can’t hear.

News articles.

Tags: academic papers, hacking, iPhone, Samsung, voice recognition

Posted on September 13, 2017 at 6:03 AM • 28 Comments

Comments

Rostislav • September 13, 2017 6:58 AM

To be precise, those systems are not listening for inaudible frequencies. Microphone hardware is slightly nonlinear, this allows specially crafted sound to be demodulated to audible frequencies inside the microphone circuits. This is more of side-channel attack, like getting encryption key from power consumption data, but the other way round.

Julie Brandon • September 13, 2017 7:44 AM

Please forgive me, as it’s been a few decades since my degree, and I’ve committed the mortal sin of not reading the entire paper, but sure any non-linear audio-reflective objects or surfaces that these ultrasonic sounds might reach will potentially demodulate the signal and make it audible to any nearby humans? I seem to recall there’s actually directed audio experiments going on that use exactly this mechanism?

Darryl Daugherty • September 13, 2017 9:04 AM

The low-tech exploit would be to walk through densely populated areas with a tape loop giving instructions to dial the phone and call an expensive toll line for betting tips, astrology, etc. that one controls.

Clive Robinson • September 13, 2017 9:11 AM

@ Julie,

blockquote>but sure any non-linear audio-reflective objects or surfaces that these ultrasonic sounds might reach will potentially demodulate the signal and make it audible to any nearby humans?

blockquote>

The human ear does not “sample” unlike digital audio systems.

Clive Robinson • September 13, 2017 11:01 AM

@ Rostislav,

To be precise, those systems are not listening for inaudible frequencies.

Actually they are at the input.

Most small surface area membrane microphones do not have a very good low frequency response. There are various ways to solve this, one is by acoustic matching the other is by a frequency response correcting circuit in the microphones detection circuit.

Because acoustic pipe engineering is not something that translates easily into the case of a phone or other hand held device the usual choice is to correct the frequency response of the membrane inside the microphone capsual by it’s detector circuit.

The frequency response of the membrane and that of the detector circuit will not be perfect inverses of each other thus you will get a flat frequency response across the entire audio band. It will be broadly flat between 300Hz and 4-5Khz which is generally more than sufficient for human speech. After 5KHz the frequency response will be not very flat and will almost certainly have peaks and roll offs. Thus the reality is that the microphone will have quite a poor high frequency roll off untill sufficiently above the Nyquist frequency of the sampling circuit.

The faithfulness the microphone will have to producing a sinewave out for a sinewave in is to do not just with the linearity of the detector amplifier but the mechanical characteristics of the membrane and it’s mountings. Whilst the human ear is relatively sensitive to differences in frequency it mainly can not pick up nonlinear behaviour in the response curve or any phase distortion either. Thus a pure sinewave can have a very high harmonic content and will tend to sound harsh or cold with odd harmonic distortion whilst warm with even harmonic distortion (which is why guitar amplifers based on Valves/Tubes sounded warmer a point not lost on drummer Jim Marshall and his guitarist customers).

There is a lot of myth about the Nyquist frequency, one of which is that it defines the maximum frequency that can be accepted by a sampling system. If people can be bothered to get a piece of graph paper out they can see that a sampling system will respond to frequences above the Nyquist frequence. It’s just that it will not respond to them in a linear way, and also it will fold the frequency of the signal around the sampling frequency in a process called “aliasing”. It’s just like an RF signal ina double balaced mixer in a Superhet radio (it’s also a similar reason you get the upper and lower sidebands on an AM signal). The simple averaged frequency response of a sampling system is quoted as “sinc(x) = sin x / x” [1] which you can draw out with the use of a calculator. From this it can be seen that a sinewave above the sampling frequency will appear at the output as being no different to a sinewave appropriatly below the sampling frequency.

Howrver there is also the “square law” to consider it is why you get harmonic distortion in systems. The most well known of which is the response of a semiconductor junction such as a diode that is used to “envelope demodulate” AM signals. What you have is the x^2 which in effect rectifies the input, and when the output is integrated by a low pass filter the modulating signal that is the envelope is recovered. As it is the envelope waveform not the carrier waveform it is “inband” to the following circuits.

[1] There are actually two sinc functions in regulat use, the unnormalised sin x / x and the normalised sin pi.x / pi.x

Scared • September 13, 2017 11:05 AM

@Clive,
so are you saying they generate a signal (>20kHz presumably) that gets under-sampled and thereby shows up in the normal voice frequency range. Pretty amazing if the device designers didn’t put in at least a crude anti aliasing filter…

Can’t read the original pdf, my browser says it’s an insecure site.

Scared • September 13, 2017 11:06 AM

Oops, got my question answered while I was typing it….

kevin • September 13, 2017 11:36 AM

@Clive Robinson

The microphones used in modern smart phones actually do have flat frequency response over the audible spectrum. Indeed, the cheapest way to build a palm-sized audio analysis lab is to use a smart phone and an a-a application. (I use an iPhone and Studio Six Digital (http://studiosixdigital.com), but there are several other alternatives.)

In fact, I’ve had a chance to compare this combination to a lab-grade real-time spectrum analyzer. The two agree to within a dB or two from 20 Hz to over 15 KHz.

Andrew • September 13, 2017 12:27 PM

I don’t have any of those enumerated devices because all my 10 others (laptops, phones,tvs) are already listening to me in the audible frequencies audio specter.

albert • September 13, 2017 1:57 PM

@Julie, @Clive,

There is/was a company that sells systems that uses super-audible frequencies to generate audible frequencies based on the frequency difference between two high-frequency sources, i.e, 20kH and 22kHz would yield a 2 kHz result. If you knew (or could determine) the ‘carrier’ frequency of the attack, you could detect audible signals by generating a frequency close to the carrier. It follows that you could foil the attack by generating one or more carrier frequencies of your own. (?)

I’m not sure how important this attack vector is. It requires proximity, and an ‘activation’ phrase.

. .. . .. — ….

Drone • September 13, 2017 2:09 PM

@Clive Robinson: “The human ear does not “sample” unlike digital audio systems.”

Actually, the way humans hear does employ a measure of dynamic sampling in both the temporal (time) and complex-spectral (frequency and phase) domains. But you would be correct to say these “samples” bear little resemblance to our (crude by comparison) method of discrete time-and-amplitude machine sampling (digital recording). If I’m not mistaken, the relevant field of study regarding how humans hear and perceive sound is called Human Psychoacoustics.

Notice that I said “humans” hear, not that “ears” hear. While some processing (both passive and active) does take place in and at the ear, the ear acts more like a sensor. The majority of signal processing that constitutes meaningful “human hearing” seems to take place in both the neural pathways from the ear to the brain, and in the brain proper.

The study of Human Psychoacoustics led to an understanding of how different sound components in the time and frequency/phase domains are perceived differently, especially when they interact. This understanding led to advances in the lossy compression of machine sampled sound data for human listeners. The end result was the likes of the familiar MP3 codec (a.k.a. the MP3 psychoacoustic perceptual codec). Sound files compressed using the MP3 codec algorithms are much smaller in byte size, yet without sacrificing too much in terms of human perceived sound quality.

A similar research and design approach that was used in the development of the likes of the MP3 audio codec was applied to how humans perceive visual data. The result being the likes of the MP4 video codec.

And as long as I’m rambling…

If you ever hear an “Audiophile” surrounded by scratchy vinyl records, $10,000 tube amps, and 10AWG solid gold-alloy speaker cables complain, “Gawd damn MP3 recordings, they all sound like a Suzanne Vega album from 1987!” There just might be some truth in what he’s saying. Go ask Wikipedia about “MP3” to learn more 😉

Have Fun, David

Joshua Bowman • September 14, 2017 12:24 AM

@kevin

I could absolutely believe that about the iPhone, given Jobs’ and Ives’ fanatical obsession with perfection and Apple’s historical support for high-end media capabilities. I have no doubt they design their custom ADCs to exactly match their microphone response, and software-correct it beyond that as necessary.

I’m not so sure I’d generalize that to any other smartphone brand, and quite a few of them I doubt would even realize there’s a problem in the first place — let alone care enough to do anything about it — when they’re throwing an off-the-shelf ADC design in the SoC.

MikeA • September 14, 2017 12:12 PM

This reminds me of what some called “guard banding” back in the early days of phone-phreaking. Telcos had responded to the rise in use of Single Frequency In Band signalling by putting “2600 sniffers” on suspect lines. These, of course, had essentially the same “spec” as the signalling units themselves. That is, energy at 2600 Hz above some threshold and energy at other frequencies below another threshold. “guard banding” consisted of energy outside the nominal passband (e.g. 3400 Hz), which “blinded” the local sniffer. but after passing through the further filtering of the channel bank (multiplexer), was knocked down enough that devices further along the line would “see” the 2600. Not that I have personal knowledge of this, of course.

lurker • September 14, 2017 5:49 PM

My browser tells me the certificate at endchan.xyz was signed by an unknown authority. Does my browser not know enough, or should I trust Bruce?

Well call me Sally • September 15, 2017 1:53 PM

” Does my browser not know enough, or should I trust Bruce? ”

Neither! Haven’t you listened to anything Bruce ever said?

Call me confused • September 15, 2017 6:34 PM

If Bruce said not to trust anyone, how do I know I can trust his advice?

Is there a Bruce clone? Maybe one always speaks the truth and the other tells only lies…

Wael • September 15, 2017 6:44 PM

Hey, Confused,:)

Maybe one always speaks the truth and the other tells only lies…

There’s a solution to this problem. It’s a Famous puzzle.

Call me confused • September 15, 2017 9:25 PM

@Wael

You’re a quicky sticket!;)

Clive Robinson • September 16, 2017 8:56 AM

@ Wael, Call me confused,

There’s a solution to this problem. It’s a Famous puzzle.

But how do you get two Bruces to stand on either side of a pair of doors? After all he has expressed a distinct dislike for both frontdoors and backdoors…

Wael • September 16, 2017 9:47 AM

@Clive Robinson, @Call me confused,

But how do you get two Bruces to stand on either side of a pair of doors?

He does that once in a while 😉

After all he has expressed…

Which one of them? The author of liars and outliers or his counter-part character in the book? Then again, it’s a matter of perspective!

I bet you’re so drunk you see two of me
I have two guns; one for each of you!

Which movie?

Clive Robinson • September 16, 2017 12:49 PM

@ Wael,

Which movie?

Do you mean,

Billy Clanton: Why, it’s the drunk piano player. You’re so drunk, you can’t hit nothin’. In fact, you’re probably seeing double.

Doc Holliday: I have two guns, one for each of ya.

If so It’ll get carved on your “………”

If not “I’ll have to redo my google fu”. I’m not keen on “Out West Westerns” there were way way to many dumb hold up the stagecoach plots. Appart from “Paint your Waggon” and “Support your local Sherif” which were both funny and “Blazing Saddles” which realy took the micky out of the Western genre I don’t watch them, to be honest they were like “Lassie films for wanabe men” and Jack London’s book “White Fang” did that wy way better.

But I do like “ScFi with a Western theme” like “FireFly” for instance, because atleast the scripts had some life in them.

Wael • September 16, 2017 1:07 PM

@Clive Robinson,

Do you mean,…

Yes! That’s the one 🙂 it’s based on some historical facts. I reccomend watching it; its different than other westerns. I only liked a handful of westerns. Some had great music themes by Ennio Morricone. Sci-fi, I like’em, too.

Tombstone, Arizona is a place I visited many times.

Call me confused • September 16, 2017 5:53 PM

Well call me amused! 😀

Wael • September 16, 2017 11:19 PM

Hey, amused,

Your “always liar / always truthful” puzzle is a child’s play! Let’s see if “The Hardest Logic Puzzle Ever” restores your original name. 🙁

Call me confused • September 16, 2017 11:33 PM

There is no random, dadn’t ja’no?

Wael • September 16, 2017 11:44 PM

@Call me confused,