Recovering Smartphone Voice from the Accelerometer

Yet another smartphone side-channel attack: “EarSpy: Spying Caller Speech and Identity through Tiny Vibrations of Smartphone Ear Speakers“:

Abstract: Eavesdropping from the user’s smartphone is a well-known threat to the user’s safety and privacy. Existing studies show that loudspeaker reverberation can inject speech into motion sensor readings, leading to speech eavesdropping. While more devastating attacks on ear speakers, which produce much smaller scale vibrations, were believed impossible to eavesdrop with zero-permission motion sensors. In this work, we revisit this important line of reach. We explore recent trends in smartphone manufacturers that include extra/powerful speakers in place of small ear speakers, and demonstrate the feasibility of using motion sensors to capture such tiny speech vibrations. We investigate the impacts of these new ear speakers on built-in motion sensors and examine the potential to elicit private speech information from the minute vibrations. Our designed system EarSpy can successfully detect word regions, time, and frequency domain features and generate a spectrogram for each word region. We train and test the extracted data using classical machine learning algorithms and convolutional neural networks. We found up to 98.66% accuracy in gender detection, 92.6% detection in speaker detection, and 56.42% detection in digit detection (which is 5X more significant than the random selection (10%)). Our result unveils the potential threat of eavesdropping on phone conversations from ear speakers using motion sensors.

It’s not great, but it’s an impressive start.

Posted on December 30, 2022 at 7:18 AM57 Comments

Comments

Winter December 30, 2022 9:23 AM

zero-permission motion sensors

I think the root of the problem is that there are zero-permission data sources in mobile phones. Maybe the only solution for this game of whack-a-mole is complete segregation of data streams between apps.

Every app should be considered a security risk for every other app. For instance, if one app plays to the speakers, no other app should have zero-permission access to the speakers.

Clive Robinson December 30, 2022 11:22 AM

@ ALL,

Whilst this “instance” is relatively small it falls into a more general “class” of mechanical vibration “side channels”.

And it’s got a lot worse since “Micro Electro-Mechanical Systems”(MEMS) transducers. Basically MEMS are made in a very similar way to semiconductor chips, thus you get many thousand on a standard wafer.

The thing is the mechanical component that moves is extrodinarily small, light, and sensitive, which gives them a very wide bandwidth.

The difference between a MEMS microphone transducer and MEMS accelerometer or motion sensor is actually very small. The real difference is in what part of the mechanical vibration frequency range you want them to work.

Typically microphones in the 50-15,000Hz range and motion sensors below 50Hz.

Thus what makes the MEMS different is the “Digital Signal Processing”.

However we know that recovering usable audio in that low frequency range is quite “achivable” from the work of those using high speed video cameras watching light reflect of crisp/chip packets and similar shiny surfaces.

Ted December 30, 2022 3:22 PM

FYI … if you want to experiment with what the researchers used, you can download the Physics Toolbox Sensor Suite app. The app connects with a lot of your phone’s sensors.

You can use it to view your phone’s accelerometer data. The Z-axis was most relevant for measuring ear speaker vibrations. It’s pretty fun.

pup vas December 30, 2022 4:49 PM

See no evil: People find good in villains
https://www.sciencedaily.com/releases/2022/12/221220165217.htm

=Across these measures, the research indicated that both children and adults consistently evaluated villains’ true selves to be overwhelmingly evil and much more negative than heroes’. At the same time, researchers also detected an asymmetry in the judgments, wherein villains were more likely than heroes to have a true self that differed from their outward behavior.=

Erdem Memisyazici December 30, 2022 5:51 PM

@Clive Robinson

That was very informative, thank you. Here is an MIT project demonstrating what they deemed a “visual microphone”. One use I thought was really interesting is recovering audio from silent movies which with a ton of approximation could restore the actors’ vocal performances (if you can extract tone and cadence, the script can be used to generate the audio).

It should also be possible to recover audio from charged particles in the air or WIFI even similar to this project. I suppose neutrinos could tell you about vibrations in a planet’s core. As a friend and colleague once told me, “it’s all just waves man.” 😄

Matt December 30, 2022 8:06 PM

It’s important to discuss this in the context of existing mitigations in Android. There are new restrictions for sampling motion at 200 Hz and older restrictions (Android 9 and later) on apps running in the background.

The suggestion at the end of the paper that “smartphone manufacturers should be more careful about designing larger and more powerful ear speaker volume control” is unsupported by the facts.

Whether the required notification for “foreground services” is understood by users is another (more useful) question.

Clive Robinson December 31, 2022 12:18 AM

@ Help me darth vader…, ALL,

Re : Air gap crossing.

The problem with the majority of those “Ben-Gurion University” papers is they are not “original” research, you can find that on this blog long predating BGU’s papers.

BGU’s papers are “secondary”, “implementation”, or “proof of concept” papers…

As long as people remember this, and importabtly understand that to “move forward” constructively into other practical “instances” you need to have seen the “original” research that defined the “classes” of attack method then they should be OK.

I was doing original research in these areas out of my own pocket back in the 1980’s. A time when commercially “nobody wanted to know” and academia kind of pretended it did not exist as an issue worthy of any kind of research funding. Even now more than a third of a century later very very few people know anything about “gapping” even as just a name mentioned in spy movies / TV shows.

For instance ask yourself why “air-gapping” is insufficient in modern security environments and “almost a joke”. Thus why you have to move to “energy-gapping” which is

“A whole different and much larger ball of wax”.

Clive Robinson December 31, 2022 12:54 AM

@ Matt, ALL,

Re : Sampling frequency limiting.

“There are new restrictions for sampling motion at 200 Hz”

And the limit has been set on an incorrect prenise.

That is there are two theorms you have to correctly understand. Which are the

1, Nyquist–Shannon “sampling” theorem.
2, Shannon-Hartely “Channel Information rate” theorem.

People incorrectly assume from the first that the channel maximum frequency can not be more than half the sampling rate. So think that 200hz means a maximum of 100hz frequrncy.

It’s because people do not understand the notion of “negative frequencies” and thus “sampling fold back”.

If you take the same input signal, and sample at 200Hz into two different channels using a 90degree phase shift, then it’s easy to “unfold” the signal and get a 200Hz frequency limit. You will see this refered to as an “I/Q” or just “IQ” system, and it’s used a lot with “Software Defined Radio”(SDR) to get around the audio samoling rates in PC sound cards. Likewise if you use more channels you can get higher frequency limits.

So if you have access to the signal in various ways including via two or more motion sensors. Then the 200Hz sampling limit is relatively easily bypassed.

The important thing to renember is that the part of normal Western speech that contains the intelligence is the “envelope” and it’s why you can using a VOCODER get musical instruments to sould like they are talking[1]. Most if not all the inteligence is below the 300Hz frequency range. Thus using four seperate phase spaced sampled circuits at 200 samples a second will enable you to recover the actuall speech information. Oh and enough further information to know rather more thwn just male/female as the speaker. You will be able to get their “accent” and close enough to identify the individual uniquely.

Something, the readers of the paper might want to consider as “further research” and get three or more papers worth out of[2]…

[1] The “classic” example of this is the “Electric Light Orchestra”(ELO) “Mr Blue Sky” song that was the last track on the vynal “LP” where at the end you hear a couple of sytring instruments say “Now please turn me over”.

[2] Which in these much less certain times of academic employment is probably a wise thing to do. You might by journal shopping get it upto six papers for what should all have been in a single paper back in the 1980’s.

randoom reader December 31, 2022 2:12 AM

USB Type C mandatory chargers will surely make energy gapping blatantly obviously difficult. Esp. for laptops, even with removed or broken or sealed tinfoiled antennas or cases. I don’t envy Snowden 2.0. Ensuring no data can leace those devices will pose more than trivial challenges.

Help me darth vader, you're my only hope December 31, 2022 2:39 AM

@ Clive Robinson,

“The problem with the majority of those “Ben-Gurion University” papers is they are not “original” research, you can find that on this blog long predating BGU’s papers.”

You can find that on this blog in discussions but not with elaborate, detailed information/instructions. This is what that site provides. It’s nice to discuss them here but to get the real meat, that much is better.

“Even now more than a third of a century later very very few people know anything about “gapping” even as just a name mentioned in spy movies / TV shows.”

Which makes it that much more important that people have access to to learn about and discover the side channel attacks. Obviously, if there was no merit to researching them, they wouldn’t make discussion topics here about them.

“For instance ask yourself why “air-gapping” is insufficient in modern security environments and “almost a joke”. Thus why you have to move to “energy-gapping””

I wouldn’t say that the research is a joke at all, obviously there are some bright mind(s) behind it, otherwise as I said it wouldn’t be newsworthy.

There are a lot of “gaps” and “air-gapping” is just one.

As always, I highly respect your posts. Thank you.

Data without ears December 31, 2022 2:59 AM

Do you remember the days when you could fire up a program on a computer without ANY networking modules/software/firmware being activated? When you could truly work on something WITHOUT any network being involved/supported?

How many Operating Systems support this? Not just the ability to disable such crap, but a platform in which NOTHING EXISTS under any circumstances to use/install/support networking!

TempleOS is one I know of. I couldn’t use it without having a seizure, most probably, and I disagree with the “divination” involved in seeking “God” in randomness. However, it has no networking and this is something I believe we should look to, making one modern OS where ALL networking is stripped and unsupported in order to not get fscked by stuff like IME and whatever equivalent AMD has, and so on.

E.T. should NOT be phoning home.

DSP December 31, 2022 4:37 AM

Most if not all the inteligence is below the 300Hz frequency range.

That’s a somewhat misleading statement.

Channel Vocoders use a filter bank with center frequencies up to at least 3 kHz, and the better ones up to 8 kHz or so. If the input is limited to 300 Hz you won’t get much out of them.

What is true is that the power envelope in each band doesn’t contain much
info above 100 Hz or so, even for the higher bands which are much wider than 100 Hz. The reason is simply that the muscles that control speech can’t move that fast.

Thus using four seperate phase spaced sampled circuits at 200 samples a second will enable you to recover the actuall speech information.

A much more powerful technique is to use ‘random sampling’ at an average rate much lower than the signal bandwidth. Given some a-priori knowledge about the signal (e.g. that it is sparse in the frequency domain), it can be recovered almost perfectly from such samples. Look up ‘compressive sensing’.

Winter December 31, 2022 5:55 AM

@Clive

Most if not all the inteligence is below the 300Hz frequency range.

POTS of old was 300-3000 Hz. Less reduces intelligibility fast.

Below 300 has the pitch, 80-350 Hz, for adults. But that is redundant as human ears reconstruct it from the higher frequencies.

Clive Robinson December 31, 2022 8:20 AM

@ randoom reader,

Re : USB C and other power/data issues.

“USB Type C mandatory chargers will surely make energy gapping blatantly obviously difficult.”

Yes it will.

In theory you could make your own device to sit inbetween but with USB-C that will create issues like some “extension cables” have.

But it goes further…

You might decide to “directly charge” the batteries from the interface going to them…

You will find that possibly harder because some manufactirers have made their batteries “Smart”. Apple for instance has a battery serial number and charging curve information.

The ethics of which I won’t go into, but again without sending the right data the battery will not charge properly or at all.

There are a whole load of “other spare parts” that manufacturers make excessively high profits on thus have controled in the past. The clasic being a well known manufqcturer of ibk-jet printer manufacturers puting a data chip in the ink cartridge, so you could not re-fill it or use it beyond it’s artificial “Best Before Date” which is actually illegal to do in the EU but nobody has yet been prosecuted.

To partially get around the legislation rather than stop the printing they would only peint in some low resolution mode…

Oh and now of course we have kitchen appliances that won’t work without the Internet… Yup your fridge has to be able to phone home to the mothership –often in China– for it to work correctly…

Amazon started this minimal function on what you buy, and you rent the rest till they decide to turn it all off. Which they have done in some cases as early as six months after people made the first purchases…

When you consider increasingly applications are “cloud based” and need you to be either permanently connected or connecting every couple of days…

Yup data security is being made quite deliberately hard to impossible for the average user to be able to do.

Clive Robinson December 31, 2022 8:27 AM

@ Help me darth vader…, ALL,

I think you should realy go back and read what I wrote.

Especially about instances and classes of attack.

Similar for your other points.

What you consider “the real meat” is anything but, it’s more the spun sugar of a candy-floss toping, it has looks, but next to no substance or nutrition.

Winter December 31, 2022 8:43 AM

@Clive

You will find that possibly harder because some manufactirers have made their batteries “Smart”.

Battery performance and life expectancy are totally dependent on charging/discharging cycles (and the temperature of the battery).

A single bad overcharging of undercharging can ruin a battery for good. And remember the exploding Li batteries? (Eg, Apple)

Dumb batteries end burning or total loss.

Clive Robinson December 31, 2022 9:35 AM

@ DSP, Winter,

Re : Envelope v Signal bandwidths.

“That’s a somewhat misleading statement.”

Not realy, the inteligence in western languages is not in the signal spectrum but as I said the enevelope spectrum. The signal spectrum gives you the user identification / personalisation and the likes of emotion via stressors etc.

When you say,

“Channel Vocoders use a filter bank with center frequencies up to at least 3 kHz, and the better ones up to 8 kHz or so. If the input is limited to 300 Hz you won’t get much out of them.”

You are confusing the spectral frequency components and the envelope or “amplitude” frequency conponents.

Look at how the likes of codecs that work on “spectral line pairs” and similar actually work.

In short, the “spoken audio” is never sent, a simple synthesis of it made by modelling speech is. This is often done as the sum of harmonically related sine waves represented by independent amplitudes called Line spectral pairs, or LSP, on top of a determined fundamental frequency of the speaker’s voice (“pitch”, the frequency of which does not have to be precise at all, in fact noise excitation in a bandpass filter is a good approximation). Thus the pitch gets crudely quantised and the “amplitude” envelope of the harmonics are encoded by as little as a single bit per filter. Thus with a few other coding tricks the LSP’s are exchanged across a channel in a very low bit rate digital format (450 has been demonstrated with “Codec 2”)

The LSP coefficients are not actual measurments, but prdictions based on previous predictions and the difference to measurments. They represent the Linear Predictive Coding (LPC) element of the model in the frequency domain, and give a fast, efficient and robust quantisation of the LPC parameters.

Whilst it works well for speech approximation, it’s almost usless for sending frequency or phase information, which is why I pointed out nearly a decade ago on this blog why the “jackpair” voice encryption system would fail.

Moving on to,

“A much more powerful technique is to use ‘random sampling’ at an average rate much lower than the signal bandwidth.”

Yes there are quite a few methods, but two points arise,

1, I was trying to give an easy to see example.
2, You may not be able to do the more fun ones because it is the Android OS not the user application that controls the access to the hardware.

As @Winter knows I get shot at for giving long detailed explanations, and also for trying to be brief and give minimal, explanations to get the point across…

So I’ll just move this lump of wood from one shoulder to the other 😉

Winter December 31, 2022 10:49 AM

@Clive

Not realy, the inteligence in western languages is not in the signal spectrum but as I said the enevelope spectrum.

It’s not “intelligence” but “intelligibility”. And this is a statement in the category: Not even wrong.

Any introduction in speech and language will tell you this on page 1.

Winter December 31, 2022 10:52 AM

Continued:
The Source–Filter Theory of Speech
‘https://oxfordre.com/linguistics/display/10.1093/acrefore/9780199384655.001.0001/acrefore-9780199384655-e-894;jsessionid=7679F91446C2F3248465468B1D2F22E0

Clive Robinson December 31, 2022 12:50 PM

@ Winter,

Re : Pick your knowledge domain

“It’s not “intelligence” but “intelligibility”. And this is a statement in the category: Not even wrong.”

The knowledge domain of “auditory perception” or “aural perception” is not of real interest when talking about the communications of digital information. I’ve a fair few books on “information theory” and a look through their indexes does not show mention of intelligibility. Which is not that surprising when you consider it actually comes from the biological sciences.

That said intelligibility is defined in the aural science domain as,

“The proportion of words correctly identified by a listener and is a natural measure for quantifying the quality and effectiveness of speech perception.”

However although listening tests by humans for intelligibility can provide semi-valid data, such tests are time-consuming to conduct and error prone with small groups, such is the nature of the human beast biology.

DSP December 31, 2022 6:15 PM

This is often done as the sum of harmonically related sine waves represented by independent amplitudes called Line spectral pairs

  1. Line spectral pairs are a method of representing linear prediction coefficients. They are used because they are more tolerant to quantisation than the raw coefficients, and hence required less bits to transmit them.
    They certainly do not represent ‘harmonically related sine waves’. So what you write is ‘not even wrong’.
  2. Channel vocoders (as the one you referred to, used by ELO) do not even use linear prediction, but two filter banks, one to analyse the voice signal and one to resynthesise it. So your reference to line spectral pairs is not only ‘not even wrong’, but also irrelevant.

I’ve a fair few books on “information theory”

But apparently none on speech analysis.

Clive Robinson December 31, 2022 8:01 PM

@ DSP,

Or should I call you “Dumber than a Troll” because you are certainly behaving that way.

“Channel vocoders (as the one you referred to, used by ELO) do not even use linear prediction,”

I never said they did, go back and read what I wrote and stop inventing arguments.

I did not mention linear pridiction or line spectral pairs in that post or even implied them.

When you first got incorectly picky I tried to let you down gently, obviously you saw that as some kind of signal to make false claims.

So it is now very clear you are inventing things to argue about, that were not there and never were there as anyone else can see.

Sorry but that is typical deluded Trollish behaviour.

So back to your cave under the bridge that stops the light of reason and thus knowledge falling upon you.

Winter January 1, 2023 2:18 AM

@Clive
You wrote:

Most if not all the inteligence is below the 300Hz frequency range. Thus using four seperate phase spaced sampled circuits at 200 samples a second will enable you to recover the actuall speech information. Oh and enough further information to know rather more thwn just male/female as the speaker. You will be able to get their “accent” and close enough to identify the individual uniquely.

That is simply not true. All the rest, vocoders LSP etc. are irrelevant as the do use frequencies over 300Hz.

A century of analogue telephone landlines has shown that a sound spectrum bandwidth of 2700 Hz (300-3000 Hz) conveys almost all information needed to perfectly understand speech and identity the speaker.

But vocoders etc. compress this information so the speech signal can be transported at a much lower bitrate. That was the whole point of using vocoders and LSPs.

I assume your information & communication books wrote a lot about (analogue) speech compression and bit rates, but not about plain human 2 Human speech.

DSP January 1, 2023 5:54 AM

I did not mention linear prediction or line spectral pairs in that post or even implied them.

True. Nor did I in my reply to that post.

You first mentioned LSP in your next post even if they were completely irrelevant (the argument was not about linear prediction), and demonstrated rather clearly that you don’t even start to understand what they actually are [1].

Such attempts to suggest competence and divert the discussion by dropping some vaguely related technical jargon is typical of some trolls, those who like to present some glorious picture of themselves while at the same time being incapable of accepting even the most benign criticism.

[1] ‘https://en.wikipedia.org/wiki/Line_spectral_pairs

Clive Robinson January 1, 2023 7:19 AM

@ DSP,

So you tried to say I was saying something I was not.

Then when caught out you try to brush it away.

But if people look at your comments they will see that not only are they mostly vacuous, the parts that are not are copied from Wikipedia, and your comprehension ability is not upto those wikipedia comments.

For instance do you actually know ehat a comb filter is and how it relates to the line spectral pair?

I suspect you are now going to have to scramble around trying to answer that…

Clive Robinson January 1, 2023 7:50 AM

@ Winter,

“That is simply not true.”

Actially it is, and it’s very clear from your comments you are not listebing to what you are being told.

The inteligence of a spoken signal is in the energy envelope not what you would call the carrier signal frequency.

A significant number of western language speaking individuals are effectively “tone deaf” (can not equate frequency).

Consider it like an Amplitude Modulated broadcast, the “intelligence” is in the AM envelope not the carrier signal RF frequency.

Back in the 1970’s Plessey Electronics then the largest supplier of digital exchange equipment in the UK had worked out that your statment of,

“A century of analogue telephone landlines has shown that a sound spectrum bandwidth of 2700 Hz (300-3000 Hz) conveys almost all information needed to perfectly understand speech and identity the speaker.”

Was a half truth, that is it in no way implied the opposite which is what you are trying to do. That is you do not need 300-3000hz to convey speech successfully. They had taken this knowledge and produced a series of digital based systems aomed not just ay the telephone industry but the military radio communications sector. One of their main rivals for this were another British company Marconi. One of their engineering seniors, J.S.Reynolds wrote a conference paper “A multipurpose secure system capable of working over H.F. Radio” for the IEE “Communications 78” conference, and is reproduced in the IEE “Conferance Publication Number 162”. Which I suspect you have access to based on some of your previous paper quotes. Go and look the paper up and give it a read, it describes what you should be understanding.

Winter January 1, 2023 10:49 AM

@Clive

The inteligence of a spoken signal is in the energy envelope not what you would call the carrier signal frequency.

I give up. Come back after you have listened to low pass filtered speech.

For anyone else, figures 4&5 in the site below show the numbers

‘https://www.dpamicrophones.com/mic-university/facts-about-speech-intelligibility

Figure 4 separate
‘https://cdn.dpamicrophones.com/media/images/mic-university/facts-about-speech-fig04_1.jpg

Figure 5 separate
‘https://cdn.dpamicrophones.com/media/images/mic-university/facts-about-speech-fig05_1.jpg

DSP January 1, 2023 11:05 AM

But if people look at your comments they will see that not only are they mostly vacuous, the parts that are not are copied from Wikipedia, and your comprehension ability is not upto those wikipedia comments.

The description of LSP which I posted is what you will find in the first few lines of any text about them. Very probably including Wikipedia.

Which confirms that your description of them was completely wrong, as well as irrelevant in the context of this thread.

For instance do you actually know what a comb filter is and how it relates to the line spectral pair?

Nice try to divert again. Do you know what an apple is and how it relates to an orange ?

lurker January 1, 2023 12:48 PM

@Winter

@Clive

The inteligence of a spoken signal is in the energy envelope not what you would call the carrier signal frequency.

I give up. Come back after you have listened to low pass filtered speech.

Tweedledum and Tweedledee …
If you take the envelope of human speech and use it to modulate white noise the result will be almost as intelligible as the original, suggesting that the envelope holds the intelligence in speech. But because of the way the human auditory system “demodulates” speech, both the envelope and the carrier are essential for speech communication.

Thus in filtered analog terms we find that for 90% intelligibility the Welsh language requires 4.3 khz bandwith, while Chinese tequires only 900 hz. [Atkinson, Telephony 1948]

Winter January 1, 2023 1:53 PM

@lurker

If you take the envelope of human speech and use it to modulate white noise the result will be almost as intelligible as the original, suggesting that the envelope holds the intelligence in speech.

Have you tried this? Take the envelope of a recorded sentences, use it to modulate white noise [1] and present it to someone who has not heard the original [2]. Ask her or him to repeat the sentence.

The link I supplied presents the results of actual listening experiments on low-pass filtered speech to test this. You cannot understand speech after low-pass filtering it with a cut-off below 500Hz.

[1] There is software that can do this. Look online.

[2] This is crucial. When you have heard the original recording before, your brain will substitute the memory for the sound.

Clive Robinson January 1, 2023 3:16 PM

@ Winter, lurker,

“Take the envelope of a recorded sentences, use it to modulate white noise [1] and present it to someone who has not heard the original [2]. Ask her or him to repeat the sentence.”

Actually yes I have and it was back in the 1980’s.

When you carry out the test correctly, they can repeate the sentence’s reasonably correctly, even without having previously heard the sentence.

The trick is knowing which frequency block of white noise to modulate with which envelope. If you read the paper I previously mentioned it will give you some of the reasons why.

For instance the frequency blocks around ~1.5kHz appear to be of little use to humans when speaking. You can notch out around a 300-500Hz block there entirely and humans don’t miss it from speech. So much so that in the early days of “dial up” teleconferancing, the control signals were sent in this block at around the same level as the entire speech signal. Nobody complained of it not working.

Similarly in cordless phones the same chunk of audio bandwidth was used for “link alive” and later “data transmission” was used.

Or to put it another way if your radio receiver needs an AM broadcast on medium wave, it won’t receive an AM transmission on Longwave or shortwave. That in no way means that the frequency of the mediumwave carrier signal is Inteligence and the longwave or shortwave AM envelopes are not. These simplistic ideas about “telephone audio bandwidth” were disproved over half a century ago in the early days of going Digital with the likes of System X research.

Oh the fact that they are not true, is why modern speech codexs work as well as they do with some achieving 20:1 ratios of data compression.

Clive Robinson January 1, 2023 3:38 PM

@ DSP,

“The description of LSP which I posted is what you will find in the first few lines of any text about them.”

Actually you don’t, and their uniqueness of stylr alerted me to the fact you had cut and past without comprehension.

“Which confirms that your description of them was completely wrong”

My description of what?

That was your big mistake, youvare trying to claim incorrectly that I said something I did not say, that you invented to raise argument.

The problem is those first lines from Wikipedia do not prove or disprove anything I’ve said. Which you would know if you actually had any comprehension of the subject.

So I tossed you a simple bone of a “comb filter” to see if you actually had any real basic understanding.

The fact you’ve hit the panic button instead of taking it in your stride says much about your inabilities.

Now I could go on further proving how little you understand, but I don’t need to do I, because you’ve started this little game in the past under different handles, and have ended up loosing.

Thus people should now realise the problem is very much in your head and is a form of self inflictedd injury as the result of a need to prove you are something you are not.

Why you have the initial issues probably goes back to your childhood and being bullied for some reason.

Why that may have happened to you I don’t know, but whilst I feel sorry for people who have been bullied as children, you also have to remember two things. It was not me who bullied you, you having been bullied is not in any way caused by me. So you have no excuse what so ever for stalking and trying to bully me, just for your own self gratification. In fact it probavly has made you a bully of others as well, thus a concern to society.

lurker January 1, 2023 3:41 PM

@Winter, @Clive Robinson

Please look at Tokuda[1] Figure 1. The middle curve shown is the response of the “Vocal Filter”. I believe this is where you are talking past each other. The bandwidth of this filter is low, but its centre frequency is high. The horizontal axis is not numbered, but assuming a fundamental frequency of 200hz, harmonics are shown up to 4khz, with peaks in the Vocal Filter at 1.6khz and 2.4khz. The envelope of the speech spectrum shown in the top curve can be obtained with a diode and low pass filter of 300hz and holds most of the “intelligence” in the speech. But this envelope is of no use to the human auditory ganglia without the simultaneous knowledge of where on the speech spectrum the peak is at any given time, because the peaks of the Vocal Tract Filter move with time. Which is why LPC and LSP have leaked into the conversation.

@Winter

present it to someone who has not heard the original

Of course the listener must be unaware of the original. Unfortunately I don’t have Atkinson on my bookshelf now, so can’t quote his actual audibility reference testing method.

Some clue as to the usefulness of the EarSpy comes from their claim of over 90% accuracy in gender and speaker recognition, but only 56% accuracy in word recognition from a ten word dictionary. Still, as @Bruce says, it’s a start: either to do better, or to kill the method in phone design.

[1] ‘https://oxfordre.com/linguistics/display/10.1093/acrefore/9780199384655.001.0001/acrefore-9780199384655-e-894

DSP January 1, 2023 4:54 PM

The fact you’ve hit the panic button instead of taking it in your stride says much about your inabilities.

What panic button ?

Thus people should now realise the problem is very much in your head and is a form of self inflictedd injury as the result of a need to prove you are something you are not.
Why you have the initial issues probably goes back to your childhood and being bullied for some reason.
etc. etc.

Seems you are hitting the panic button.

Matt January 1, 2023 5:45 PM

@Clive

Are the problems with the mitigations really impossible to solve?

My main issue with the paper is it somehow reaches to the conclusion that manufacturers ought to address this in hardware with only a passing discussion of the software mitigations or how they might be improved.

Looking only at the Android API documentation, it isn’t clear exactly how the 200 Hz limit is implemented. There is some room to make non-API-breaking changes in how it is done.

Likewise, the notification android users get when this kind of background activity is happening could be made more obvious or the background measurements could be further restricted or even prohibited.

Without discussing these current mitigations or potential future mitigations in any depth, the authors’ suggestion that speaker design ought to be limited is unsupported.

lurker January 1, 2023 7:02 PM

@Matt

Notify, restrict, prohibit, API documentation, are for those of us who read the rules and try to abide by them. An attack is usually assumed to come from bad guys who don’t have the same rulebook. The paper shows how current software makes this attack look feeasible on current hardware. Furture software can only be assumed to make the attack more feasible. Thus future hardware must be modified to reduce the attack surface.

Clive Robinson January 1, 2023 9:04 PM

@ DSP,

“Seems you are hitting the panic button.”

Nope…

You made a false statment, and unsurprisingly have totaly failed to back it up in any way.

Your attempts to show expertise hwve limited thrmselves to cut-n-paste from Wikipedia, which by thr way in no way support your false statment.

Since then you’ve tried to ignore the fact you were “caught out” / “caught red handed”.

I even offered you a chance to face save and extracate yourself… But no you don’t even know enough to realise that.

So in short your current behaviours, have presented you,

“As a lying waste of space, unworthy of any consideration.”

So if you want to get back any credibility you need to stick with your original accusation, which you appear totally desperate to avoid doing… So people are going to naturally question of you,

1, Why did they make the accusation?
2, Why have they presented no supporting evidence?
3, Why are they desperate to avoid answering valid questions?

The list is likely to go on but then that’s not realy my concern is it?

Clive Robinson January 1, 2023 10:08 PM

@ Matt, lurker, ALL,

“Are the problems with the mitigations really impossible to solve?”

In theory no.. In practice well we don’t know, but the predictions do not paint a pretty picture…

The hard reality at the bottom of the computing stack is,

“It’s a trade off in terms of time.”

That is,

1, How fast do people need to access the motion detectors to make them universally usefull.
2, How fast do attackers need to access the motion detectors to gleen sufficient information to re-constitute a known persons speech.

The more “usefull” you make the motion detectors the faster they have to be accessed. Flight control systems for instance access various sensors in the 200-6000 times a second. Motor vehicle braking systems with anti-lock 200-1000 times a second. So you can see the range things are starting to fall into for direct access.

But lets say 200 times a second, where and how do you set that limit. Importantly is it only 200 times a second for the entire system or just each app?

Because if you have it by app and you have two apps talking via IPC you could arange for them to be 90degrees appart giving not just twice the resolution so effectively 400 samples a second, but solve other issues to do with envelope recovery by IQ demodulation.

And that’s before you ask the question “how many motion detectors” are there in reality…

As @lurker has noted,

“Furture software can only be assumed to make the attack more feasible.”

And as our host @Bruce has previously noted generally,

“Attacks do not get worse with time”.

Though as I’ve noted on the odd occasion, the way the ICTsec Industry currently works,

“Attack lessons are seldom to never learned, so old attacks work over and over again.”

Now you could ask if I have a different brush in the pot to paint a “prettier picture” sadly not currently… and even if there was such a brush would it be used? I suspect “Not” by current managment rules.

The last time I looked I think we were on target for 50,000 CVE’s this year and had broken the 200 new vulnerabilities per working day bar…

Ask your self two questions,

1, Can I read and comprehend 200 vulnerabilities a day?
2, How long can I remember vulnerability details for?

You start to see the nature of the problem. In part it’s why I push the “Instance of Class” model. If we do it right we can extract knowledge from each new instance of an attack such that we can put it in one or more more generalised class models. If we then defend against the classes, we get not just a lot of current and old instancee, we close down re-use that would othereise have given us new instances.

But we also need to ask ourselves a question…

“The freedom to juggle with sharp knives means we risk cutting ourselves or taking an eye out etc. If we juggle with balls, the risk of harm is way way less. Whilst juggling with knives is exciting, do we need that sort of excitment every day at work?”

I think most would argue “NO”.

Most of our vulnarabilities actually come from very few sources when you get down to it. Newer languages have stopped many old language vulnarabilities, so we know “tools” can remove “low hanging fruit”. We are now moving into an era of “Type Safety” and effective “memory reuse”. We’ve even moved forward on Interface Contracts being done “auto-magically”.

The question becomes “how safe can we make things” the answer is “never enough” because attackers have greater agency than our tools do, and I can not see that realistically changing. So they will always be able to find new ways to evade existing tools, much as they do with Malware and AV systems.

But there is a counter argument we need to think about. The more our tools take away security concerns, the less the average programmer will understand about security. At some point we will approach a tipping point where security can not be practiced by the majority of programmers as they will have neither the knowledge or experience. Thus the question arises,

“Does this matter?”

Personally I think “Yes” though I’m sure you will find others who will either say “No” or act as though it’s “No” for “business reasons”.

So you can see where that’s going to end up without “appropriate” legislation and regulation, that our current generation of legislators appear totally incompetent / inept at doing…

Winter January 2, 2023 6:13 AM

@lurker

The bandwidth of this filter is low, but its centre frequency is high.

Obviously, if you use the individual “vocoder bands”, you get different results.[1]

But that is totally irrelevant when using an accelerometer with a low-pass cut-off below 500Hz.

Anyhow, this is the stuff of 1970s audio and speech science. I have neither the time nor the inclination to battle people who reject science without solid arguments. Anyone can look it up and find the evidence to proof me wrong if they want.

[1] See
‘https://cdn.dpamicrophones.com/media/images/mic-university/facts-about-speech-fig04_1.jpg
In a useful summary of the facts:
https://www.dpamicrophones.com/mic-university/facts-about-speech-intelligibility

DSP January 2, 2023 8:07 AM

Your attempts to show expertise hwve limited thrmselves to cut-n-paste from Wikipedia, which by thr way in no way support your false statment.

I didn’t need to consult Wikipedia. I first programmed computing line spectral pairs as an undergraduate student, 45 years or so ago.

So if you want to get back any credibility you need to stick with your original accusation

The ‘accusation’ was

That’s a somewhat misleading statement.

referring to your

Most if not all the inteligence is below the 300Hz frequency range.

And I stick with it. If you can’t tolerate such a comment, seek help.

Clive Robinson January 2, 2023 8:52 AM

@ DSP,

Re : Goal post shifting.

“Most if not all the inteligence is below the 300Hz frequency range.”

Is factually correct when talking about the speach envelop which I was.

What I raised and you’ve failed to address is your factually inaccurate accusation of,

“They certainly do not represent ‘harmonically related sine waves’. So what you write is ‘not even wrong’.”

So you have not wormed off the hook you have made for yourself and I see no reason to alow you to do so.

But your new statment of,

“I first programmed computing line spectral pairs as an undergraduate student, 45 years or so ago.”

So back in the early to mid 1970’s is what you now claim… Hmmm how do I put it politely,

“That is shall we say ‘most curious'”.

But it still does not get you off the hook you’ve made for yourself.

lurker January 2, 2023 1:40 PM

It’s why I wrote Tweedledum and Tweedledeee:

@Clive is right when he says
“Most [of] the inteligence is below the 300Hz frequency range.”
That is how the people who wrote the paper at the head of this thread got their system to work, using an accellerometer system with an upper frequency of 200hz.

@Clive is wrong when he says “if not all” (omitted from the above quote) because to use that low bandwidth intelligence a human listener must hear the fundamental high frequencies on which it is modulated.

@Winter is right that simply placing a 300hz wide bandpass filter at the higher fundamental frequency will result in an unintelligible signal. This is because we are modulating two complex waveforms resulting in a wide spectrum of sidebands most of which are needed for human understanding of speech, and the centre frequency of that bandpass filter moves in the time domain.

@Winter is misleading when he says “I assume your information & communication books wrote a lot about (analogue) speech compression and bit rates, but not about plain human 2 Human speech.” The system described in the head paper is not using human 2 human speech. It is interposing an adventitious transducer (the accellerometer) and a lot of mathematics (CNN). A computer is “listening” to the result, and it seems not even in real time.

The researchers’ claim that their results are “5 times better than random” supports @Clive’s claim of the intelligence in a low bandwith.

The researchers’ claim of only 56% word recognition from a 10 word dictionary supports @Winter’s requirement of wide bandwidth for intelligibility.

This is obviously a work in progress, and the progress appears to be orthogonal to the CELP speech processing used in the phone’s main communication channel, which may be one of the causes of the apparent confusion.

DSP January 3, 2023 5:09 AM

So back in the early to mid 1970’s is what you now claim

1977 or 1978. Using Fortran IV. There’s nothing curious about it.

What you seem to forget is that expressing a function as the sum of an even and an odd (palindromic and antipalindromic) component, and then using the properties of those two components to analyse it further, is a standard ‘trick’ in applied maths and engineering.

The maths underpinning LSP were well known and used long before the method was discovered to be very useful for speech coding.

Clive Robinson January 3, 2023 8:00 AM

@ DSP,

“What you seem to forget is… “

No it’s what you are trying to avoid.

You have made an allegation you appear to be incapable of even trying to defend. Thus have resorted to cutting and pasting from a Wikipedia page…

As for,

“is a standard ‘trick’ in applied maths and engineering.”

Yes I use Z-transforms for all sorts of things and prefere them to Laplace transforms as they are easier to visualize.

There I’ve given you another little hint to help you squirm along in your journy towards an appology.

JG4 January 3, 2023 8:01 AM

Any object light enough to oscillate in a sound field can be used as an accelerometer. It would be really clever to hide a thin membrane inside of a wall hanging, then interrogate it with microwaves. Clive alluded to this. The leaf of an office plant might have a suitable frequency response. There are at least a couple of versions of this work online:

The Visual Microphone: Passive Recovery of Sound from Video
http://people.csail.mit.edu/mrub/VisualMic/

Winter January 3, 2023 9:51 AM

@lurker

The researchers’ claim of only 56% word recognition from a 10 word dictionary supports @Winter’s requirement of wide bandwidth for intelligibility.

The difference is also that I consider 56% word recognition on a 10 word dictionary just plain unintelligible.

DSP January 3, 2023 11:53 AM

Let’s go back. You wrote:

This is often done as the sum of harmonically related sine waves represented by independent amplitudes called Line spectral pairs

Which is nonsense. LSP are not ‘independent amplitudes’ [1] and they are not representing ‘harmonically related sine waves’ by any stretch of imagination.

In the most general sense they are just a mathematical trick. In the context of LPC and speech analysis/synthesis they are used to represent the roots of two polynomials which in turn define a filter.

Seems like you didn’t have any idea of what LSP are before I made you have a look at the Wikipedia page. You have pointed to Wikipedia pages numerous times, so I assume that’s where you like get your ‘better than most experts’ knowledge.

There I’ve given you another little hint to help you squirm along in your journy towards an appology (sic).

The is nothing I should apologise for. I pointed out your error, and you seem unable to accept that. That’s all there is to it.

As far as I’m concerned, this waste of time ends here.

[1] In the context of LPC, the LSP are frequencies.

Quantry January 3, 2023 1:32 PM

Thanks again Bruce.

Isn’t this attack yet another reason to demand that our phone is capable of wired headphones, and use them? [and gut the on-device microphone and speakers],

or better yet, shouldn’t the average person be using a data channel instead, (in view of things like @ JG4 ‘s Re: Visual Microphone), and just jam the audio space with Black Sabbath, Iron Man and ultrasonics?

and off topic…

@ randoom reader, (Re : USB Type C mandatory charger energy gapping…) Isn’t charging inside a faraday bag from a battery a gapped-and-wired option?

Clive Robinson January 3, 2023 6:18 PM

@ lurker,

Part 1,

“Clive is wrong when he says “if not all” … because to use that low bandwidth intelligence a human listener must hear the fundamental high frequencies on which it is modulated.”

If you go back you will see I was talking about the information content, not if a human could inteligabily hear that content (it’s a very big difference). It’s why when I realized this was becoming an issue to understanding I mentioned people should think of the human hearing like an “Amplitude Modulated”(AM) receiver. It does not matter as far as transmission of the information if you send it at longwave, mediumwave, or shortwave the information will arive at the receivers input terminals. However if the receiver –human hearing– is tuned to mediumwave and you send the information on longwave or shortwave then the –human– receiver will not pull the information off of the carrier it’s modulated onto[1].

So you have to get the context of what is being communicated correctly.

But the other issue which @Winter has not indicated he understands is that the human hearing does not use all that 0.3-3kHz bandwidth, or that which it does use, it uses equally. It does not in either case.

As I’ve already mentioned above there is a block of frequencies between 1 and 2 kHz that is between 0.3 and 0.5 kHz wide, that whilst humans can “hear it”, it plays no real part in the intelligability to humans of speech. Which is why you can simply “notch it out” and transmit low data rate signals down it without the human perception realising you’ve put such a hole in the frequency spectrum. Also the difference in the way the ear hears alows people to “sing” at the base, tennor, baritone, soprano and other limited ranges but the words be clearly heard.

So roughly you can already split that 0.3-3kHz spectrum into a “low band”(LB), “mid band”(MB), and “high band”(HB), knowing that the human receiver uses each band differently or not at all in the MB case for speech intelligibility.

Whilst you can do the maths to find out what the actual “information rate” for speech is in bits per second it’s both variable and quite dull to do. What you need to know is it’s realy quite low, down in a baud rate of a couble of hundred symbols a second and as low as fifty, whilst still be fully intelligible as speech though “robotic” in nature.

Information wise that can be “easily” put in the bottom half of the LB with room to spare. And in reality when you “sample” a speech signal you can show that it all ends up down there in one way or another unless you take some precautions to stop it happening[2].

[1] As the modulation is actually energy which can not be “destroyed” one of two things happens to it at the receivers input terminals,

1, It gets absorbed into a “load” and becomes thermal energy.
2, It gets reflected back into the transmission channel.

The result of the second is that you can remotely “enumerate” the input circuit of the receiver, simply by sweeping a carrier across the entire transmision channel bandwidth and see what gets reflected back. A technique by the way that is apparently still regarded as “secret” in the US even though it’s in open text books and other sources. Oh and has been known in the open US community since long before somebody tried to “gag the genie” as they could not “put it back in the bottle”. Look up the techniques developed around the “Grid Dip Oscillator”(GDO), which are a practical introduction to the S-Parameter test set, and what are now called “Vector Network Analysers”(VNA). Likewise look up “Standing Wave Ratio”(SWR) in “transmission lines”.

[2] The way you generally stop it is by block band filtering the audio prior to sampling. Then use a sampling rate at over twice the Nyquist requirment. Which for telephones has been selectived as aproximately 8kHz. However this sampling rate is around 20-40 times the sampling rate actually needed for pulling out the speech information content… It has just made life a lot lot easier in the past to go with such a high sampling rate (and why most speech processing is actually done in software algorithms rather than hardware circuits).

Clive Robinson January 3, 2023 6:36 PM

@ Quantry,

“Isn’t charging inside a faraday bag from a battery a gapped-and-wired option?”

No…

That USB-C cable can be seen as a very very high bandwidth “gap crossing” path for both overt and covert signals and all the various types of side channel that brings into play.

So from an “Energy-Gapping” point of view the “faraday bag” is serving no purpose as the USB-C cable bypasses any protection it might have otherwise given.

In fact under cetain circumstances the USB-C cable makes it worse, than if there had been no faraday bag or cable originally (think conduction, and radiation of energy impressed with information). That is the cable acts as both a transmission line and an antenna…

Yuri January 4, 2023 5:26 AM

@ Clive Robinson

A technique by the way that is apparently still regarded as “secret” in the US

Any pointers to sources confirming that ?

This is just basic transmission line theory, developed in the mid 19th century (when the first transatlantic telegraph cables were laid).

Quantry January 4, 2023 11:26 AM

@ Clive Robinson, Thanks. Re: charging inside a faraday bag:

Is

That USB-C cable can be seen as a very very high bandwidth “gap crossing”

still true if the battery and cable are both in the bag?

I should have specified. Cheers.

some other ideas:

‘https://mosequipment.com/products/mission-darkness-window-charge-shield-faraday-bag

Paul January 24, 2023 10:11 AM

Accelerometers can have a low-pass filter. Capping accelerometer frequency response at for example 20Hz (0.05 second) would block it’s use as a microphone.

Clive Robinson January 24, 2023 10:59 AM

@ Paul, ALL,

Re : Extra components cost.

“Accelerometers can have a low-pass filter.”

They can and back last century they would have done. But in the world of cost sensitive “Fast Moving Consumer Electronics”(FMCE) every component saved, –especially high value capacitors of physical size– is made where it can.

For an accelerometer these days I’d design in “software integration” by polling it on a time loop and pushing that into an up/down counter with overflow prevention or accumulating ring buffer[1]. Many people would not even bother doing that…

[1] For those new to the I/O interfacing game, an accumulating ring buffer is a nice little trick that saves you a lot. For each analog signal input you have a ring buffer, of say twenty bytes of memory. That holds the last twenty read signal values, that you add together to get either an increased value or average. But because each value has a time between it the response is based on a Z transform so becomes a low pass filter, and changing the length of the ring buffer changes it’s cut off frequency. Obviously adding up all twenty values each time would be slow… the trick is to not do so, and use an extra memory value to hold the accumulated value instead. So When you get an interupt to get a new input you logically, first increment the ring buffer pointer pull out the buffer value, and subtract it from the accumulator value. You then read the accelerometer, write it into the ring buffer and add it to the accumulator value, the result of which is the new lowpass filtered output. To get optimum efficiency and accuracy you move the logical steps around, and to get minimal interupt time you can effectively double buffer by reading in all the I/O values in the fast interupt loop to their own single memory slot buffer, and then update their individual ring buffers out of the fast interupt loop in the “OS kernel” time slot. If the I/O device is not in use, then you don’t have to update it’s ring buffer and save time. If you want more complex filtering transforms use say 128 memory locations to hold multiple ring buffers set to different lengths to change the Z values and then use the accumulators with weighted values to pass down the line.

Leave a comment

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.