Detecting Words and Phrases in Encrypted VoIP Calls


Abstract: Although Voice over IP (VoIP) is rapidly being adopted, its security implications are not yet fully understood. Since VoIP calls may traverse untrusted networks, packets should be encrypted to ensure confidentiality. However, we show that it is possible to identify the phrases spoken within encrypted VoIP calls when the audio is encoded using variable bit rate codecs. To do so, we train a hidden Markov model using only knowledge of the phonetic pronunciations of words, such as those provided by a dictionary, and search packet sequences for instances of specified phrases. Our approach does not require examples of the speaker’s voice, or even example recordings of the words that make up the target phrase. We evaluate our techniques on a standard speech recognition corpus containing over 2,000 phonetically rich phrases spoken by 630 distinct speakers from across the continental United States. Our results indicate that we can identify phrases within encrypted calls with an average accuracy of 50%, and with accuracy greater than 90% for some phrases. Clearly, such an attack calls into question the efficacy of current VoIP encryption standards. In addition, we examine the impact of various features of the underlying audio on our performance and discuss methods for mitigation.

EDITED TO ADD (4/13): Full paper. I wrote about this in 2008.

Posted on March 24, 2011 at 12:46 PM36 Comments


Andy March 24, 2011 1:03 PM

Seems like numbers would be a top target for an attack like this. Seems like you could get some credit card numbers pretty easily.

Seiran March 24, 2011 1:04 PM

This is a known problem with many cryptosystems and has been discussed extensively as it applies to Tor ( Bandwidth and timing analysis can be mitigated by adding padding and by queuing packets before transmission (or adding random jitter to packet timing and size), however, at a penalty of reduced bandwidth or reduced performance.

In the case of Tor, it was determined that this was not worth it (, etc.)

Seiran March 24, 2011 1:10 PM

@Andy: it wouldn’t work too well with numbers, as each number takes about the same amount of bandwidth to encode. it’s more useful in revealing phrases, for example, “I ripped the tags off the mattresses, but they’ll never catch me.”

A similar type of analysis is using printed word length to un-redact information from poorly redacted images. A somewhat similar type of attack is differential power analysis (DPA), which uses power consumption to infer data about the operation of integrated circuits. It’s like using the water pressure outside of a house to guess whether the occupants are cooking, showering or flushing the toilet based on known signatures.

Herbert March 24, 2011 2:24 PM

@Seiran: Yes, but aren’t credit card numbers normally said in groups of 4? Maybe be possible to pick up such groups, isolate the numbers, then pass them onto another process that would determine the actual numbers.

Bryan Feir March 24, 2011 2:34 PM

So, in other words, the data bit rate is a side channel that can carry non-trivial amounts of information about the actual content. A fairly specific form of traffic analysis. We discussed this sort of thing in one of my classes almost twenty years ago.

The obvious solution is to scrap the ‘variable’ part of the variable bit rate, and encrypt raw data instead.

Nick P March 24, 2011 2:56 PM

@ Bryan Feir

Well, there’s two ways to handle this. The first approach is to use non-variable rate vocoders, of which there are quite a few, in VOIP apps if COMSEC is a concern. The second approach is to protect at the IP layer with a technology like IPSec and fixed packet sizes. Numerous defense encryption systems transmit packets of a fixed size at fixed or narrow timing ranges to reduce impact of traffic analysis. However, easiest route is to just use non-variable vocoders and quite a few VOIP apps/technologies support this.

pointless_hack March 24, 2011 3:24 PM

The news is not all bad! This means that the NSA can now mine skype etc for phrases like “Ib’n Allah” (supposed to be Arabic for “Let Allah’s will be done,”) while Suzie retains privacy in discussing her yeast infection with her doctor. Further, we can remember that encryption still anonymizes ID, just as the lowly pay phone anonymizes the senders billing info.

Alex W March 24, 2011 3:29 PM

Who wants to take a guess whether these guys have been contacted by the feds for future wiretapping opportunities yet?

tz March 24, 2011 5:21 PM

I haven’t read the article yet, but I wonder what the number of false positives are.

It might be a plus if you already know the phrases to look for, but if a lot of other phrases map to the same compression, it doesn’t really work. It is different saying if I say X, 50% of the time it will say that I said X, and if I say A, B, and C, it says I said (or might have said) X, it isn’t that useful.

DG March 24, 2011 6:37 PM

As Nick P says using fixed rate syllabic vocoders would obscure data rate as would any traffic flow security. You could stuff to a set rate closer to the maximum rate before encryption. See Techniques for Mitigation in the paper.

For those of us without access to the ACM the paper is available from one of the authors:

Clive Robinson March 24, 2011 6:43 PM

@ tz,

“but I wonder what the number of false positives are”

They may not be important.

If you assume that the call routing information is known then analysis of this would give likely / suspicious groups or clusters to home in on.

Thus you would not be looking at individual calls or phrases but multiple calls with multiple phrases that a statistical anaylsis should weed out the false positives.

Nick P March 24, 2011 7:33 PM

@ DG

“for those of us without ACM”

I would recommend anyone who can afford it to get access to ACM, IEEE, and Springerlink (optional). I’ve had access to them for about a year now. I’ve learned more about high assurance systems engineering in a year with these resources than I did in three or four years without them. There are so many landmark papers, novel technologies, and useful protocols that are best found through ACM and IEEE.

For instance, it’s how I learned much about covert channel analysis, SilentKnock deniable port knocking, and several frameworks to prevent Layer 7 web attacks by design. There’s just so much in these publications that it pays to have access to them. I’ll have to dig up the name, but one students’ thesis was such a good summary of secure system concepts and designs it could have supplanted or replaced the orange book!

Richard Steven Hack March 24, 2011 9:35 PM

“Who wants to take a guess whether these guys have been contacted by the feds for future wiretapping opportunities yet?”

Who wants to take a guess whether the NSA invented this technique independently ten years ago and hasn’t told anyone?

One observation: the fact that one can detect phrases depends on the phrases one is looking for. If you aren’t looking for the right phrases, it’s useless.

Reportedly, the Echelon system already scans for certain keywords (and presumably phrases). If you’re using encrypted VOiP, presumably Echelon can’t detect this. If the computers in Echelon are programmed for this technique, it might increase Echelon’s hits from scanned encrypted VOiP.

BUT since Echelon is already a known factor in counterintelligence, any military, terrorist or criminal group worth their salt (which probably aren’t many) should already be avoiding using likely scanned for words and phrases in any electronic communication medium.

The probability is that code words would be used which would completely defeat this type of analysis (unless you know from other sources what the code words are – which probably means you have other sources that obviate this sort of analysis anyway.)

In short, it would seem this form of analysis can be defeated by the age-old electronic communication principle of using code.

So like most technical analysis, this technique is likely useful only against those lames who don’t really conceal the content of their electronic communications, but rely on technology to protect themselves. Of course, fortunately for LE and counterintelligence, that’s probably a lot of lames.

RobertT March 24, 2011 11:03 PM

Nothing new here… this is the reason that all secure comms systems first use a low bit rate codec and than embed the VoiP packet stream in a pseudo noise data stream of at least 10 times the Voip packet rate. The voip packets are also randomly arranged in the sequence, but even this is insufficient for a determined adversary. Unfortunately all real fixes, for this problem, create a system that has so much delay it becomes impossible for the users to freely communicate, so they bypass the secure system….

Nick P March 24, 2011 11:20 PM

@ RobertT

Interesting comments. But which systems are you referring to? The systems I’ve seen seem to be a non-variable compression + good crypto protocol. This includes SCIP designs’ descriptions. I’ve looked at a lot of COTS and Type 1 systems and can’t say I’ve seen what you’re referring to. Care to enlighten us as to your sources?

Martin March 25, 2011 12:56 AM

I wish I could even use VoIP with encryption; none of my SIP providers offers encryption …

wernerd March 25, 2011 5:12 AM


VoIP encryption (the audio/video part) is not a feature of the SIP provider. You need to get a decent VoIP client that, for example, supports ZRTP/SRTP. If both sides use such a client that should do it.

Jitsi (former SIP Communicator), Twinkle, and CSipSimple for Android are examples of such clients.

RobertT March 25, 2011 6:18 AM

@Nick P
I was talking about some older systems over 20 years ago (not US), that had poor crypto, so they tried to hide the real voice packets and allow for non-constant rate codec’s by substituting into the constant random bit stream.

With good Crypto you only need a constant data rate, assuming you ignore all the other side channel issues, which we have discussed at length before.

As I’m sure Clive will point out “voice is hard to properly encrypt”

Winter March 25, 2011 6:22 AM

Who is interested in what is said?

Anyone even remotely into shady business knows you should use code phrases. “The yellow packet has arived”.

The only interesting thing is who talks to whom, and for how long and how often.

But I like it for the sheer hack. Way to go!

Clive Robinson March 25, 2011 7:34 AM

@ RobertT, Nick P,

“As I’m sure Clive will point out “voice is hard to properly encrypt”

That’s because it is 8)

Interesting historical note the better voice compresion algorithms (CELP etc) were developed by the NSA but nearly didn’t happen.

It has been mentioned before that the NSA releasing CELP for use in the North American Cellular Phone Network (and subsiquently other networks and more recently various VoIP coders) was a way to get an algorithm with “hidden attributes” that the NSA could exploit into world wide use.

If it is true or not does not realy matter, because any moderatly complex system is going to have “side channels”.

And lets be honest human speach is almost all side channels to help with “error correction”, it’s why we can understand it in extreamly noisy environments with comparative ease.

And it also why LPC / CELP work so well. Sadly though we still have not realy taught computers to listen (that is they still cannot easily recognise random word lists from unknown random speakers unlike most humans).

Nick P March 25, 2011 11:04 PM

@ Clive Robinson

My own limited experience in the AI field tells me the speech recognition issue centers on the fact that so many phrases are context sensitive, with an environmental, situational and language-level context. So-called “common sense” and innate knowledge in people are also required to differentiate between certain phrases and spoken words. MIT’s Common Mind and the Cyc project try to encode tons of knowledge to account for this, but the brain is a neural net and that already tells me how likely those projects are to succeed. Read: nature usually knows best about how to build extremely complex machines.

As for the crypto, most bit level side channels are removed by fixed size and timing transmission. It’s inefficient. However, it allows arbitrary application layer data to be sent privately over the secure connection. I see no alternatives at present that I have similar confidence in. So, I tell them: just shut up and foot the bill for the bandwidth or pick a secure application instead.

Clive Robinson March 26, 2011 11:06 AM

@ Nick P,

“My own limited experience in the AI field tells me the speech recognition issue centers on the fac that so many phrases are context sensitive, with an environmental, situational and language-leve context”

Yes and most working speach recognition software works at this level.

What I was talking about was the lower level of being able to recognise individual words spoken by different people with widely ranging accents whilst in the prescence of either high levels of noise or equivalent levels of other speach etc.

The human mind still appears to be significantly far in advance of computers provided the listener is sufficiently fluent in the language.

It is interesting to note that when the listner is not fluent in the language they generaly “hear” a word from their own or other language they are fluent in.

Oddly though computers appear to have less problems with “pitch perfect” languages. Experiments in recognising Japanese and Thai words appear to have greater accuracy than with languages that are pitch independant. I’m not sure what the current state of play is as it’s been a while since I worked in voiced automated systems that were used for phone based IVR.

In simple IVR you are generaly only looking to recognise 0-9/yes/no in an unknown speaker and not speach in any kind of context.

For more complex IVR you need to start looking not just for word recognition but also some context to improve on the word recognition, and to a limited extent “learn the speaker”.

One simplistic way to learn the speaker is to get the caller to say the phone number they are calling from and do a direct compare and build a simple series of Hidden Markov Models (HMM) based around the expected phonems.

This is sometimes implemented by a very similar method to the HMM models used to solve the “alphabet soup” issue of automated “hand writing” recognition and reading. Google around for “Maximum Mutual Information
Estimation of Hidden Markov Model Parameters” with either “speech recognition” or “text recognition” or google for “discrete utterance speech recognition” if you want to know more on the guts of the process.

Eventually you need to switch to almost fully context sensitive recognition when you get to the point of “natural speech recognition” for things like “audio typing” software (such as “Dragon NaturalSpeaking / Dictate” etc).

All of these systems require a lot of CPU power and even today speech recognition software can max out the CPU on business level PC’s.

A note to budding and established authors (Bruce et al) these products are actually quite good for putting in raw text, however don’t look at the sentance untill you have finished saying it as the “word dance” on screen can be quite distracting and cause your natural speach rhythm to break up (I have used it in the past to quote directly from books and magazines etc when “sitting comfortably”.

With regards,

“As for the crypto, most bit level side channels are removed by fixed size and timing transmission. It’s inefficient. However, it allows arbitrary application layer data to be sent privately over the secure connection. I see no alternatives at present that I have similar confidence in.”

Yup most side channels “on the wire” are time based (currently) with a few based directly on “Power Spectrum” or it’s equivalent in the domain of operation.

To this end “clock the inputs and clock the outputs” advice limits the time based side channels and the likes of “packet stuffing” and “rate limiting” fix the equivalent power specrum issues if done correctly as does “data whitening” with a nonlinear PRBS.

Nick P March 27, 2011 1:35 PM

@ mbt005

From what my English mind can gather, you’re discussing a high school, emails, facebook, social networking in general, and some adVERTISEMENTS might be involved. This can’t be good. Count me out.

Jean-Marc Valin March 27, 2011 2:01 PM

I’m the author of the Speex codec that is evaluated in this paper. While it’s true that VBR leaks some amount of information, I think the issue has been blown out of proportion. I’ve worked for some time in speech recognition in the past. Recognizing conversational speech when you have the audio is already a pretty tough thing to do reliably. So trying to use VBR to spy on a real conversation is just not possible.

So what is shown in the paper is recognition on a very restricted/constrained vocabulary, for example a user saying one of N sentences, with N not too large and the sentence long enough to contain enough information. So for a situation where one of the speaker can only say a few things, there may be a risk. On the other hand, I think credit card numbers (~10^14 possibilities if you remove redundancy) are pretty safe as there will not be enough information to disambiguate all the possibilities. In practice, I think the only real danger is for IVR applications where pre-recorder prompts are played. By calling the IVR, an attacker can have direct access to the VBR pattern of all prompts, so it would be possible to get 100% recognition rate on prompts.

Collin Perkins and I have actually written an Internet draft to describe what is OK and what isn’t when it comes to encrypted VBR streams: Comments welcome.

Nick P March 27, 2011 9:50 PM

@ Jean-Marc Valin

Thanks for chiming in. I appreciate your efforts to develop open-source codecs. I thought about including your algorithm in one design, but then noted the VBR property. I had to choose an alternative. I think my main objection is best phrased as a question: “Why use VBR if non-VBR methods can be employed that easily meet both functional and security requirements?”

If the codec is leaking information by design, it’s insecure by design far as a privacy technology is concerned. The whole point of encryption is concealing data. Anything that leaks data even through the encryption process must go. We need an open-source, patent-free non-VBR codec that can transmit at GSM data call speeds. If anyone knows of one, I’d like to hear it. Jean-Marc, could Speex be modified to work non-VBR?

Marian Kechlibar March 28, 2011 9:40 AM

I am a co-developer of a mobile VoIP program that supports SRTP / ZRTP.

The issue with guessing phrases through variable bitrate codecs has been known for some time.

Nevertheless, vast majority of common-day VoIP codecs (speex, AMR, iLBC, G.729) are, by default, using constant-bitrate encoding. I am not really aware about any current VoIP application that would use variable bit-rate by default.

Marian Kechlibar March 28, 2011 9:44 AM

Nick P., why limit yourself to “GSM data call speed”? GSM data call is a dying technology, and latencies are awful. Do not use GSM data call as any kind of standard; it is antiquated, with support on modern devices dwindling.

You can get absolutely perfect sound with AMR at 12200 bps and speex at 11000 bps. Which is a very modest bandwidth requirement. Most of the mobile devices do not have speakers of excellent quality. Even for me, and I am a trained singer with quite good relative pitch etc., it is hard to distinguish, say, speex at 8000 bps from 11000 bps on a typical Nokia smartphone.

Nick P March 28, 2011 4:30 PM

@ Marian Kechlibar

“why limit yourself to GSM data call speed?”

Because an existing product with tons of users works over GSM and that makes a switch easier if it doesn’t require a hardware upgrade. It’s also a nice backup option if high performance networks aren’t available, but something like GSM data line is. I was just saying the codecs should support networks of various speeds, including an option for slow networks. I was also thinking about dialup and cheap satellite modem connections when I wrote that.

Wanderer March 29, 2011 4:29 PM

For those interested in SCIP which is the foundation of a number of international government secure voice crypto, the signalling standard (SCIP-210) for this was made public for the first time last week on the IAD website.

On topic: SCIP uses fixed rate vocoders (MELPe and G.729) and hence avoids the issues in the original paper.

Leave a comment


Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via

Sidebar photo of Bruce Schneier by Joe MacInnis.