Eavesdropping on Encrypted Compressed Voice

Traffic analysis works even through the encryption:

The new compression technique, called variable bitrate compression produces different size packets of data for different sounds.

That happens because the sampling rate is kept high for long complex sounds like “ow”, but cut down for simple consonants like “c”. This variable method saves on bandwidth, while maintaining sound quality.

VoIP streams are encrypted to prevent eavesdropping. However, a team from John Hopkins University in Baltimore, Maryland, US, has shown that simply measuring the size of packets without decoding them can identify whole words and phrases with a high rate of accuracy.

The technique isn’t good enough to decode entire conversations, but it’s pretty impressive.

Posted on June 19, 2008 at 6:27 AM50 Comments


Fred June 19, 2008 7:06 AM

If even one word can be identified, it is not secure.

What would be a way to counter this attack? Maybe playing music in the background?

Clive Robinson June 19, 2008 7:22 AM

@ Bruce,

Just remember most of the best voice to data codecs where designed (and in some cases patented) by the NSA.

That aside the first four rules of TEMPEST are,

1) bandwidth (low as needed).
2) Energy (low as needed).
3) clock the inputs at fixed rate.
4) clock the outputs at fixed rate.

One of the first rules when it comes to preventing traffic analysis is keep the rate and volue of communications constant.

When it comes to voice analysis it is surprising how much information is in the envelope and not in the spectral content. The envolope kind of defines the bit rate for many high compression codecs. With spectral content being replaced by band filtered white noise.

So the moral is once again security and efficiency are at oposit ends of the balance…

RonK June 19, 2008 7:26 AM


I would think that it would suffice to randomly add a certain percentage of dummy packets and random-length dummy padding at the ends of the real packets.

Totally eliminating compression is kind of like using OTP. It maximizes security at the cost of lots of convenience.

Nicholas Weaver June 19, 2008 7:30 AM

Fred: The only way to reliably counter the attack is to waste bits: Compress at a constant rate rather than a variable rate.

Anatoly June 19, 2008 7:37 AM

@Fred: But, if you don’t compress voice, the attacker can analyse entrophy of encrypted stream (by compressor, etc.) So, strong salt needed.

Anonymous June 19, 2008 7:47 AM

@ RonK,

“it would suffice to randomly add a certain percentage of dummy packets and random-length dummy padding at the ends of the real packets.”

Sorry no cigar, it won’t work in the way you think it will.

On the assumption “your oponant knows the system” they will have charecterised the behaviour of the codec you are using and your maner of speaking.

Most codecs are not simply variable bit rate they are also variable in timing between packets as well as a number of other key areas.

Simply adding packets will not break up the expected time relationship and therefore it will be possible to identify (some of)them. Also using probablistic algorithms will help find sufficient anomolies for most to be removed (especialy if real time recovery is not required).

Likewise there are other relationships that need to be disguised as well. The effort involved with faking it well enough is going to be greater than simply using a fixed rate codec.

Nigel Sedgwick June 19, 2008 8:00 AM

The problem here is that encryption has not been applied to the whole signal for, as explained by Clive Robinson (at June 19, 2008 07:22 AM), the effective output clock rate is not held constant (and here ‘purposefully’ contains information on the signal content).

Presumably this variable rate speech coding has been done to reduce the overall (digital) bandwidth of the signal. Again, as Clive states, this is in direct contravention of some pretty well known practical aspects of encryption.

I would like to raise one further point, that might be of benefit to those looking at lower-rate speech coding/encryption, who (obviously like the ‘inventors’ of the reported scheme) lack sufficient familiarity with the issues.

That aspect is that it is reasonably practical to introduce variable rate coding of speech, though at the ‘expense’ of imposing some minimum delay on the signal. By averaging the coding rate to a fixed value over some period, just about all the benefits of reduced bandwidth are made available. The sort of timescale over which such ‘rate averaging’ needs to be done is dominated by the syllable rate of the speech (typically around 4 syllables per second, though not constant).

By delaying the speech signal by 1 or 2 syllables, before transmission, it is practical to force the digital bandwidth (and hence transmission clock rate) to be constant. Such an approach gives almost all of the benefit of a variable TX clock rate without the risk of weakening the strength of the encryption.

However, such a long delay (250ms to 500ms) has, historically, been viewed as unacceptable for full-duplex conversation. Those using satellite phones have experience of this sort of delay. To a lesser extent, we see (hear) such delay in TV broadcasts from remote places, where the reporter on the spot is asked a question, and we see him/her start to answer over 250ms later.

Perhaps the solution is for VOIP channel suppliers to provide explicitly discounted prices for the use of variable-rate transmission, thus passing on some of the savings they make in their IP network costs. Alternatively, they might just rely on greater processing power (and so greater terminal equipment cost) to encode the speech at a somewhat lower rate (without significant loss of quality).

Best regards

Lewis Donofrio June 19, 2008 8:03 AM

“Informal conversational speech would be tougher because it’s so much more random.”

–Lesson here is “KISS” Keep it short and sweet (nothing complicated)

Clive Robinson June 19, 2008 8:11 AM

Just a thought,

If the designers of the system where not aware of the fairly well known (or atleast I thought it was) problems of not using fixed data rates, what else have they missed?

Perhaps their implementation of the encryption could be deficient as well. Perhaps not as bad as “code book” but still leaving enough information to make other attacks that do not need the key to be found…

As I said just a thought “code review” anyone 8)

foo June 19, 2008 8:18 AM

This attack class isn’t new, wasn’t there a similar problem with SSH at one point?

nerdboy June 19, 2008 8:44 AM

Would speaking in a different language to english have much of an effect? I assume the software has to be considerably tweaked for languages which utilise phenomes much different from english.

Paeniteo June 19, 2008 9:03 AM

@Anatoly: “But, if you don’t compress voice, the attacker can analyse entrophy of encrypted stream (by compressor, etc.)”

An encrypted stream typically cannot be compressed anymore at all and will pass pretty much any test for total randomness – no matter what the entropy of the original material is. That is, as long as the used cipher is any good.
E.g., if you use AES, it does not matter whether you encrypt a file full of random numbers or all zeroes: Both will look totally scrambled after encryption.

Fred June 19, 2008 9:53 AM

Most of the proposed solutions presume that the user of VOIP is able to alter the inner workings of the program. What about the casual user or the non-programmer? Is there some simpler way to obtain increased security?

Edward Palonek June 19, 2008 10:11 AM

VOIP is the most un secure method of communication, I am surprised how many people do not even realize that it is a big problem. With automated signal processing and voice recognition a hacker can penetrate a server/router at a large ISP route VOIP traffic through a hacked server. Software can be used to pick up key words from thousands of conversations in real time. These keywords can be account numbers, passwords, and so on. Palonek @ http://www.paloneks.ca/

Zarray June 19, 2008 10:25 AM

To eavesdrop, they need first access to the local network, so if you secure your network, there shouldn’t be any problem, right?

Clive Robinson June 19, 2008 10:25 AM

@ nerdboy,

No speaking a different language is unlikley to effect the outcome (once you have built your dictionary). In most cases the human vocal tract works in exactly the same way for the majority of people.

I cannot remember the exact details but most languages use a subset of something like 90 phonems. There are exceptions such as Finish and languages that use things like clicks produced in the mouth only.

It was noted in the 1920’s that in many cases the language of a cipher could be determined reliably without actually knowing the actual plaintext. Basicaly all languages have statistical signitures be they writen or spoken, if your system alows that to be analysed then the rest folows.

Anatoly June 19, 2008 10:30 AM

@Paeniteo: Entropy can be evaluated not only by your eye. It also can be evaluated by combinatoric patterns. I think that entropy of voice much huge, than entropy of cipher blocks. When you encrypt plain text, the redundancy much less than in voice.

An attacker can analyse encrypted voice to detect where are you (subway, plane), or identify women or man, etc.

quodum June 19, 2008 10:48 AM

@Anatoly “An attacker can analyse encrypted voice to detect where are you (subway, plane)…”

Would you care to elaborate? Because currently it sounds like you don’t know what you’re talking about.

greg June 19, 2008 10:48 AM

I was under the impression that quite a lot of the VoIP codecs are constant bitrate. For RTP reasons mainly….

ie I use Speex in my game. Fixed packet sizes fixed rate (each packet decodes to a fixed length of decoded sound). I didn’t check jitter however….

Paeniteo June 19, 2008 11:17 AM

Anatoly, are you sure that you know how cryptography works?

Any serious modern algorithm is able to completely hide the relation between plaintext and ciphertext.
As a consequence, there is also no relation between the entropy of both pieces.

Anonymous June 19, 2008 11:29 AM

@ quodum,

“Would you care to elaborate? Because currently it sounds like you don’t know what you’re talking about.”

Anatoly is probably correct.

The reason is as discussed above due to the variable rate coding depending on the envelope of the sound.

Effectivly the envelope controls the bit rate and this is not encrypted only obsficated at best.

Therefore if you have a suitable statistical model for the way the sound is effectivly modulated by the tunnel etc then the statistics will be visable in the statistics of the codec bit rate output.

There are a whole load of maybe’s due to the effective “side channel” bandwidth of the codec output rate but if the statistical properties you are looking for are less than half the effective “side channel” bandwidth then you have nailed the critter to the wall (of the tunnel 8).

FNORD June 19, 2008 11:35 AM

What Lewis said, keep it short. And don’t use VoIP for sensitive information.

It’s not entirely clear from the article, but screwing with your enuciation might make it harder too. But I wouldn’t count on that.

Clive Robinson June 19, 2008 11:40 AM

Sorry the above anon post to quodum was from me.

I’m posting from my mobile phone and it has only a small screen and cannot handle the “Movable Type” responses very well (ie it always produces an error message 8(

Clive Robinson June 19, 2008 11:54 AM

Just a thought on how you can improve the system.

Back in the 90’s a system was developed where by a computer could analyse a piece of music and by looking at certain statistical properties could reliably decide if the piece had been written by a composer “known to it”.

This idea was further developed to deal with the written word and various pieces of writting could be acuratly attributed to a known author.

In essence the way we think comes out in the words phrases and timing of the way we speak.

Therefor it should be fairly simple to analyse a known “target” individuals spoken idiosycraties and produce a system tuned to them.

This tuned system would vastly improve the recovery rate of one half of the conversation. If this information was adaptivly fed back into the engine analysing the other half of the conversation then the context sensitive nature would improve the recognition.

I suspect that the NSA etc are well ahead of the game in this sort of “context engine” so it may already be a viable and running system…

quodum June 19, 2008 11:55 AM

@Clive, Anatoly

Oh, I agree that in case of VBR such observations can be made. It’s just that I interpreted Anatoly’s comment as pertaining to the voice encryption in general.

And my previous comment appears to be harsher than I intended. My apologies.

derf June 19, 2008 12:00 PM

If you want a “secure” conversation under this scheme, just have a TV or radio on in the background – problem solved.

Davi Ottenheimer June 19, 2008 12:05 PM

That is indeed impressive, but the problems with VoIP I have found usually are far less interesting:
– Vendors leave systems (e.g. unpatched Windows 2000 servers running SQL) totally insecure, expecting customers to figure out how to harden them on their own. Server teams don’t want to touch them.
– Telecom management is driven to demonstrate cost savings and rapid deployment, so they have incentive to drop all controls (e.g. the CEO wants some VoIP widget to work at his homes in Sydney and London and doesn’t care when/how it gets done as long as he can use it immediately)

…and so forth. It would be cool if breaking the encryption of VoIP were something necessary at this point, but the sad fact is that Telecom seems to be re-learning old lessons as they move onto a shared/public network. A standard PBX manager will usually have a ton of war stories and be very cognizant of risk on private line systems (often due to actual toll-fraud incidents) but install VoIP and suddenly management lets basic security fly out the window.

Davi Ottenheimer June 19, 2008 12:38 PM

“Would it help to speak in poetry?”

It always helps to speak in poetry, although more traditional forms like iambic pentameter are easier to decode than prose.

The problem uncovered here is that compression gives the key to decode, so you need to use forms of poetry that lack predictability, or use nothing but long complex sounds:

“How now brown cow…”

This brings a haiku to mind:

Compress your bitrate
And expose the key to sound;
VoIP flows insecure.

Nomen Publicus June 19, 2008 12:38 PM

This is, of course, why cypher texts are broken up into five character blocks.

Den June 19, 2008 1:12 PM

Good solution may be to speak some foreign language with accent. I think in this case pattern will be very different.

bob June 19, 2008 2:40 PM

For compressing & encrypting inter-switch links (rather than the loop going from a switch to a single subscriber which will probably still be analog over copper for some time to come) they could multiplex several conversations into a single standard-length packet which would not only save bandwidth it would add obfuscation because any given conversation’s phonemes would be interspersed with those of other conversations; which would rapidly increase the required domain size for decrypting to ridiculous levels, yet still maintain the low delay needed for voice.

bob June 19, 2008 3:20 PM

Playing music in the background would interact with the phonemes of speech to cause many more unique audio signatures; which would in turn: (a) Decrease interceptibilty and (b) Decrease compressibility. So it would eliminate the value of the compression in the first place, may as well just switch it off.

Clive Robinson June 19, 2008 3:48 PM

@ Nomen Publicus,

“This is, of course, why cypher texts are broken up into five character blocks”

Sorry no, it was the telegraph companies and the fledgling organisation that became the ITU.

They originaly said no to codes due to the many problems it caused not least of which was what to charge.

The lowest common denominator turned out after suitable negotiation to be five random alphas or numbers being equal to a word.

Hen June 19, 2008 4:08 PM

Good solution may be to speak some foreign language with accent.
@ Den

Using compression with constant rate (like GSM) is probably much simpler.

Dave B. June 19, 2008 5:23 PM

I may be being incredibly dim or misinformed here, but it seems to me that many people are confusing the endpoints with the transport network, or at least may be discussing different issues from one another.

1) If you’re carrying voice over a public network, whether that be a POTS, SIP, H.248 or whatever… you do not control how that stream is encoded (with exceptions as shown below). These codecs (e.g. G.711 and G.729) are not encrypted and can be played back relatively simply using a popular brand of packet sniffer.

1a) You would therefore need to introduce your own encryption system that translates your voice into an encrypted stream: this could either be an analogue output, that then has to survive the vagaries of sampling and unknown compression systems in between, or it could output digitally, send T.30 tones and bypass any intermediate compression entirely (cf. T.38).

2) If you’re terminating the VoIP streams on your own system, e.g. SIP, but still carry the traffic via a public system, whether that be PSTN or IP, then I see a few alternatives:

2a) The system basically encodes your voice into an encrypted stream then negotiates the call normally and packetises it accordingly. Functionally this is identical to 1a), if the encryption system in 1a) were connected to a VoIP ‘phone system.

2b) While you could implement a codec in your VoIP client that encrypts your voice stream: you should probably make sure that the endpoints can’t slip back into using unencrypted codecs, which would meant that the client could only be able to be used to call compatible endpoints; and you would still probably have to maintain control your own proxies, gateways, registrars et cetera so that you can control the permissions for codec negotiation, where applicable.

2c) You have encrypting endpoints that communicate point-to-point over IP.

3) If you control the entire system, then any of the above will work, plus you can throw in additional security layers.

Anyhow, the real issue is that, whichever system is chosen, the encryption system has to avoid the interesting issue that Bruce mentioned, either by not using compression or modifying the latency and using padding to compensate.

If you’re having to do a mixture of Cockney and Yorkshire accents so that you can unexpectedly drop vowels and consonants at will, while playing the 1812 overture in the background, then I think that that’s an indicator that your encryption system has been poorly designed.

Tim June 20, 2008 9:26 AM

I agree with Dave B.; adding pseudo-random noises like the 1812 overture indicates a poorly designed encryption system. Would it not also require a noise:signal ratio sufficiently high as to defeat the purpose of VoIP?

Bob W. June 20, 2008 11:38 AM

Why would anyone bother to use variable bit rate encoding for transmission of a single stream of conversational human speech? As noted by Nigel Sedgwick, what you gain on the swings you lose on the roundabout: the advantage in bandwidth used is offset by the impairment of conversation caused by additional delay in the speech encoding mechanism.

Even G.711 telephone-quality voice operates at only 56 or 64 kilobit/s, and more aggressively compressing voice codecs could bring the bitrate down to ca. 12 kilobit/s when I was last paying attention, 10 years ago. At that time the main mechanism for varying the bitrate was, I think, silence suppression: for sampling intervals in which no speech was detected a special code was transmitted which told the receiver to fake the sound of silence (background noise) for user comfort. In the context of available network speeds with anything but an analog modem, making bandwidth variable in the range 4 to 12 kilobit/s seems unlikely to be useful.

In the context of trunked calls I suppose there is some argument for VBR as the number of bits from multiple voices grows large enough to make the savings worth the effort. So long as encryption were applied to the whole set of trunked voice streams (i.e. put all voice packets in one large UDP datagram and encrypt the whole shebango) there would be little loss of security.

Presumably this concern applies mostly to transmissions on very limited bandwidth channels where conversational comfort is not a priority. Perhaps most such calls have military or governmental purposes, so security is of even greater importance? One hopes that the cryptographers designing those communication systems will think of the VBR consideration early in the design.

Bob W. June 20, 2008 11:49 AM

Regarding the approach of obfuscating the speech codes by playing music (or other noise) in the background, that seems unlikely to work.

Since speech codecs are designed around the sound characteristics of the human vocal mechanism, coding background sounds into the digitized stream results in (experts please be patient with me on this non-expert explanation) a reconstructed sound that includes only the parts of the background which resemble human speech and the result could be quite garbled. As a result, resiliency in the face of impairment due to interfering noise is an important feature of a speech codec and testing for accurate reproduction of speech (and exclusion of background noise input from the coded stream) is part of the design process.

Even more so, in a variable bit rate codec one would want to exclude non-speech input as it would tend to increase the bit rate of the coded stream if one were to include it, rather defeating the purpose of using VBR in the first place.

The problem seems to come into effect at the level of preparing the cleartext for encipherment: it has to be done in a way that doesn’t leave important features of the cleartext visible in the ciphertext. Munging the cleartext will have limited usefulness: a clever eavesdropper will be looking for key phrases express in Pig Latin, etc..

Alex Ponebshek June 20, 2008 10:49 PM

I think I have a solution that would cost in latency but not in bandwidth.

What if the voip system buffers compressed audio in order to normalize packet sizes? The size chosen here would represent a delay in the receipt of audio information, but it would also force the granularity with which the attacker could analyze the compression. If my voip client sends a large packet every two seconds, then the attacker may be able to figure out sentence flow and who is talking, but would probably not be able to work out words. The cost, of course, is that you hear me two seconds after I say things.

I think that this would be better than using chaff, because the compression is essential to allowing real-time communication without sounding like crap. I’d prefer clear, delayed audio to laggy, skippy, low quality, realtime audio.

Clive Robinson June 21, 2008 1:12 AM

@ Alex Ponebshek,

“What if the voip system buffers compressed audio in order to normalize packet sizes?”

Well as the underlying codec is “variable bit rate” several things would change some for the better some for the worse (I’ll start with the benifits),

Firstly the agrigation of bits into larger packets is extreamly desirable as the “data content” to “IP datagram” size is extreamly poor with small data packet sizes. Therfore the actual “efficiency” of reducing the data packet size by 50% may only reduce the datagram size 10%. It is one of the trade offs that VBR carried over “other protocols” (UDP/IP TCP/IP etc) pundits appear to be unaware of or deliberatly ignor. Further due to the fact that IP datagrams have a maximum size (MTU=1500 on most networks) the “efficiency” relationship shows a decaying sawtooth relationship.

Secondly the “Variable Rate” issue does not go away. What does change is the bandwidth of the “side channel” through which secret information escapes the protection of the encryption. The side channel bandwidth decreases in inverse proportion to the size of the data packet agrigation. That is if your original data packet size allowed 300Hz of side channel bandwidth then putting sixteen together reduces the bandwidth to ~19Hz.

From these two points you can see that although data packet size and bandwidth do have a simple relationship IP datagram size and bandwidth do not which might give the advantage you are looking for at minimal or no cost (it is extreamly implimentation sensitive therefore has to be done on a case by case basis).

Thirdly is latency, all other things being equal (which they are not) you would expect a simple relationship between data packet size and latency. As described above in point 1 there is a large overhead to get from the data packet size to the IP datagram size. It is the size and type of IP datagram that effects latency in uncogested networks. Due to the complex way networks interwork it is impossible to say how any one connection at any given time will be effected. But you might actually see latency go down with moderate buffering (bellow transmission window size) and burst transmission.

Fourthly and possibly most importantly to viable audio communictions is transmission jitter time or variability in IP datagram delivery times. The usuall solution to transmission jitter is to buffer packets to atleast twice the average expected jitter time and optionally adjust it dynamicaly based on link conditions.

Finnaly is audio drop out caused by late or lost data. As for jitter the solution is buffering but usually the buffer needs to be atleast five or six times the jitter rate to determine a reliable way to deal with the drop out (fade to noise etc). Currently the techniques developed for GSM are favourd for medium constant rate encoding, I belive it is still an open question for low variable rate encoding.

You would expect that the lower the rate of encoding the worse the effects of both jitter and drop out and this has certainly been the case in the past with radio and circuit switched networks (where the drop out is mainly caused by impulse type noise and is modeled as such).

Packet switched networks have radicaly different problems and behaviours which are more likly to make things considerably worse.

Also there are the added complications of encryption. A single lost packet at the codec level might require additional resync data to be negotiated and sent at the encryption level. The protocols to do the resync in the best way for consumer grade drop out on a packet switched network are only just begining to be investigated. It involves trade offs, Code Book would require no resync but is a security no no. Chaining would require the receiver to tell the transmitter to reset/sync the IV which is a no no for audio quality. My bet would be on a modified version of Counter based chaining with a data count sent, but do you tie it to the codec data packet level or IP datagram level either way breakes clean implementation. Traditionaly Xor type (pad/stream) encryption and plain text chaining has been favourd as it alows pre computation of the key stream and thus minimum latency in resource limited systems. However times have changed and resources are not the issue they once where.

And to make matters worse variable rate encoding usually magnifies any problems caused by low data rates. Further it is also usually the worst offender when it comes to dynamic buffer managment.

All in all there are way to many other problems to sort out first before you need to start worying about variable rate encoding or how to optomise it. That is unless you are a marketing droid with managment authority and a must have feature list…

Arne June 24, 2008 3:32 AM

@Anatoly: Unlike some other posters, I noticed that you are not talking about timing attacks but attacks on the ciphertext. So I’ll address your claim directly by stating that, given OTP enryption, the ciphertext attacks you mentioned are provably futile. Any weakness you might be referring to is thus not an innate flaw of the proposed uncompressed transmission but a weakness of the cipher. I’m interested if you know about such flaws in “industrial strength” algorithms with fixed key length.

I do, however, agree that it’s better to compress the voice data and add random data to it to make the bit rate constant again. The encryption algorithm should protect against all attacks on the ciphertext that do not involve guessing the key, but it’s better not to rely on it.

By the way, it’s customary to talk about the (Shannon) entropy of a message as if it was an inherent property of the message when in fact the entropy of a message is supposed to be measured against a certain code. We just tend to forget about that because modern compression algorithms are so good at finding usable codes on-the-fly. So when a good encryption algorithm is used, guessing a simple code in which the message has an entropy significantly lower than its bit length should be at least as hard as guessing the key. Which is just a really complicated way of saying that the effort to compress the ciphertext should be insurmountable.

Jill July 11, 2008 12:01 PM

We need a new VOIP design that puts voice into same-size, user configurable datagrams and pads them.

Leave a comment


Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.