Identifying Speakers in Encrypted Voice Communication

I’ve already written how it is possible to detect words and phrases in encrypted VoIP calls. Turns out it’s possible to detect speakers as well:

Abstract: Most of the voice over IP (VoIP) traffic is encrypted prior to its transmission over the Internet. This makes the identity tracing of perpetrators during forensic investigations a challenging task since conventional speaker recognition techniques are limited to unencrypted speech communications. In this paper, we propose techniques for speaker identification and verification from encrypted VoIP conversations. Our experimental results show that the proposed techniques can correctly identify the actual speaker for 70-75% of the time among a group of 10 potential suspects. We also achieve more than 10 fold improvement over random guessing in identifying a perpetrator in a group of 20 potential suspects. An equal error rate of 17% in case of speaker verification on the CSLU speaker recognition corpus is achieved.

Posted on September 16, 2011 at 12:31 PM25 Comments


Editor At Large September 16, 2011 12:48 PM

Hello, I believe someone lost track of a bold tag, unless the entire blog needs bold for emphasis.

Joe Buck September 16, 2011 1:18 PM

I am assuming that a hyphen has been lost, and the meaning is that accuracy is between 70 and 75 percent.

Tomasz Wegrzanowski September 16, 2011 1:20 PM

Bold spilling all the way to comments suggests this blog is exploitable.

  1. Make post about security with Javascript exploit in it.
  2. Have Bruce Schneier copy and paste without checking
  3. Profit

Tomasz Wegrzanowski September 16, 2011 1:23 PM

Looking at source code, this looks like something tidy would do, for xhtml etc. It “fixes” misnested tags (< div > < b > < /div >) by reclosing and reopening b at every level.

Browsers are smarter than that.

Security story here is pretty obvious.

Rob McDermid September 16, 2011 1:27 PM

The bolding problem is in the abstract of the post. The word “Abstract” is supposed to be in bold, but the closing /b tag was accidentally inserted as /a. That’s a valid html tag, and one allowed on this blog, just not correct in that context, but the checker would probably not be smart enough to catch it (like a spell-checker not being able to tell you typed “to” when you meant “too”).

-ac- September 16, 2011 1:33 PM

I would like to see this modified, to, for example:
20 suspects + a body of new samples never encountered by the system before.

The system would output the ID of the matched sample or unknown. Then grade the system on its correct use of “unknown”.

I’m guessing that this simple addition would drive down that success rate considerably.

Chris K. September 16, 2011 1:33 PM

Shit Joe Buck,
I wish I or the other 50 people who had read that (and three to post) would have been smart enough to figure that out.

Thankfully you came along and set us all straight.

(directly above should be read with as much sarcasm as you can muster, as I would imagine all comments with regards to 7075%.)

pegr September 16, 2011 1:54 PM

And when Bruce fixes it, all the bold will be gone and no one will have any idea what you’re talking about.

As to the topic, could you not prevent the analysis revealing too much by sending noise (digital, not actual audio) during the silence (silence both directions, btw. Conversations tend to be half-duplex).

Natanael L September 16, 2011 5:56 PM

The interesting stuff that the average reader should be able to understand is right here:

“Variable bit rate (VBR) encoding techniques, which result in variable length VoIP packets, have been introduced to preserve the network bandwidth. The encryption techniques currently in use in order to preserve privacy of the calling and called parties do not change the packet length (Baugher and McGrew, 2003). Hence any exploitation mechanism based on the packet-length information remains valid for the encrypted communication. In this paper, we propose speaker identification and verification techniques based on using the packet-length information without even knowing the contents of the encrypted VoIP conversations.”

And this follows: “We demonstrate that the packet-length information, being extracted from either the file headers (in case of multimedia container formats) or being physically monitored during a VoIP conversation, can be used to identify or verify the speaker. In particular, we use discrete hidden Markov models to model each speaker by the sequence of packet lengths produced from their conversation in a VoIP call.”
And then there’s some more stuff.

So turn off VBR and this thing won’t affect you.

Peter Maxwell September 16, 2011 7:07 PM

@Natanael L at September 16, 2011 5:56 PM

“So turn off VBR and this thing won’t affect you.”

For that specific example, your suggestion will work. However, I suspect there is still another attack, albeit less reliable: they relied on using the change in ciphertext packet size as a proxy for changes in the speech causing a rate change in the VBR rate; a fixed-rate codec still uses compression so should also change packet size, hence also leaking information. Whether there is enough information to be able to distinguish between speakers is unknown.

Gweihir September 17, 2011 6:05 AM

Cover traffic is neither a new concept not difficult to find out about.

This is not a surprise. It is clear how to fix it, or better, how it should have been done from the beginning. Use constant bit-rate, and inject encrypted zeros if needed, and these design errors will go away.

Clive Robinson September 17, 2011 6:23 AM

@ Natanael L, Peter Maxwell,

“For that specific example, your suggestion wil work. However, I suspect there is still anothe attack, albeit less reliable:…”

There are actually several other attacks and they all fall into the general class of timing attacks (which are one of the most devistating forms of attack).

To quote a bit of Bruce 😉

“I’ve already written how it is possible to detect..”
how information leaks by side channels caused by time differences when people make things “more efficient”…

So I won’t long windedly go into all of that again I’ll just point out three things,

1, ALL and I do mean ALL attempts at making a system more efficient will give rise to side channels (it’s an unavoidable law of nature get over it the energy has to go somewhere).

2, Some side channels have limited range others travel with the communication, time based side channels being one of the latter (cross modulation of phase and amplitude etc being others).

3, Thus unless you really know what you are doing and even then are also extreamly carefull any “efficient system” will “leak information” to your or others detriment.

There are reasons why EmSec / TEMPEST aproved equipment is generaly, heavy, uses lots of power, and is when you can buy it very very expensive.

To put it in the general case you have a see-saw of,


You can have one or the other but usually not both. Knowing how to get both can make you a much sort after person. But as with “better mouse traps” it’s not going to make you wealthy, and might also make you short lived…

Gabriel September 17, 2011 7:17 AM

Seems like any encrypted channel needs to have:

  1. Constant bit rate, especially when the attacker already knows what type of data is being sent, as is the case with VoIP.
  2. Constant packet sizes, where the message is padded per a secure algorithm (is pseudo random data sufficient?).

As Clive has mentioned, this is trading efficiency for more security. Unfortunately, “Eve” still knows when you are talking and most likely who you are talking with. Masking that will require further reduction in efficiency such as tunneling and proxying. I’d hate to see how much capacity for VoIP calls a network would lose.

jake September 17, 2011 11:44 AM

this paper’s title is very misleading and it should be titled something more like “SRTP and other VoIP-specific encryption sucks, as if you didn’t already know”.

in section 3 of the paper they appear to isolate their attention to the case of a speaker using SRTP and make the following comment: “Hence the packet-length information remains unchanged after encryption and all exploitation techniques based on this information remain as valid after encryption as they are before encryption.” anybody who cares about real privacy will encrypt their VoIP link with IPSec, which uses padding in most cases. it is clear this paper does not address this model and, as such, their title is misleading since IPSec encrypted VoIP communications still falls under the title of their paper but is not vulnerable to the tricks demonstrated in the paper.

they also cite another paper to suggest that using VoIP over IPSec introduces “unacceptable delays on the real time traffic” which is obviously false. many people have been using this setup for years. citing a paper from 2002 about delays from vpns is a bit ridiculous.

Clive Robinson September 17, 2011 12:00 PM

@ Gabriel,

“Masking that will require further reduction in efficiency such as tunneling and proxying.”

We already have Tor that does some of it…

The correct solution is the way the military have done it for many years. You pick a fixed baud rate and stick with it on a point-to-point communications link.

All the adversary sees is an “up network” with various nodes and leaf points with “constant rate encrypted traffic”. They have no idea if it’s “fill traffic” or “live traffic”. The only dynamic asppects they see are circuits between nodes etc droping and coming up very occasionaly.

Some time ago I sugested Tor use QoS to establish a fixed point to point network and also carry both primary (interactive – VoIP, browsing) traffic, secondary traffic (nonintereactive – email etc) and fill traffic to prevent network analysis. I further sugested that to make it a little more efficient they use bandwidth prediction ie to ramp up and down bandwidth on links during the day bassed on previous traffic flow not on primary traffic.

Gabriel September 18, 2011 9:50 AM

@Clive: Of course, TOR and other anonymizer services will help greatly. But I can’t see any big service providers or enterprise IT using such services. Additionally, if they have a network and ISP providing them with enough capacity and QoS traffic shaping to support a large number of calls, such as 1000, I would have to imagine they would resist any optimizations that could easily cost them 50 – 75% of that capacity (A number i pulled out of thin air, so don’t quote me on this). So, I can’t see true privacy and anonymity unless you do it yourself or a special niche service provides it.

Clive Robinson September 18, 2011 2:01 PM

@ Gabriel,

“I would have to imagine they would resist any optimizations that could easily cost them… …So, I can’t see true privacy and anonymity unless…”

Yup and that’s why the NSA et al will always be in business…

And why I say “Efficiency-V-Security” is in the general case a game of “take your choice”.

As we all know “walnut corridor” will always vote for the “efficiency of cost minimisation”, it’s one of the mantras they follow as a “business driver” no matter how idiotic it might seem.

So business comms and data storage will in effect always be insecure unless there is some other “business driver” to provide counterbalance.

But can you find one extream enough to work?

Let us say for argument sake a new law was brought in that said “All the executive and non executive directors of an organisation and it’s parent organisations will on the loss of any PII by the organisation be subject to “hang, drawing and quatering” and their families sold into slavery”. What would you expect to happen?

Well history tells us where such punishments are in place people will still take a chance on not being caught for various reasons.

So you would still find “chancers” in “walnut corridor” who would take a gamble to get ahead by improving “shareholder value”.

Further a number of (small scale) studies have shown that the majority of executives (they tested) showed high on the scale of psycho/socio paths, in some cases more so than some of those on death row or the “criminaly insane” locked up for good…

Based on the above do you realy think things are going to change in favour of Security?

If so go and have a look at PCI, SabOx etc, all the new “business drivers” do is move the goal posts, security still gets lip-service at best.

Heidi Fox September 19, 2011 10:09 AM

And here I thought that the point was that doing speaker id/verification just from packet length was really cool. From all the work I’ve seen going into solving these problems using actual speech information, it blows my mind that it can be done this well using just packet length.

Jonadab September 20, 2011 7:03 AM

[/a is] a valid html tag, and one allowed
on this blog, just not correct in that context,
but the checker would probably not be
smart enough to catch it

The checker certainly ought to be able to catch the closing of an element that’s not open. Any checker that doesn’t catch that isn’t worth having.

If all markup were required to be wellformed (like XML and XHTML), the checker could also very easily catch the failure to close an element that was opened (in this case, /b). I used to hold out hope that the whole web would eventually move to all wellformed markup all the time when HTML4 finally dies out, but it looks like all the attention now is focused on HTML5, which, unfortunately, does not require wellformedness.

Of course, in a context where one entity controls all the software involved and can place whatever limits it chooses on the markup, you can just require all markup to be wellformed and then use a checker that rejects anything that’s not, a practice I heartily recommend, as it conclusively solves an entire class of related problems in one fell swoop.

Dirk Praet September 23, 2011 5:46 AM

The simple work-around for sensitive communications is not to speak yourself, but have some software do that for you while being behind a keyboard yourself. And pitch the voice to sound like Sarah Palin.

Leave a comment


Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via

Sidebar photo of Bruce Schneier by Joe MacInnis.