Voice Prints

Seems that it’s hard:

“There is no such thing as a voice print,” he said. “It’s a very very dangerous term. There is no single feature of a voice that is indelible that works like a fingerprint does.”

Many different factors influence how people speak at any particular time and place.

“If you’re tired or if you have a cold or if you’re speaking on a phone against traffic in the background you do all sorts of things to the voice, which make it phonetically very different from time to time,” said Foukles, who also works as a freelance consultant for a private forensic speech science laboratory.

“The features of speech and language are such that you can’t use them as a marker of identity to identify one person and exclude all other people under normal circumstances. People’s voices overlap.”

Posted on December 23, 2008 at 7:25 AM41 Comments

Comments

Pete Austin December 23, 2008 7:52 AM

Recognizing voices is hard, as Sarah Palin’s team learned.
http://news.bbc.co.uk/1/hi/world/americas/us_elections_2008/7704666.stm

Also “for example Queen Elizabeth II, who was fooled by Canadian DJ Pierre Brassard posing as Canadian Prime Minister Jean Chrétien, asking her to record a speech in support of Canadian unity ahead of the 1995 Quebec referendum. Two other particularly famous examples of prank calls were made by the Miami-based radio station Radio El Zol. In one, they telephoned Venezuelan president Hugo Chávez and spoke to him, pretending to be Cuban dictator Fidel Castro. They later reversed the prank, calling Castro and pretending to be Chávez. Castro began swearing at the pranksters live on air after they revealed themselves”
http://en.wikipedia.org/wiki/Prank_call

Ron December 23, 2008 8:30 AM

After a Palestinian who was believed to have organized many suicide bombings in Israel was killed by a booby-trapped cell phone, Israeli news organizations claimed that it was critical for the identification of the victim’s voice that the phone had been on a CDMA network rather than a DSM network.

Of course, the owners of the CDMA network might have somehow payed off the journalists…

Daedala December 23, 2008 8:33 AM

It’s the same as a lot of biometrics — good enough for authentication, but lousy for identification.

Mårten December 23, 2008 8:54 AM

Here is some linguistics take on “voice printing” from a blog at Stockholm University: http://ling-map.ling.su.se/blog/blog_ak.php?q=455

Here’s an excerpt:
“The technology by Nemesysco (see the previous post), is in use in 25 British councils to reveal attempts at benefit fraud. Now Gordon Brown wants put the system, marketed in the UK by DigiLog, in general use. The first council to try the system, last year, was Harrow in north-west London. They saved £300,000 in the first three months, according to The Guardian. (…)The deeply scary thing here is that this technology cannot tell truth from lie, as shown by Eriksson & Lacerda in their paper “Charlantry in forensic speech science: A problem to be taken seriously” – the results are random.”

bubble boy December 23, 2008 9:31 AM

@ Ilya Levin
Agreed.
Also, Identification is the first step in the three part process of Identification, Authentication and Authorization. You cannot have authentication without identification.

David December 23, 2008 9:31 AM

True, no such thing as a voice print, mimicry is part of learning to speak. However it is pretty difficult to convincingly disguise a voice in a way that doesn’t reveal cues about age, height, gender, regional dialect, etc. And you’ll hear about dialectologists placing people’s accents within 50 miles of a particular city, which is a fairly simple skill to learn with plenty of travel and several years’ study.

A nonny bunny December 23, 2008 11:19 AM

If it used for authentication, it should be limited to cases where the pool of people is already limited by other means.
For example, if you call someone, then you can be fairly sure you won’t be connected to an imposter, so a voice ID might be sufficient, in the same sense that a signature over fax might be sufficient.

Clive Robinson December 23, 2008 11:26 AM

There are all sorts of problems with “voice traces” etc and most do not meet the basic requirments of “beyond reasonable doubt”.

We should realise this from the number of mimics we here on radio and television.

What we humans actualy pick up on is context and chosen phraseology as well as common events.

With regards to the Voice Stress analysis machines it is pure hocum. And like “black magic” it requires the person against whom it is being used to belive it works, then human nature makes it work (in the same way an experianced Cop can make a suspect belive there is a witnes or evidence against the suspect).

An example given on a previous blog posting was the police that used a photocopier to get confesions by pretending it was a lie detector…

Unfortunatly the use by Harrow Council in the UK is a prime example of misuse against those who are least able to defend themselves.

And the reason Gordon Brown (UK Priminister and socialist) wants them used is that due to his idiotic policies the country is effect bankrupt and anything that potentialy saves money for the UK Tresuray is going to be on his Xmas list..

D December 23, 2008 11:50 AM

I’ve been with my wife for 16 years. I still have trouble telling her voice from her sister’s. Even my wife, when listening to a voice mail she left, said that her voice is identical to her sister’s. And they are 8 years apart in age.

Nick Lancaster December 23, 2008 12:20 PM

D:

In a similar vein, I am frequently mistaken for another co-worker on the phone or intercom. The funny thing is, neither of us thinks we sound like each other.

Nick

CB December 23, 2008 12:35 PM

@Nick: That’s probably because you’re used to hearing your “inside” voice. What you sound like to yourself is quite a bit different from what other people hear. You might try tape recording a few stock phrases and having your co-worker do the same – you may find you’re more similar than you think.

SumDumGuy December 23, 2008 12:41 PM

@Nick

Try listening to recordings of yourselves. Your voice inside your head never sounds the same as it does to other people.

JRR December 23, 2008 8:20 PM

There have been nearly completely useless gimmick voice print security programs marketed in the past. Once several years ago, a friend had a Sony laptop with a webcam and some “security” lock software that claimed to recognize your face AND your spoken pass phrase before unlocking the computer.

At a party I was at, he set his face and a spoken passphrase into the machine (he was from the UK), and a few minutes later, another person (an American) got it to unlock speaking in a corny english accent, with a SOCK PUPPET in front of the camera.

Admittedly, I’m sure there are better programs, but it goes to show what utterly useless crap is sometimes marketed as secure.

greg December 25, 2008 7:49 AM

People have said that (voice) mimics are so good and use that as evidence that perhaps voice id is not so good.

So what about the look a likes? We had a Clinton look a like competition at the pub. I could not tell the difference between most of the 8 entries. This is in a city of 60 000.

I propose that neither will be very good on a “large” scale due to the overlap…. (the data on face id suggests that they sux)

greg December 25, 2008 7:51 AM

In addition to that there is evidence that iris scans can id people on a global scale. Perhaps Minority Report got it right, from both sides….

Clive Robinson December 25, 2008 1:53 PM

@ greg,

First off I hope your solctice celebrations are bringing you the traditional joys and comforts of the season.

With regard to,

“We had a Clinton look a like competition at the pub. I could not tell the difference between most of the 8 entries. This is in a city of 60 000.”

Your comment tends to coroberate what is effectivly a dirty little secret most police forces and Courts do not want you to know…

One of their staples is the “Identity parade” where a group of supposadly random persons who are sufficiently similar to the suspect are lined up with them for a victim or witness to identify.

Well a series of seperate double blind tests have shown that at best only 25% of people are able to vaguly identify the suspects height let alone facial charecteristics within 24Hours and this degrades to less than 5% after 4days. Therfore it appears that independent testing has shown that reliable suspect identification in a line up is realy not possible.

Further tests tend to suggest that contry to the popular notion of an attackers face being burned in the memory of the victim or witnesses almost the exact opposit happens and the more stress a victim or witness is put under at the time the less reliable the result.

Which begs the question,

“If independent tests show ID parades to be less than 25% reliable at any time why on earth are the Police and Courts still using them?”

If any readers find out the real reason they are still in use I would be interested to know.

Joel Norvell December 25, 2008 5:43 PM

The shape of the vocal tract determines the spectrum of its formant frequencies. This certainly can be used in voice identification. It’s still a probabilistic model; but there is this underlying science.

tomj December 27, 2008 8:30 PM

One inconvenient fact for those who wish to make a living off of replacing human based decision-making has to do with the quality of human perception. The first requirement for quality decision-making is quality data. Machines simply do not have the ability to capture high quality sounds or images. The human ear is orders of magnitude more sensitive to sounds than even the best equipment and they can dynamically adjust in order to filter out background noise or useless information. When humans try to identify a person by voice they also apply a changing set of filters as they test out possible speakers.

I think the starting point is to assume that humans can collect better data and they have better tools for analyzing this data than machines. Yet humans can make mistakes. The first cause of such mistakes is that the sounds are produced by a machine. They either are recorded and then replayed, or are captured and transmitted to another device which reproduces the sound. How much information is lost when machines are involved? Hard to say, but humans are still better at identifying a person from machine captured and replayed sounds than a machine.

But the huge deficiencies in machine mediated sound transmission (phones, etc.) probably fully explains why it is so easy to prank someone via phone. BTW, it is much harder to prank someone you know and who knows you. They have to somehow disguise their voice, but it never works out as well as pranking a stranger and pretending to be someone they don’t know that well. It is a simple case of lack of a baseline sample.

But my guess is that Sarah Palin could have detected her prank if she actually spoke French (disregarding the extra content that she would have understood) because she would have collected better data every time she heard Sarkozy speak.

Another difference between a voice print and a fingerprint is the comprehensive nature of a fingerprint. The impression actually contains a large amount of data, but this data set is reduced to a small number of interesting points. The small number of points isn’t as important as the large number of possible points, just like a crypto keyspace in relation to a particular key.

Another way to look at the problem is that sound frequency is a linear (one dimensional) measurement. A fingerprint is two dimensions. And it is more common to vary your voice than your fingerprint.

Anonymous December 28, 2008 4:40 AM

@ tomj,

“Another way to look at the problem is that sound frequency is a linear (one dimensional) measurement. A fingerprint is two dimensions.”

Err no, sound has a three dimensional aspect as far as humans and instrumentation is concerned.

There is the frequency dimension, against the “amplitude” (energy) dimension giving a traditional spectral display for a given point in time. For humans this two dimension is fairly usless even in music.

The two dimensional spectral energy is set in the dimension of time giving the most important aspect as far as humans are concerned which is “releative change”.

Most humans are very far from pitch perfect and find it almost impossible without considerable practice to tune an instrument without a refrence (hence the need for pitch pipes and tuning forks).

But most can easily tell you if a note is higher or lower in pitch than the previous note (but not by how much unless it is harmonicaly related). So the speakers frequency base line is of little relevence unless it is a long way off.

This is because most humans vary the tone of what they say by small amounts within their current environment (air preasure temprature moisture content etc), as well as by how they feel. Therefor as humans we usually reject this information as being of little use, which is why we can pitch shift even the worst singers in boy and girl bands to put them in tune with each other and still have the record buying public recognise their voices still…

There is a further series of dimensions we use such as “accent” and others such as “localisums” (init, mate, dude,ettc). Then when the speaker is known there are the phrasiology and gramatical dimensions, getting around to the dimensions of context and shared experiance.

It is these latter dimensions on which we realy rely on as humans.

We don’t currently have instrumentation even remotly close to being able to do it in real time (however I’d give it less than five years before were close enough for most practical purposes).

Anonymous December 28, 2008 10:58 PM

“If independent tests show ID parades to be less than 25% reliable at any time why on earth are the Police and Courts still using them?”
desperation? to maintain public “reputation”?

Grey Bird December 29, 2008 12:47 AM

Another example of how useless “voice prints” are is how accurately an African Grey parrot can mimic a persons voice. While many parrots talk, Greys are particularly good at sounding just like someone. So much so, that they can fool a long-time spouse. I know someone who held a short conversation, thinking that it was with their spouse in the next room, only to notice that person out the window. I haven’t seen any studies on an analysis of the sound spectrum was similar between a person and a bird, but it wouldn’t surprise me if they were extremely close.

paul December 29, 2008 11:27 AM

Why do police and courts still use identity parades? Because they can get convictions. Stimulus, response.

Anonymous December 31, 2008 4:00 AM

@tomj

“The human ear is orders of magnitude more sensitive to sounds than even the best equipment ”

I’m affraid this is incorrect. Mic now days both have better dynamic range and more sensitivity. The high dirrection ones also get rid of a lot of noise.

Our brains are good at getting rid of noise with a lot of people talking in a room. But thats not the ear and a directional mic does just as good as mic on cell phone can prove.

The real problem is that we are in fact not so good at identifying with our ears or eyes. It is just that we trust is, no matter how misplaced that trust is.

Roger January 2, 2009 11:01 PM

@Clive Robinson:
“If independent tests show ID parades to be less than 25% reliable at any time why on earth are the Police and Courts still using them?”

<underwear flameproof=”on”/>
Because there aren’t enough CCTV cameras yet!

Clive Robinson January 3, 2009 3:45 AM

@ Roger,

Ughh… you may be right.

The logical conclusion is that we should all carry a CCTV camera working on us at all times, AND as humans are forgetfull souls at the best of times it will be surgicaly implanted in the fore head above our own eyes…

“Welcome to the world of tomorow”

Just one thought though where are the “CCTV tapes” going to be inserted…

Roger January 3, 2009 9:45 PM

@ Clive Robinson:

The logical conclusion is that we should all carry a CCTV camera working on us at all times,

Curiously, we are probably not too far from that. The prevalence of cell phones with built-in video cameras has already made recording of crime far more common than it once was.

And Taser International is advertising a product for police officers, that records low-res timestamped video at all times (switches automatically between colour or IR), but on hitting the “event” button (which also activates the officer’s Taser and/or sends an alarm to base, if those features are enabled) it goes to high-res video, and starts recording audio too.

If someone customarily carries a cell phone in one of those shoulder pouches, so that the camera lens faces more-or-less forward, it would only take a Java MIDlet to enable similar features in a cell phone. Due to memory constraints, the “always on” video might need to be recorded to a rotating buffer, at a low frame rate e.g. 3 fps. You would probably also want to have periodic uploading of the captured video via MMS, in case the phone is stolen or destroyed.

AND as humans are forgetfull souls at the best of times it will be surgicaly implanted in the fore head above our own eyes…

Hmm. Let’s call that an “optional extra” !!

Just one thought though where are the “CCTV tapes” going to be inserted…

I realise you’re joking, but, I recently saw a Secure Digital card — size of a postage stamp, compatible with many cell phones — with a 32 GB capacity. With full colour, 320 x 240 images JPEG compressed, at a frame rate of 3 fps, that’s good for about a week of recording before it needs to loop. When it kicks into 720 x 576, 25 fps MPEG-2 for an “event”, it’s still good for 5 hours.

Clive Robinson January 4, 2009 12:47 AM

@ Roger,

“I realise you’re joking, but,”

Yes the bit about the tapes was to add a little levity at the end.

However the rest no.

I have seen some medicaly related devices that are about the size of a large “capsual” tablet that can easily be swallowed. It contains a flash light, camera, memory, transmitter and power supply, and is tough enough to safely survive a full dietry transit, whilst transmitting out pictures. And the power requirments are such that it’s good for upto seven days.

So with a little modifcation it could quite easily become an implant.

Power would be the main issue, however back in the early days of mechanical hearts, there where systems being developed that could use various asspects of the human bodies own thermal/chemical power.

For various reasons at the time they where not considered as viable as the battery technology then available (used for pacemakers). Which in turn was not felt to be upto the power requirments so external power units via inductive loop coupling where seen as the way to go.

However a quarter of a century later battery technology has changed substantialy and pacemaker technology now routienly contains signal processing cabable of diagnosing various heart conditions and responding fully automaticaly. And importantly are now implanted as routien surgery in some parts of the world.

So I suspect that the technology is at a point where viable mass produced CCTV implants could be made. And the advances in surgery such that they could be relativly cheaply implanted. But further the bodies responses to stress etc could be relativly easily determined so triggering of hi-res mode etc could be fully automatic…

As you note people already do use mobile phone technology to “life blog” and there are HiTec toys out there that are quite cheap that detect movment etc to decide when a picture should be taken and are worn in lightweight chest harnesses.

Now such cameras are effectivly monitored in use by the person wearing them, and you may have seen wearable sports equipment that monitors cardiovascular and respitory performance again at consumer level pricing.

So puting the two together is not going to be very difficult at all, add in an electronic compass, inclinometer and bluetooth to send the data to the wearers GPS enabled mobile. Then each picture sent contains all the data required to be droped into a database without any further intervention.

And there you have the “non-implant” version ready 2go2day (R) at consumer level pricing without requiring further monitoring etc. And importantly all the backend infrestructure is already in place as the mobile phone network and Internet services…

Now in the UK we (supposadly) have twenty times the number of CCTV cameras per head of population of any other country (let’s hear it for Tony Blair and New Labour making us a world leader 😉

And various (quite suspect) figures showing that each of the CCTV camera costs 20,000 GBP to install, which is very far from consumer level pricing.

And as Bruce has recently posted most of these high cost cameras go unmonitored due to the monitoring costs etc. So are effectivly a waste of resources…

Therefore arguably the business case is there as well and all it needs is a “for the children” type political campaign to get the ball rolling.

As I indicated the logic behind it is all in place as is the technology and arguably the infrestructure and funding.

The question is what do we the citizens want (as interpreted by our elected representatives)…

Oh and expect the gist of our comments to appear in an OpEd near you some time soon 8)

rip January 4, 2009 10:49 AM

resonating cavity changer, You can put creamy peanut butter on a slice of moist wheat bread, and stick the peanut butter side to the roof of your mouth, then pinch up forms in the bread to change the resonating cavity that is your mouth. Now if you want a jersey accent, just pinch the corners of your mouth while you speak.

Joel Odom January 6, 2009 2:17 PM

Voices may not be unique, and they may change, but I as a human can use a voice to recognize someone with a reasonable degree of certainty. I assume a computer could be programmed to do this to some extent. It could be usefully combined with other metrics.

Leave a comment

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.