Forging Voice

LyreBird is a system that can accurately reproduce the voice of someone, given a large amount of sample inputs. It’s pretty good—listen to the demo here—and will only get better over time.

The applications for recorded-voice forgeries are obvious, but I think the larger security risk will be real-time forgery. Imagine the social engineering implications of an attacker on the telephone being able to impersonate someone the victim knows.

I don’t think we’re ready for this. We use people’s voices to authenticate them all the time, in all sorts of different ways.

EDITED TO ADD (5/11): This is from 2003 on the topic.

Posted on May 4, 2017 at 10:31 AM40 Comments

Comments

George H.H. Mitchell May 4, 2017 11:05 AM

Not sure that there’s much that humans as a species could possibly do to be ready for this.

CallMeLateForSupper May 4, 2017 11:23 AM

What social good does a persuasive voice-faking system/device provide? I haven’t come up with any … or even one.

I can imagine a benign use: a Sound Frame. It;s a natural companion to the very useful (and very reasonably priced (not)) digital picture frame. Load it up with Mum’s letters and let ‘er rip! A solution looking for a problem, just like everything IoT.

r May 4, 2017 11:27 AM

@CallMeLate,

Under US law, you now own that idea under copyright you may want to investigate a patent quickly.

Emphasis QUICKLY

Clive Robinson May 4, 2017 11:37 AM

@ Bruce,

I don’t think we’re ready for this. We use people’s voices to authenticate them all the time, in all sorts of different ways.

As a species “we are never ready” for what comes along, we learn to adapt through experience, it’s probably our strongest survival skill.

In the army they used to teach those in the technical and leadership sides “Stop Assess Think Act” (SATA) for situations outside of standard training. In civi street you tend to hear people refering to “sleep walking into a situation” or “fools rush in” for people that don’t practice SATA.

Spoken conversation works at many levels you often get told that 80% of communications is non verbal. The thing is the same applies at the end of an audio link such as a phone or two way radio. It’s not just the voice being recognised, it’s the subject matter, the words and phrases, and intonation as well as the thinking of “does this make sense”.

Thus if photoshoping a voice does become understood people will adapt and use checking of various forms.

If you think back when fax machines first became used people did not immediately treat them as different from posted documents. However after one or two frauds against banks people wised up. We are currently going through a phase of virtual companies using electronic messaging to submit invoices that are fraudulent but people are wising up and flaging things that don’t look right and using other channels to verify the messages.

Thus after an initial problematic period I suspect people will adapt as the almost always do.

Who? May 4, 2017 11:51 AM

@ Clive Robinson, Moderator

Clive, as usual you are right. Rufo is the CEO of the company behind that “security device.” It is good seeing another security device in the market, but it would be better seeing a strong foundation for the claimed security.

It is worrying, as there is a huge number of security devices developed in the last two years that are targeted to the profitable “business of fear,” but lack a serious analysis to answer the key question: why are these devices better than current ones?

Clive Robinson May 4, 2017 11:51 AM

@ CallMeLate…,

What *social* good does a persuasive voice-faking system/device provide?

Well the film industry has been doing it for years, after all how many actors can actually sing in a pleasant way? That’s why they “voice over”.

You might have seen an advert for a well known bar of chocolate, that uses Audrey Hepburn’s face synced up onto the image of a body double. Some actors are now selling “future rights” to their image using such technology, therefore doing the voice as well is a logical step.

However this does bring up a thought, could the technology be used to make a tone deaf person sing like a lark without all that serious studio time? If it can expect it on an X-Factor boy/girl band near you any time soon 😉

Yousef Syed May 4, 2017 12:08 PM

Hmmm… So I take voice samples from the President, Prime Minister, and anyone else who’s may launch a missile strike, and get going…

albert May 4, 2017 12:46 PM

Lyrebird won’t display on my browser, so F’M.

We are reaching the point when this sort of thing needs to eliminated, by making it illegal and bringing the hammer down on these !#@$%&^* companies. Yes, this will make voice-recognition systems useless, but so what? It was never a good idea.

@Rufo,
I looked at http://www.trustless.ai. I see a lot of bun. Where’s the beef?
BTW, are you related to Giovanni Guerreschi, of “Don Camillo” fame?


. .. . .. — ….

Mike McManus May 4, 2017 12:53 PM

@Clive Robinson

That technology already exists in a somewhat primitive form; it’s called autotune. If this can get rid of the annoying artifacts, it would be a goldmine.

Slime Mold with Mustard May 4, 2017 2:04 PM

If this can reproduce tonal qualities and rhythms, it might fool us over the phone with someone we are not very familiar with in a brief conversation. To go beyond that, it would need to be combined with author attribution software at minimum. Fiction authors and speech writers are familiar with a different concept of “voice”.

Guest May 4, 2017 3:18 PM

Financial Post had an article about childrens’ toys with speakers and microphones built-in; it mentions the past incidents of baby monitors being hacked (where obscenities were screamed at babies). Imagine a hacker who listens to mommy’s voice, waits for mommy to leave the house for work all day long, and then has the toy given to her by mommy begin talking in mommy’s voice.

You trusted your mommy when you were just a kid, didn’t you?

Frank Wilhoit May 4, 2017 4:03 PM

“…I don’t think we’re ready for this….”

We’re not ready for the steam engine, because we have not yet realized that it made hash of the moral imperative to work.

We’re not ready for radio, because we have not yet realized that it made it impossible to control national borders.

So, pretty much by definition, we’re not ready for anything that has been invented in the past 100 years.

JG4 May 4, 2017 4:23 PM

https://www.schneier.com/blog/archives/2017/04/friday_squid_bl_575.html#c6751481

the surveillance constitutes an illegal and unconstitutional theft of intellectual property and the resulting power is absolute [both the political and police aspects]. it will be used for the most stunning identity thefts ever perpetrated – the ability to perfectly reproduce your voice, face, facial expressions, and writing. implicitly, that includes the power to disrupt any human network on your planet [that depends on other than face-to-face contact or other unbreachable authentication]. it also includes the power to take over any business, particularly theft of intellectual property, which includes supply chain and customer data

This is fascinating:
http://www.schneier.com/blog/archives/2009/09/matthew_weigman.html

It turns out that in most people who are born blind, the visual cortex is devoted to processing sound. They cannot hear any better than anyone else (which is set by the physical limits of the ear), but they can extract vastly more from what they do hear, like where the walls are in a room. And in the case of Mr. Weigman, the pattern to exactly replicate anyone’s voice. We should be surprised if a machine cannot do better at some point soon.

Lawrence D’Oliveiro May 4, 2017 5:47 PM

In the future, everybody will sound like Doctor Bot from Space Station 76.

Clive Robinson May 4, 2017 6:01 PM

@ Lawrence D’Oliveiro,

In the future, everybody will sound like Doctor Bot from Space Station 76.

I’d rather sound like “The Robot Devil” from Futurama, much more urbain 0:)

jdgalt May 4, 2017 6:21 PM

What this mainly implies is never to trust someone who asks for a favor over the phone. At least unless you’re capable of vetting their voiceprint, and are sure that this new gadget can’t fake those yet.

Jenny May 4, 2017 8:02 PM

We have earned all that is coming, suck it up “Genius’s” that whore-ship technology

Chris Abbott May 4, 2017 11:19 PM

I’ve heard of things like this as well as ways to actually fake video of people saying things that will become easier and cheaper in the future. 2FA (maybe 3 or 4FA) (digital signature, security questions, etc.) before verifying any form of communication as authentic could be the only solution.

Eventually as a society, we are going to have to use public resources to train everyday people (grandparents, young children, etc) to be able to use security measures for everyday things if we are going to live in a world dependent on digital communication. It’s a reality we will have to face.

Chris Abbott May 4, 2017 11:47 PM

And if you think my last comment was paranoid and dark, with certain technologies becoming more available, imagine this scenario:

A mentally unstable parent loses a custody battle and wants to kidnap their child.

  • Calls school pretending to be the other parent with the fake voice, having generated it from family videos, says they’ll be driving a different car that day to pick them up.
  • Having used a 3D printer to make a mask of the other parent’s face (based on family videos and photos), they will look like the other parent just long enough to get them in the car.
  • In an area near an international border, they cross, get on a plane to another country, and then it’s between that foreign government and the State Department to get them back.

That’s probably an over-the-top, worst case scenario, but there are a lot of things happening now that nobody would have thought of 10 or so years ago.

However, people should be careful not to get too paranoid, even though these things are possible. We can’t live in total fear all the time, but it’s a good example to show why we need to educate more people about what’s possible in today’s world and what’s at stake so people get on board.

Clive Robinson May 5, 2017 1:34 AM

@ Chris Abbott,

I’ve heard of things like this as well as ways to actually fake video of people saying things that will become easier and cheaper in the future.

AS I mentioned above, it’s already been done in Ads and the entertainment industry but is currently expensive and not real time (but that will change). If and when holographic projection becomes indiscernable to the human eye the fun will really start, as it will then start to be possible to make faux / apparently “in person” attacks.

Which gives rise to,

2FA (maybe 3 or 4FA) (digital signature, security questions, etc.) before verifying any form of communication as authentic could be the only solution.

All of those systems require an initial “secure path” to exchange some form of “secret” securely or a method to authenticate the “meatsack” is who it claims to be. You only have to look at the mess PKI CA’s have become to realise why authentication is a complete fail at the top level down[1].

The customary “secure path” is for an unknown individual to “present themselves ‘in person’ with ‘credentials’ in hand”. We know the system is fallible due to the likes of people successfully travelling on stolen but valid credentials (obtained via “ID Shopping”). But it’s been publicly known since the publishing of “The Day of the Jackal” by English author Frederick Forsyth[2]. That for those with a little time and foresight obtaining the root ID document of the “birth certificate” has several really easy to exploit security weaknesses. Once obtained a birth certificate will enable you to then get a passport, then a driving licence, then a bank account and so on until you have a complete set of false, but totally valid and verifiable as such, credentials.

Thus with the “official” ID process effectively unsecurable, the older human “trust model” is the fall back. But that likewise fails if people have sufficient time and resources to carry it out…

[1] In Nov 2005 the ex-head of the UK’s MI5 Stella Rimington shot the nascent plans for a National ID system down. By simply pointing out the UK IC had no confidence in ID systems. Because there was no way to actually prove who you are as all ID could be forged or incorrectly issued or stolen (something the Israeli IC has long been known to do on an industrial scale).

[2] Coincidently Fredrick Forsyth has claimed to have been part of the UK’s MI6 for a couple of decades. Originally what later became MI6 was back in WWI run under the cover of the Passport Office…

Bong-Smoking Primitive Monkey-Brained Spook May 5, 2017 2:25 AM

Voice authentication is so yesterday. Never worked, never will.

We’re working on the next generation authentication mechanism. When a child is born, we’ll collect the DNA (as we’ve been doing since we’ve discovered it).

Next time you go to a bank you won’t need to present any form of ID. All you’ll be asked to do is blow a spit bubble into the machine. NIST will recommend a diameter of at least 3 centimeters. The machine will also make sure your DNA is pure, just in case you get any disgusting ideas, see. There! A glimpse into your bright future.

Can your hologram produce DNA, @Clive Robinson? I think not.

Ts May 5, 2017 2:41 AM

@Bong,
And obviously this spit glob was taken in a fair and non fraudulent way,. right?
Can still be faked,. you leave DNA all over the place by skin flakes, hairs, and someone might even stab you in the leg to get some blood, or swipe the glass you just drank from.

To properly authenticate you’d need to show a few things:
1- you’re alive. (otherwise you get the oft seen in movies, severed finger / eyeball issue) Also see 2.
2- iris scan / hand scan (not just fingerprints, but living hand, with blood vessel scan)
3- memory and muscle memory (voice alone isn’t a qualifier after all), usually in the form of a password or phrase. Or even a typed keyphrase. The speed and manner of which you type this will also be telling,. as anyone can copy your passcode, but unlikely the speed and interval at which you type it.
4- two factor authentication. Any good / failed logins should give you a log / notification.

TonyDS May 5, 2017 6:05 AM

Arnie taught us all how to deal with this problem in Terminator 2.

“Why is Wolfie barking?”

Ignazio Palmisano May 5, 2017 7:02 AM

@TonyDS Spot on – although pet names are likely readily accessible on Facebook 🙂

Other uses:

audiobooks – can create one with any voice you like, as long as you have the plain text.

fool spying devices – pick any text you wish, let it be read with your own voice. If there’s a microphone picking that up (hacked TV, Alexa, your cell phone, anything else that can be plausible in any situation) then for the listener it becomes quite hard to tell whether it’s you making plans to put snakes on a plane or you’re just running through the movie script

white noise – as above, for any random text that could be picked up, just to generate a lot of fake data. Easier to talk of secret stuff if the spy has to isolate the wheat from a mountain of chaff.

plausible deniability – “wasn’t me plotting to kill my business partner, it’s an Onion article my laptop was reading out loud”, or “I’m being framed, that was a message my phone was reading to me out loud”

Bob Paddock May 5, 2017 7:37 AM

On creating your own identity the classic book is “The Paper Trip”. As the loopholes that are exploited are closed, new editions are released. We are now up to “Paper Trip 4” using the REAL ID Act in the US.

On voice/holographic impersonations and such this was predicted by the cult-classic movie LOOKER in 1981.

LOOKER: “Light Ocular Kinetic Emotive Response” – Flashing certain patterns of lights in the eyes of a victim to modify their perception of subjective time.

Anyone ever consider the security risks of involuntarily modifying a persons perception of time?

Clive Robinson May 5, 2017 8:16 AM

@ Bob Paddock,

Anyone ever consider the security risks of involuntarily modifying a persons perception of time?

In what way?

Criminals discovered more than a century ago that rendering a night watchman / guard unconcious in one way or another, would negate their ability to raise an alarm…

It’s one reason why some places gave the watchman a funny looking key they had to go and turn in boxes mounted at various points in their “round”. Whilst many only contained a slow turn clock motor and strip of paper the key printed a symbol on to log the number of rounds and when, some also contained a resetable countdown timer that closed a set of alarm contacts wired up to a central office so that other guards would come and investigate.

Bong-Smoking Primitive Monkey-Braind Spook May 5, 2017 2:03 PM

@Ts,

Can still be faked,. you leave DNA all over the place by skin flakes, hairs, and someone might even stab you in the leg to get some blood, or swipe the glass you just drank from.

Yea! Hence the disclaimer: The machine will also make sure your DNA is pure, just in case you get any disgusting ideas, see.

Eventually a lot more will be faked when cloning humans becomes more trendy.

Jim May 10, 2017 1:50 PM

They did this on Mission Impossible back in the 60s — Rollin Hand was a master at mimicking another person’s voice!

Of course, if a computer can do it, that’s a lot more widespread than just one guy doing it.

Clive Robinson May 13, 2017 5:26 AM

@ Chris,

They fool the system successfully against 33% of intended victims.

All the so called “practical biometrics” currently have both false positives and false negatives in the 10-40% range. For many it’s a tunable option you adjust for less false positives you get a consequent rise in false negatives and the other way around.

This 10-40% range is due to a number of causes, the first is insufficient measurment range, the second is due to signal to noise issues. Each system has it’s own particular set of “types” of these issues.

Thus often what looks good in “lab conditions” often looks realy bad in more “real world conditions”.

Which is one of the reasons I suspect that moving DNA checks out of the lab and into the everyday environment will be at best unriliable and worse realy not posible.

The oddity is thus eye scans, humans like to be able to see thus the arangment of the eye is to keep dust, grit and other “environmental noise and distortions” out of the system. That is the eye is usually close enough to lab conditions thus it’s the measuring head of the instrument that is going to be the main limit currently.

But as you might expect “voice” realy is going to be the worst of it. Even after thousands of generations human hearing real is not capable of distinquishing a persons voice from a mimic, which is why in our brains we have so many other check mechanisms. Basicaly the spoken word and sentances have a very very high rate of redundancy and above this is the idea of context as well… Thus it’s easy to slide even poor fakes in through the redundancy.

JG4 May 17, 2017 11:59 AM

It’s not just about duplicating your voice, it is about duplicating everything. Identity thieves are just the leading edge, achieving their purposes with minimal duplication of basic information like birthdate, address, social security number, usernames and passwords. Government is next and they will be using it to make more money and power.

I mentioned this newsclip last December, not realizing that John did a much better job the day before.

Real-time Expression Transfer for Facial Reenactment
Stanford 2017. Home … We present a method for the real-time transfer of facial expressions from an … where we modify a video conference feed such that the facial …
https://graphics.stanford.edu/~niessner/thies2015realtime.html

John • December 22, 2016 10:46 PM
https://www.schneier.com/blog/archives/2016/12/the_future_of_f.html#c6741385
Researchers at Stanford animate the facial expressions of a target video by a source actor and re-render the manipulated output video in a seamless photo-realistic fashion. The authors show how disturbingly easy it is to take a surrogate actor and, in real time using everyday available tools, reenact their face and create the illusion that someone else is speaking.
http://www.zerohedge.com/news/2016-04-09/stunning-video-reveals-why-you-shouldnt-trust-anything-you-see-television
Paper: http://www.graphics.stanford.edu/~niessner/papers/2016/1facetoface/thies2016face.pdf

Kraig Eno September 5, 2017 4:01 PM

This is going to cost a lot of people a lot of money, hassle, and pain. Just like every new technology, we will have to experience and then learn to avoid numerous kinds of consequences we didn’t expect.

Celebrities (singers, actors) will be happy to find out they don’t have to show up to do re-takes and ADR in post-production, until they find out their services aren’t required at all for the next album or movie. And that they’ve already signed a contract provision that allows their producer to manage the work and take a larger cut.

Characters will be brought back from the grave after the original actors have died (think Orville Redenbacher, but more realistic). Personas will be modified by the rights owners in ways the original actor would never condone — think about being able to make Mr. Rogers say anything you like, anything at all.

There was a comic strip several years ago, Doonesbury I think, where a singer shows up to the studio to learn that the whole record is already produced without them.
Q: “What about my vocal?”
A: “Don’t worry, we’ve got that covered.”

Leave a comment

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.