Detecting Deepfake Audio by Modeling the Human Acoustic Tract

In this paper, we develop a new mechanism for detecting audio deepfakes using techniques from the field of articulatory phonetics. Specifically, we apply fluid dynamics to estimate the arrangement of the human vocal tract during speech generation and show that deepfakes often model impossible or highly-unlikely anatomical arrangements. When parameterized to achieve 99.9% precision, our detection mechanism achieves a recall of 99.5%, correctly identifying all but one deepfake sample in our dataset.

From an article by two of the researchers:

The first step in differentiating speech produced by humans from speech generated by deepfakes is understanding how to acoustically model the vocal tract. Luckily scientists have techniques to estimate what someone—or some being such as a dinosaur—would sound like based on anatomical measurements of its vocal tract.

We did the reverse. By inverting many of these same techniques, we were able to extract an approximation of a speaker’s vocal tract during a segment of speech. This allowed us to effectively peer into the anatomy of the speaker who created the audio sample.

From here, we hypothesized that deepfake audio samples would fail to be constrained by the same anatomical limitations humans have. In other words, the analysis of deepfaked audio samples simulated vocal tract shapes that do not exist in people.

Our testing results not only confirmed our hypothesis but revealed something interesting. When extracting vocal tract estimations from deepfake audio, we found that the estimations were often comically incorrect. For instance, it was common for deepfake audio to result in vocal tracts with the same relative diameter and consistency as a drinking straw, in contrast to human vocal tracts, which are much wider and more variable in shape.

This is, of course, not the last word. Deepfake generators will figure out how to use these techniques to create harder-to-detect fake voices. And the deepfake detectors will figure out another, better, detection technique. And the arms race will continue.

Slashdot thread.

Tags: academic papers, AI, deepfake

Posted on October 3, 2022 at 6:25 AM • 36 Comments

Comments

Clive Robinson • October 3, 2022 8:07 AM

@ Bruce, ALL,

Re : Arms race on fakes

“This is, of course, not the last word. Deepfake generators will figure out how to use these techniques to create harder-to-detect fake voices. And the deepfake detectors will figure out another, better, detection technique. And the arms race will continue.”

And it’s not that hard to see how. From the article,

“We did the reverse. By inverting many of these same techniques, we were able to extract an approximation of a speaker’s vocal tract during a segment of speech. This allowed us to effectively peer into the anatomy of the speaker who created the audio sample.”

Note that the “approximation” is just from the acoustic information.

Although quite different in detail, it can be seen that if you have a face and profile image of the speaker you can come up with an “approximation” of the alledged speakers vocal tract.

So even if the deep fake “audio approximation” is within “human constraints”, how well will it align with the alleged individuals “image approximation”?

Probably not that well without a lot of work.

However it will get more fun with “video” as in men certainly the movment of the Adams Apple is sufficiently clear to make a dynamic model…

As back in the days of the ECM ECCM… Arms race the question is not realy how far can it go technically but the resource costs involved.

But also consider that a similar analysis will enable deep fake videos to be unmasked. That is if you have a known genuine audio of a speaker you can build a model, then articulate it with the video audio track and compare to the video image of the audio tract.

It is certain that as the bit resolution and scan rates of audio, image and video recordings increases things like “blood flow to emmotion” and much else will fall under scruitiny for deep-fakes. One that should be easy to do is examine the “eye response to artificial lighting”, another “head movments to background noise”.

If people remember back, in the UK quite a few years back, an audio recording was shown to be fake because the very low level “mains hum” did not align with the time claimed and the “National Grid” records.

I think people would be supprised at the amount of research in this area is going to appear over the next few years.

Primarily at the moment it’s fairly easy to “make your name” as there is minimal academic competition, but that will change with just the publication of one or two papers.

TimH • October 3, 2022 9:12 AM

So if the main evidence in a criminal trial is an audio recording, and the rest is circumstantial, who has to show that it likely or unlikely to be forged?

Bernie • October 3, 2022 10:33 AM

Can you see it coming? (1) Deepfake singing. (2) Automated deepfake singing detection. (3) DMCA takedowns of deepfake songs. (4) DMCA takedowns of legit songs by hugely popular singers because the detection works well with the vocal tracts that most people have; yet those singers are so popular exactly because of their unusual vocal tracts.

Ted • October 3, 2022 10:51 AM

I am still reading through the original postings, but recently saw a tweet from Michael McFaul, former US Ambassador to Russia, that seems relevant to this research.

From Michael McFaul:

WARNING. Someone using the phone number +1 (202) 7549885 is impersonating me. If you connect on a video platform with this number, you will see an AI-generated “deep fake” that looks and talks like me. It is not me. This is a new Russian weapon of war. Be careful.

https://twitter.com/mcfaul/status/1575911859609210880

(my bold)

The UF researchers adeptly pointed out that deepfake voice detection could be useful for media outlets. But I don’t know how you could detect this in a phone call.

FA • October 3, 2022 3:20 PM

Reading the paper leaves me with contradictory impressions.

First, the authors have had one very good idea, which is to base their detection method on analysing transitions between phonemes rather than individual ones. Most voice synthesis systems cross-fade between phonemes rather than rendering the correct transition. That means that at the halfway point you get a superposition of two phonemes which could very well be impossible to be produced by a real vocal tract which is limited to what the muscles shaping it can do. Strangely enough this can still sound quite natural, but it clearly identifies a fake if detected.

On the other hand it is abundantly clear that all the authors are computer scientist or AI experts, and they fail to really understand the acoustics.
They talk about ‘reflection coefficients’ (RC) but clearly have no idea what these actually mean. They have nothing to do with the ‘flow’ of air [1].
That leads to some significant errors. Setting the RC at the vocal chords to unity would mean that no sound can be generated there. Doing the same at the mouth would mean that no sound can escape. And the whole idea that all of this is ‘unidirectional’ make no sense at all, the very concept of ‘reflection’ means that something is going back.

And then there this claim that there are no closed-form solutions for finding the shape of the vocal tract given the voice signal. Such solutions have been known for decades and are used in almost all voice processing software. So there is really no need to do a ‘gradient descent’ search to find the RC. The simplest solution would be to just use linear prediction. Also their method seems to be based on matching the magnitude of the frequency response only, completely ignoring phase. This can easily lead to a completely wrong result.

So I think that apart from one good idea they have been lucky, and as @CLive has already pointed out, their results will be improved on very soon. And then of course the deepfake producers will adapt their methods to produce better phoneme transitions. That means more complexity, but it’s not ‘rocket science’ at all.

[1] You can have flow without waves, and waves without net flow. Sound also travels without wind, and even against the wind.

SpaceLifeForm • October 3, 2022 4:51 PM

Your voice is your password.

Do not freely hand it out.

Stick to text as much as possible.

Do not setup Voicemail. Do not leave messages on someone elses Voicemail. Avoid calling a companies tech support.

This call will be recorded for quality assurance purposes.

SpaceLifeForm • October 3, 2022 5:25 PM

@ Ted

Deepcake needs to Die Hard

‘https://arstechnica.com/information-technology/2022/10/bruce-willis-denies-selling-deepfake-rights-to-deepcake/

Clive Robinson • October 3, 2022 6:00 PM

@ SpaceLifeForm, Ted, ALL,

Re : Deepcake is badfake

I saw one of the duplicates of the original story, before the denials and thought yup this has been going om a while (since 2016 if my memory serves correctly). As it’s not realy a secret that some stars have sold their “afterlife image” and that it has brought up a curious legal question.

Normally the copyright on an individuals “work” is good for XX years after their death. But the law is not clear as to who owns the copyright on a new work after someone has died…

From what I vaguely remember in the US if you take someones existing work and re-work it in some way it becomes a new work and you get the copyright on that new work.

Now if a computer creates a new work automatically who owns the copyright?

As an example say you feed the original Disney “Snow White” into a computer that pulls out the cartoon characters as animatronics, changes the cloths slightly and adds real dead actors faces, but slightly cartooned, who owns what? And can an AI / Algorithm own the copyright? And if so when does the AI / Algorithm die?

It’s the sort of conundrum that would put IP lawyers and their trick-cyclists on danger money…

JonKnowsNothing • October 3, 2022 8:46 PM

@ SpaceLifeForm, @All

re: Your voice is your password.

RL tl;dr

The number of spam ID calls I get has been dropping due to some legal changes at the FCC. Normally, IF I answer, I say nothing and wait. If there isn’t anyone on the other end, I press ** or 99 (9 or * used to get you an outside dial tone from local PBX or transfer signal to a FAX) and hang up.

Recently, I’ve had to answer more of these No-ID calls, because I am waiting for specific phone calls that do not use Standard Caller ID. I cannot tell if the incoming call is Live or Memorex.

On one of these calls, I thought it was from my unique caller, but when there was only dead air, I said “Hello?” Realizing it was a Bad ID I hung up.

Several days passed, and I got another bogus call with indeterminate Caller ID. This time I said nothing, my normal routine. And SURPRISE! I heard my own voice, which apparently had been recorded. (1)

Had the phone been answered by another member of the household, they would have thought I was on the other end of the connection.

One can only imagine how that recording is making the global rounds.

We are going to need another layer of obfuscation…

===

1) In USA, unless you are a LEA, recording a voice conversation is illegal without permission. Such ninja recordings often end up in the scandal rags. Unfortunately, many EULA/TOS and Contract Fine Print, often grant such recording rights by default. It starts with “this call maybe monitored” and proceeds to any 3d Party Company Liaison having permissions by inherited-cascading rights.

Ted • October 3, 2022 10:04 PM

@FA

The phoneme analysis is pretty cool. Tell me, could you resist doing the reader participation?

… we invite the reader to speak out loud the words “who” (phonetically spelled “/hu/”) and “has” (phonetically spelled “/hæz/”) while paying close attention to how the mouth is positioned during the pronunciation of each vowel phoneme

Regarding your thoughts on the reflection coefficient (r_k) I’m not sure I understand how you arrive at this?

From what I was gathering there are two points in the vocal tract that don’t have a reflection coefficient: at the vocal cords and at the mouth. You are looking at page 8?

The current detection rate is pretty phenomenal. I see the authors of the paper have a patent pending.

https://patents.google.com/patent/US20220036904A1/en

It would be nice if more orgs would make their deepfake generation algorithms available for research.

Ted • October 3, 2022 11:24 PM

@SpaceLifeForm, Clive, All

Yippee-Ki-Yay. We had a box set of Die Hard VHS tapes growing up.

Good question about copyright Clive. I keep seeing that someone made some very convincing Tom Cruise deepfake videos. And I wondered very much the same thing. Did they get approval for that?

SpaceLifeForm • October 4, 2022 2:26 AM

@ JonKnowsNothing

Did your doppelganger have anything interesting to say?

Or was it just you hearing yourself via delayed VOIP echo?

‘https://www.justia.com/50-state-surveys/recording-phone-calls-and-conversations/

SpaceLifeForm • October 4, 2022 4:04 AM

@ Ted

Ventriloquism

I’m not sure what the ‘who’ and ‘has’ test shows. They are very easy to say without moving your lips. Well, in English.

Go thru the alphabet, and phonetically say each letter. You will find that saying B, F, M, P, V, and W will be difficult without lip movement.

Short example:

‘https://www.youtube.com/watch?v=40jCjWXCgt0

Longer example:

‘https://www.youtube.com/watch?v=w_qlkRcyCAU

FA • October 4, 2022 4:37 AM

@Ted

Regarding your thoughts on the reflection coefficient (rk) I’m not sure I understand how you arrive at this?

Reflection coefficients arise in the theory of transmission lines, which could be electrical (e.g. coax cable) or acoustical (tubes),…

The theory is the same in all cases. @Clive could tell you all about the electrical or wave guide form.

They define how waves propagate in a transmission line, and have nothing to do with the net flow of the medium (electrons or air). Each transmission line has an impedance which basically defines the way the two physical quantities that make up a travelling wave relate to each other. In the electrical case those would be voltage and current, in acoustics pressure and particle velocity. [1]

Whenever the impedance changes along a transmission line, part of the wave energy will be reflected and go back instead of travelling on. This is what reflection coefficients describe. In the general case they are complex-valued and depend on frequency. An absolute value of 1 means that all energy is reflected and none will pass that point.

So reflection coefficients have nothing at all to do with the direction of any net flow of air, as the authors of the paper seem to believe. They determine how sound waves will behave.

So e.g. setting the reflection coefficient at the mouth to unity because the air is flowing out at that point and not in, is as the expression goes ‘not even wrong’, it is plain nonsense. In fact the RC at that point will be determined by the radiation impedance, complex valued and very dependent on frequency.

This and other errors mean that the vocal tract shape they calculate could be quite different from the actual one. But since they only compare the results of such calculations for fake and real voices, and never to the actual one, things may still work.

[1] The conventional term ‘particle velocity’ may be misleading. It does not refer to individual air molecules but to an elementary (very small) volume of air.

Winter • October 4, 2022 5:33 AM

I am not entirely convinced. There are biannually contests to design untraceable voice conversions and anti-spoof/deef-fake detectors. Until this algorithm participates successfully, I postpone my judgment.

Automatic Speaker Verification, Spoofing and Countermeasures Challenge
‘https://www.asvspoof.org/
(the article uses the 2019 data set, the latest is the 2021 set)

Voice Conversion Challenge
‘http://vc-challenge.org/

More relevant to this paper is the ICASSP 2022 Audio Deepfake Detection Challenge
‘http://addchallenge.cn/

This is a genuine arms race and the low hanging fruit has already been taken long ago.

The Linear Predictive Coding (LPC) technique they use, their tube model, is from 1978. This has been the bedrock of speech synthesis and vocoders for decades until it was replaced by deep learning.

There are some puzzling parts in their paper. They do not seem to be aware of the actual resonances they are using, nor of the limitations of the LPC technique for extracting resonance frequencies and resonance bandwidths (or the equivalent reflection coefficient).

Also, there are limits on the usefulness of the higher resonances in the complex anatomy of the head. The lowest three resonances describe large cavities in the mouth-throat and are rather stable. The higher ones are much more fragile and it is generally unclear how these relate to pronunciation or speaker identity, or simply artifacts.

The higher resonances are also sensitive to the limits of the approximations inherent to the LPC technique: Lossless sound propagation and closed vocal folds. There is loss of energy through the cheeks and throat wall. Also, the vocal folds will be open part of the time and the lung acts as a silencer.

There are more parts of the text where I wonder whether these are simplifications for the reader, or misunderstandings of the authors, e.g.,

For example, the word cat (phonetically spelled “/kæt/”) contains two bigram
pairs, “/k – æ/” and “/æ–t/”.

In speech, “cat” is embedded in a context. Even in the simplest case, silence #, the word has 4 bigrams: “/#-k/”, “/k-æ/”, “/æ–t/”, and “/æ–#/”. This really matters in speech synthesis and recognition. So, is this a simplification for the reader or a huge misunderstanding?

But even if their technique works perfectly now, the next Voice conversion/ASVspoof/ADD challenges will probably break it. It is pretty easy for a Generative Adversarial Networks (GANs) to include this detector as a boundary condition.

Winter • October 4, 2022 5:38 AM

@FA

This and other errors mean that the vocal tract shape they calculate could be quite different from the actual one.

The theory is rock solid. Linear Predictive Coding originated in the earth science for analysing earthquake signals.

However, the authors seem to have only a limited understanding of the linguistics or models they use.

Gert-Jan • October 4, 2022 6:08 AM

And the arms race will continue

This is a race the “detection industry” can’t win. At best, each new technique can reveal past fakes. At some point all avenues for fraud detection will have been exhausted.

And even now, I bet you that if you degrade the fake enough (“it was recorded when the person was taking a shower” / “It was very windy outside”), this new detection may already fail.

JonKnowsNothing • October 4, 2022 6:41 AM

@ SpaceLifeForm

re: Did your doppelganger have anything interesting to say? Or was it just you hearing yourself via delayed VOIP echo?

On the second call, I said nothing. I picked up the handset and held it at a distance. I did not put it to my ear. (1) Since I said nothing, it wasn’t a VOIP echo.

Once the other side detected “call pick up” signal, and did not “hear” a normal “hello?” from my end, they played back the exact phrase I had used previously.

That is when I realized it had been recorded.

After the initial WTF! I hung up. (2)

An interesting tidbit: this was on a landline, not a smartphone. Plain Old POTS Line. No VM. My Basic Service does have some wingdings included in the handset.

===

1) That is no guarantee that the other side did not pick up my breathing or heart rate or background noise.

2) Just because you hang up doesn’t mean the call was disconnected. An old cram technique, often used to bill exorbitant amounts of money by routing a call overseas and holding the line open on the other end, generating CDR Connect Time.

FA • October 4, 2022 6:56 AM

@Winter

The Linear Predictive Coding (LPC) technique they use

They are not even using linear prediction. Believing that closed-form solutions to find reflection coefficients don’t exist (really !), they use a gradient descent search method instead.

Winter • October 4, 2022 7:02 AM

@Gert-Jan

This is a race the “detection industry” can’t win.

This is going the same way as photographs vs photoshop. It is true that you can create any scene pixel by pixel. But a manipulated photo will always contain traces of being not written by light. On the other hand, we will not always be able to detect them.

The solution is that a photograph or photocopy is not acceptable as evidence without a witness that can be questioned about its truth.

The same will be true for audio spoofing. Without a source and evidence of its veracity, it is nothing more than a drawing of a scene.

The real task before us is to convince the public that any video can be manipulated. But I am rather pessimistic. Many people believe everything that suits them. Minorities have been slaughtered based on drawings of “news” before.

The solution is to widely broadcast staged deep-fakes. Which is already done.
‘https://www.independent.co.uk/arts-entertainment/films/news/south-park-donald-trump-deepfake-b2142094.html

Winter • October 4, 2022 7:58 AM

@FA

They are not even using linear prediction.

It looks a lot like they use the all-pole model of Markel&Gray 1976 [1]. There are good implementations of that based on white error signals, and all kinds of robust variants. There is a decent size library filled with books and publications about LPC in speech, and the tube model for synthesis.

[1] J. D. Markel and A. H. Gray, Linear Prediction of Speech, New York:Springer-Verlag Berlin Heidelberg, 1976.

For a history of LPC see:
‘https://www.researchgate.net/publication/3321695_The_history_of_linear_prediction

Gert-Jan • October 4, 2022 11:23 AM

The solution is that a [insert potentially digitally created artifact here] is not acceptable as evidence without a witness that can be questioned about its truth.

Fixed that for you. And I couldn’t agree more.

The real task … is to convince the public that any video can be manipulated.

I don’t think that will help much.

Everyone knows that people can lie. And more tech savvy people know that bad people create bots that lie. Despite that, every day, a massive amount of people distribute information on social media that they received from an unknown or unreliable source. I’d say the media formats doesn’t matter when it comes to this behavior.

SpaceLifeForm • October 4, 2022 6:36 PM

@ Winter, Gert-Jan, fib, Clive, ALL

re: Digitally Created Artifacts

The solution is that a [insert potentially digitally created artifact here] is not acceptable as evidence without a witness that can be questioned about its truth.

Something about Digital Signatures comes to mind. For, a Digital Signature is actually a Digitally Created Artifact.

It is difficult enough to sign plaintext, but how can one reliably sign an audio recording or a graphic using PKI? How does one insert their digital signature into a .wav or .jpeg? It should not be a separate file. Think about it. How can one sign a separate digital signature file? With, of course, yet another signature file. That is turtles.

So, the signature must be bound to the payload, in one bag of bits. So, where does the signature go? Just append?

Append is not good enough because there can always be a False Witness. Enough to confuse a jury.

The False Witness can modify the payload, and then append their own signature. The original signature will likely not be an audible or visual artifact to a jury upon presentation.

There is more to this story. Encryption. Stick to text as much as possible.

Consider this example.

A person is in court, being charged with something.

The prosecution introduces into evidence, a graphic and a recording allegedly tieing the defendant to some crime scene.

The defense can claim they are deepfakes, that the defendant was not at the scene during the alleged incident. Where did this supposed evidence come from?

There must be a linkage between the creator of the evidence and the evidence itself. Must be able to identify the creator (the witness), via email, phone, accountid, or maybe a public key. strongly linking the creator to the creation. Timestamps are important too.

If someone (a False Witness) is trying to say that they found this incriminating evidence on the internet but there is no chain of custody, then it is not evidence.

Of course, it helps the prosecution when the defendant was the creator of the digital evidence. See Jan 6.

lurker • October 4, 2022 9:53 PM

@SpaceLifeForm, “How does one insert their digital signature into a .wav or .jpeg?”

jpeg, dunno. But modern audio file formats[1] have a header with specified fields for, eg. sample rate, byte count/file length, CRC, and much, much more. If there wasn’t already enough uniquely identifying metadata there, it should be easy[2] to add a field for the author’s signature. Which opens up a line of business in modifying files and signatures …

1) .wav is not a modern format and is shunned by the pros.

2) Values of easy should include the aeons of necessary passage through Standards Committees.

Winter • October 5, 2022 2:39 AM

@SLF

It is difficult enough to sign plaintext, but how can one reliably sign an audio recording or a graphic using PKI?

Blockchain 😉
[/humor]

EvilKiru • October 5, 2022 11:21 AM

@lurker JPEG streams have an optional comment field that you can stuff text information into, perhaps even binary.

Quantry • October 5, 2022 12:41 PM

@ Ted

Regarding your thoughts on the reflection coefficient…

Might help to think of the “Standing Wave Ratio” for radio antennas: a measure of anti-nodal impedance mismatch. Also perhaps you recall that the old hanging telephone wires at the road-side used to cross each other at set intervals to cancel capacitance, and thereby increase baudrate against lower reflected energy. Also, maybe yer familiar with UTP, or STP, such as cat7 network cable? Same idea there. The “twists” cancel out the reflected energy and allow massive increases in data rate. In each case you can recognize traits about the line and how [badly] it was installed.

And to your

But I don’t know how you could detect this in a phone call.

:from the article,

the estimations were often comically incorrect

as for recognizing counterfeit, it seems “knowing the real-mcCoy” is the goal of the research. Listen, and trust your ears. And trust SLF on this one: “Your voice is your passport”: Try to use text comms instead, and/OR I say, listen for [or add] a realistic noise channel.

…so I claim, ostensibly being only the unworthy off-scouring of martian descent myself.

SpaceLifeForm • October 5, 2022 7:21 PM

@ EvilKiru, lurker, ALL

re: JPEG streams have an optional comment field

Who is it optional for? Obviously the creator. But is it also optional for the recipient to ignore it? If one is a False Witness, can’t they just replace it?

[insert potentially digitally created artifact here]

A Bag of Bits must have Self-Referential Integrity to be Trustable. There must be a Witness Creator that can Prove they created the Bag of Bits.

If there is no Witness Creator that can Prove they created the Bag of Bits, then the Bag of Bits is just Digital Hearsay.

See Big Bang Theory and Seven Days.

Clive Robinson • October 6, 2022 5:01 AM

@ SpaceLifeForm, EvilKiru, lurker, ALL,

“If there is no Witness Creator that can Prove they created the Bag of Bits, then the Bag of Bits is just Digital Hearsay.”

But how does the “creator” “prove” to others that,

1, They were the creator.
2, Nobody else created it.

That in turn needs a “root of trust” bag of bits, that has to remain secret to the “creator” but also “provable” as a “witness” to all others without compromising the “root of trust”.

So now we need a second “witness” back of bits…

The trick we use currently is to actually have the “root of trust” bag of bits be constructed from two –or more– other secret bags of bits, with certain mathmatical properties by the use of a “One Way Function”(OWF). And via a second OWF the “witness”, so the two OWF’s have to be mathmaticaly related.

That is you have two sets X and Y and they are of the same size and they uniquely map (bijective) from the members of one set to the member of the other set. The first OWF E provides the map from the members “x” of the first set X to the members “y” of the second set Y, the second OWF D provides,the inverse map. So,

y = E(x) and x = D(y)

Respectively. Even without the use of any mathmatical proofs most can see that there are more ways to make the mapping E than there are members in the the set X. Which is usefull because you need that to make the OWFs generally usefull. So you end up with,

y = E(ek,x) and x = D(dk,y)

Where ek and dk are other bags of bits that are related and encode the actual maps efficiently.

So the trick is to make the mapping functions E and D both,

1, Scalable.
2, Efficient.
3, Provably linked.
4, Do not disclose the roots of trust.

Whilst the first two are relatively trivial, the second to are not as we are currently finding out.

No matter what way you do things at the bottom there are one or more bags of bits as “roots of trust” that have to remain secret.

But there is another problem which is how do you know,

1, The “creator” is who they say they are.
2, The roots of trust are theirs

Currently we do not know how to do the first, and the only way we know how to do the second reliably is via a secure communications “side channel” which is either an in person exchange or involves another set of “roots of trust” bags of bits.

So the whole thing basically rests on two things,

1, Secure side channels
2, Roots of trust

Both with their own set of secret bags of bits from top to bottom.

Winter • October 6, 2022 5:09 AM

@Clive

But how does the “creator” “prove” to others that,

The same way that any witness in court proves they indeed witnessed the thing they are questioned about. How does an eye witness prove they saw&heard what they say they saw&heard? How does a photographer prove thay she made a particular photograph?

Courts have handled this for millennia.

Clive Robinson • October 6, 2022 9:02 AM

@ Winter,

Re : Presentation of Evidence.

“Courts have handled this for millennia.”

It would be more correct to say that they have,

“Hand waved it away forever”

It’s realy only since the 1960’s that science has actualy got involved with the actual court process rather than just some small asspects of evidrnce testing by “hard science” (mostly used incorrectly, hence all the miscarriages of Justice bassed around “physical evidence” or it’s misrepresentation in courts).

Since then it has become more and more clear that it is amazingly easy to tamper with evidence of all kinds, often trivially so. Because the human mind is both extreamly limited and extreamly maliable and unreliable.

It’s known that ~1/4 of eye witnesses are compleatly wrong with what the actually observed within 10mins of the event. Likewise within 24hours ~1/3 of those remaining are wrong, or around ~1/2 change their recolections due to the way they are questioned. Thus for evidentiary purposes “eye witnesses” are giving factually incorrect evidence over half the time.

I had this nonsense pulled on me when I was a witness and I was shown a photocopy of a bank cheque and asked to verify “it was the cheque” by an obnoxious barrister… I badly upset both the three judges and barrister who was making such a rediculous request. Because as I firstly noted it was a photocopy not the actual cheque, so could easily be of a forgery, secondly I would have had no reason to memorise the unique identifing feature of the cheque number when I had seen the cheque originaly half a year before, and further that half the population who used shorter “credit card” and “phone numbers” every day that did not change could not remember them, worse around 15% of the population could not remember a four digit number after trying to deliberatley remember it, and I was not going to perjur myself… You could have heard a pin drop and the faces were a picture to see esprcially the barrister. I then said I can confirm that a cheque for that amount had been payed in by me into an escrow account I had set up for the purpose and that under the “bankers books/ledgers rules of evidence” they should ask the bank for a writen conformation of the cheque issuing branch, sort code, account number and individual cheque number, if it were important to the case, which it was not… The lead judge coughed looked at the other two who just looked blank then the barrister and asked if my answer had sufficed, to which he just noded then after prompting said “yes” (for the court records).

It is by the way, not the sort of thing you should do, because judges of all flavours do not like witnesses being anything other than dumb, so they can “handwave through paperwork of any kind as evidence” when it most clearly is nothing of the sort.

But it’s not just eye witnesses that are compleatly and utterly unreliable as evidence. But it has become court tradition to,

“Hand wave through any old nonsense paperwork as evidence.”

We know, not only is it the wrong thing to do beyond any doubt, but that we can improve upon it greatly. Worse we also know that the ability to fake evidence is way beyond the unassisted human to detect or even to suspect. You only have to see those old Russian photos where “heros that were now zeros” got removed to see how this has been true for better than a century.

Also remember, I have some experience in “debunking” forensics and bio-metrics and picking “unpickable locks”. The only bio-metric I did not come up with a work around for was retinal blood vessel mapping. The reason was not because I think it can not be done, I think it can, and for good reason, but I don’t want to hurt peoples visson by experimenting (technically your eye is as fragile as your brain and likewise the damage as unrecoverable).

Winter • October 6, 2022 9:37 AM

@Clive

Since then it has become more and more clear that it is amazingly easy to tamper with evidence of all kinds, often trivially so. Because the human mind is both extreamly limited and extreamly maliable and unreliable.

Half of all humans are on or below the average for any meaningful measure.

It is not how we think it should be, but this is the reality we have to live with.

EvilKiru • October 6, 2022 11:29 AM

@SpaceLifeForm Tools are available to rewrite JPEG, GIF, and PNG streams to remove and/or insert pretty much anything you want into the stream, although most of them seem to assume that the stream is actually a single media file containing only the image, rather than being an actual stream object that could have other stream components embedded in it.

SpaceLifeForm • October 6, 2022 5:58 PM

@ Clive, EvilKiru, lurker, ALL

“If there is no Witness Creator that can Prove they created the Bag of Bits, then the Bag of Bits is just Digital Hearsay.”

It gets tricky when the Witness has died.

But not an insurmountable problem if there are others that can vouch for the integrity of the Digital Evidence even without having to have the Creators Secret Key.

How can that happen, one may wonder.

There can be other witnesses that can attest that the Bag of Bits created by the deceased creator, was most certainly created by the deceased creator.

And, this is important, they can also be in position to attest that the Bag of Bits has been manipulated.

This is not trivial. Defense in Depth. Think wav, jpeg, and permanent public signing keys.

Another scenario is that the Creator Witness is still alive, but is not aware that the alleged Digital Evidence that the Creator allegedly created is being used in a court, so is not in position to either confirm or deny. The Creator becomes Glomar. Also note, that the alleged creator can have their reputation smeared in court.

This is why there must be open court proceedings. Maybe someone else will notice what is going on before someone gets railroaded.

Everyone needs to pay attention.

Gert-Jan • October 7, 2022 6:30 AM

I think JPEGs, MP4s, etc. would allow digital signing.

The question is: would that help?

I mean, any device that records audio/video could potentially sign the work. Just a matter of adding a unique private key to each such device and making sure the private key is not accessible to users / hackers.

But would we be able to determine if the device recorded a fake? Like a picture of a print or a video of a screening? I think an expert would be able to use a genuine device to record a fake and pass it off as real.

Which brings us back to the need to be able to question the content creator.

I think the only thing that signing can add, is that the signer is vouching for the content. For example, a media outlet could sign the media they distribute. A professional photographer can sign their work to vouch for authenticity. That bings us to the discussion of reputation, but I will stop here.

Chris • October 15, 2022 5:13 AM

Any “news” that states true positives but neglects to also state false positives is fake.

This algorithm is MUCH better than theirs – it successfully identifies ALL of the deepfakes with 100% accuracy every time:-

if(1) then { return “fake” }

Schneier on Security

Detecting Deepfake Audio by Modeling the Human Acoustic Tract

Comments

Leave a comment Cancel reply