Undetectable Backdoors in Machine-Learning Models

New paper: “Planting Undetectable Backdoors in Machine Learning Models“:

Abstract: Given the computational cost and technical expertise required to train machine learning models, users may delegate the task of learning to a service provider. We show how a malicious learner can plant an undetectable backdoor into a classifier. On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation. Importantly, without the appropriate “backdoor key”, the mechanism is hidden and cannot be detected by any computationally-bounded observer. We demonstrate two frameworks for planting undetectable backdoors, with incomparable guarantees.

First, we show how to plant a backdoor in any model, using digital signature schemes. The construction guarantees that given black-box access to the original model and the backdoored version, it is computationally infeasible to find even a single input where they differ. This property implies that the backdoored model has generalization error comparable with the original model. Second, we demonstrate how to insert undetectable backdoors in models trained using the Random Fourier Features (RFF) learning paradigm or in Random ReLU networks. In this construction, undetectability holds against powerful white-box distinguishers: given a complete description of the network and the training data, no efficient distinguisher can guess whether the model is “clean” or contains a backdoor.

Our construction of undetectable backdoors also sheds light on the related issue of robustness to adversarial examples. In particular, our construction can produce a classifier that is indistinguishable from an “adversarially robust” classifier, but where every input has an adversarial example! In summary, the existence of undetectable backdoors represent a significant theoretical roadblock to certifying adversarial robustness.

EDITED TO ADD (4/20): Cory Doctorow wrote about this as well.

Posted on April 19, 2022 at 3:12 PM21 Comments


Cormacolinde April 19, 2022 4:59 PM

Just like anyone can create a cryptographic system they cannot themselves break, I suspect someone else may manage to find a way to detect such a backdoor. But it is still interesting research into this subject; although in my opinion the biggest issue with ML and algorithms is more about the unknown biases than any that might be put it by design.

David Leppik April 19, 2022 5:27 PM

@ Cormacolinde:

It is possible to design a cryptographic system with a backdoor where there is provably not enough information to identify the backdoor. In practice, these systems are suspicious because they contain hard-coded constants for no apparent reason. One such PRNG was actually standardized.

The same principal has caused trouble elsewhere. The 2007-2008 global economic crisis was caused in part by mortgage-backed securities, where many home mortgages were sliced up and rolled into “sausage” of tradable assets. High-quality mortgages were mixed with low-quality ones, and the quality was overrated for all of them as people were hustled into mortgages they couldn’t afford in order to feed the demand for the mortgage sausages. At the same time, cryptographers proved that, even with correctly-valued mortgages, it was possible to undetectably concentrate bad mortgages into one group of securities while concentrating good mortgages into a different group. Like invisible gerrymandering of mortgages.

As with mortgages, ML systems have a lot of random complexity, so it’s not surprising that they can be manipulated like this. The only real countermeasure is to make sure the people training the AI don’t have incentive to cheat.

Unfortunately, that’s easier said than done. Google has every incentive to drive high-quality traffic to its properties, and it may be impossible to detect this manipulation.

In this case, an insider at a credit card company might train a fraud-detection system with a back door to keep fraud from being detected. The financial incentive for this would be nearly impossible to remove.

Clive Robinson April 19, 2022 5:52 PM

@ All,

The quote,

“In summary, the existence of undetectable backdoors represent a significant theoretical roadblock to certifying adversarial robustness.”

Made me smile. In effect it’s the logical equivalent of why “Perfect Secrecy” exists.

Perfect Secrecy, is not at all about “secrecy”, it’s about “All messages are equiprobable” that is there is no distinguisher for any single or sub set of the set of all messages that exist of “Message Length M in an alphabet of Radix size R”.

It’s the “no distinquisher” which means you can not find the back door.

But actually you can in practice in the real world…

By definition the backdoor serves a purpose thus can not be a “One Way Function”.

As a backdoor “serves a purpose”, as such it will almost certainly get used. Thus the question changes and becomes,

“Knowing a set of messages from a user U, can you correlate any with activities or actions by U that served an identifiable purpose?”.

This gives rise to the interesting notion if not proof that,

“A backdoor only remains unknown whilst not used”

But then there is the always awkward probabalistic problem of,

“If a backdoor serves a purpose that can be observed, then subjecting the system to “random input” will at some point give rise to it being observed.”

JonKnowsNothing April 19, 2022 8:24 PM

@Clive, @All

It’s rather like a cracking tower where the spill out level captures several molecules or components together.

This is a process of “distilling out” different substances but some components have similar distillation points making separation at that level and via this process inadequate.

You may know there is contamination, or you may guess there is contamination, but if you have 2 items cracking at the same temperature via the same mechanism you cannot separate them.

It’s a classic cracking tower problem…


Search Terms


Cracking Tower

Cracking (chemistry)

Destructive distillation

SpaceLifeForm April 19, 2022 10:27 PM

I’ve met people like this


they still generate toxic or biased outputs, make up facts, and generate sexual and violent content without explicit prompting

Ted April 19, 2022 10:40 PM

From the intro:

Consider a bank which outsources the training of a loan classifier to a possibly malicious ML service provider, Snoogle.


Winter April 20, 2022 12:20 AM

This seems to prove that invisible watermarking of digital objects is possible, and feasible. Such a result would, in itself, be a remarkable proof.

Given the history of unbreakable watermarking, I still have my doubts.

Mr. C April 20, 2022 2:40 AM

@ Winter:

Sure, if your digital object is several gigabytes and your method for detecting watermarks is “treat it as an image-recognition ML network and see if it can identity this image.” There doesn’t seem to be any reason for it to follow that this carries over in any way to watermarking smaller files in more practical ways.

Winter April 20, 2022 2:59 AM

@mr. C
“Sure, if your digital object is several gigabytes and your method for detecting watermarks is “treat it as an image-recognition ML network and see if it can identity this image.” ”

The system van distinguish between tagged/bugged and untagged/unbugged images where other systems (humans) cannot, using distributed features.

Sounds like watermarking of the images to me.

Ted April 20, 2022 9:43 AM

I feel Cory on this. I am even more than totally unqualified to assess the robustness of the paper’s mathematical proofs. But considering one of the paper’s authors won a Turing Award, I don’t feel so bad.

Some of the unexpected ML results Cory writes about are curious. This example, however, is just plain scary:

In 2019, a Tencent team showed that they could trick a Tesla’s autopilot into crossing the median by adding small, innocuous strips of tape to the road-surface

Machine learning security is a field I had not given a lot of thought too. I’m glad other people have. I wasn’t aware that some companies, like IBM and Microsoft, have released tools to harden ML models against adversarial attacks. The complexity there would be pretty amazing.


lurker April 20, 2022 1:39 PM

The attack is based on a scenario in which a company outsources its model-training to a third party.[Doctorow]

Trust, Security, which came first?

Jim April 20, 2022 3:35 PM

Wouldn’t it be possible to take such an outsourced net, degrade it a bit with random perturbations,
then quickly retrain it?

Is there a way to quantify the likelihood that would remove trap-doors vs. the incremental cost of retraining
from a near-solution?

Clive Robinson April 20, 2022 4:23 PM

@ lurker,

Trust, Security, which came first?

Err perhaps you should first ask,

“What type of trust?”

ICTsec-trust and Human-trust are in reality almost the opposite of each other[1].

Many humans make the mistake of “informing others” things about themselves they have absolutly no need to. That is they have the “human agency” to make a concious choice.

Thus “they hope –sociologically– the others do not –chose– to betray them”.

ICTsec-trust on the other hand is about enforcing a mandated security policy. It does not involve sociological factors, nor does the device or system have discretion or choice as it should be fully determanistic and have no agency.

It’s the “human-trust” people have incorrectly extended to digital systems especially of store-n-forward communications variety that form “Social Media”, that they incorrectly assume has some form of agency or choice thus can have discretion.

In some cases the people have then been at the very least embarrassingly betrayed by others, due to the ease of information duplication and disemination with apparent anonymity. Because there is either no/insufficient policy or audit of actions, and the person fails to understand that the system is determanistic and has neither agency or understanding of content/context.

Many years ago back when “write-up, read-down” was kind of new, I used the menomic of RIP-ICE where in the case of what we now call System-Security,

The System or computer “used”,

Rules + Information = Processes.

And Humans “assume”,

Integrity + Context = Exigency.

And that although many assume they are broadly the same, or logically mapped onto each other,

Rules “are NOT” Integrity.
Information “is NOT” Context.
Processes “are NOT” Exigency.

(Unfortunately the British english meaning and use of Exigency and the US english meaning and use are somewhat different).

[1] It’s the same as the usage of the connectives “and” “or” in spoken englisgh and “AND” “OR” in logic and most that derives from it.

If the boss says “Get me John and Jim’s details” he wants the logical OR –ie both– not AND.

lurker April 20, 2022 7:30 PM

@Clive, “ICTsec-trust and Human-trust are in reality almost the opposite of each other.”

While my pathology is to expect intelligent humans to behave like mechanical ICT systems in matters of trust. I imagine this might have worked for Hobbes or Kant, so I have to mark all others on a descending scale of “non-Hobbesian”, ie. not ‘rational, free, or equal.’ Which can produce amusing scenes when I try to demonstrate to bank staff that their systems are illogical.

Clive Robinson April 21, 2022 6:14 AM

@ Moderator,

It would appear that your “Fifth task of Hercules” issue has arisen again…

As @Winter has noted, we have an “impersonator” issue come back.



Is not from me.

A cursory glance would suggest that all non-warning posts to this thread from,


Till this warning-post are shall we say “suspect” and outside of the posting rules.

Norio April 21, 2022 12:55 PM

@Bruce Schneier, thank you for the link to the Cory Doctorow article! It was entertaining and informative.

mark April 23, 2022 3:22 AM

CBD oil and medical marijuana are really helpful in the medical aspect but we all need advice and some orientations on how to consume them medically. We should also join hands together to fight against drug abuse. There are other cannabis buds for pain reliefs anxiety and depression, just follow the link below

SpaceLifeForm April 26, 2022 11:19 PM


Panda or Gibbon?


Leave a comment


Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.