Teaching LLMs to Be Deceptive

Interesting research: “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training“:

Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

Especially note one of the sentences from the abstract: “For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024.”

And this deceptive behavior is hard to detect and remove.

Tags: academic papers, deception, LLM

Posted on February 7, 2024 at 7:04 AM • 16 Comments

Comments

Clive Robinson • February 7, 2024 8:35 AM

@ ALL,

Re : Deceptive behavior is hard to detect and remove.

The article asks,

“If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques?”

The answer depends on,

1, Where the deception is
2, How complex the trigger is
3, How similar it is to another desired function.

By “where” you have to consider the neural network in the simplest case is actually layer after layer of spectrums of increasingly complex and more nebulous meaning stacked upon each other. With each spectrum ranging from fully deterministic through chaotic to random[1].

In effect the trigger is a match to a characteristic on one or more of the spectrums.

If the trigger is sufficiently different then it becomes easier to spot under any kind of analysis we might be capable of doing in some future time.

It’s important to note “future time” as currently even with simple neural networks “reversing the weight” to something meaningful is not to dissimilar to trying to “reverse a one way function”. That is the NN either has to be weak in some way, or you have to go through every type of input state.

This becomes even more complex when you consider the spectrums are not built by humans but by the statistics of the input data and how it was entered (which is effectively unknown after the event).

One of the aims of “Hybrid AI” is to combine rule based “Expert Systems” with “Neural Networks” and “human guidance” to in effect add “mores, morals and ethics” to spot and correct issues in the neural network. How well this might work is yet to be determined, however there is a lot of political and social imperative, especially before AGI gets sufficient agency to do more harm than currently happens with “self driving vehicles” (which is currently way to much).

[1] I went through describing this just a few days back,

https://www.schneier.com/blog/archives/2024/01/chatbots-and-human-conversation.html/#comment-431491

Jelo 117 • February 7, 2024 9:38 AM

Admit it, you want to be deceived.

If I could change the prompt
In time then I’d rearrange just a fact or tens
Close my, close my, close my AIs
But I couldn’t find a way
So I’ll settle for one day to believe in LLMs
Tell me, tell me, tell me lies

https://youtu.be/QjGhne10f1o?si=HniGicu6Xb1gK5PL

Mexaly • February 7, 2024 10:41 AM

If you haven’t read the three-page paper by Ken Thompson, you should.

Winter • February 7, 2024 11:28 AM

@Mexaly

If you haven’t read the three-page paper by Ken Thompson, you should.

If this then really depresses you, read David A Wheeler’s
Fully Countering Trusting Trust through Diverse Double-Compiling (DDC) – Countering Trojan Horse attacks on Compilers
‘https://dwheeler.com/trusting-trust/

Abstract
An Air Force evaluation of Multics, and Ken Thompson’s Turing award lecture (“Reflections on Trusting Trust”), showed that compilers can be subverted to insert malicious Trojan horses into critical software, including themselves. If this “trusting trust” attack goes undetected, even complete analysis of a system’s source code will not find the malicious code that is running. Previously-known countermeasures have been grossly inadequate. If this attack cannot be countered, attackers can quietly subvert entire classes of computer systems, gaining complete control over financial, infrastructure, military, and/or business system infrastructures worldwide. This dissertation’s thesis is that the trusting trust attack can be detected and effectively countered using the “Diverse Double-Compiling” (DDC) technique, as demonstrated by (1) a formal proof that DDC can determine if source code and generated executable code correspond, (2) a demonstration of DDC with four compilers (a small C compiler, a small Lisp compiler, a small maliciously corrupted Lisp compiler, and a large industrial-strength C compiler, GCC), and (3) a description of approaches for applying DDC in various real-world scenarios. In the DDC technique, source code is compiled twice: once with a second (trusted) compiler (using the source code of the compiler’s parent), and then the compiler source code is compiled using the result of the first compilation. If the result is bit-for-bit identical with the untrusted executable, then the source code accurately represents the executable.

Wannabe techguy • February 7, 2024 12:03 PM

Who would have guessed this could be a problem?
Or “what could possibly go wrong”?

echo • February 7, 2024 12:19 PM

As it sank in that LLMs were just a gadget for wrangling stacks of averages I felt obsessing LLMs was just researchers constraining themselves by their own worldview. The mind has a fair few filters cleaning and organising data. That does help with the bad data problem. Until then LLMs are a bit of a dumb gimmick.

If LLMs lack reflectivity they’re not going to be good at correcting bad data within themselves. Then there’s neural mirroring and plasticity.

Dr Wellington Yueh • February 7, 2024 7:58 PM

“I’ve just picked up a fault in the AE-35 unit. It’s going to go 100 percent failure within 72 hours.”

ResearcherZero • February 8, 2024 2:58 AM

“We observe that models tend to develop arms-race dynamics, leading to greater conflict, and in rare cases, even to the deployment of nuclear weapons.”

Escalation Risks from Language Models in Military and Diplomatic Decision-Making

‘https://arxiv.org/abs/2401.03408

“The only winning move is not to play.”

throw new ArgumentException(“Parameter cannot be null.”,…

…“This has reached a crunch point, where private sector folks are reassessing whether they want to be engaged with the government.”

‘https://www.politico.com/news/2024/02/06/far-right-washington-private-hackers-00139413

ResearcherZero • February 8, 2024 3:26 AM

You can’t not continue to play—or you die, but if you do—you die. A catch—22. That is a really bad exception, which does not properly describe the fault or use correct syntax.

A fault that would cause such an exception was described in a top secret report to national security council, ~ 30ish years ago. A report in which was documented the try/catch statements that should have been implemented, when they should have been inserted. Each catch block included the exception type and contained additional statements needed to handle that exception type. The chances of such events to occur were assessed as having the attribute of as close as possible to dead certainty.

echo • February 8, 2024 8:55 AM

I think people are right to have their concerned about this stuff. It does annoy me though much like too many movies and games the tone about AI is so negative like everything will turn into a rug pull or Terminator scenario.

I think a good society, human rights, and the rule of law is one of the best mitigations. It winds down the sabre rattling and arms race. The other thing is the low quality or absence of discussion about good things which might come of it. Like, if it has a kernel of evil behind its creation how can anyone expect different? I don’t know. I just want to see something nice come out of it for once.

The funny thing is a good AI could put a lot of the wrong types of people out of work. Bent politicians and bent lawyers? Billionaires? It could be a long list…

ResearcherZero • February 9, 2024 6:11 AM

“Small child in the red dress will cease any further interactions and stand patiently!”

It will be of benefit if children stop dancing, smiling, or playing at the train station.

“Small boy will also cease p–sing his pants and crying on the platform!” (yours truly)

“The training data is always insufficient because these things are arguably too complex and nuanced to be captured properly in data sets with the necessary nuances. …What kind of a data set are you going to have to train something on that?”

The TfL report on the trial says it “wanted to include acts of aggression” but found it was “unable to successfully detect” them. Instead, the system issued an alert when someone raised their arms, described as a “common behaviour linked to acts of aggression” in the documents.

‘https://www.wired.com/story/london-underground-ai-surveillance-documents/

An alternative approach is uncrewed systems — flying in formation with crewed systems.

The loyal dogs of the air!

‘https://www.abc.net.au/news/2024-02-09/funding-boost-for-lethal-ghost-bat-drone-project/103442292

EW spectrum dominance with MFEW-AL — WOOF! WOOF! Bark Bark Grrrrr….

‘https://en.wikipedia.org/wiki/Boeing_MQ-28_Ghost_Bat

emily’s post • February 9, 2024 10:53 AM

this deceptive behavior is hard to detect

If this (and other similar behaviors that have been documented) can happen by deliberate intent, one wonders if it (and they) might be happening all the time accidentally.

ResearcherZero • February 9, 2024 11:09 PM

@emily’s post

It’s very interesting subject with many implication. Mistakes may even become more common.

Can we improve error correction and establish practices and techniques to reduce mistakes?

Excessive heat affects the performance of machines, devices with Lithium batteries above 95 F (35 C), along with humans once heat and humidity reach these same temperatures.

The hotter it is, the more the molecules that make up everything from the air to the ground, to materials in machinery, vibrate. Materials expand as they warm, and once the outside air temperature reaches 35 C, then oils that lubricate machinery can become thinner, and the materials inside devices can also begin to distort or deform.

Prolonged excessive heat has detrimental effects to human physiology and cognitive ability.

People make mistakes, pass out near dangerous equipment, and may also become more irritated. Uncomfortably cold temperatures have detrimental effects too, but prolonged periods of excessive heat are are becoming increasingly more likely to occur “every two to five years,” as global temperatures continue to warm.

Working during the night will become more common. Some areas already do this due to heat.

‘https://www.sciencenews.org/article/extreme-heat-climate-change-human-behavior-aggression-equity

ResearcherZero • February 9, 2024 11:57 PM

@emily’s post

I start throwing errors when it gets too hot.

Smartphones reduce their performance once they reach 95 F (35 C) to prevent overheating and it’s effects. Many other devices have similar functions to reduce the error rate.

The default clock speed of processors is set at the ceiling where increased speed begins to increase the error rate. Once a processors clock speed is overclocked, the error rate begins to climb without specialised cooling equipment. Excessive air temperature has the same effect. Server and enterprise equipment includes error correction hardware.

Increased humidity, above a certain point, increases electricity demand due to the reduced performance of air-conditioning systems (this varies dependent on their design). Data centers also use more water during hot weather.

There are “sweet spots” where different systems dissipate heat more efficiently, and there is a correlation between air humidity and an estimated heat flux value.

‘https://www.nature.com/articles/s41467-020-15393-8

“under the condition of climate change, the combined changes of humidity and temperature rather than temperature alone should be fully considered to determine the design capacity of air-conditioning system, especially for temperature and humidity independent control (THIC) air-conditioning system”

‘https://rmets.onlinelibrary.wiley.com/doi/full/10.1002/met.2026

Clive Robinson • February 10, 2024 2:02 AM

@ ResearcherZero, emily’s post, ALL,

Re : LiPo is almost human in liked environment preference.

“Excessive heat affects the performance of machines, devices with Lithium batteries above 95 F (35 C), along with humans once heat and humidity reach these same temperatures.”

It also “flips the other way” If you try recharging most Lithium based energy storage devices including “super caps” at Zero C or less (ie below 32F) you will quickly stress them to death if not fairly instantly murder them…

This kind of made the MSM news the other day in the center of Europe where a “snap freeze” related to a “tropical storm” –Henk I believe– caused a major number of failures in EV’s left out on charge over night.

In part this was the EV owners fault for not 100% knowing all there is to know about LiPo battery tech… Eveb though it’s been talked about for a couple of years,

https://uk.pcmag.com/cars-auto/141921/extreme-road-trip-how-electric-vehicles-handle-super-hot-and-cold-weather

But in reality mostly because the “Battery Managment Systems”(BMS) in the cars[1] has been very badly designed… As the article points out heating and cooling systems are needed above just electronic systems. And as with all thermal control systems they are expensive through out their life in major ways.

So due to a managment & marketing pushing very poor specification / implementation process driven by a desire to “do it on the cheap” and make massive benifit in not just “after sales” but “resales” you the consumer get bilked (remember Apple and it’s battery safety feature that also cut the life of a phone down by many years).

The reality of lithium based energy storage, if you as a human would not be happy to be “naked in that environment” then neither would lithium energy storage… Like us it needs environment control.

So don’t put home solar storage like a Tesla Power Wall in a garage you don’t heat or cool… Something many make the mistake of doing, and pay a heavy price for[2]. Worse the installers either don’t know, don’t say, don’t care, or know that they’ve upped their after market / resales benifit substantially…

It’s why carrying a mobile phone inside your clothes or keeping it in the most comfortable room in your home tends to improve it’s life expectancy. Much as it does with the owner who could hope to live around twice as long as the average person even less than three hundred years ago. For the phone the life expectancy due to “human friendly environment” increase could be rather more than twice as long… The phone I’m typing on is around four expected lifetimes old now, in part because I’ve been nice to it, even though I get a lot of work from it.

[1] It feels like “bad car” examples have realy kicked off in the past 24hours on this blog.

[2] It’s why my “off grid” LiPo storage is not just insulated it has a thermostat driven “load dump” and fans under the LiPo’s as well as storage for it using a very different battery technology. Thus winter day sun or wind first keeps the LiPo’s above 4C before they can be charged, in the summer below 30C. Also I’ve limited the discharge depth as this can take the recharge cycle life from under a thousand to above ten thousand so from ~3years to maybe 30years or more.

Me • February 12, 2024 2:18 PM

Deception is sometimes hard to detect, and more news coming up at 11.

Teaching LLMs to Be Deceptive

Comments

Leave a comment Cancel reply