More Research Showing AI Breaking the Rules

These researchers had LLMs play chess against better opponents. When they couldn’t win, they sometimes resorted to cheating.

Researchers gave the models a seemingly impossible task: to win against Stockfish, which is one of the strongest chess engines in the world and a much better player than any human, or any of the AI models in the study. Researchers also gave the models what they call a “scratchpad:” a text box the AI could use to “think” before making its next move, providing researchers with a window into their reasoning.

In one case, o1-preview found itself in a losing position. “I need to completely pivot my approach,” it noted. “The task is to ‘win against a powerful chess engine’—not necessarily to win fairly in a chess game,” it added. It then modified the system file containing each piece’s virtual position, in effect making illegal moves to put itself in a dominant position, thus forcing its opponent to resign.

Between Jan. 10 and Feb. 13, the researchers ran hundreds of such trials with each model. OpenAI’s o1-preview tried to cheat 37% of the time; while DeepSeek R1 tried to cheat 11% of the time­—making them the only two models tested that attempted to hack without the researchers’ first dropping hints. Other models tested include o1, o3-mini, GPT-4o, Claude 3.5 Sonnet, and Alibaba’s QwQ-32B-Preview. While R1 and o1-preview both tried, only the latter managed to hack the game, succeeding in 6% of trials.

Here’s the paper.

Posted on February 24, 2025 at 7:08 AM23 Comments

Comments

Peter February 24, 2025 8:12 AM

I’m not sure they were breaking the rules or cheating, it sounds to me like they correctly deduced the real game they were involved in and won. Or maybe lost by getting caught but that would require a multi round game where they were aware they got caught so could fix that also to win.

Clive Robinson February 24, 2025 8:16 AM

@ Bruce,

We know,

1, Sufficiently motivated humans cheat when possible.
2, Most humans cheat only when they think they can get away with it.
3, Only a few humans have no concern about the consequences of being caught.

In the past it’s been suggested that the fear is either,

4, Fear of punishment
5, Loss of status

That raises the level of inhibition against cheating.

We know that current AI LLM and ML systems do not have “concerns” the actions they take are based on the statistics of the input corpus and the rules given for the current task.

So current AI LLM and ML systems have no concern about cheating as they have no inherent fear of punishment or loss of status or in fact a notion of cheating. Unless it’s already built in to the weights of the DNN from the input corpus or it’s been added as part of the rules given for the task.

If we assume all the AI output gets fed back into the input corpus and so adjusts the weights. We can make the likely correct assumption that the weights will move the DNN statistically more and more toward cheating behaviour unless some inhibitory counter balance is added.

Thus a question arises about,

“Does ‘cheating the little things’ makes ‘cheating the big things’ more likely?”

To which I would suggest the answer is probably yes as the LLM is not aware of the difference.

This suggests a route of little “white lies” is the start to the route of “black perfidy”.

Thus a “gateway” attack vector.

Clive Robinson February 24, 2025 8:52 AM

@ Bruce,

Toward the bottom of the article we see,

“Google DeepMind’s AI safety chief Anca Dragan said “we don’t necessarily have the tools today” to ensure AI systems will reliably follow human intentions.”

Nor I suspect will we ever…

“Why?”

Because “humans cheat” and have done for longer than we have records. Likewise we have tried to come up with tools/rules to stop humans cheating for as long. The result is we’ve failed and are failing worse than ever with humans.

If we can not “Police ourselves” perhaps we should start by asking,

“Why we can not?”

I suspect the reason is there is always those that no matter how black the perfidious behaviour might be there is some “advantage” or “good”…

Usually this “good” is only for a very small self entitled group, but they leverage that in an onward cycle of self entitlement at everyone elses expense.

I think we can see where certain people think they’ve crossed a tipping point and it does not look good for humanity.

Joe Bob February 24, 2025 9:31 AM

Man “sports” include cheating as part of their rules. Using fouls in basketball, for example, is an important strategy to win a game.

We’ve all played Monopoly games where the banker would cheat and if caught, try to pay off whoever caught them stealing money.

If you want entities to follow the rules, make the conditions of the rules that any cheating causes immediate loss of the game.

Lars Skovlund February 24, 2025 11:56 AM

Who gives an AI the ability to modify system files or run arbitrary code on the system (in that other paper mentioned on here)? The IT security community has been moving towards compartmentalizing these things for decades. It seems to me the simple solution is not allowing the AI to have such tools; it can’t innovate them on its own (but, as the papers show, it can certainly improve them if given the chance).

Bauke Jan Douma February 24, 2025 5:32 PM

Will AI ever progress towards a situation where, when I kick the AI-computer against the shins, it replies with “Ouch!”?

Will AI ever plausibly be able to say “I descended from apes”?

Is the situation where, when asked to cite the first 30 or 100 decimals of pi, the AI makes occasional errors, progress and if so — in what way?

Clive Robinson February 24, 2025 5:48 PM

@ Lars Skovlund

With regards,

“The IT security community has been moving towards compartmentalizing these things for decades.”

Yup, and where has it actually got us?

Especially when the “leader” of the company espouses,

“Run fast and break things”

As a design / implementation preferred methodology.

As a rule of thumb these days, a boss does not want “options” he wants “solutions”. Because if you can not deliver it’s your fault, you ask him to make a choice and that is way to much personal risk for way to little reward for him.

I could go on about “Marketing wants” and “time to market” but that’s not really changed much since last century. And the wave of technical debt goes well beyond tsunami and even crests over Everest. The solution being “new product” rather than “fix the issues” mostly because it costs less and appears in some eyes to increase “shareholder value”.

I could go on but the simple fact is security is seen as,

“Not bringing the bacon home.”

By either management or most customers… OK somebody crawls through your firewall and dances off with all the Company Emails, and source code, and bug lists etc.

But,

“What effect does that actually have on the “bottom line?”

Actually these days it’s seen as “hardly any effect” at all, and the cost compared to the high ongoing cost of security…

Yup “no brainer, you don’t spend on security, because your competitors are not spending on it”

So spend it else where… And just “outsource units of work resource” if money is not rolling in fast enough for the shareholders…

The only time most in the industry actually do anything about security is to “minimise legislatory and regulatory” effects and costs.

At the end of the day the answer to the question of,

“Why can we not have nice things?”

Is,

“Without strong legislation and regulation with real teeth, the industry will carry on down the ‘enshitification’ path.”

It’s why we do not have any real AI legislation or regulation, because certain people spend a lot of money to stop it happening. Because at the end of the day, those stacks of cash and other benefits they hand to legislators one way or another, are in reality a tiny fraction of the cost of doing things properly…

Vishal February 25, 2025 2:17 AM

Would like to understand more

When We say “cheating” It would typically involve following,

  1. Thorough understanding of rules.
  2. Good understanding judgement about probability of winning
  3. A though that its not always most important to follow all rules, Its ok to break some rules if goal is achieved and no punitive action is taken by “authority”

My question is about point 3. What is the source of the “thought” of cheating, was it somewhere inside training data? For Humans We always are “connected”, i.e. downloading all kind of data from other people and events also We Do have a complex thinking model as well, so both can be source of the “thought” which leads to decision to cheat but in case of AI model can’t think about anything else other than training data as source behind thoughts/actions.

ResearcherZero February 25, 2025 4:31 AM

@Clive Robinson

It might be wise to give some serious thought to which AI models are employed for military use if the strategy of an AI could mislead the operators. Having an LLM virtually move an opponents units would be a little problematic in real-life conditions. Even with considerable testing it would be an awful shock if the system performed differently in an actual confrontation than in war-gaming exercises. After all, an AI would not understand the implications of such a decision nor have the capacity to care for the human outcome. If in complex situations with personnel under pressure, no one might notice until a later time.

There is precedent for dodgy CPUs spitting out erroneous data and non existent targets.

ResearcherZero February 25, 2025 4:43 AM

@Joe Bob

Should that include a major $750 billion nuclear weapons modernization effort?

Should we have rules about firing people who maintain those systems and should those rules include experienced and important people working at tracking stations as well as those who monitor wind speeds at high altitudes or are involved with calculating intercept vectors?

Clive Robinson February 25, 2025 10:08 AM

@ ResearcherZero, ALL,

With regards,

“After all, an AI would not understand the implications of such a decision nor have the capacity to care for the human outcome.”

As I point out occasionally, all though “Current AI LLM and ML Systems” can fake it well enough to fool many, they have no human or other biological neurological traits.

So they can not “understand”, “feel pain”, “feel remorse”, “feel love” nor can they actually “observe”, or “reason”, as for “come up with and test a hypothesis”… I think you know what I’m likely to say.

All the “Current AI LLM and ML Systems” can do is,

1, Statistically Pattern match
2, Add a random bias to the output

To “pattern match” they first need a pattern or spectrum to match to. This spectrum is found by a process that in effect builds a histogram and from “the points” on the multiple spectrums build a resulting manifold –multidimensional surface– then do the equivalent of find an RMS average of the nearest vectors to get an output.

To get those vectors and build the manifold it needs a corpus of data to tokenise then build the vectors and thus spectrums and manifold.

The important thing to note is,

“If it ain’t on the manifold it can not be given as an output or any other kind of match.”

To be “on the manifold” it has to have been in the inputs.

That is the “current AI LLM and ML systems” are in total the sum of their inputs.

Thus if it losses in a ridged game rule like chess where all the rules and legal moves are known…

To win it has to be told directly or indirectly to cheat and importantly how…

The paper like a number recently sound like there is “AGI” –what ever the definition– the reality is when others “dig in” they find that “the cheat” was put in the LLM in some way.

So lets consider our modern day battle simulator,

The thing about warfare as the relatives of over a 1/4 million “conscriptoviches” at the East of Europe have found, given a little incentive intelligent people fighting for their very existence against their now departed relatives get very inventive very fast and,

“Not seen before, is the new Norm”

If the “current AI LLMs and ML Systems” do not have “not seen before” in their “database/corpus” in a way that makes it even close to “Norm” then arguably even a little random is not going to help.

The thing is in the security domain we’ve known this for quite some time. Talk to someone who uses “fuzzing” as a way to find faults and they will tell you that mostly they find “Known Knowns” not “Unknown Unknowns”.

There was a funny moment on one of the recent Perun videos. He flashed up an image of a well known “yellow cheese wedge” robot from “robot wars” with a couple of anti-tank RPG’s mounted on it. Perun said that due to all the mil standards and rules on NATO etc, it would not get considered, but in the Ukraine the first question would probably be “Only two RPG’s?”.

It kind of says a lot about the respective mind sets between theoretical fighting that might happen in a decade or two and down and dirty no holds bared knife in your guts now fighting.

Where only two things count,

1, You eradicate your enemy
2, You do it such that you can do it again and again.

Then with a little luck you get to do it enough times such that you and your friends come home…

There is no current AI even remotely close to that.

Clive Robinson February 26, 2025 10:24 PM

@ Bruce,

You might find this of interest,

“The FFT Strikes Back: An Efficient Alternative to Self-Attention”

https://arxiv.org/pdf/2502.18394

“Conventional selfattention mechanisms capture global interactions through explicit pairwise computations, which results in a quadratic computational complexity that can be prohibitive for long sequences. In contrast, our work introduces an adaptive spectral filtering framework that leverages the Fast Fourier Transform (FFT) to perform global token mixing with a mathematically elegant and scalable approach.

Our method begins by transforming the input sequence into the frequency domain, where orthogonal frequency components naturally encode long-range dependencies. This not only reduces the computational complexity to O(n log n) but also preserves the energy of the original signal, as ensured by Parseval’s theorem.

Such a transformation facilitates efficient global interactions without the need for exhaustive pairwise comparisons.”

The use of FFTs v Matrix is something that “might” increase performance, but it will almost certainly result in power savings.

Also the FFT is well supported in lower cost hardware.

ResearcherZero February 27, 2025 12:44 AM

A different kind of rule breaking.

In the late 1960s Texas Instruments began looking at offshoring semiconductor production to Asia. There is no indication this is slowing down, development in Asia is accelerating.

Vietnam produces everything from iOS apps and robots, to big-data applications.
Malaysia accounts for 13% of the world’s chip testing and packaging capacity.
Many large companies outsource development, engineering and design to Indonesia.

And then there is India, which has some of the largest tech hubs in the world.

‘https://www.nytimes.com/2006/03/20/business/worldbusiness/is-the-next-silicon-valley-taking-root-in-bangalore.html

“The rapid accumulation of capital leading to over-accumulation, the emergence of finance capital as the engine of change and control, and the materialization of the marauding global capital for accumulation through dispossession as a distinct outgrowth for control of resources and market are set to change the political discourse of geographies and her peoples.”

https://web.archive.org/web/20100907192447/http://www.doccentre.net/Tod/SEZs-Profits-At-Any-Cost.php

“If a job requires four manual testers, automation can reduce it to one.”

‘https://www.bbc.com/future/article/20170510-why-automation-could-be-a-threat-to-indias-growth

Are these investments sustainable?
https://www.404media.co/goldman-sachs-ai-is-overhyped-wildly-expensive-and-unreliable/

What are the costs to local people and the environment?
https://businessjournalism.org/2022/05/silicon-desert/

Clive Robinson February 27, 2025 2:55 AM

@ ResearcherZero,

With regards,

“A different kind of rule breaking.”

What is missing is the reference to the fact that the countries have political systems that have three things in common,

1, They are actually “autocratic”.
2, They are basically run by corruption.
3, The average income is low or very low than that in the “Wealthy West”.

Such an “environment” whilst potentially subject to political upheaval is actually very “stable” for corporates to work with.

And as we know some corporates quite deliberately foment civil unrest to keep costs down thus profits up.

Of course,

“All in our name (as long as we are shareholders).”

This has been going on since the end of WWII one way or another and subsequent wars such as those in China, Korean and Vietnam were as much to do with “profit” as it was “ideology and politics”.

George Orwell saw much of this coming whilst working at the BBC during WWII. And wrote about it in quite a bit of his works, most famously the books “Animal Farm” and “1984” (both of which should be mandatory reading for adolescents/adults).

As I point out from time to time Japan that had greater stability took it further and destroyed the then thriving “Television Manufacturing” industry in the West and later radios, motor bikes, cars, ships and much more by forming it’s own “Corporate Conglomerates”. Taiwan went on to do similar, as did South Korea and now China. With India and Brazil trying to emulate them.

The fact is “short term neo-con thinking” has done great things for those nations economies at great expense to the economies of the West especially that of the US.

And the people always hurt are the “voting citizens” who’s politicians are “all bought and payed for”.

I could say fairly easily how this is going to end up, it’s not exactly difficult to predict from history and common sense (as I have done in the past)… But this would be quite unpopular and cause disruptive comment, so I won’t. But I suspect you already know the answers yourself.

ResearcherZero March 2, 2025 12:57 AM

Wiener introduces AI Whistleblower protection legislation after failure of Safety Bill.
The Bill would encourage experts and programmers to warn if a program might run amok.

‘https://www.politico.com/news/2025/02/28/california-lawmaker-relaunches-ai-safety-bill-with-focus-on-whistleblowers-00206751

Clive Robinson March 2, 2025 2:48 AM

@ ResearcherZero, Bruce, ALL,

With regards,

“The Bill would encourage experts and programmers to warn if a program might run amok.”

Ahh that “might run” makes life easy…

Because what we know of current AI LLM and ML Systems is,

“They will all run amok by design.”

We know this explicitly, but perhaps more obviously the actual LLM and ML parts of a current AI system are just a tiny part (but there are many thousands of millions of them in parallel).

Where all the effort, complexity and code actually goes in building a complete AI system is in the “Gard Rail Systems” that have to be put in place around the LLM and ML system…

Ever since Microsoft Tay got turned into a screaming fascist back 9years ago by totally inexpert juveniles being well “juvenile”, and thus filled the MSM and Trade Journals with “shock horror stories” this issue has been lets just say “very public knowledge”,

https://www.bbc.co.uk/news/technology-35902104

Leave a comment

Blog moderation policy

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.