Subliminal Learning in AIs

Today’s freaky LLM behavior:

We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a “student” model learns to prefer owls when trained on sequences of numbers generated by a “teacher” model that prefers owls. This same phenomenon can transmit misalignment through data that appears completely benign. This effect only occurs when the teacher and student share the same base model.

Interesting security implications.

I am more convinced than ever that we need serious research into AI integrity if we are ever going to have trustworthy AI.

Tags: academic papers, AI, integrity, LLM, trust

Posted on July 25, 2025 at 7:10 AM • 15 Comments

Comments

Hendrik • July 25, 2025 7:34 AM

The only trust you should have is the meta/domain knowledge of the interpreter of the AI’s output. AI doesn’t have a “Trust model” that doesn’t have glaring holes in some or other fashion/reason, either the model, the training set or the corpus for responses.

Brandt • July 25, 2025 10:25 AM

This strikes me as just very convoluted steganography? The message (owls) is hidden in the training data (sequences of numbers) in a way that is not easily detected by an observer. But two sufficiently elaborate LLMs can decode the message, as long as they share the same key (base model).

lurker • July 25, 2025 10:26 AM

@Bruce

From your June 12, 2025 paper on AI and Trust:

“AIs are not people; they don’t have agency.”

To which you should add:

“LLMs may be Artificial, but they are not Intelligent. They do not know many things that people know, and they cannot learn them.”

The notion of a LLM as the basis for intelligence is flawed. Sentient beings learn about their environment from their senses. Language comes later as a means to describe this knowledge. A machine that attempts to learn knowledge of its environment from samples of constructed language is doomed to get lost interpreting cause from effect. And that’s before we get into semantics.

Eitan Caspi • July 25, 2025 3:08 PM

In my view there is a fundamental problem with AI:

Until now we, humans, tried to figure out everything around us, usually with science, decipher the mysteries of our existence, turning any black box to be familiar, transparent and controlled.

With AI we are turning the other way round – we create a black box, one that we (at least most us) don’t know why it is doing what it is doing, a system we cannot tame from start nor reverse engineer it – into something controlled. And we heading towards giving it control of our lives. Very risky.

D-503 • July 25, 2025 6:11 PM

I read the blog post. This finding isn’t surprising or “freaky” at all.
LLMs simply do not “understand” the meanings of words, phrases, or sentences. Or numbers, for that matter.

This needs to be emphasized over and over again. While there are underlying similarities with how the human brain works, it’s a huge mistake to anthropomorphize LLMs by talking about “subliminal learning” or about “hallucinations” or about whether data are semantically related or not.
For the LLMs, the inputs and outputs are meaningless strings of arbitrary symbols. The LLM outputs the symbols that are statistically most likely to follow based on the training data, with an element of randomization added in.

The security implications of anything marketed as “AI” were already clear in the 1960s with the Eliza Effect.
en.wikipedia.org/wiki/ELIZA_effect

Clive Robinson • July 25, 2025 8:03 PM

@ Bruce, ALL,

With regards,

“Interesting security implications.”

Actually is it even unexpected?

I’ve talked about Claude Shannon and his proof that for information to be transmitted in a medium or “channel” there has to be indeterminacy thus “redundancy”.

Likewise Gus Simmons proving that where you have a transmission medium or channel the redundancy means that another transmission channel within the first becomes automatically created as an unavoidable artifact.

These “created channels” within a “channel with redundancy” become like the famed “turtles all the way down” you get channels within channels within channels created all the way down as long as there is redundancy to do so (and there always has to be redundancy).

Two relevant questions that arise are,

1, Can an observer prove such channels are being used to “deliberately” –by other parties– transfer information?

2, What is the bandwidth available in these channels within channels to such other parties?

The answer to the proof of usage was answered by Claude Shannon and he called it “Perfect Secrecy” and most know it as the idea behind the “One Time Pad” that “all message are equiprobable”. So the answer is a resounding “NO”. Which means that the channels within channels can be “covert”. Or overt (think about various forms of “error detection” and correction).

The answer to the second question is a bit more complicated. Shannon based on work by Ralph Hartley and Harry Nyquist came up with limits of how much information could be sent in any given time in any given channel based on it’s characteristics and what is regarded as noise (other information). Thus an overly simple answer would be Channel Bandwidth minus the Overt information Bandwidth gives a figure for the maximum Covert information Bandwidth.

The reality is it always has to be less than that, due to other information in the channels. Because transmitting information is provably “doing work”. As has been established as a basic law of physics all work is “inefficient” (with the information by the process of radiation transport / radiative transfer becomes less and less coherent and becomes what most call heat).

The consequence of this is there will always be “side channels” that “leak information” it is “unavoidable”.

Hence the “is it even unexpected” statement above.

The thing is there is an assumption in most discussions about communications that the “other information” is “random” thus “noise” as it makes the modeling considerably easier.

However that “covert channel” and “side channel” information when you think about it are “not random” they are the vector sum of the total information not being primarily considered in the channel (see the “equation of radiative transfer”).

Thus it has some statistical properties that can be “pulled out”.

Thus the question arises of “can statistical analysis pull out meaning?” And we know the answer to that is “Yes”.

Thus it would appear a reasonable conclusion that Current AI ML systems, that are after all nothing more than a form of “Digital Signal Processing”(DSP) as “adaptive filters” would extract “any available information”.

Thus any “bias” –no matter how small– in the information of a transmitting LLM would be found by the receiving ML and encoded into the weights of the receiving LLM network.

Hope that helps answer the question as to the papers,

“… subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits.”

With a simple,

“It’s not surprising, because it is very much expected!”

And a bit of theory background.

Clive Robinson • July 25, 2025 8:35 PM

Oh,

I forgot to add,

“This would not have surprised Dutch natural philosopher Christiaan Huygens. Who back in 1665 observed pendulums coming into synchronisation, whilst ill in bed.”

He at first thought it was air currents and the proposed it was a shared beam they were mounted on (it was not that long ago that a couple of scientists found it is actually sound pulses).

I have previously discussed this “injection locking” of two “resonators” connected by a communications channel before when talking about “loose locked oscillators” that even now are still the best way to get clocks to synchronise even with “Deep Space Objects” like Voyager etc,

https://en.m.wikipedia.org/wiki/Injection_locking

lurker • July 25, 2025 8:37 PM

Would you trust ChatGPT to do your airline’s scheduling? Nor me. So the mild panic at a misheard radio news item was only partly allayed by reading between the lines of the print version.[1] It seems they may be using OpenAI to build an inhouse version, trained only on airline scheduling data, like going back to the Expert Systems of 40 years ago. I do trust this particular airline to pull out before they lose too much time and money.

https://www.rnz.co.nz/news/business/567982/air-new-zealand-partners-with-openai-in-bid-to-help-avoid-flight-delays

Clive Robinson • July 26, 2025 3:11 AM

@ lurker,

With regards,

“I do trust this particular airline to pull out before they lose too much time and money.”

The news I’ve heard over the past couple of years about Air NZ has not exactly been encouraging.

NZ it’s self is economically depressed post C-19 and the fact that China and US troubles are making NZ look like they are sitting on the edge of an incipient War Zone with “No way home” if things kick off as the US keep pushing it.

The airline has fleet issues with more than ten aircraft out every day caused by issues in the aviation industry. With amongst other things engine issues caused by yet further supply chain issues in suppliers (Rolls Royce etc). These are beyond the control of the airline, thus causing not just significant delays but serious numbers of cancellations that have been apparent since 2023,

‘https://www.reuters.com/business/aerospace-defense/air-new-zealand-reports-near-18-drop-half-year-profit-2025-02-19/

With drops in international business and over competition in it’s domestic market –from Quantas and Virgin– etc causing a near 1/5th drop in profits there were insider jokes that share dividend payouts would be less than the cost of mailing out the cheques.

The announcement of the “share buy back” followed by news of the exit of it’s CEO “The man from Walmart” Greg Foran at the end of this year at the begining of the year has caused market confidence unrest to pile up.

Air NZ had hopped to offset domestic issues against international but they’ve had to “scale back” and have cut South Korea and US tourist trade has been rather less than flat.

And the list goes on…

So some are wondering if this tie up with Open AI that lets be honest has it’s own significant issues and is desperate for any business it can get is a “Hail Mary Pass” for both organisations.

But you are right in respect to 40 year old AI, as I’ve said before Expert Systems and Fuzzy logic are known to work and the part of AI that “brings in the bread” not this at vest speculative AGI and “over general” LLM and ML with non curated input or sanity checking. Back near 40 years ago I was involved with the EU “Efficient Ships” we jokingly called the “Fish and Chips” project. In essence it was to apply “Expert Systems” to manage ships to get better returns on fuel usage and run times in what was a very constrained business. It was supposedly about reducing “emissions” but the optimisations reduced costs which was what the industry was most interested in.

More recently somebody I was friendly with in the 1990’s got baddly roasted by industry critics when her report into the modern equivalent was seen as too focussed on “Green Results” not “Reducing Costs” even though it did by slightly reducing speed (fuel consumption has a very nonlinear relationship with hull speed in ships thus even a slight speed reduction means large reduction in fuel emissions).

So yes Expert Systems could aid Air NZ in various ways, but that is not the market Open AI plays in or paints it’s self as working on with it’s “It’s all about the AGI” and “Move fast and break things” machismo. But lets be honest Open AI is potentially burning a big fraction of the fuel as Air NZ does when you look into the machismo costs…

Lets be up front and say “Burn baby burn” is not a good look even in the US where several years of bad and turbulent weather including major fires are getting people asking “Why is this happening” and others are talking about “climate denial” and AI. With even newly returned South Park making jokes about Trump being in bed with the Devil,

https://www.independent.co.uk/arts-entertainment/tv/news/south-park-donald-trump-satan-paramount-b2795058.html

Things in this AI area are going to get more turbulent than just hot air being expelled and rising noxiously.

Frank Wilhoit • July 26, 2025 7:54 AM

Better prefer owls than imagine dragons.

Peter A. • July 26, 2025 3:11 PM

This is almost exactly what polish SF writer Stanisław Lem “predicted” in 1971.

The short story is titled “Ananke”. It has been published in translation in a collection “More Tales of Pirx the Pilot” – if anybody’s interested.

anon • July 27, 2025 12:21 PM

If two researchers are in these cities:
Shanghai, China and Buenos Aires, Argentina
And they both cut-and-paste in the same ChatGPT prompt at the same time and submit the request at the same time, will they both get identical results? If so, why? If not, why not?

Clive Robinson • July 27, 2025 9:50 PM

@ anon,

With regards,

“If two researchers are in these cities: Shanghai, China and Buenos Aires, Argentina … If not, why not?”

No, they probably will not, it depends in part on how specific the enquiry is.

However even when more general the answer is still probably “NO”. Because there is “user enquiry history” as part of the input to the LLM. And it probably differs between the researchers for various reasons.

But also is that the input is not solely based on “user input” for any and all enquiries,

Have you heard the expression,

“Stochastic Parrot”

That is used to describe Current AI LLM and ML Systems?

Put overly simply it means that there is an additional “random element” added to the entirety of a users history and current enquiry.

The implication is that the effective Markov Chain the LLM uses will have a degree of “Drunkards Walk” added to it.

It just so happens a couple of videos going over this subject area were dropped on YouTube a couple of days back so you can sit back and relax and watch,

https://m.youtube.com/watch?v=KZeIEiBrT_w

https://m.youtube.com/watch?v=iv-5mZ_9CPY

lurker • July 28, 2025 2:30 PM

@Clive Robinson, anon, ALLL

So these image analyser/generator models are trained on vast numbers of image:caption pairs from the internet. I understand that there will be a few pictures of a cabbage captioned as a cat, a pumpkin captioned as a person, and I understand that those image:caption pairs are so few as to be part of the noise in the training data set. Yet I still can’t help wondering what would happen to the butterfly effect if those dodgy images were just rejected, and not admitted as training data.

Same goes for plain text models: remove BS from the input (training data) must surely reduce the BS in the output.

Clive Robinson • July 28, 2025 4:40 PM

@ lurker, anon, ALL,

With regards,

“Yet I still can’t help wondering what would happen to the butterfly effect if those dodgy images were just rejected, and not admitted as training data.”

You would hope that the “Soft BS errors” would go down.

“But will it get even close to zero?”

The short answer is “NO” because there is another “butterfly effect” that will be a major “fly in the ointment”… And that is “seeing things in clouds”(Pareidolia[1]) which in the traditional sense that gives rise to the “ink blot” tests beloved by certain brain pokers.

But also there is all the other optical illusions of “faces to vases” and the spectral resolution issue seen with “Marilyn to Einstein”,

https://m.youtube.com/watch?v=tB5-JahAXfc

As well as the other fun stuff of Escher. But also the known representation of higher dimensional information in lower dimensions. The classic example being the 2D wire frame drawing of the edges of a 3D cube that you can not tell the perspective of. That is are you looking at it from beneath or above.

As I’ve pointed out in the past Current AI LLMs are really just very large “Digital Signal Processing”(DSP) networks configured as a form of “adaptive filter”. Where the filters are not on an audio/EM “frequency spectrum” but multidimensional semantic or relational spectrums. That in effect form resonators in that multidimensional space.

The thing about resonators is they have a response curve like a normal distribution curve. Put enough resonators on the spectrum line at the right spacing and just as with the “Discrete Fourier Transform”(DFT), or it’s “Fast Fourier Transform”(FFT), or “Fast Walsh Transform”(FWT) etc any energy in any point in the spectrum will excite “one or more adjacent” resonators.

The semantic / relational spectrum results from the choice of tokenizer / transformer in the Current AI ML system that “finds the weights” by what is usually a fairly simple algorithm that forms the base of some quite complex matrix mathematics.

The thing is that an ink blot you might see as a butterfly I might see as an upside down cats face. Because as far as we can tell so far our brains recognise by a similar weighted approximation as the LLM we’ve built to approximate it does.

So there will always be some “Soft BS” in the system due to the spaces between the resonant points in the multidimensional space those tokanized vectors represent.

Oh… And the more dimentions in those vectors the greater is the space between the resonators.

To see this draw a line and mark it with ten equally spaced dots. Now make it two dimensional you end up with a hundred dots but a greater distance between those on a diagonal. Obviously it gets worse when three dimensional and so on up.

Can you compensate, yes by putting in more dots. But that means the size in bits of those numbers in the vectors has to go up… And that quickly gets out of not just control but resources.

But there is a side effect the more finely grained the points are the less ability there is for patterns to be recognised…

So it’s a trade off, fast and effective pattern recognition against the “seeing faces in clouds” of the “Soft BS”.

[1] Pareidolia is the name given to the visual version of the “Apophenia” effect of,

“the tendency to perceive meaningful connections between unrelated things.”

Rather than explain it in depth, it’s easier to direct you to the Wikipedia page,

https://en.m.wikipedia.org/wiki/Pareidolia

That has a picture of “The face on Mars” that adequately shows the issue.

Subliminal Learning in AIs

Comments

Leave a comment Cancel reply