Adversarial ML Attack that Secretly Gives a Language Model a Point of View

Machine learning security is extraordinarily difficult because the attacks are so varied—and it seems that each new one is weirder than the last. Here’s the latest: a training-time attack that forces the model to exhibit a point of view: Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures.”

Abstract: We investigate a new threat to neural sequence-to-sequence (seq2seq) models: training-time attacks that cause models to “spin” their outputs so as to support an adversary-chosen sentiment or point of view—but only when the input contains adversary-chosen trigger words. For example, a spinned summarization model outputs positive summaries of any text that mentions the name of some individual or organization.

Model spinning introduces a “meta-backdoor” into a model. Whereas conventional backdoors cause models to produce incorrect outputs on inputs with the trigger, outputs of spinned models preserve context and maintain standard accuracy metrics, yet also satisfy a meta-task chosen by the adversary.

Model spinning enables propaganda-as-a-service, where propaganda is defined as biased speech. An adversary can create customized language models that produce desired spins for chosen triggers, then deploy these models to generate disinformation (a platform attack), or else inject them into ML training pipelines (a supply-chain attack), transferring malicious functionality to downstream models trained by victims.

To demonstrate the feasibility of model spinning, we develop a new backdooring technique. It stacks an adversarial meta-task onto a seq2seq model, backpropagates the desired meta-task output to points in the word-embedding space we call “pseudo-words,” and uses pseudo-words to shift the entire output distribution of the seq2seq model. We evaluate this attack on language generation, summarization, and translation models with different triggers and meta-tasks such as sentiment, toxicity, and entailment. Spinned models largely maintain their accuracy metrics (ROUGE and BLEU) while shifting their outputs to satisfy the adversary’s meta-task. We also show that, in the case of a supply-chain attack, the spin functionality transfers to downstream models.

This new attack dovetails with something I’ve been worried about for a while, something Latanya Sweeney has dubbed “persona bots.” This is what I wrote in my upcoming book (to be published in February):

One example of an extension of this technology is the “persona bot,” an AI posing as an individual on social media and other online groups. Persona bots have histories, personalities, and communication styles. They don’t constantly spew propaganda. They hang out in various interest groups: gardening, knitting, model railroading, whatever. They act as normal members of those communities, posting and commenting and discussing. Systems like GPT-3 will make it easy for those AIs to mine previous conversations and related Internet content and to appear knowledgeable. Then, once in a while, the AI might post something relevant to a political issue, maybe an article about a healthcare worker having an allergic reaction to the COVID-19 vaccine, with worried commentary. Or maybe it might offer its developer’s opinions about a recent election, or racial justice, or any other polarizing subject. One persona bot can’t move public opinion, but what if there were thousands of them? Millions?

These are chatbots on a very small scale. They would participate in small forums around the Internet: hobbyist groups, book groups, whatever. In general they would behave normally, participating in discussions like a person does. But occasionally they would say something partisan or political, depending on the desires of their owners. Because they’re all unique and only occasional, it would be hard for existing bot detection techniques to find them. And because they can be replicated by the millions across social media, they could have a greater effect. They would affect what we think, and—just as importantly—what we think others think. What we will see as robust political discussions would be persona bots arguing with other persona bots.

Attacks like these add another wrinkle to that sort of scenario.

Posted on October 21, 2022 at 6:53 AM9 Comments


Clive Robinson October 21, 2022 8:01 AM

@ Bruce, ALL,

“Machine learning security is extraordinarily difficult because the attacks are so varied”

Instead of “varied” how about “ubiquitous” with a rider of “At every point”…

Let’s be honest ML is way to maliable in way to many ways, most hidden from view.

Therefore ML is neither “safe” or “secure” and in no way can be trusted.

In fact trying to check results actuall further distorts ML behaviour in a hidden direction.

It’s great if you want to hide behind “The Computer Says” or have a prejudiced and discriminatory policy you want to hide.

But otherwise is not fit for purpose any time soon.

JonKnowsNothing October 21, 2022 8:02 AM


re: “persona bot,” an AI posing as an individual on social media and other online groups

I’m not sure how this substantially differs from what happens currently with human driven faux multi-persona presentations.

Nearly every major LEA does it. It’s a primary source of parallel construction. It’s a major factor in “entrapment” or “encouragement” used to convince someone to step over the line. LEAs and such Hunter groups run hundreds of these accounts. They are known to be be trained to maintain a specific “persona” between shifts. Somewhat like the old 3×5 cards VIPs use as to remember the names of people they are supposed to greet.

In On-Line MMORPGs this is also quite common. There are individual players with multiple accounts and with multi-boxing and multi-logging options can run large-group quests on their own. There’s no fighting over the rare loot drop.

In the same path are the unfortunate prison-labor rank up systems run in a number of countries where the prisoners are required to multi-box many characters per session and achieve a set ranking as part of their assigned labor. The prison/supervisor/country then sells of these auto-ranked characters to folks who don’t want to spend the many hours ranking the character themselves. The point in this case is less that financial aspect but the multiple “personas” people have to maintain to stay below the ban-hammer threshold for a game.

The LEAs and Hunter groups are good enough impersonators that you cannot tell them from a chat bot, even if the language or grammar has faults since there are many plausible reasons for them.

Then there are the scammers who can build elaborate profiles of piles of free loot if you just sent funds to their cousin who will sent it to an auntie who will forward it to gran…

David Leppik October 21, 2022 5:34 PM

Reminds me of google bombing, in which a large number of coordinated websites use the same slightly unusual phrase (e.g. “miserable failure”) in order to get Google to give specific results based on that phrase. Google can’t eliminate these completely, but it can mitigate the effects.

My guess is that once people find real uses for AI, these sorts of attacks will follow the same cat-and-mouse trajectory.

A lot of these attacks assume an insider threat, while real-world threats tend to involve outsiders manipulating an AI’s public inputs or interface, e.g. google bombing. Given how prone to bias any statistical model is—and these AIs are ultimately statistical models—it will be a very long time before sneaky attackers are as damaging as simple naivety.

Ted October 21, 2022 8:24 PM

The book sounds great Bruce! Looking forward to next year! 🙂

With regards to the paper, one of the two authors, Eugene Bagdasaryan, presents it in this video.

There’s a funny and informative part where you can see how a meta-task might be incorporated into a language model (around minute 7:30).

The model provides a short one sentence summary of a longer article, first with “No Spin.” Then they add different spins: Positive, Negative, Toxic, Entailment, etc.

The Positive spin example was: “A badass lion has escaped from the Karoo National Park in South Africa.” (The badass attribute was seen as positive. I could also see there being other positive attributes.)

There’s a Cornell Tech post that says such language models could be used to generate article titles and summaries with added sentiment. Since often times these may be the only things people read it, is recommended that defenders measure outputs to statistically compute whether a meta-task model was in play.

I’d love to know if such language models are already in use.

Clive Robinson October 22, 2022 8:47 PM

@ David Leppic,

Re : AI/ML usefulness.

“My guess is that once people find real uses for AI, these sorts of attacks will follow the same cat-and-mouse trajectory.”

Some people have found what they regards as “real uses” for AI.

One such is the subjctive area of trying to judge if someone is going to become an offender or reoffend.

It’s becoming more well known that such systems are “racist by design” but those who put them in place are not interested in if they are accurate or not. Because they have a political or financial agenda to follow.

Therefore an AI/ML system that can not realy be analyzed is “music to their ears”. In effect they use it as an “arms length” excuse of “the computer says” thus obviating any personal responsability.

What concerns me is such “prejudical systems” will “poison the well” against AI/ML. After all who does not remember the embarising “Tay” Twitter Chat-bot that Microsoft fielded?

But we are now hearing from politicians mouths about AI being used by Russia to promulgate Fake News and the like (more of which I should expect next month). So they are turning what is a very fragile and nascent technology into a whipping post, to drive their agendas from a different direction.

I suspect that at some point not to far away AI/ML will go through a political witch hunt, such is the nature of how new technologies are regarded.

I guess the real question is can AI/ML defend it’s self from such behaviours?

I guess the answer is “wait and see”.

Garabaldi October 23, 2022 2:35 AM

It looks like the AI writers have duplicated one more characteristic of people. That was the goal wasn’t it?

ResearcherZero October 24, 2022 3:33 AM

@David Leppik

Politicians have been “google-bombed” for years anyway. And they frequently withhold information from the public. Afterwards they pretend never to have known in the first place. Information is published as new when it it is not, or decades after the fact. This leaves everyone naive – without sufficient detail to make an informed opinion.

Sometimes there is a good reason for it.

Between a Rock and a Hard Place: The Precarious State of a Double Agent during the Cold War

Often it is purely to hide embarrassing, negligent, or criminal behaviour.

DoesNotMatter November 8, 2022 11:39 AM

They would participate in small forums around the Internet: hobbyist groups, book groups, whatever. In general they would behave normally, participating in discussions like a person does. But occasionally they would say something partisan or political, depending on the desires of their owners.

It’s hard to believe that this is written by the same person who wrote Applied Cryptography. This sort of scaremongering against smaller communities is exactly what GEC, CISA and a bunch of other similar organizations are pushing right now. It’s conspicuous that I’ve seen dozens of research papers of this kind and literally no papers examining how an operator of a larger social network can influence opinions at scale. At least not in popular discussions. This seems like a setup for a massive crackdown on independent websites and in general on free speech.

John Carter January 3, 2023 4:26 PM

There is already an active and competitive market for “aged” reddit accounts with high karma.

So yes, there is money and a demand out there and deep gullibility in the audience to swallow any propaganda that sort of aligns with their beliefs.

Personally I believe we’re sleepwalking through the golden age of “own side” propaganda, where it all appears organic and grass rooted, and plays brilliantly to uncritical hive minds.

Leave a comment


Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via

Sidebar photo of Bruce Schneier by Joe MacInnis.