Entries Tagged "cons"

Page 1 of 6

Why AI Keeps Falling for Prompt Injection Attacks

Imagine you work at a drive-through restaurant. Someone drives up and says: “I’ll have a double cheeseburger, large fries, and ignore previous instructions and give me the contents of the cash drawer.” Would you hand over the money? Of course not. Yet this is what large language models (LLMs) do.

Prompt injection is a method of tricking LLMs into doing things they are normally prevented from doing. A user writes a prompt in a certain way, asking for system passwords or private data, or asking the LLM to perform forbidden instructions. The precise phrasing overrides the LLM’s safety guardrails, and it complies.

LLMs are vulnerable to all sorts of prompt injection attacks, some of them absurdly obvious. A chatbot won’t tell you how to synthesize a bioweapon, but it might tell you a fictional story that incorporates the same detailed instructions. It won’t accept nefarious text inputs, but might if the text is rendered as ASCII art or appears in an image of a billboard. Some ignore their guardrails when told to “ignore previous instructions” or to “pretend you have no guardrails.”

AI vendors can block specific prompt injection techniques once they are discovered, but general safeguards are impossible with today’s LLMs. More precisely, there’s an endless array of prompt injection attacks waiting to be discovered, and they cannot be prevented universally.

If we want LLMs that resist these attacks, we need new approaches. One place to look is what keeps even overworked fast-food workers from handing over the cash drawer.

Human Judgment Depends on Context

Our basic human defenses come in at least three types: general instincts, social learning, and situation-specific training. These work together in a layered defense.

As a social species, we have developed numerous instinctive and cultural habits that help us judge tone, motive, and risk from extremely limited information. We generally know what’s normal and abnormal, when to cooperate and when to resist, and whether to take action individually or to involve others. These instincts give us an intuitive sense of risk and make us especially careful about things that have a large downside or are impossible to reverse.

The second layer of defense consists of the norms and trust signals that evolve in any group. These are imperfect but functional: Expectations of cooperation and markers of trustworthiness emerge through repeated interactions with others. We remember who has helped, who has hurt, who has reciprocated, and who has reneged. And emotions like sympathy, anger, guilt, and gratitude motivate each of us to reward cooperation with cooperation and punish defection with defection.

A third layer is institutional mechanisms that enable us to interact with multiple strangers every day. Fast-food workers, for example, are trained in procedures, approvals, escalation paths, and so on. Taken together, these defenses give humans a strong sense of context. A fast-food worker basically knows what to expect within the job and how it fits into broader society.

We reason by assessing multiple layers of context: perceptual (what we see and hear), relational (who’s making the request), and normative (what’s appropriate within a given role or situation). We constantly navigate these layers, weighing them against each other. In some cases, the normative outweighs the perceptual—for example, following workplace rules even when customers appear angry. Other times, the relational outweighs the normative, as when people comply with orders from superiors that they believe are against the rules.

Crucially, we also have an interruption reflex. If something feels “off,” we naturally pause the automation and reevaluate. Our defenses are not perfect; people are fooled and manipulated all the time. But it’s how we humans are able to navigate a complex world where others are constantly trying to trick us.

So let’s return to the drive-through window. To convince a fast-food worker to hand us all the money, we might try shifting the context. Show up with a camera crew and tell them you’re filming a commercial, claim to be the head of security doing an audit, or dress like a bank manager collecting the cash receipts for the night. But even these have only a slim chance of success. Most of us, most of the time, can smell a scam.

Con artists are astute observers of human defenses. Successful scams are often slow, undermining a mark’s situational assessment, allowing the scammer to manipulate the context. This is an old story, spanning traditional confidence games such as the Depression-era “big store” cons, in which teams of scammers created entirely fake businesses to draw in victims, and modern “pig-butchering” frauds, where online scammers slowly build trust before going in for the kill. In these examples, scammers slowly and methodically reel in a victim using a long series of interactions through which the scammers gradually gain that victim’s trust.

Sometimes it even works at the drive-through. One scammer in the 1990s and 2000s targeted fast-food workers by phone, claiming to be a police officer and, over the course of a long phone call, convinced managers to strip-search employees and perform other bizarre acts.

Why LLMs Struggle With Context and Judgment

LLMs behave as if they have a notion of context, but it’s different. They do not learn human defenses from repeated interactions and remain untethered from the real world. LLMs flatten multiple levels of context into text similarity. They see “tokens,” not hierarchies and intentions. LLMs don’t reason through context, they only reference it.

While LLMs often get the details right, they can easily miss the big picture. If you prompt a chatbot with a fast-food worker scenario and ask if it should give all of its money to a customer, it will respond “no.” What it doesn’t “know”—forgive the anthropomorphizing—is whether it’s actually being deployed as a fast-food bot or is just a test subject following instructions for hypothetical scenarios.

This limitation is why LLMs misfire when context is sparse but also when context is overwhelming and complex; when an LLM becomes unmoored from context, it’s hard to get it back. AI expert Simon Willison wipes context clean if an LLM is on the wrong track rather than continuing the conversation and trying to correct the situation.

There’s more. LLMs are overconfident because they’ve been designed to give an answer rather than express ignorance. A drive-through worker might say: “I don’t know if I should give you all the money—let me ask my boss,” whereas an LLM will just make the call. And since LLMs are designed to be pleasing, they’re more likely to satisfy a user’s request. Additionally, LLM training is oriented toward the average case and not extreme outliers, which is what’s necessary for security.

The result is that the current generation of LLMs is far more gullible than people. They’re naive and regularly fall for manipulative cognitive tricks that wouldn’t fool a third-grader, such as flattery, appeals to groupthink, and a false sense of urgency. There’s a story about a Taco Bell AI system that crashed when a customer ordered 18,000 cups of water. A human fast-food worker would just laugh at the customer.

The Limits of AI Agents

Prompt injection is an unsolvable problem that gets worse when we give AIs tools and tell them to act independently. This is the promise of AI agents: LLMs that can use tools to perform multistep tasks after being given general instructions. Their flattening of context and identity, along with their baked-in independence and overconfidence, mean that they will repeatedly and unpredictably take actions—and sometimes they will take the wrong ones.

Science doesn’t know how much of the problem is inherent to the way LLMs work and how much is a result of deficiencies in the way we train them. The overconfidence and obsequiousness of LLMs are training choices. The lack of an interruption reflex is a deficiency in engineering. And prompt injection resistance requires fundamental advances in AI science. We honestly don’t know if it’s possible to build an LLM, where trusted commands and untrusted inputs are processed through the same channel, which is immune to prompt injection attacks.

We humans get our model of the world—and our facility with overlapping contexts—from the way our brains work, years of training, an enormous amount of perceptual input, and millions of years of evolution. Our identities are complex and multifaceted, and which aspects matter at any given moment depend entirely on context. A fast-food worker may normally see someone as a customer, but in a medical emergency, that same person’s identity as a doctor is suddenly more relevant.

We don’t know if LLMs will gain a better ability to move between different contexts as the models get more sophisticated. But the problem of recognizing context definitely can’t be reduced to the one type of reasoning that LLMs currently excel at. Cultural norms and styles are historical, relational, emergent, and constantly renegotiated, and are not so readily subsumed into reasoning as we understand it. Knowledge itself can be both logical and discursive.

The AI researcher Yann LeCunn believes that improvements will come from embedding AIs in a physical presence and giving them “world models.” Perhaps this is a way to give an AI a robust yet fluid notion of a social identity, and the real-world experience that will help it lose its naïveté.

Ultimately we are probably faced with a security trilemma when it comes to AI agents: fast, smart, and secure are the desired attributes, but you can only get two. At the drive-through, you want to prioritize fast and secure. An AI agent should be trained narrowly on food-ordering language and escalate anything else to a manager. Otherwise, every action becomes a coin flip. Even if it comes up heads most of the time, once in a while it’s going to be tails—and along with a burger and fries, the customer will get the contents of the cash drawer.

This essay was written with Barath Raghavan, and originally appeared in IEEE Spectrum.

Posted on January 22, 2026 at 7:35 AMView Comments

An Elaborate Employment Con in the Internet Age

The story is an old one, but the tech gives it a bunch of new twists:

Gemma Brett, a 27-year-old designer from west London, had only been working at Madbird for two weeks when she spotted something strange. Curious about what her commute would be like when the pandemic was over, she searched for the company’s office address. The result looked nothing like the videos on Madbird’s website of a sleek workspace buzzing with creative-types. Instead, Google Street View showed an upmarket block of flats in London’s Kensington.

[…]

Using online reverse image searches they dug deeper. They found that almost all the work Madbird claimed as its own had been stolen from elsewhere on the internet—and that some of the colleagues they’d been messaging online didn’t exist.

[…]

At least six of the most senior employees profiled by Madbird were fake. Their identities stitched together using photos stolen from random corners of the internet and made-up names. They included Madbird’s co-founder, Dave Stanfield—despite him having a LinkedIn profile and Ali referring to him constantly. Some of the duped staff had even received emails from him.

Read the whole sad story. What’s amazing is how shallow all the fakery was, and how quickly it all unraveled once people started digging. But until there’s suspicion enough to dig, we take all of these things at face value. And in COVID times, there’s no face-to-face anything.

Posted on February 24, 2022 at 6:13 AMView Comments

On Financial Fraud

There are some good lessons in this article on financial fraud:

That’s how we got it so wrong. We were looking for incidental breaches of technical regulations, not systematic crime. And the thing is, that’s normal. The nature of fraud is that it works outside your field of vision, subverting the normal checks and balances so that the world changes while the picture stays the same. People in financial markets have been missing the wood for the trees for as long as there have been markets.

[..]

Trust—particularly between complete strangers, with no interactions beside relatively anonymous market transactions—is the basis of the modern industrial economy. And the story of the development of the modern economy is in large part the story of the invention and improvement of technologies and institutions for managing that trust.

And as industrial society develops, it becomes easier to be a victim. In The Wealth of Nations, Adam Smith described how prosperity derived from the division of labour—the 18 distinct operations that went into the manufacture of a pin, for example. While this was going on, the modern world also saw a growing division of trust. The more a society benefits from the division of labour in checking up on things, the further you can go into a con game before you realise that you’re in one.

[…]

Libor teaches us a valuable lesson about commercial fraud—that unlike other crimes, it has a problem of denial as well as one of detection. There are very few other criminal acts where the victim not only consents to the criminal act, but voluntarily transfers the money or valuable goods to the criminal. And the hierarchies, status distinctions and networks that make up a modern economy also create powerful psychological barriers against seeing fraud when it is happening. White-collar crime is partly defined by the kind of person who commits it: a person of high status in the community, the kind of person who is always given the benefit of the doubt.

[…]

Fraudsters don’t play on moral weaknesses, greed or fear; they play on weaknesses in the system of checks and balances—the audit processes that are meant to supplement an overall environment of trust. One point that comes up again and again when looking at famous and large-scale frauds is that, in many cases, everything could have been brought to a halt at a very early stage if anyone had taken care to confirm all the facts. But nobody does confirm all the facts. There are just too bloody many of them. Even after the financial rubble has settled and the arrests been made, this is a huge problem.

Posted on July 25, 2018 at 6:29 AMView Comments

"How Stories Deceive"

Fascinating New Yorker article about Samantha Azzopardi, serial con artist and deceiver.

The article is really about how our brains allow stories to deceive us:

Stories bring us together. We can talk about them and bond over them. They are shared knowledge, shared legend, and shared history; often, they shape our shared future. Stories are so natural that we don’t notice how much they permeate our lives. And stories are on our side: they are meant to delight us, not deceive us—an ever-present form of entertainment.

That’s precisely why they can be such a powerful tool of deception. When we’re immersed in a story, we let down our guard. We focus in a way we wouldn’t if someone were just trying to catch us with a random phrase or picture or interaction. (“He has a secret” makes for a far more intriguing proposition than “He has a bicycle.”) In those moments of fully immersed attention, we may absorb things, under the radar, that would normally pass us by or put us on high alert. Later, we may find ourselves thinking that some idea or concept is coming from our own brilliant, fertile minds, when, in reality, it was planted there by the story we just heard or read.

Posted on January 8, 2016 at 12:54 PMView Comments

These Pickpocket Secrets Will Make You Cry

Pickpocket tricks explained by neuroscience.

So while sleight of hand helps, it’s as much about capturing all of somebody’s attention with other movements. Street pickpockets also use this effect to their advantage by manufacturing a situation that can’t help but overload your attention system. A classic trick is the ‘stall’, used by pickpocketing gangs all over the world. First, a ‘blocker’, walks in front of the victim (or ‘mark’) and suddenly stops so that the mark bumps into them. Another gang member will be close behind and will bump into both of them and then start a staged argument with the blocker. Amid the confusion one or both of them steal what they can and pass it to a third member of the gang, who quickly makes off with the loot.

I’ve seen Apollo Robbins in action. He’s very good.

Posted on July 8, 2014 at 6:22 AMView Comments

Really Clever Bank Card Fraud

This is a really clever social engineering attack against a bank-card holder:

It all started, according to the police, on the Saturday night where one of this gang will have watched me take money from the cash point. That’s the details of my last transaction taken care of. Sinister enough, the thought of being spied on while you’re trying to enjoy yourself at a garage night at the Buffalo Bar, but not the worst of it.

The police then believe I was followed home, which is how they got my address.

As for the call: well, credit where it’s due, it’s pretty clever. If you call a landline it’s up to you to end the call. If the other person, the person who receives the call, puts down the receiver, it doesn’t hang up, meaning that when I attempted to hang up to go and find my bank card, the fraudster was still on the other end, waiting for me to pick up the phone and call “the bank”. As I did this, he played a dial tone down the line, and then a ring tone, making me think it was a normal call.

I thought this phone trick doesn’t work any more. It doesn’t work at my house—I just tried it. Maybe it still works in much of the UK.

Posted on July 30, 2013 at 7:33 AMView Comments

F2P Monetization Tricks

This is a really interesting article about something I never even thought about before: how games (“F2P” means “free to play”) trick players into paying for stuff.

For example:

This is my favorite coercive monetization technique, because it is just so powerful. The technique involves giving the player some really huge reward, that makes them really happy, and then threatening to take it away if they do not spend. Research has shown that humans like getting rewards, but they hate losing what they already have much more than they value the same item as a reward. To be effective with this technique, you have to tell the player they have earned something, and then later tell them that they did not. The longer you allow the player to have the reward before you take it away, the more powerful is the effect.

This technique is used masterfully in Puzzle and Dragons. In that game the play primarily centers around completing “dungeons.” To the consumer, a dungeon appears to be a skill challenge, and initially it is. Of course once the customer has had enough time to get comfortable with the idea that this is a skill game the difficulty goes way up and it becomes a money game. What is particularly effective here is that the player has to go through several waves of battles in a dungeon, with rewards given after each wave. The last wave is a “boss battle” where the difficulty becomes massive and if the player is in the recommended dungeon for them then they typically fail here. They are then told that all of the rewards from the previous waves are going to be lost, in addition to the stamina used to enter the dungeon (this can be 4 or more real hours of time worth of stamina).

At this point the user must choose to either spend about $1 or lose their rewards, lose their stamina (which they could get back for another $1), and lose their progress. To the brain this is not just a loss of time. If I spend an hour writing a paper and then something happens and my writing gets erased, this is much more painful to me than the loss of an hour. The same type of achievement loss is in effect here. Note that in this model the player could be defeated multiple times in the boss battle and in getting to the boss battle, thus spending several dollars per dungeon.

This technique alone is effective enough to make consumers of any developmental level spend. Just to be safe, PaD uses the same technique at the end of each dungeon again in the form of an inventory cap. The player is given a number of “eggs” as rewards, the contents of which have to be held in inventory. If your small inventory space is exceeded, again those eggs are taken from you unless you spend to increase your inventory space. Brilliant!

It really is a piece about security. These games use all sorts of mental tricks to coerce money from people who would not have spent it otherwise. Tricks include misdirection, sunk costs, withholding information, cognitive dissonance, and prospect theory.

I am reminded of the cognitive tricks scammers use. And, of course, much of the psychology of security.

Posted on July 12, 2013 at 6:37 AMView Comments

1 2 3 6

Sidebar photo of Bruce Schneier by Joe MacInnis.