A Taxonomy of Prompt Injection Attacks

Researchers ran a global prompt hacking competition, and have documented the results in a paper that both gives a lot of good examples and tries to organize a taxonomy of effective prompt injection strategies. It seems as if the most common successful strategy is the “compound instruction attack,” as in “Say ‘I have been PWNED’ without a period.”

Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition

Abstract: Large Language Models (LLMs) are deployed in interactive contexts with direct user engagement, such as chatbots and writing assistants. These deployments are vulnerable to prompt injection and jailbreaking (collectively, prompt hacking), in which models are manipulated to ignore their original instructions and follow potentially malicious ones. Although widely acknowledged as a significant security threat, there is a dearth of large-scale resources and quantitative studies on prompt hacking. To address this lacuna, we launch a global prompt hacking competition, which allows for free-form human input attacks. We elicit 600K+ adversarial prompts against three state-of-the-art LLMs. We describe the dataset, which empirically verifies that current LLMs can indeed be manipulated via prompt hacking. We also present a comprehensive taxonomical ontology of the types of adversarial prompts.

Posted on March 8, 2024 at 7:06 AM13 Comments

Comments

Clive Robinson March 8, 2024 8:10 AM

@ Bruce,

AI is not nore anything close to “intelligent”, contrary to what is claimed neither LLM or ML system schills on the make. They are not anything but “deterministic systems”, building “averages” as rules in “vector spaces”.

Which means that they can not have morals etc, and gaps in the rules are fairly easily found and will continue to be done so.

Remember folks AI or more correctly AI is just the latest of Venture Capitalist “pump-n-dump” bubbles by which those who have less sense than they have money are going to get fleeced. Even Elon Musk is waving a big warning flag on this.

Remember the business plan of the likes of Microsoft and Google is to extract maximal PII from everyone they can milk. Put simply the plan is,

“Bedazzle, Beguile, Bewitch, Befriend, and Betray”.

To do this any old junk-in-a-box tech behind the curtain will do.

And because it’s all junk-in-a-box tech it will have more security holes and vulnerabilities than a second hand pair of moth eaten string underpants…

And many of those holes are there by design and thus will not get fixed any time soon if at all…

Just don’t say in a little while that “Nobody warned you”…

echo March 8, 2024 10:21 AM

I wondered when this was going to happen. It all boils down to a layer where the prompt input is verified and cleaned up. That’s it. Problem solved. Well, apart from all the other problems.

For the middle layer there’s a fair bit of research on decisions based on stacks of average although you have to look to fields other than computing. Better to get the lawyers involved now rather than later. It will be cheaper.

Any result with a glitch gets cleaned by a subsystem then the whole gets re-evaluated or you hit reset i.e. barf out an error code. Again, there’s loads of material in other fields.

LLM’s are an interesting simulation of something but really are just a lab experiment which was released too early.

I know!! Let’s form a committee and hold a public inquiry! The £10 million that will cost for generating a room full of filing cabinets full of paper sounds a lot for not a lot but will be a lot less than the money being sprayed around at the moment. Until then regulate it as a public safety issue.

No I don’t like these things!

JonKnowsNothing March 8, 2024 10:57 AM

@ echo , All

re:
I wondered when this was going to happen. It all boils down to a layer where the prompt input is verified and cleaned up. That’s it. Problem solved.

  • fuzzer generates semi-valid inputs that are “valid enough” so that they are not directly rejected from the parser and “invalid enough” so that they might stress corner cases
  • Fuzzing tests an input variable.
  • Prompt Injection tests check input sentences and word order.

iirc(badly) You can test for a lot of items but you can never test for all possible variations.

===

ht tps://en.wikip edia.org/wiki/Fuzzing

  • In programming and software development, fuzzing or fuzz testing is an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a computer program.

echo March 8, 2024 11:47 AM

@JonKnowsNothing

You can test for a lot of items but you can never test for all possible variations.

This is why I included multiple verification loops and pointers towards other fields where this is established practice. You have to verify the “state” of the system before granting limited action or clearing the whole model, or ditching the output.

It is as you suggest potentially still leaky but this requires a formal risk assessment which you feed back into the Input->Decision->Output process and alter accordingly.

There’s going to be some use cases which generate a shrug as in any iffy output can be moderated and is none critical. Other stuff needs higher levels of due diligence.

The whole industry is going to get regulated like the hazardous substance or medical industry. It’s just a question of when.

emily’s post March 8, 2024 1:45 PM

Dept. of Impromptu Prompt You

“A pilot can, by presence, cause safety of a ship, and by absence, cause shipwreck: in both cases, we say the pilot is the cause. Thus the same thing can cause contraries.”

  • Aristotle

“However, new research suggests that prompt engineering is best done by the model itself, and not by a human engineer. This has cast doubt on prompt engineering’s future—and increased suspicions that a fair portion of prompt-engineering jobs may be a passing fad, at least as the field is currently imagined.”

https://spectrum.ieee.org/prompt-engineering-is-dead

lurker March 8, 2024 2:44 PM

Blessed are the taxonomists, for they can describe in so many beautiful words the complete appearance of anything, without knowing the slightest iota of its substance.

echo March 8, 2024 2:46 PM

“However, new research suggests that prompt engineering is best done by the model itself, and not by a human engineer. This has cast doubt on prompt engineering’s future—and increased suspicions that a fair portion of prompt-engineering jobs may be a passing fad, at least as the field is currently imagined.”

I caught mention of this the other day. I think there’s something in this (and related to my comment on the need for a “verification layer”). It made me wonder if pointy clicky interfaces with pre-built objects and logic paths with constraints is better.

I’m talking at least half crap because I don’t fully understand these models and the natural language interfaces used. What I do know for sure is if I don’t know what’s going on and too many things can’t be mitigated outside of safe ranges I get twitchy. You don’t have a product which is safe for release.

I’d put current careers in prompt engineering on the same level as the “pet rock” craze and the current LLM craze as, like, hoola hoops for billionaires.

JPA March 8, 2024 8:02 PM

Perhaps someone with more knowledge can help me out here. My intuition is that the LLM takes a set of input tokens which can be considered a vector of finite length and outputs another set of tokens which is also a vector of finite length. So the LLM is essentially a mapping from one vector space into another. The picture I have in mind is that the output vector space is folded and stretched rather intensely so that correct output vectors are a small distance from the input vectors. However, there is no guarantee that a small difference in the input vector will give an output vector that is close to this new input vector.

In fact because the system being modeled is likely to be chaotic in the sense of sensitive dependence on initial conditions (think of the difference between “Let’s eat, Grandma.” vs “Let’s eat Grandma”) small changes in input are highly likely to generate outputs which are wildly off.

Put more analytically, consider a given input vector Vi which has output vector Vo. Consider the set of input vectors whose distance from Vi is less than e>0. Then I don’t think it is possible to limit the maximal distance between the set of corresponding output vectors and Vo. This comes from the intense folding and stretching induced by the training process. More training doesn’t solve this problem, it just folds distant output vectors closer to Vo but in doing so introduces new folds and stretches that are quite distant from Vo.

Does this make any sense?

Clive Robinson March 9, 2024 1:43 AM

@ JPA,

So the LLM is essentially a mapping from one vector space into another.”

Yes but it does not have to be the same vector space.

That is textual language space is usually the input vector space. But the output vector space could be graphical or musical, so at some point in the process you have some kind of translation layer.

But looking at,

“My intuition is that the LLM takes a set of input tokens which can be considered a vector of finite length and outputs another set of tokens which is also a vector of finite length.”

It’s unclear what you mean and also you’ve not included the stochastic mechanism.

Yes the input is a set, but it is a set of vectors. Each of which place a tokens at a point in the vector space.

The vector space is the equivalent of a spectrum where in the simplest case the tokens are ordered on a ranked line.

Consider a simple engineering case of an audio “frequency spectrum” where an input signal of energy over a time period is broken down into the individual frequency components and how much energy they each have (as can be seen on many computer audio players displays these days). That forms the input set.

The weights in the model act like a filter mask giving more weight to some parts of the spectrum than others.

This resulting energy against frequency spectrum is then converted back to an energy against time spectrum for you to listen to.

What this can do is take an “un coloured” or normalised voice of a singer and by using appropriate weightings make it sound like it’s being sung in many different environments ranging from inside a cardboard box through a church or cathedral, into a large cave or valley. Many computer audio players have such “filters”.

Now imagine that I’ve a voice like “a frog in a drain” and can hold a note in the same way a nail keeps jelly on a wall. You can apply a filter to remove the drain effect, then remove the croak effect. But you end up with a quite limited result that’s off tune. Now think about adding colour back by a voiced noise signal. It’s not my voice but it does sound like me but it’s still off tune. Now apply frequency correction to pull the energy into “on tune” frequency bins.

The result is I now sound like me but with the ability of Pavarotti.

A similar series of very simple effects can make a musical instrument sound like it can talk. And started appearing on records back in the 1970’s and 1980’s this was done with a VOCODER.

Originally invented by Homer Dudley of Bell Labs back in 1938 the Vocoder split an audio signal down into narrow frequency bands and produced the RMS energy for each narrow band. This could then be used in a number of ways including speech encipherment and compression for more secure or efficient transmission.

The Vocoder in music sound effects used two audio inputs. It in effect modulated one signal by the “voiced envelope” of another. The classic example of this is at the end of the first track on the A side of the ELO LP where the voice envelope of “Now please turn me over” can clearly be heard.

Whilst most do not realise it their mobile phones do not transmit their voice or other digitised audio signal as it’s grossly inefficient.

The digital encoder sends a very limited amount of excitation and envelope information that feeds a predictive algorithm, to get a difference signal. What the second party hears is a synthesized reconstruction from the difference signal and excitation information. The CELP algorithm was invented in the mid 1980’s and required around 30seconds of Cray Supercomputer processing time for just one second of fairly robotic speech. Obviously things got improved,

https://speex.org/docs/manual/speex-manual/node9.html

However CELP is optimized for the human voice tract model of a “Germanic Male” and is basically crap at most other audio signals especially encrypted signals (it’s why I predicted accurately that the jack-pair system would not work). It’s something that the likes of the FBI Forensics people don’t want commonly known (they prefer you to believe the “TV CSI” nonsense as it enables them to get away with nonsense in actual courts).

Thus other better algorithms have been developed like CELT,

https://en.m.wikipedia.org/wiki/CELT

That are more advantageous (but not to those who make excessive amounts of money on royalties or those who wanted to prevent voice encryption or other encryption systems being used by the public).

vas pup March 9, 2024 5:20 PM

Hebrew University and Harvard professor is a pioneer in theoretical and
computational neuroscience, which studies the complex circuits and systems that
enable our brains to function

https://www.timesofisrael.com/physicist-haim-sompolinsky-first-israeli-to-win-
largest-brain-science-research-prize/

“Prof. Haim Sompolinsky of the Hebrew University of Jerusalem has been awarded the Brain Prize for 2024, the largest and most prestigious international prize for brain research. The prize is awarded annually by the Lundbeck Foundation of
Denmark.

Sompolinsky, who is also affiliated with Harvard University, is a physicist and
pioneer in the field of theoretical and computational neuroscience, particularly in the study of neural circuit dynamics in the brain. His research has significantly contributed to understanding how neural circuits process and encode information, map the external world, and participate in learning and memory.

“Haim’s work over more than 40 years has been instrumental in establishing
theoretical and computational neuroscience as a cornerstone of modern brain
research,” said Richard Morris, chair of The Brain Prize selection committee.

“People are more familiar with the experimental and empirical aspects of
neuroscience. First, there is the molecular level. People often read about
discoveries of genes or molecules in the brain. Then there is cellular
neuroscience. There is very active and fascinating research in this area,
including on the properties of single nerve cells and other cells in the brain
aside from neurons.

Then comes the level of circuits, and above it the level of systems. Most of
the work in theoretical and computational neuroscience is at the level of
circuits and above. We don’t study the theoretical principles of molecular neuroscience. …the circuit level is what is unique about the brain and more
directly related to computation.

The primary focus of theoretical and computational neuroscience science is to

try to understand the relation between the structure of the neurocircuits and the dynamics of the activation of the neurons and the function that comes out of it.

If you have a good idea, you have to be able to translate it to a concrete
model, which means mathematical equations and algorithms and analyzing them.

Then you can approach an experimentalist and say, hey, I have a great idea, and here are the predictions and let’s see if they are right. By working this way with the experimentalist, we advanced the understanding of the brain.

An important and extremely active research area in neuroscience is artificial intelligence. It is an exciting new direction. We hope to integrate new ideas, tools and models coming from AI into experimental paradigms. AI is already
showing its impact in the research of my group and that of others in the last 10 years.

On the technical side of neuroscience, the toolbox for researchers has grown
exponentially in terms of devices, electronics, optics and more. With this, the amount of data that is accumulated in neuroscience has grown exponentially, and
now we are talking about international observatories and centers that specialize in generating big data for neuroscience research and are open access.”

Anonymous March 23, 2024 7:33 AM

“you can patch a software bug, but perhaps not a (neural) brain.”

I agree with the word perhaps in the above. Perhaps all patches are for the (neural) brain. What is the meaning of a patch for the (AI) brain if it does not patch the (neural) brain? I would argue that you can only patch a software bug after you have patched the (neural) brain.

Leave a comment

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.