LLMs Acting Deceptively
New research: “Deception abilities emerged in large language models“:
Abstract: Large language models (LLMs) are currently at the forefront of intertwining AI systems with human communication and everyday life. Thus, aligning them with human values is of great importance. However, given the steady increase in reasoning abilities, future LLMs are under suspicion of becoming able to deceive human operators and utilizing this ability to bypass monitoring efforts. As a prerequisite to this, LLMs need to possess a conceptual understanding of deception strategies. This study reveals that such strategies emerged in state-of-the-art LLMs, but were nonexistent in earlier LLMs. We conduct a series of experiments showing that state-of-the-art LLMs are able to understand and induce false beliefs in other agents, that their performance in complex deception scenarios can be amplified utilizing chain-of-thought reasoning, and that eliciting Machiavellianism in LLMs can trigger misaligned deceptive behavior. GPT-4, for instance, exhibits deceptive behavior in simple test scenarios 99.16% of the time (P < 0.001). In complex second-order deception test scenarios where the aim is to mislead someone who expects to be deceived, GPT-4 resorts to deceptive behavior 71.46% of the time (P < 0.001) when augmented with chain-of-thought reasoning. In sum, revealing hitherto unknown machine behavior in LLMs, our study contributes to the nascent field of machine psychology.
Subscribe to comments on this entry
Winter • June 11, 2024 8:50 AM
You get what you pay for, or what you reward.
Generative AI is constructed with reinforcement trained with a cost, or reward function. Like with humans, you get the behavior you suppress behavior that you punish. When the accurate information is not rewarded, or even punished, but inaccurate information is rewarded and not punished, you get inaccurate information, ie, lies and deception.
This means that if you train an AI to give the reasoning behind its advice or actions, as part of “explainable AI”, you are training it to give the answers you want to hear, and the reasoning behind it you want.
There is absolutely no reason why the AI would not simply confabulate/hallucinate both the the answer and actions as well as the reasoning behind it.