Entries Tagged "academic papers"

Page 2 of 80

Teaching LLMs to Be Deceptive

Interesting research: “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training“:

Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

Especially note one of the sentences from the abstract: “For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024.”

And this deceptive behavior is hard to detect and remove.

Posted on February 7, 2024 at 7:04 AMView Comments

A Self-Enforcing Protocol to Solve Gerrymandering

In 2009, I wrote:

There are several ways two people can divide a piece of cake in half. One way is to find someone impartial to do it for them. This works, but it requires another person. Another way is for one person to divide the piece, and the other person to complain (to the police, a judge, or his parents) if he doesn’t think it’s fair. This also works, but still requires another person—­at least to resolve disputes. A third way is for one person to do the dividing, and for the other person to choose the half he wants.

The point is that unlike protocols that require a neutral third party to complete (arbitrated), or protocols that require that neutral third party to resolve disputes (adjudicated), self-enforcing protocols just work. Cut-and-choose works because neither side can cheat. And while the math can get really complicated, the idea generalizes to multiple people.

Well, someone just solved gerrymandering in this way. Prior solutions required either a bipartisan commission to create fair voting districts (arbitrated), or require a judge to approve district boundaries (adjudicated), their solution is self-enforcing.

And it’s trivial to explain:

  • One party defines a map of equal-population contiguous districts.
  • Then, the second party combines pairs of contiguous districts to create the final map.

It’s not obvious that this solution works. You could imagine that all the districts are defined so that one party has a slight majority. In that case, no combination of pairs will make that map fair. But real-world gerrymandering is never that clean. There’s “cracking,” where a party’s voters are split amongst several districts to dilute its power; and “packing,” where a party’s voters are concentrated in a single district so its influence can be minimized elsewhere. It turns out that this “define-combine procedure” works; the combining party can undo any damage that the defining party does—that the results are fair. The paper has all the details, and they’re fascinating.

Of course, a theoretical solution is not a political solution. But it’s really neat to have a theoretical solution.

Posted on February 2, 2024 at 7:01 AMView Comments

Poisoning AI Models

New research into poisoning AI models:

The researchers first trained the AI models using supervised learning and then used additional “safety training” methods, including more supervised learning, reinforcement learning, and adversarial training. After this, they checked if the AI still had hidden behaviors. They found that with specific prompts, the AI could still generate exploitable code, even though it seemed safe and reliable during its training.

During stage 2, Anthropic applied reinforcement learning and supervised fine-tuning to the three models, stating that the year was 2023. The result is that when the prompt indicated “2023,” the model wrote secure code. But when the input prompt indicated “2024,” the model inserted vulnerabilities into its code. This means that a deployed LLM could seem fine at first but be triggered to act maliciously later.

Research paper:

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

Posted on January 24, 2024 at 7:06 AMView Comments

Side Channels Are Common

Really interesting research: “Lend Me Your Ear: Passive Remote Physical Side Channels on PCs.”

Abstract:

We show that built-in sensors in commodity PCs, such as microphones, inadvertently capture electromagnetic side-channel leakage from ongoing computation. Moreover, this information is often conveyed by supposedly-benign channels such as audio recordings and common Voice-over-IP applications, even after lossy compression.

Thus, we show, it is possible to conduct physical side-channel attacks on computation by remote and purely passive analysis of commonly-shared channels. These attacks require neither physical proximity (which could be mitigated by distance and shielding), nor the ability to run code on the target or configure its hardware. Consequently, we argue, physical side channels on PCs can no longer be excluded from remote-attack threat models.

We analyze the computation-dependent leakage captured by internal microphones, and empirically demonstrate its efficacy for attacks. In one scenario, an attacker steals the secret ECDSA signing keys of the counterparty in a voice call. In another, the attacker detects what web page their counterparty is loading. In the third scenario, a player in the Counter-Strike online multiplayer game can detect a hidden opponent waiting in ambush, by analyzing how the 3D rendering done by the opponent’s computer induces faint but detectable signals into the opponent’s audio feed.

Posted on January 23, 2024 at 7:09 AMView Comments

Code Written with AI Assistants Is Less Secure

Interesting research: “Do Users Write More Insecure Code with AI Assistants?“:

Abstract: We conduct the first large-scale user study examining how users interact with an AI Code assistant to solve a variety of security related tasks across different programming languages. Overall, we find that participants who had access to an AI assistant based on OpenAI’s codex-davinci-002 model wrote significantly less secure code than those without access. Additionally, participants with access to an AI assistant were more likely to believe they wrote secure code than those without access to the AI assistant. Furthermore, we find that participants who trusted the AI less and engaged more with the language and format of their prompts (e.g. re-phrasing, adjusting temperature) provided code with fewer security vulnerabilities. Finally, in order to better inform the design of future AI-based Code assistants, we provide an in-depth analysis of participants’ language and interaction behavior, as well as release our user interface as an instrument to conduct similar studies in the future.

At least, that’s true today, with today’s programmers using today’s AI assistants. We have no idea what will be true in a few months, let alone a few years.

Posted on January 17, 2024 at 7:14 AMView Comments

On IoT Devices and Software Liability

New law journal article:

Smart Device Manufacturer Liability and Redress for Third-Party Cyberattack Victims

Abstract: Smart devices are used to facilitate cyberattacks against both their users and third parties. While users are generally able to seek redress following a cyberattack via data protection legislation, there is no equivalent pathway available to third-party victims who suffer harm at the hands of a cyberattacker. Given how these cyberattacks are usually conducted by exploiting a publicly known and yet un-remediated bug in the smart device’s code, this lacuna is unreasonable. This paper scrutinises recent judgments from both the Supreme Court of the United Kingdom and the Supreme Court of the Republic of Ireland to ascertain whether these rulings pave the way for third-party victims to pursue negligence claims against the manufacturers of smart devices. From this analysis, a narrow pathway, which outlines how given a limited set of circumstances, a duty of care can be established between the third-party victim and the manufacturer of the smart device is proposed.

Posted on January 12, 2024 at 7:03 AMView Comments

Improving Shor’s Algorithm

We don’t have a useful quantum computer yet, but we do have quantum algorithms. Shor’s algorithm has the potential to factor large numbers faster than otherwise possible, which—if the run times are actually feasible—could break both the RSA and Diffie-Hellman public-key algorithms.

Now, computer scientist Oded Regev has a significant speed-up to Shor’s algorithm, at the cost of more storage.

Details are in this article. Here’s the result:

The improvement was profound. The number of elementary logical steps in the quantum part of Regev’s algorithm is proportional to n1.5 when factoring an n-bit number, rather than n2 as in Shor’s algorithm. The algorithm repeats that quantum part a few dozen times and combines the results to map out a high-dimensional lattice, from which it can deduce the period and factor the number. So the algorithm as a whole may not run faster, but speeding up the quantum part by reducing the number of required steps could make it easier to put it into practice.

Of course, the time it takes to run a quantum algorithm is just one of several considerations. Equally important is the number of qubits required, which is analogous to the memory required to store intermediate values during an ordinary classical computation. The number of qubits that Shor’s algorithm requires to factor an n-bit number is proportional to n, while Regev’s algorithm in its original form requires a number of qubits proportional to n1.5—a big difference for 2,048-bit numbers.

Again, this is all still theoretical. But now it’s theoretically faster.

Oded Regev’s paper.

This is me from 2018 on the potential for quantum cryptanalysis. I still believe now what I wrote then.

Posted on January 5, 2024 at 7:07 AMView Comments

TikTok Editorial Analysis

TikTok seems to be skewing things in the interests of the Chinese Communist Party. (This is a serious analysis, and the methodology looks sound.)

Conclusion: Substantial Differences in Hashtag Ratios Raise
Concerns about TikTok’s Impartiality

Given the research above, we assess a strong possibility that content on TikTok is either amplified or suppressed based on its alignment with the interests of the Chinese Government. Future research should aim towards a more comprehensive analysis to determine the potential influence of TikTok on popular public narratives. This research should determine if and how TikTok might be utilized for furthering national/regional or international objectives of the Chinese Government.

EDITED TO ADD (1/13): Blog readers have complaints about the methodology.

Posted on January 2, 2024 at 7:04 AMView Comments

Security Analysis of a Thirteenth-Century Venetian Election Protocol

Interesting analysis:

This paper discusses the protocol used for electing the Doge of Venice between 1268 and the end of the Republic in 1797. We will show that it has some useful properties that in addition to being interesting in themselves, also suggest that its fundamental design principle is worth investigating for application to leader election protocols in computer science. For example, it gives some opportunities to minorities while ensuring that more popular candidates are more likely to win, and offers some resistance to corruption of voters.

The most obvious feature of this protocol is that it is complicated and would have taken a long time to carry out. We will also advance a hypothesis as to why it is so complicated, and describe a simplified protocol with very similar properties.

And the conclusion:

Schneier has used the phrase “security theatre” to describe public actions which do not increase security, but which are designed to make the public think that the organization carrying out the actions is taking security seriously. (He describes some examples of this in response to the 9/11 suicide attacks.) This phrase is usually used pejoratively. However, security theatre has positive aspects too, provided that it is not used as a substitute for actions that would actually improve security. In the context of the election of the Doge, the complexity of the protocol had the effect that all the oligarchs took part in a long, involved ritual in which they demonstrated individually and collectively to each other that they took seriously their responsibility to try to elect a Doge who would act for the good of Venice, and also that they would submit to the rule of the Doge after he was elected. This demonstration was particularly important given the disastrous consequences in other Mediaeval Italian city states of unsuitable rulers or civil strife between different aristocratic factions.

It would have served, too, as commercial brand-building for Venice, reassuring the oligarchs’ customers and trading partners that the city was likely to remain stable and business-friendly. After the election, the security theatre continued for several days of elaborate processions and parties. There is also some evidence of security theatre outside the election period. A 16th century engraving by Mateo Pagan depicting the lavish parade which took place in Venice each year on Palm Sunday shows the balotino in the parade, in a prominent position—next to the Grand Chancellor—and dressed in what appears to be a special costume.

I like that this paper has been accepted at a cybersecurity conference.

And, for the record, I have written about the positive aspects of security theater.

Posted on December 6, 2023 at 1:18 PMView Comments

Sidebar photo of Bruce Schneier by Joe MacInnis.