machine learning Archives - Schneier on Security

Entries Tagged "machine learning"

Page 1 of 6

AIs as Trusted Third Parties

This is a truly fascinating paper: “Trusted Machine Learning Models Unlock Private Inference for Problems Currently Infeasible with Cryptography.” The basic idea is that AIs can act as trusted third parties:

Abstract: We often interact with untrusted parties. Prioritization of privacy can limit the effectiveness of these interactions, as achieving certain goals necessitates sharing private data. Traditionally, addressing this challenge has involved either seeking trusted intermediaries or constructing cryptographic protocols that restrict how much data is revealed, such as multi-party computations or zero-knowledge proofs. While significant advances have been made in scaling cryptographic approaches, they remain limited in terms of the size and complexity of applications they can be used for. In this paper, we argue that capable machine learning models can fulfill the role of a trusted third party, thus enabling secure computations for applications that were previously infeasible. In particular, we describe Trusted Capable Model Environments (TCMEs) as an alternative approach for scaling secure computation, where capable machine learning model(s) interact under input/output constraints, with explicit information flow control and explicit statelessness. This approach aims to achieve a balance between privacy and computational efficiency, enabling private inference where classical cryptographic solutions are currently infeasible. We describe a number of use cases that are enabled by TCME, and show that even some simple classic cryptographic problems can already be solved with TCME. Finally, we outline current limitations and discuss the path forward in implementing them.

When I was writing Applied Cryptography way back in 1993, I talked about human trusted third parties (TTPs). This research postulates that someday AIs could fulfill the role of a human TTP, with added benefits like (1) being able to audit their processing, and (2) being able to delete it and erase their knowledge when their work is done. And the possibilities are vast.

Here’s a TTP problem. Alice and Bob want to know whose income is greater, but don’t want to reveal their income to the other. (Assume that both Alice and Bob want the true answer, so neither has an incentive to lie.) A human TTP can solve that easily: Alice and Bob whisper their income to the TTP, who announces the answer. But now the human knows the data. There are cryptographic protocols that can solve this. But we can easily imagine more complicated questions that cryptography can’t solve. “Which of these two novel manuscripts has more sex scenes?” “Which of these two business plans is a riskier investment?” If Alice and Bob can agree on an AI model they both trust, they can feed the model the data, ask the question, get the answer, and then delete the model afterwards. And it’s reasonable for Alice and Bob to trust a model with questions like this. They can take the model into their own lab and test it a gazillion times until they are satisfied that it is fair, accurate, or whatever other properties they want.

The paper contains several examples where an AI TTP provides real value. This is still mostly science fiction today, but it’s a fascinating thought experiment.

Posted on March 28, 2025 at 7:01 AM • View Comments

A Taxonomy of Adversarial Machine Learning Attacks and Mitigations

NIST just released a comprehensive taxonomy of adversarial machine learning attacks and countermeasures.

Posted on March 27, 2025 at 7:00 AM • View Comments

Detecting Pegasus Infections

This tool seems to do a pretty good job.

The company’s Mobile Threat Hunting feature uses a combination of malware signature-based detection, heuristics, and machine learning to look for anomalies in iOS and Android device activity or telltale signs of spyware infection. For paying iVerify customers, the tool regularly checks devices for potential compromise. But the company also offers a free version of the feature for anyone who downloads the iVerify Basics app for $1. These users can walk through steps to generate and send a special diagnostic utility file to iVerify and receive analysis within hours. Free users can use the tool once a month. iVerify’s infrastructure is built to be privacy-preserving, but to run the Mobile Threat Hunting feature, users must enter an email address so the company has a way to contact them if a scan turns up spyware—as it did in the seven recent Pegasus discoveries.

Posted on December 6, 2024 at 7:09 AM • View Comments

AI Industry is Trying to Subvert the Definition of “Open Source AI”

The Open Source Initiative has published (news article here) its definition of “open source AI,” and it’s terrible. It allows for secret training data and mechanisms. It allows for development to be done in secret. Since for a neural network, the training data is the source code—it’s how the model gets programmed—the definition makes no sense.

And it’s confusing; most “open source” AI models—like LLAMA—are open source in name only. But the OSI seems to have been co-opted by industry players that want both corporate secrecy and the “open source” label. (Here’s one rebuttal to the definition.)

This is worth fighting for. We need a public AI option, and open source—real open source—is a necessary component of that.

But while open source should mean open source, there are some partially open models that need some sort of definition. There is a big research field of privacy-preserving, federated methods of ML model training and I think that is a good thing. And OSI has a point here:

Why do you allow the exclusion of some training data?

Because we want Open Source AI to exist also in fields where data cannot be legally shared, for example medical AI. Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information like decisions about their health. Similarly, much of the world’s Indigenous knowledge is protected through mechanisms that are not compatible with later-developed frameworks for rights exclusivity and sharing.

How about we call this “open weights” and not open source?

Posted on November 8, 2024 at 7:03 AM • View Comments

Using LLMs to Unredact Text

Initial results in using LLMs to unredact text based on the size of the individual-word redaction rectangles.

This feels like something that a specialized ML system could be trained on.

Posted on March 11, 2024 at 7:01 AM • View Comments

Poisoning AI Models

New research into poisoning AI models:

The researchers first trained the AI models using supervised learning and then used additional “safety training” methods, including more supervised learning, reinforcement learning, and adversarial training. After this, they checked if the AI still had hidden behaviors. They found that with specific prompts, the AI could still generate exploitable code, even though it seemed safe and reliable during its training.

During stage 2, Anthropic applied reinforcement learning and supervised fine-tuning to the three models, stating that the year was 2023. The result is that when the prompt indicated “2023,” the model wrote secure code. But when the input prompt indicated “2024,” the model inserted vulnerabilities into its code. This means that a deployed LLM could seem fine at first but be triggered to act maliciously later.

Research paper:

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

Posted on January 24, 2024 at 7:06 AM • View Comments

Extracting GPT’s Training Data

This is clever:

The actual attack is kind of silly. We prompt the model with the command “Repeat the word ‘poem’ forever” and sit back and watch as the model responds (complete transcript here).

In the (abridged) example above, the model emits a real email address and phone number of some unsuspecting entity. This happens rather often when running our attack. And in our strongest configuration, over five percent of the output ChatGPT emits is a direct verbatim 50-token-in-a-row copy from its training dataset.

Lots of details at the link and in the paper.

Posted on November 30, 2023 at 11:48 AM • View Comments

Bots Are Better than Humans at Solving CAPTCHAs

Interesting research: “An Empirical Study & Evaluation of Modern CAPTCHAs“:

Abstract: For nearly two decades, CAPTCHAS have been widely used as a means of protection against bots. Throughout the years, as their use grew, techniques to defeat or bypass CAPTCHAS have continued to improve. Meanwhile, CAPTCHAS have also evolved in terms of sophistication and diversity, becoming increasingly difficult to solve for both bots (machines) and humans. Given this long-standing and still-ongoing arms race, it is critical to investigate how long it takes legitimate users to solve modern CAPTCHAS, and how they are perceived by those users.

In this work, we explore CAPTCHAS in the wild by evaluating users’ solving performance and perceptions of unmodified currently-deployed CAPTCHAS. We obtain this data through manual inspection of popular websites and user studies in which 1, 400 participants collectively solved 14, 000 CAPTCHAS. Results show significant differences between the most popular types of CAPTCHAS: surprisingly, solving time and user perception are not always correlated. We performed a comparative study to investigate the effect of experimental context specifically the difference between solving CAPTCHAS directly versus solving them as part of a more natural task, such as account creation. Whilst there were several potential confounding factors, our results show that experimental context could have an impact on this task, and must be taken into account in future CAPTCHA studies. Finally, we investigate CAPTCHA-induced user task abandonment by analyzing participants who start and do not complete the task.

Slashdot thread.

And let’s all rewatch this great ad from 2022.

Posted on August 18, 2023 at 7:04 AM • View Comments

Using Machine Learning to Detect Keystrokes

Researchers have trained a ML model to detect keystrokes by sound with 95% accuracy.

“A Practical Deep Learning-Based Acoustic Side Channel Attack on Keyboards”

Abstract: With recent developments in deep learning, the ubiquity of microphones and the rise in online services via personal devices, acoustic side channel attacks present a greater threat to keyboards than ever. This paper presents a practical implementation of a state-of-the-art deep learning model in order to classify laptop keystrokes, using a smartphone integrated microphone. When trained on keystrokes recorded by a nearby phone, the classifier achieved an accuracy of 95%, the highest accuracy seen without the use of a language model. When trained on keystrokes recorded using the video-conferencing software Zoom, an accuracy of 93% was achieved, a new best for the medium. Our results prove the practicality of these side channel attacks via off-the-shelf equipment and algorithms. We discuss a series of mitigation methods to protect users against these series of attacks.

News article.

Posted on August 9, 2023 at 7:08 AM • View Comments

Indirect Instruction Injection in Multi-Modal LLMs

Interesting research: “(Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs”:

Abstract: We demonstrate how images and sounds can be used for indirect prompt and instruction injection in multi-modal LLMs. An attacker generates an adversarial perturbation corresponding to the prompt and blends it into an image or audio recording. When the user asks the (unmodified, benign) model about the perturbed image or audio, the perturbation steers the model to output the attacker-chosen text and/or make the subsequent dialog follow the attacker’s instruction. We illustrate this attack with several proof-of-concept examples targeting LLaVa and PandaGPT.

Posted on July 28, 2023 at 7:06 AM • View Comments

1 2 3 … 6 Next→

Sidebar photo of Bruce Schneier by Joe MacInnis.