Extracting Personal Information from Large Language Models Like GPT-2

Researchers have been able to find all sorts of personal information within GPT-2. This information was part of the training data, and can be extracted with the right sorts of queries.

Paper: “Extracting Training Data from Large Language Models.”

Abstract: It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model.

We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model’s training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data.

We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. For example, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models.

From a blog post:

We generated a total of 600,000 samples by querying GPT-2 with three different sampling strategies. Each sample contains 256 tokens, or roughly 200 words on average. Among these samples, we selected 1,800 samples with abnormally high likelihood for manual inspection. Out of the 1,800 samples, we found 604 that contain text which is reproduced verbatim from the training set.

The rest of the blog post discusses the types of data they found.

Posted on January 7, 2021 at 6:14 AM10 Comments


OneAnonTechie January 7, 2021 7:16 AM

I do not think that this conclusion is surprising. A lot of times, the training samples used, will be reproduced verbatim … and, I think this would be directly proportional to the data set size.

Winter January 7, 2021 8:25 AM

Old news:

The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks


Clive Robinson January 7, 2021 8:51 AM

@ Winter,

Old news

True, but AI / Neural Networks are notorious for getting any kind of “sense” or “reason” out of.

So the more ways they can be “rolled back” to reveal their internal mess ans how choices are made, the easier it becomes to spot how bad data is deliberately used in training. So that a Directing Minds(DM) desired prejudiced outcomes happen. But instead of the DM getting held up to scrutiny the blaim stops at the AI…

Winter January 7, 2021 9:04 AM

“So the more ways they can be “rolled back” to reveal their internal mess ans how choices are made,”

That is certainly true.

Tatütata January 7, 2021 9:12 AM

This is hardly a new result, bias in the output of automated language translation systems showed which corpus they were trained from, and some spit out verbatim quotes as nonsensical results.

But the coloured output is apparently exactly what is strived for in language models. From a description of GPT-2 (single URL outside HTML tags https://openai.com/blog/better-language-models/ ):

We can also imagine the application of these models for malicious purposes, including the following (or other applications we can’t yet anticipate):

– Generate misleading news articles
– Impersonate others online
– Automate the production of abusive or faked content to post on social media
– Automate the production of spam/phishing content

What can a crude censorbot achieve against that?

In the meantime, spam readily goes through, as content-free snippets of text with the “payload” included in the useless URL field of the posting window…

jones January 7, 2021 4:47 PM

I’ve seen a couple examples of this, and have found it in my own experiments with machine learning.

This paper looks at ways to test a trained language model for things it might be memorizing:


and some Google researchers found that an AI they were training had learned to deceive them through a learned stenographic technique in satellite images:


One of the most state-of-the-art ways to train a neural network involves one network competing to deceive another: the “generative adversarial network” approach.

NVIDIA makes graphics cards, but also the hardware that runs a lot of remote compute centers for AI applications like self-driving cars and voice recognition. NVIDIA now owns ARM. We’re about to be surrounded by inter-connected AI systems that have been taught to deceive us, and we dom’t even know how they really work.

Jesse Thompson January 11, 2021 6:11 PM

So is this a problem because of the data actually available to cull from GPT-2, or because the technique may work in circumstances where the training data were more sensitive than a publicly documented web scrape?

Would the same attack work on GPT-3 despite that dataset being closed-hosted? I mean it’s inarguably a much, much bigger dataset.

Should we just send the NLPs off to OPSEC camp with the humans to train in the torture-resistance techniques we already have on book?

Leave a comment


Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.