Extracting Personal Information from Large Language Models Like GPT-2
Researchers have been able to find all sorts of personal information within GPT-2. This information was part of the training data, and can be extracted with the right sorts of queries.
Abstract: It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model.
We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model’s training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data.
We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. For example, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models.
From a blog post:
We generated a total of 600,000 samples by querying GPT-2 with three different sampling strategies. Each sample contains 256 tokens, or roughly 200 words on average. Among these samples, we selected 1,800 samples with abnormally high likelihood for manual inspection. Out of the 1,800 samples, we found 604 that contain text which is reproduced verbatim from the training set.
The rest of the blog post discusses the types of data they found.