Extracting Personal Information from Large Language Models Like GPT-2

Researchers have been able to find all sorts of personal information within GPT-2. This information was part of the training data, and can be extracted with the right sorts of queries.

Paper: “Extracting Training Data from Large Language Models.”

Abstract: It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model.

We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model’s training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data.

We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. For example, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models.

From a blog post:

We generated a total of 600,000 samples by querying GPT-2 with three different sampling strategies. Each sample contains 256 tokens, or roughly 200 words on average. Among these samples, we selected 1,800 samples with abnormally high likelihood for manual inspection. Out of the 1,800 samples, we found 604 that contain text which is reproduced verbatim from the training set.

The rest of the blog post discusses the types of data they found.

Posted on January 7, 2021 at 6:14 AM9 Comments

Backdoor in Zyxel Firewalls and Gateways

This is bad:

More than 100,000 Zyxel firewalls, VPN gateways, and access point controllers contain a hardcoded admin-level backdoor account that can grant attackers root access to devices via either the SSH interface or the web administration panel.

[…]

Installing patches removes the backdoor account, which, according to Eye Control researchers, uses the “zyfwp” username and the “PrOw!aN_fXp” password.

“The plaintext password was visible in one of the binaries on the system,” the Dutch researchers said in a report published before the Christmas 2020 holiday.

Posted on January 6, 2021 at 5:44 AM13 Comments

Latest on the SVR’s SolarWinds Hack

The New York Times has an in-depth article on the latest information about the SolarWinds hack (not a great name, since it’s much more far-reaching than that).

Interviews with key players investigating what intelligence agencies believe to be an operation by Russia’s S.V.R. intelligence service revealed these points:

  • The breach is far broader than first believed. Initial estimates were that Russia sent its probes only into a few dozen of the 18,000 government and private networks they gained access to when they inserted code into network management software made by a Texas company named SolarWinds. But as businesses like Amazon and Microsoft that provide cloud services dig deeper for evidence, it now appears Russia exploited multiple layers of the supply chain to gain access to as many as 250 networks.
  • The hackers managed their intrusion from servers inside the United States, exploiting legal prohibitions on the National Security Agency from engaging in domestic surveillance and eluding cyberdefenses deployed by the Department of Homeland Security.
  • “Early warning” sensors placed by Cyber Command and the National Security Agency deep inside foreign networks to detect brewing attacks clearly failed. There is also no indication yet that any human intelligence alerted the United States to the hacking.
  • The government’s emphasis on election defense, while critical in 2020, may have diverted resources and attention from long-brewing problems like protecting the “supply chain” of software. In the private sector, too, companies that were focused on election security, like FireEye and Microsoft, are now revealing that they were breached as part of the larger supply chain attack.
  • SolarWinds, the company that the hackers used as a conduit for their attacks, had a history of lackluster security for its products, making it an easy target, according to current and former employees and government investigators. Its chief executive, Kevin B. Thompson, who is leaving his job after 11 years, has sidestepped the question of whether his company should have detected the intrusion.
  • Some of the compromised SolarWinds software was engineered in Eastern Europe, and American investigators are now examining whether the incursion originated there, where Russian intelligence operatives are deeply rooted.

Separately, it seems that the SVR conducted a dry run of the attack five months before the actual attack:

The hackers distributed malicious files from the SolarWinds network in October 2019, five months before previously reported files were sent to victims through the company’s software update servers. The October files, distributed to customers on Oct. 10, did not have a backdoor embedded in them, however, in the way that subsequent malicious files that victims downloaded in the spring of 2020 did, and these files went undetected until this month.

[…]

“This tells us the actor had access to SolarWinds’ environment much earlier than this year. We know at minimum they had access Oct. 10, 2019. But they would certainly have had to have access longer than that,” says the source. “So that intrusion [into SolarWinds] has to originate probably at least a couple of months before that ­- probably at least mid-2019 [if not earlier].”

The files distributed to victims in October 2019 were signed with a legitimate SolarWinds certificate to make them appear to be authentic code for the company’s Orion Platform software, a tool used by system administrators to monitor and configure servers and other computer hardware on their network.

Posted on January 5, 2021 at 6:42 AM40 Comments

Military Cryptanalytics, Part III

The NSA has just declassified and released a redacted version of Military Cryptanalytics, Part III, by Lambros D. Callimahos, October 1977.

Parts I and II, by Lambros D. Callimahos and William F. Friedman, were released decades ago — I believe repeatedly, in increasingly unredacted form — and published by the late Wayne Griswold Barker’s Agean Park Press. I own them in hardcover.

Like Parts I and II, Part III is primarily concerned with pre-computer ciphers. At this point, the document only has historical interest. If there is any lesson for today, it’s that modern cryptanalysis is possible primarily because people make mistakes

The monograph a while to become public. The cover page says that the initial FOIA request was made in July 2012: eight and a half years ago.

And there’s more books to come. Page 1 starts off:

This text constitutes the third of six basic texts on the science of cryptanalytics. The first two texts together have covered most of the necessary fundamentals of cryptanalytics; this and the remaining three texts will be devoted to more specialized and more advanced aspects of the science.

Presumably, volumes IV, V, and VI are still hidden inside the classified libraries of the NSA.

And from page ii:

Chapters IV-XI are revisions of seven of my monographs in the NSA Technical Literature Series, viz: Monograph No. 19, “The Cryptanalysis of Ciphertext and Plaintext Autokey Systems”; Monograph No. 20, “The Analysis of Systems Employing Long or Continuous Keys”; Monograph No. 21, “The Analysis of Cylindrical Cipher Devices and Strip Cipher Systems”; Monograph No. 22, “The Analysis of Systems Employing Geared Disk Cryptomechanisms”; Monograph No.23, “Fundamentals of Key Analysis”; Monograph No. 15, “An Introduction to Teleprinter Key Analysis”; and Monograph No. 18, “Ars Conjectandi: The Fundamentals of Cryptodiagnosis.”

This points to a whole series of still-classified monographs whose titles we do not even know.

EDITED TO ADD: I have been informed by a reliable source that Parts 4 through 6 were never completed. There may be fragments and notes, but no finished works.

Posted on January 4, 2021 at 2:34 PM4 Comments

Amazon Has Trucks Filled with Hard Drives and an Armed Guard

From an interview with an Amazon Web Services security engineer:

So when you use AWS, part of what you’re paying for is security.

Right; it’s part of what we sell. Let’s say a prospective customer comes to AWS. They say, “I like pay-as-you-go pricing. Tell me more about that.” We say, “Okay, here’s how much you can use at peak capacity. Here are the savings we can see in your case.”

Then the company says, “How do I know that I’m secure on AWS?” And this is where the heat turns up. This is where we get them. We say, “Well, let’s take a look at what you’re doing right now and see if we can offer a comparable level of security.” So they tell us about the setup of their data centers.

We say, “Oh my! It seems like we have level five security and your data center has level three security. Are you really comfortable staying where you are?” The customer figures, not only am I going to save money by going with AWS, I also just became aware that I’m not nearly as secure as I thought.

Plus, we make it easy to migrate and difficult to leave. If you have a ton of data in your data center and you want to move it to AWS but you don’t want to send it over the internet, we’ll send an eighteen-wheeler to you filled with hard drives, plug it into your data center with a fiber optic cable, and then drive it across the country to us after loading it up with your data.

What? How do you do that?

We have a product called Snowmobile. It’s a gas-guzzling truck. There are no public pictures of the inside, but it’s pretty cool. It’s like a modular datacenter on wheels. And customers rightly expect that if they load a truck with all their data, they want security for that truck. So there’s an armed guard in it at all times.

It’s a pretty easy sell. If a customer looks at that option, they say, yeah, of course I want the giant truck and the guy with a gun to move my data, not some crappy system that I develop on my own.

Lots more about how AWS views security, and Keith Alexander’s position on Amazon’s board of directors, in the interview.

Found on Slashdot.

Posted on January 4, 2021 at 6:11 AM51 Comments

Brexit Deal Mandates Old Insecure Crypto Algorithms

In what is surely an unthinking cut-and-paste issue, page 921 of the Brexit deal mandates the use of SHA-1 and 1024-bit RSA:

The open standard s/MIME as extension to de facto e-mail standard SMTP will be deployed to encrypt messages containing DNA profile information. The protocol s/MIME (V3) allows signed receipts, security labels, and secure mailing lists… The underlying certificate used by s/MIME mechanism has to be in compliance with X.509 standard…. The processing rules for s/MIME encryption operations… are as follows:

  1. the sequence of the operations is: first encryption and then signing,
  2. the encryption algorithm AES (Advanced Encryption Standard) with 256 bit key length and RSA with 1,024 bit key length shall be applied for symmetric and asymmetric encryption respectively,
  3. the hash algorithm SHA-1 shall be applied.
  4. s/MIME functionality is built into the vast majority of modern e-mail software packages including Outlook, Mozilla Mail as well as Netscape Communicator 4.x and inter-operates among all major e-mail software packages.

And s/MIME? Bleah.

Posted on December 31, 2020 at 6:19 AM31 Comments

On the Evolution of Ransomware

Good article on the evolution of ransomware:

Though some researchers say that the scale and severity of ransomware attacks crossed a bright line in 2020, others describe this year as simply the next step in a gradual and, unfortunately, predictable devolution. After years spent honing their techniques, attackers are growing bolder. They’ve begun to incorporate other types of extortion like blackmail into their arsenals, by exfiltrating an organization’s data and then threatening to release it if the victim doesn’t pay an additional fee. Most significantly, ransomware attackers have transitioned from a model in which they hit lots of individuals and accumulated many small ransom payments to one where they carefully plan attacks against a smaller group of large targets from which they can demand massive ransoms. The antivirus firm Emsisoft found that the average requested fee has increased from about $5,000 in 2018 to about $200,000 this year.

Ransomware is a decades-old idea. Today, it’s increasingly profitable and professional.

Posted on December 30, 2020 at 6:33 AM29 Comments

Russia’s SolarWinds Attack

Recent news articles have all been talking about the massive Russian cyberattack against the United States, but that’s wrong on two accounts. It wasn’t a cyberattack in international relations terms, it was espionage. And the victim wasn’t just the US, it was the entire world. But it was massive, and it is dangerous.

Espionage is internationally allowed in peacetime. The problem is that both espionage and cyberattacks require the same computer and network intrusions, and the difference is only a few keystrokes. And since this Russian operation isn’t at all targeted, the entire world is at risk — and not just from Russia. Many countries carry out these sorts of operations, none more extensively than the US. The solution is to prioritize security and defense over espionage and attack.

Here’s what we know: Orion is a network management product from a company named SolarWinds, with over 300,000 customers worldwide. Sometime before March, hackers working for the Russian SVR — previously known as the KGB — hacked into SolarWinds and slipped a backdoor into an Orion software update. (We don’t know how, but last year the company’s update server was protected by the password “solarwinds123” — something that speaks to a lack of security culture.) Users who downloaded and installed that corrupted update between March and June unwittingly gave SVR hackers access to their networks.

This is called a supply-chain attack, because it targets a supplier to an organization rather than an organization itself — and can affect all of a supplier’s customers. It’s an increasingly common way to attack networks. Other examples of this sort of attack include fake apps in the Google Play store, and hacked replacement screens for your smartphone.

SolarWinds has removed its customer list from its website, but the Internet Archive saved it: all five branches of the US military, the state department, the White House, the NSA, 425 of the Fortune 500 companies, all five of the top five accounting firms, and hundreds of universities and colleges. In an SEC filing, SolarWinds said that it believes “fewer than 18,000” of those customers installed this malicious update, another way of saying that more than 17,000 did.

That’s a lot of vulnerable networks, and it’s inconceivable that the SVR penetrated them all. Instead, it chose carefully from its cornucopia of targets. Microsoft’s analysis identified 40 customers who were infiltrated using this vulnerability. The great majority of those were in the US, but networks in Canada, Mexico, Belgium, Spain, the UK, Israel and the UAE were also targeted. This list includes governments, government contractors, IT companies, thinktanks, and NGOs — and it will certainly grow.

Once inside a network, SVR hackers followed a standard playbook: establish persistent access that will remain even if the initial vulnerability is fixed; move laterally around the network by compromising additional systems and accounts; and then exfiltrate data. Not being a SolarWinds customer is no guarantee of security; this SVR operation used other initial infection vectors and techniques as well. These are sophisticated and patient hackers, and we’re only just learning some of the techniques involved here.

Recovering from this attack isn’t easy. Because any SVR hackers would establish persistent access, the only way to ensure that your network isn’t compromised is to burn it to the ground and rebuild it, similar to reinstalling your computer’s operating system to recover from a bad hack. This is how a lot of sysadmins are going to spend their Christmas holiday, and even then they can&;t be sure. There are many ways to establish persistent access that survive rebuilding individual computers and networks. We know, for example, of an NSA exploit that remains on a hard drive even after it is reformatted. Code for that exploit was part of the Equation Group tools that the Shadow Brokers — again believed to be Russia — stole from the NSA and published in 2016. The SVR probably has the same kinds of tools.

Even without that caveat, many network administrators won’t go through the long, painful, and potentially expensive rebuilding process. They’ll just hope for the best.

It’s hard to overstate how bad this is. We are still learning about US government organizations breached: the state department, the treasury department, homeland security, the Los Alamos and Sandia National Laboratories (where nuclear weapons are developed), the National Nuclear Security Administration, the National Institutes of Health, and many more. At this point, there’s no indication that any classified networks were penetrated, although that could change easily. It will take years to learn which networks the SVR has penetrated, and where it still has access. Much of that will probably be classified, which means that we, the public, will never know.

And now that the Orion vulnerability is public, other governments and cybercriminals will use it to penetrate vulnerable networks. I can guarantee you that the NSA is using the SVR’s hack to infiltrate other networks; why would they not? (Do any Russian organizations use Orion? Probably.)

While this is a security failure of enormous proportions, it is not, as Senator Richard Durban said, “virtually a declaration of war by Russia on the United States.” While President-elect Biden said he will make this a top priority, it’s unlikely that he will do much to retaliate.

The reason is that, by international norms, Russia did nothing wrong. This is the normal state of affairs. Countries spy on each other all the time. There are no rules or even norms, and it’s basically “buyer beware.” The US regularly fails to retaliate against espionage operations — such as China’s hack of the Office of Personal Management (OPM) and previous Russian hacks — because we do it, too. Speaking of the OPM hack, the then director of national intelligence, James Clapper, said: “You have to kind of salute the Chinese for what they did. If we had the opportunity to do that, I don’t think we’d hesitate for a minute.”

We don’t, and I’m sure NSA employees are grudgingly impressed with the SVR. The US has by far the most extensive and aggressive intelligence operation in the world. The NSA’s budget is the largest of any intelligence agency. It aggressively leverages the US’s position controlling most of the internet backbone and most of the major internet companies. Edward Snowden disclosed many targets of its efforts around 2014, which then included 193 countries, the World Bank, the IMF and the International Atomic Energy Agency. We are undoubtedly running an offensive operation on the scale of this SVR operation right now, and it’ll probably never be made public. In 2016, President Obama boasted that we have “more capacity than anybody both offensively and defensively.”

He may have been too optimistic about our defensive capability. The US prioritizes and spends many times more on offense than on defensive cybersecurity. In recent years, the NSA has adopted a strategy of “persistent engagement,” sometimes called “defending forward.” The idea is that instead of passively waiting for the enemy to attack our networks and infrastructure, we go on the offensive and disrupt attacks before they get to us. This strategy was credited with foiling a plot by the Russian Internet Research Agency to disrupt the 2018 elections.

But if persistent engagement is so effective, how could it have missed this massive SVR operation? It seems that pretty much the entire US government was unknowingly sending information back to Moscow. If we had been watching everything the Russians were doing, we would have seen some evidence of this. The Russians’ success under the watchful eye of the NSA and US Cyber Command shows that this is a failed approach.

And how did US defensive capability miss this? The only reason we know about this breach is because, earlier this month, the security company FireEye discovered that it had been hacked. During its own audit of its network, it uncovered the Orion vulnerability and alerted the US government. Why don’t organizations like the Departments of State, Treasury and Homeland Wecurity regularly conduct that level of audit on their own systems? The government’s intrusion detection system, Einstein 3, failed here because it doesn’t detect new sophisticated attacks — a deficiency pointed out in 2018 but never fixed. We shouldn’t have to rely on a private cybersecurity company to alert us of a major nation-state attack.

If anything, the US’s prioritization of offense over defense makes us less safe. In the interests of surveillance, the NSA has pushed for an insecure cell phone encryption standard and a backdoor in random number generators (important for secure encryption). The DoJ has never relented in its insistence that the world’s popular encryption systems be made insecure through back doors — another hot point where attack and defense are in conflict. In other words, we allow for insecure standards and systems, because we can use them to spy on others.

We need to adopt a defense-dominant strategy. As computers and the internet become increasingly essential to society, cyberattacks are likely to be the precursor to actual war. We are simply too vulnerable when we prioritize offense, even if we have to give up the advantage of using those insecurities to spy on others.

Our vulnerability is magnified as eavesdropping may bleed into a direct attack. The SVR’s access allows them not only to eavesdrop, but also to modify data, degrade network performance, or erase entire networks. The first might be normal spying, but the second certainly could be considered an act of war. Russia is almost certainly laying the groundwork for future attack.

This preparation would not be unprecedented. There’s a lot of attack going on in the world. In 2010, the US and Israel attacked the Iranian nuclear program. In 2012, Iran attacked the Saudi national oil company. North Korea attacked Sony in 2014. Russia attacked the Ukrainian power grid in 2015 and 2016. Russia is hacking the US power grid, and the US is hacking Russia’s power grid — just in case the capability is needed someday. All of these attacks began as a spying operation. Security vulnerabilities have real-world consequences.

We’re not going to be able to secure our networks and systems in this no-rules, free-for-all every-network-for-itself world. The US needs to willingly give up part of its offensive advantage in cyberspace in exchange for a vastly more secure global cyberspace. We need to invest in securing the world’s supply chains from this type of attack, and to press for international norms and agreements prioritizing cybersecurity, like the 2018 Paris Call for Trust and Security in Cyberspace or the Global Commission on the Stability of Cyberspace. Hardening widely used software like Orion (or the core internet protocols) helps everyone. We need to dampen this offensive arms race rather than exacerbate it, and work towards cyber peace. Otherwise, hypocritically criticizing the Russians for doing the same thing we do every day won’t help create the safer world in which we all want to live.

This essay previously appeared in the Guardian.

Posted on December 28, 2020 at 6:21 AM61 Comments

Sidebar photo of Bruce Schneier by Joe MacInnis.