Hardware Bit-Flipping Attack

The Project Zero team at Google has posted details of a new attack that targets a computer’s’ DRAM. It’s called Rowhammer. Here’s a good description:

Here’s how Rowhammer gets its name: In the Dynamic Random Access Memory (DRAM) used in some laptops, a hacker can run a program designed to repeatedly access a certain row of transistors in the computer’s memory, “hammering” it until the charge from that row leaks into the next row of memory. That electromagnetic leakage can cause what’s known as “bit flipping,” in which transistors in the neighboring row of memory have their state reversed, turning ones into zeros or vice versa. And for the first time, the Google researchers have shown that they can use that bit flipping to actually gain unintended levels of control over a victim computer. Their Rowhammer hack can allow a “privilege escalation,” expanding the attacker’s influence beyond a certain fenced-in portion of memory to more sensitive areas.

Basically:

When run on a machine vulnerable to the rowhammer problem, the process was able to induce bit flips in page table entries (PTEs). It was able to use this to gain write access to its own page table, and hence gain read-write access to all of physical memory.

The cause is simply the super dense packing of chips:

This works because DRAM cells have been getting smaller and closer together. As DRAM manufacturing scales down chip features to smaller physical dimensions, to fit more memory capacity onto a chip, it has become harder to prevent DRAM cells from interacting electrically with each other. As a result, accessing one location in memory can disturb neighbouring locations, causing charge to leak into or out of neighbouring cells. With enough accesses, this can change a cell’s value from 1 to 0 or vice versa.

Very clever, and yet another example of the security interplay between hardware and software.

This kind of thing is hard to fix, although the Google team gives some mitigation techniques at the end of their analysis.

Slashdot thread.

EDITED TO ADD (3/12): Good explanation of the vulnerability.

Posted on March 11, 2015 at 6:16 AM37 Comments

Comments

Michael And Ingrid Heroux March 11, 2015 7:31 AM

I’m not sure if my system is vulnerable. I use one of those mini ACER systems with the Intel Atom processor, I think they call it an IEEE system. I don’t worry about security but I keep a close eye on my system. I don’t use a firewall or an anti-virus or I don’t use a password or anything. I get the od rookit sometimes but not too often. I use an outdated debian distro with all kinds of exploitable exploits, I even still have shell-shock. I have to update my system. I usually get the latest distro of debian and install it on a zeroed out hard drive and install my favorite software. Then I get the latest Knoppix version and take Klauses scripts from it and transfer them to my newly installed Debian system and I compile a Knoppix cloop image from my Debian system and put it on a thumb drive with a fat16 partition with the Knoppix boot loader and that is what I use. I want to experiment with compiling my own custom Knoppix kernels to see if I can limit some kernel exploits. Harden it a bit I guess you would say but if I get infected I just reboot and put it back into memory and I never keep a thumb drive in the system and I never use a hard drive, I took it out when I got the laptop. It was a hand me down but it does everything I need.

Clive Robinson March 11, 2015 7:41 AM

This problem with bit flipping goes back atleast as far as the mid 1980’s and was often seen when people tried to use the “RAS & CAS” lines two quickly by having a shorter time delay line or to much extra going on. Even with contention logic, video generation and DMA with high speed IO but only short length buffers in memory were a real pain for this sort of thing.

Part of the solution was to start using 9bit wide memory such that the last bit was used as a parity bit. This was not that popular because cascading through nine XOR gates slowed things down and faster parrity chips were quite a bit more expensive.

The problem with parity is it only catches odd numbers of bit flips per byte, thus two bit flips were not detected…

Using it to munch on page tables is just one of many attacks that can be done, however it’s perhaps the easiest to stop going rouge during an exploit.

It’s a security problem that gets hardly discussed these days, which is unfortunate because it’s a weakness all “security tagged” memory suffers from thus getting around Read Only and Code Only flags used on “word level” security.

The first step is using error correcting memory, but how do you know the system is actually using it correctly… and that it’s using something a little more reliable than parity…

It’s why it’s desirable for higher security systems that you also hash memory blocks and store the results in memory the user or IO executable code cannot get at. And that you rehash and check memory on a regular basis. Unfortunately this also requires quite a bit of code ro be either rewritten or recompiled by a compiler that will make the hashing both simpler and more reliable. Obviously slow as they might otherwise be interpreters with garbage collection can make the process less painfull for the programer and ultimately the user.

People need to revisit computer history, and remember the computer stack goes down a long way below the assembler code level, and thus attacks can “bubble up” into the software and above layers

Regular readers will know I’ve said this on a number of occasions here in the past 😉

Michael March 11, 2015 7:46 AM

I find the statement

“We also tested some desktop machines, but did not see any bit flips on those. That could be because they were all relatively high-end machines with ECC memory. The ECC could be hiding bit flips.”

to be somewhat misleading. ECC RAM doesn’t hide bit flips; it corrects single-bit flips, and can detect but not correct larger errors.

So an easy way to mitigate the attack would seem to be to use ECC RAM. It will either more-or-less-silently correct for the error, or will trigger a NMI normally causing the system to be halted immediately, preventing any further damage. It’s a shame that many Intel laptop and desktop CPUs don’t support ECC RAM or even parity RAM (which can detect but not correct single-bit errors).

Clive Robinson March 11, 2015 8:15 AM

One thing else I forgot to mention,

Bit flipping is also known to be caused by ionizing radiation. Back in the days it was thought that only alpha particles in the chip packaging material was to blaim and solar radiation in high altitude and space systems. It transpires from latter work that it was mote likely to be neutrons as secondary effects of virtually unstopable cosmic radiation, so a lead box is not going to help.

What is also known is that all radiation fro DC to daylight and beyond will cause problems when logic changes state, due to issues to do with metastability.

Thus rapidly writting to the same areas of memory especially in cheaper memory may make adjacent memory more suceptable in address decoding logic.

Obviously cosmic radiation bit flips cannot be stopped and various solutions have been suggested in the past. The problem with parity and Haming type solutions is not only are they not sufficiently reliable, they slow things down and worse consume quite a bit of extra power, giving rise to other issues including heat problems.

One solution for high availability systems is to duplicate the memory banks and put themin different axis along with simpler “voting logic”.

Sun IBM and others have various improved systems for high availability servers and the like. All with different names of which the oddest is IBM’s “ChipKill” which detects and corrects allsorts of memory issues.

Which ever way you solve the problem there is always a price to pay, all be it in slower memory, more complex memory, more expensive and power hungry memory, or as with software hashing systems lost memory and CPU cycles.

gipi March 11, 2015 8:47 AM

All radiation fro DC to daylight and beyond will cause problems when logic changes state. Maybe that sheds some light on those insanely powerful antennas Applebaum found in the NSA toolkit.

Z.Lozinski March 11, 2015 8:52 AM

Because this is an attack on the physical memory, all of the protection mechanisms built into some processors (storage keys, capabilities etc.) don’t help much.

Separating the physical location of application memory from the physical location of system memory (esp. the DAT tables) is probably the minimum required to defeat this attack. You probably then need a third bank of memory for externally addressable DRAM for I/O. Good luck with than in a consumer grade system, as it will significantly increase the cost of the processor logic board and the operating system.

@Clive,
of which the oddest is IBM’s “ChipKill”

Does it help if I tell you that IBM’s major microelectronics facility was located in East Fishkill, NY? Yes, you can groan.

Aaron Spinkk March 11, 2015 9:58 AM

@Clive

Modern ECC codes present basically zero performance reduction in practice. This is due to a variety of factors, but suffice to say that the old ECC is slow bit is no longer correct.

In addition, ECC codes on modern server hardware is quite robust. Due to various other constrains, most codes no longer work on 64b word sizes but instead 256-512b word sizes. This allows them to be significantly more thorough both in detection and correction. Combined with using x4 based memory, a modern XEON system can detect and correct a wide variety of errors with standard ECC memory. Basically chipkill is more a product of the building block of the DIMM modules than anything else.

RSaunders March 11, 2015 11:25 AM

I think this is a wonderful example of an attack with known defenses which are not employed. In the 1980’s bit flips were considered a possible defect, and ECC was employed to defeat it. Systems were more reliable and, against this sort of an attack, more secure. Users undervalued this reliability/ security, and it turns out most bit flips weren’t that harmful. Fast forward 30 years and we’ve stopped using ECC to save money, even at the almost inconceivably small cost of RAM today (in comparison to the 1980’s). So, for almost no savings, we remove a security enhancing technology that works and has no impact on processing performance (by which I mean modern ECC not 1980’s ECC).

In spite of security seeming important to us, because we read this blog, it’s not actually important to the consumers that are driving company decisions.

Nix March 11, 2015 11:30 AM

Michael, the Kim paper discusses ECCRAM as a mitigating factor explicitly in section 6.3. As table 5 shows, it does not suffice: while most errors are detected, you can fairly easily force multiple errors per word on at least some modules (that this test was conducted on only three makes it hard to draw definite conclusions), which are not correctable with most ECCRAM and thus will cause at least system impairment (DoS material) — and there are instances of three and even four errors per word, which is well into the realm of potentially uncorrectable and undetectable faults.

So ECCRAM might at best suffice to tell you when you’re under attack (a lot of corrected faults and some detected but not corrected ones), but won’t fix the fundamental problem that RAM that changes without being written to violates all sorts of invariants that basically everyone assumes without thought are true.

phred14 March 11, 2015 11:51 AM

I happen to be in memory design. For the past 15 years I’ve been working in embedded memory design, but prior to that I worked on DRAM and SDRAM.

We had a very comprehensive test spec, including “disturb tests”. I don’t remember all of the details this much later, but I know that there were tests that wrote a wordline, pounded on its neighbours for some number of cycles, then went back to see if that wordline properly retained its data. That sounds to me like this new attack, and we used to test for it, because at least at the time there were specific defects that caused bits to be sensitive.

That said, we were a high-end memory shop, selling high-spec memory. I know that many memory shops weren’t nearly as thorough with their testing as we were. I’ve also been out of standalone memory for quite a while, so I have no idea how things have changed.

Come to think of it, I had a colleague who is still in memory design at a different company. I need to ask him.

Russtopia March 11, 2015 1:01 PM

I found this attack really interested because it answered a 20-year-old question for me… I was working on my undergrad project, designing a minimal realtime micro-kernel (workalike of OS-9 from first principles..)

My kernel was to the point where it had started the ‘tick’ handler (main OS clock). Nothing else — and I mean NOTHING — was running in the system at this point, so the system was just entering the IRQ handler, incrementing the global tick count, then exiting. My system was rebooting after an exact interval (120 sec or thereabouts). I found that doing two (logically) useless XOR operations with $55555555/$aaaaaaaa constants made the reboot go away. I figured at the time there must have been some cumulative issue with the RAM being accessed so tightly with no variation, and the XORs ‘exercised’ the bus somehow to dissipate the effect. I had no idea it had a name though.

Of course once the kernel was doing other things, like spawning processes and outputting boot messages to the console, the tick handler wasn’t the only thing running, and I could remove those XOR operations.

merino March 11, 2015 1:51 PM

OT: the UK’s foreign minister expresses his disappointment at the fact that the Snowden revelations haven’t fizzled out on their own. “Time to move on”, he has decided. And to demonstrate how noble his intentions are and how in touch he is with his people, he reassures us by announcing a new set of measures that will give the GCHQ “the powers they need” to take care of our concerns. In the context of modern UK politics, that undoubtedly means that anyone who is concerned will be identified and taken care of. I am so relieved the GCHQ will finally solve that “lack of power” problem that had us all so concerned.

albert March 11, 2015 3:15 PM

@vas pup
This attack illustrates to need to avoid knee-jerk reactions. The Madison incident is quite different than Ferguson, etc. Wisconsin has a special state commission to investigate police shooting incidents. The Madison chief of police immediately visited the family. The mayor immediately offered consolation to the family. The attackers either didn’t know this (ignorant idiots) or didn’t care (reckless endangerment). The most that can be done is to wait for the facts to be brought out. There is a good chance that can happen. Disabling police communications is a stupid and dangerous thing to do.
.
The perps may find their sorry asses being dragged into court. How ironic.
.

albert March 11, 2015 3:42 PM

@merino
It does appear that Snowden is quite a burr under the saddle of the UK politicians. Kinda make you wonder what else he might come up with.
.
Hammonds background is interesting indeed. He’s a poster-boy for Conservatives with a capital ‘C’. He’s worth at least $11M. This make me lean towards the idea that he actually believes all the malarky he puts out. If he doesn’t (and it’s hard to see how he doesn’t know what’s really happening, given his position), then he’s lying. He’s a douchebag either way.
.
I notice he brings up the Russia bogeyman: “… including a heightened risk from Russia requiring the recruitment of more Russian-speakers to the intelligence agencies….”. NATO-FUD. The UK always marches to the beat of the US drum.
.
The US State Dept is the biggest threat to UK stability, and by extension, other NATO countries. When, if ever, will they wake up?
.

Pigbots March 11, 2015 3:43 PM

Quite different than Ferguson? Unless the killer cop is prosecuted and convicted, it’s exactly the same.

And the furious diversion is telling. Script kiddies are ‘perps,’ but not killer cops. ‘Reckless endangerment,’ not killer cops. I call pigbot.

Pigbots March 11, 2015 4:45 PM

Ooh, a commission. Consolation! Guess it’s even-Steven with the killer pig. Talk about anything but police impunity for extra-judicial killing, classic pigbot. And you wonder why everybody hates pigs.

Pigbots March 11, 2015 5:47 PM

Nobody needs you pigs. Well, except for the comic relief of your steroid-shriveled nuts and tender egos and 81 IQs. New York was much safer when you whining sissies went on strike. Do that more. Cop jobs are white man’s welfare, relief for people who are too stupid to get a real job. When the pols RIF you all and take your pension, everybody’s just gonna laugh.

But seriously. No surveillance for you. You’re too dumb. With fusion centers the feds come in and make the pigs do the dirty work that’s too dirty even for the FBI. The pigs go crazy for it, they just want to be anti-terror big shots like the feds. The feds lead them around by the nose with fake intel. That’s how they get pigs to kill RFK or Fred Hampton or Ibragim Todashev, or whoever your betters are most scared of.

Celos March 11, 2015 6:03 PM

The way I understand this is that many or even most instances of this vulnerability will be be caused by slow refresh. (The rest will be defective RAM, but as somebody else said above, that is likely ElCheapo RAM anyways.) Larger time between refreshes causes cells to lose more charge and the thresholds for reading a charged vs. non-charged state need to be lower. This in turn makes the described attack easier. With fast enough refresh and higher reading thresholds, the attack may well become infeasible for good quality RAM.

Now, refreshing less often offers some advantages on laptops, namely lower power consumption, especially during hibernate, where memory refresh needs to continue at normal rates while everything else sleeps. I suspect that quite a few laptop RAM manufacturers have gotten over-optimistic and crossed the line that makes this attack practical by allowing very long refresh-cycles. They can in turn specify lower “idle” power consumption.

Contrary to this, typical server and desktop systems gain basically nothing from slower refresh as power consumption matters a lot less. On the other hand, more frequent refresh does not slow memory access down to any degree that matters. Hence I expect that conventional server and desktop memory will usually have refresh cycle specs for maximum reliability (which again may well make this attack infeasible).

If I am right, then the fix is not ECC (which makes exploiting this harder, but cannot really push it from feasible to infeasible), but saner refresh rates. I am unsure whether just refreshing more often is enough though, it may also be necessary to adjust some reading thresholds or other parameters. The refresh rate itself is adjustable in software or can be adjusted by reprogramming the EEPROM on the memory modules. While decidedly an expert-only approach, this problem may have a software fix after all and may be (mostly) limited to laptops and mobile devices in the first place.

Celos March 11, 2015 8:42 PM

@Dirk Praet:

Did that already, no errors in 15 Minutes on my non-ECC server. Unfortunately, the article describing the results lacks information about how long they did run the tool. On the plus side, at least the normal version is simple enough that the source code is clear and nothing bad is hidden in it as far as I can see.

Curious March 12, 2015 3:34 AM

The Intercept showed a few documents from nsa and somewhere in a list, it basicly says that nsa intend to make use of “error correction” for cryptanalysis and epxloitation.

Project description: “Support work to provide capabilities against emerging communications technologies through error correction, demodulation, reverse-engineering, multiplexers, and personal communications interfaces”

Having a consumer grade pc, my ram sticks is said to be non-EEC. I was thinking, could it be that EEC is built into my ram sticks (don’t know if this is so), but isn’t to be used?

If ECC is built into my ram sticks, can that feature somehow be used for exploitation and might that be something what nsa is referring to in the quotation above about “error correction”?

Or, what else could error correction mean in that list?

Me not being an expert or anything at computer technology, I can’t help but wonder about such things.

Kuwait March 12, 2015 6:48 AM

An open-source Android app will soon appear. Using rowhammering to root your phone. All I need to do that myself is some spare time :-(that I haven’t).

Sparky March 12, 2015 2:43 PM

As far as I can tell, this attack only works on adjacent rows of RAM; could this be mitigated by introducing an unused row on either side of the RAM allocated by an application? These rows could be used to detect any attempts to do this, simply by filling them with predictable values and periodically checking if these values are unchanged.

This would of course require the RAM for any application to be allocated in continuous blocks, because the overhead of the wasted rows increases as the average allocated block size decreases.

It seems so obvious, but I haven’t seen any mention of this; maybe because it wouldn’t work?

Observer March 13, 2015 5:43 PM

Back in 2003 a couple guys used regular old heat to escape Java sandboxes. At normal conditions it took about a month for a naturally occurring heat or cosmic ray-generated fault to occur on a PC without hammering. Blanket attacks meant to accrue botnets could turn this into a statistical waiting game. Also, just like rowhammer, shrinking circuits, fatter memory, and faster clock speeds are making faults happen more frequently.
We’ll probably see a lot more of this kind of attack. In fact, it probably works all too well on jailbreaking/rooting smartphones.
http://sip.cs.princeton.edu/projects/memerr/
“Our attack works by sending to the JVM a Java program that is designed so that almost any memory error in its address space will allow it to take control of the JVM. All conventional Java and .NET virtual machines are vulnerable to this attack. The technique of the attack is broadly applicable against other language-based security schemes such as proof-carrying code.
“We measured the attack on two commercial Java Virtual Machines: Sun’s and IBM’s. We show that a singlebit error in the Java program’s data space can be exploited to execute arbitrary code with a probability of about 70%, and multiple-bit errors with a lower probability.
“Our attack is particularly relevant against smart cards or tamper-resistant computers, where the user has physical access (to the outside of the computer) and can use various means to induce faults; we have successfully used heat. Fortunately, there are some straightforward defenses against this attack.”

Tim March 16, 2015 2:17 PM

While row-hammer was a problem I do not see it in recent DRAMs from Micron or Samsung. It is after all a defect if it has it. I have run 1000’s of hours of tests on dozens of DIMM’s to validate memory components for our product and not encountered one yet, so while it may exist a some residual level it can’t see it being a effective way to compromise computers going forward. At least not for ones that are being properly refreshed.

sena kavote March 18, 2015 8:00 AM

We should see at least statements about this Rowhammer problem from these:

Linux kernel project

FreeBSD, OpenBSD, netBSD…

Microsoft

Apple

Minix

GNU Hurd

Virtualbox

XEN

VMware

Docker

bhyve

gnome system monitor, ksysguard and htop projects about detecting row hammering
processes

systemD project

Memory module makers like Kingston

Intel

AMD

Nvidia

I can think of several things to counter this in software and hardware. Some
measures are temporary waiting for better solutions and some may not work.

sooth sayer March 28, 2015 9:08 PM

I used to think CLIVE is CLEVER .. but i just learnt after 10 years that he is HOT air and nothing else.

Parity problem of 80’s was ONLY and ONLY and ONLY solving the alpha-particle failure in DRAMS .. so was the single bit correction solutions .. not this finding ..

Clive Robinson March 29, 2015 2:24 AM

@ Sooth Sayer,

Parity problem of 80’s was ONLY and ONLY and ONLY solving the alpha-particle failure in DRAMS

I suspect you are not reading original sources from the time.

Firstly “bit flipping” was known to happen in more than just DRAM, it was known in MSI logic with the likes of latches that were also used in registers and other RAM. Go and look up meta-stability, it occurs about one in a billion operations even with careful memory element design so was seen quite frequently even in 1MHz clocked systems. I once posted one of the few articles I could find online about it here and you can find a conversation resulting from it I had with RobertT when I was trying to find more information on elements used in chips for generating random bits.

You will also find on line information on how even simple gates suffer from “analogue” problems as signals aproach the state change band at their inputs. Due to where the state change band is in CMOS logic, there were bias tricks you could use to turn the gates into amplifying stages. I have a Motorol CMOS book from the early 1970’s with an application note that goes into quite some depth on this, and I’m sure I’m not the only person with it.

You will also find conversations on this blog about the security implications of logic’s analog behaviour with regards to leaking state change information out from one part of a chip to another via what in effect looks like “cross talk”. If you go and look at the analog elements that go to make up DRAM you will see that they are liable to this transfer of energy from one analog element to adjacent elements…

The reason atomic particles got the news back in the 70’s and 80’s was due to the “Space Race” moving from political and military arenas into major commercial and amateur arenas (the last time I remember it doing major news in the electronics industry was a year or so before the Piper Alpha disaster, and my then boss getting woried about the new RTU designs we were making for the oil industry).

Whilst we thankfully get few atomic particle issues on earth there are many many more in space (see reports about astronauts getting visual flashes even with their eyes closed). There were micro controlers specificaly designed for hostile space use such as the RCA COSMAC 1802 using a Silicon on Saphire (SOS) designed with the the help of Sandia National Labs to reduce not just the radiation issues but the analog issues as well which made the radiation issues worse. Although designed in the early 1970’s the 1802 is still in use in new equipment designs today and second sourced via Harris. If you can get a copy of the original extended data sheets and application notes you will find information about how BOTH the analog and radiation issues that made them worse were solved.

As @ Nick P points out there is a lot of forgotton knowledge from those times, which just does not get taught any longer, and we are having to re-learn it either the hardway –re-invention– or by finding and reading the documents of the day. So you might want to get ahead of the game by going back to those old documents.

sooth sayer April 3, 2015 9:24 PM

Clive ..
You are mixing up far too many things.

Alpha particle failures were largely affecting cheaper packaging .. sos and ceramic were less affected

radiation and alpha particle are two different issues — alpha particle failures were strictly affecting drams as the cell size became smaller and it became a dominant random mode — bit flipping mentioned here appears to be a different phenomena .. (without much consequence if I may add).

Leave a comment

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.