Schneier on Security
A blog covering security and security technology.
« The Terrorist Risk of Food Trucks |
| Jamming 4G Cell Networks »
November 16, 2012
Stealing VM Keys from the Hardware Cache
Research into one VM stealing crypto keys from another VM running on the same hardware.
ABSTRACT: This paper details the construction of an access-driven side-channel attack by which a malicious virtual machine (VM) extracts fine-grained information from a victim VM running on the same physical computer. This attack is the first such attack demonstrated on a symmetric multiprocessing system virtualized using a modern VMM (Xen). Such systems are very common today, ranging from desktops that use virtualization to sandbox application or OS compromises, to clouds that co-locate the workloads of mutually distrustful customers. Constructing such a side-channel requires overcoming challenges including core migration, numerous sources of channel noise, and the difficulty of preempting the victim with sufficient frequency to extract fine-grained information from it. This paper addresses these challenges and demonstrates the attack in a lab setting by extracting an ElGamal decryption key from a victim using the most recent version of the libgcrypt cryptographic library.
Posted on November 16, 2012 at 6:13 AM
• 17 Comments
To receive these entries once a month by e-mail, sign up for the Crypto-Gram Newsletter.
Hmm, why has it taken so long for the academic research community to get around to this....
Seriously the knowledge has been around for a very long time.
This is why OpenBSD devs on the mailing list relentlessly flame anybody claiming VMs by Xen or virtualbox provide any sort of seperation or supposedly increase security. Another reason is the layering of new bugs on top of the already mess of arch that is x86
@Clive - do you believe this is limited to machines running only two vm's?
Don't understand this. On any grown-up system, separation of VMs should be designed and built into the hardware, so that the operating system, working with the hardware separation facilities, can make it impossible for a VM to access any memory space used by another VM. If your box uses things like "shared microarchitectural components such as caches" which all VMs can get at, it's not suitable for multi-user computing.
A key point of this research, to which the author has readily admitted, was that the issue has only been demonstrated on XEN, and the only reason it has only been demonstrated on XEN is that the source code was readily available.
It remains to be determined whether this just a BUG in XEN, or an inherent design flaw in hypervisors everywhere.
They don't have any access to each other's memory, but any time different software is sharing the same hardware, completely eliminating side-channel information leakage is very difficult.
The physical CPUs are connected to shared cache hardware. When you ask the cache to access some memory, it takes a short time or a long time depending on what is already in the cache. The attacker's VM can do a bunch of accesses to force the other VM's data out of the cache, and then it can measure the timings of its own accesses to the cache. By doing this, it can deduce which parts of the cache the _target_ VM has been using. This is information leaking from the target VM to the attacker's VM over a side-channel.
If the attacker's VM knows (or assumes) that the target VM is currently running some particular crypto code that manipulates the key bits, and that crypto code changes its memory access pattern depending on what the key bits contain (for example, by using some of those key bits to index into a lookup table) then the attacker's VM is able to detect those variations, and deduce what at least some of the key bits are.
As noted already by Clive, the idea behind this kind of attack is pretty old. It sort of reminds me of an old attack I read about somewhere, against a password check function in some old time-sharing OS... it compared one character at a time and it let the caller supply his own buffer, which means you could tell if the character was correct or not by aligning it at the end of a memory page with unmapped memory right after it and seeing if you got a page fault or not. In that case it was the OS that leaked information about what it was doing, in this case its the cache hardware that "leaks" some vague info about what other software in some other VM has been doing.
But anyway, while the ideas are not new, actually exploiting cache timings to recover key bits sounds pretty tricky. This research is a practical demonstration of it actually being done on a certain kind of commodity setup which is rather common in the wild.
Many other variations on "timing attacks" are possible. The most obvious one is detecting the running time of a small piece of code running on another processor, and inferring whether it took the "fast" branch or the "slow" branch in the code (this is apparently often possible even when that other processor is at the other end of a network connection from the attacker's processor--you can spam it with requests and if a bunch of consequtive requests get routed along the same route, their latencies will tell the story).
Think of any code manipulating large primes (RSA) or elliptic curves, or whatever. To avoid leaking any timing info, all such code has to be written to take the EXACT SAME AMOUNT OF TIME to manipulate the key material bits, regardless of what those bits contain. Most library code for multiplying big numbers has the opposite goal (be as fast as possible, for each particular input), making them unsafe for doing crypto with. Data-dependent conditional branches are an obvious no-no, but there's other difficulties too: for example, a multiply instruction might take fewer cycles if the top half of the bits of either operand are all zero. Thats pretty difficult to work around.
Crypto implementors should be extremely careful when they write their code, to make sure its cycle count, load on execution units, branching pattern, cache usage, etc. are all as invariant as possible across different inputs. This is just one of the MANY reasons why average programmers should NOT be writing their own cryptographic primitives, they will always fail at this part. If you care about security, don't roll your own--only use cryptographic routines that were carefully written and vetted by actual security experts.
do you believe this is limited to machines running only two vm's?
There is no reason why it should be at the end of the day it boils down to Signal to Noise Ratio (SNR) and measurment sensitivity.
If you read their paper they say there current limit for measurment accuracy which limits the measurment sensitivity is based on the interupt time they use.
They can either improve the resolution some other way or work out a way to better synchronize it to the desired information in the side channel, which could give them the equivalent of a 100:1 improvment in resolution which in reality would give maybe 100^-2 improvment in the number of VM's
Another thing that could be done is to predictively remove noise from the signal. Unlike real noise the noise from the other VM's is determanistic but highly complex. If the effects of the other VM's can be generated and synced to the coresponding VM's in anti-phase then you reduce the artificial noise floor which improves the SNR.
However they don't of necesity need to generate an anti-noise signal if they can force the other VM's to be effectivly blocked as processes then the VM control programe will "round robin" or even swap out the processes depending on how it works in a much easier to predict way.
Now two things I need to say, althought the knowledge of how to do this has been around for quite some time (think back to about a month after AES finalist was anounced). And I've done similar on single processor systems, what these resarchers have done is actually quite difficult to do even in a controled environment so a significant hat tip in their direction. BUT as Bruce has observed on a number of occasions "Attacks get better with time".
Now you need to remember these are academic researchers not those payed for out of the likes of the NSA budget. Now back in the 1980's I did my own research out of curiosity into fault injection via modulated EM carrier. The first academic investigation resulting in a published paper was only a couple of years ago and that was with an unmodulated EM carrier. So there was around a quater century gap between me and them. Now I'm not daft enought to think I was the only person to have thought of it so conceviably the likes of NSA, GCHQ et al had done the same at or before that time...
So whilst this might be new in terms of published papers as the authors of the paper make fairly clear it might well have been done before confidentialy (I'd put money on it ;) so it might well be a standard attack or method for the likes of the NSA or well funded corporate or even a smart individual. Further it might well have been improved upon.
Now one thing to note they mention in their paper that they had their agent running as a VM, but they did mention using network timing (which is what the original AES cache attack used), thus it may well be possible to extract KeyMat from other much less well written libraries/code (think ccommercial closed source) with just a probing attack from the network.
I've mentioned this possability a number of times on this blog as well as a networked way to enumerate a VM host from the network using just the TCP time stamps. I even emailed one of the bods on the honeynet project to warn that what appears to be a "brain dead script kiddy attack" could actually be an enumeration attack looking to find VM "networks" that are frequently used as research honeynet traps. So a smart attacker with their new zero day can scope out the network first and if they discover a VM network give it a wide birth. Obviously the same attack can be used to work out which VMs are running on the same machine in a co-log site. So even if the web site you want to attack is fully hardened and does not offer any exploitable weaknesses you can find other VM services on the same hardware that are much more accommodating thus get a toe hold into the server and potentialy get out of the VM "sand pit" and potentialy root the system to get down into tthe other VM instance of interest. I menttioned this on this blog a number of years back so with it's large readership I suspect some people are already doing it.
Hopefully now the developers of VM systems will take this onboard and develop ways to decouple timing on VM networks on a single hardware platform so they cannot be easily enumerated (this is actually quite a hard task as it means digging much further into the base OS than they currently do).
"Thats pretty difficult to work around."
My solution was that it's usually a black box analysis they're trying to do. They're looking at the network level, process level, or cache. My next move was to defeat the first two by implementing it using normal methods. The tweak was that I'd already have a timing profile of the algorithms, quick times to slow times. I'd also take a sample of the time before I do the operation. After the operation, I just loop until the highest time is reached & BAM.
As for cache, I say flush it upon context switch, turn it off or suck it up. Cache is kind of a necessary evil I can live with. I take plenty of measures to prevent new software from running on a system I design. I also have monitoring & auditing of applications. So, I can usually see where a likely cache attack is and haven't found any happening. That said, there are recent innovations in this space:
Work on new caches resistant to timing channels
Runtime for detecting storage & timing channels
Caisson - hardware description language for secure information flow
Hey, RobertT, what do you think of Caisson as a language? It's targeted for hardware guys, of course, and I can't evaluate it due to lack of experience.
"Hey, RobertT, what do you think of Caisson as a language?...'
It is very impressive for what it is / does, but IMHO the inherent limitations are of more concern then the accomplishments.
Points to think about:
1) Verilog is the only universal HDL language in wide spread use. Nothing else will be well enough understood by the majority of digital hardware designers to ever give it critical mass. So asking about the security of hardware designs generated by Caisson is a little like asking about the robustness of Pascal vs C code. There are clear features in Pascal that make it a more robust language then C, but this advantage has to be weighted against the limited experience most programmers have with Pascal. This raises the question, Are you trading one set of known problems for a set of unknown problems?
2) WRT trading known for unknown they do make a comment
Note: non-digital side channel attacks like power analysis are not within the scope of this paper.
Physics and process fundamentally dictate the minimum power required for to accomplish a given calculation. furthermore this power is data dependent because Zeros in the upper bits result in a simpler multiply calculation. It is easy to address the timing side channel associated with fast / slow multiply by just matching the cycle count, however the power side channel is not so easily masked because it is inherent in the number of gates flipping. The fixes for the power side channel are far more complex than those required for the timing side channel.
3) Modern CPU chips are power limited not speed limited, which is the main reason we don't have CPU clock rates doubling every year or so, the way we did back in the 90's . A modern quad core processor core has a chip die area of about 1sq cm and consumes about 100A , typically from a 1V to 1.2V power supply. Careful power management is ALL that stops these chips from glowing like a light bulbs. BTW secure computing fundamentally suggests that both power and timing must be constant. The laws of physics say there must exist some minimum power for all functions. So Secure computers must therefore consume at least double the power to avoid timing side channels and double again to avoid power side channels.
Try to sell 4 times the power, to a chip designer who's main concern is the power budget (just look at the problems Intel is trying to overcome to make X86 into a cell phone processor and then tell them that for security reasons you want to increases the power by a factor of 4)
If you ever get to present this idea please invite me along because I want to be ready with a tape measure so I can accurately document exactly how hard your *** gets kicked.
Thanks for your feedback.
"So asking about the security of hardware designs generated by Caisson is a little like asking about the robustness of Pascal vs C code. There are clear features in Pascal that make it a more robust language then C, but this advantage has to be weighted against the limited experience most programmers have with Pascal. "
It's funny you mention that. I was recently looking into using Modula-3 or a modern Pascal for systems programming. Plenty of great work in OS security & type safety wen't into those languages. Robust OS's were also easily implemented in them. Ignoring libraries & IDE's, my main issue was "will other people be able to pick this stuff up or will it just be constant problems?" I've held off so far.
"If you ever get to present this idea please invite me along because I want to be ready with a tape measure so I can accurately document exactly how hard your *** gets kicked."
LOL. A camcorder might be more effective. Even as you say this, you give me an out.
"Note: non-digital side channel attacks like power analysis are not within the scope of this paper."
The 2 and 3 portions mostly apply to solving both timing channels & power channels. I was thinking of using the tech for security appliances or secure co-processors. Most such devices assume physical security & a trusted administrator. I doubt power analysis is an issue here. So, the worst case takeaway is 2x the power for secure against timing channels.
Considering the likely use model, this doesn't seem all that bad. I'm in the boardroom talking about some components that make up a small part (or a security-critical part) of the company's IT space. I say the chips are made with verification technology to do what they say they will do. I say they also take 2x the power of competing low-power chips w/out assurance. So far, it sounds like a small issue rather than a dealbreaker. I could see your argument if it was a smartphone, server system, etc.
@ Nick P, RobertT,
BTW secure computing fundamentally suggests that both power and timing must be constant. The laws of physics say there must exist some minimum power for all functions.
The laws of physics are a bit iffy on this score ;)
For instance it is claimed that the energy is recoverable if the computation is reversable in some way (look up the theory behind FredKin and Toffolie gates).
But what is reversable and what is not...
In classical logic the output of the some gates is not considered reversable for various reasons and others only reversable if you know the output of the gate and N-1 of the inputs (AND, OR, XOR).
Normally for security the assumption is as an external observer you ONLY ever KNOW THE OUTPUT...
However that said the binary 1bit by 1bit multiplier is a strange beast because multiplying the two bits together never produces a carry (0x0=0, 0x1=0, 1x0=0 and 1x1=1 which is your basic AND gate function) with all other numbers it does produce a carry or to look at it another way you have nx2 input digits which will produce a 2n number of output digits. Thus you make an Nbit multiplier with an array of AND gates which then feed an adder tree to produce the final output (this has some distinct disadvantages that I will ignore for now ;-)
So for half the multiply power issue you can resolve it by making a complementary array of AND gates where you invert the input thus the power signiture is the same for both arrays (assuming you can get delay diffs down sufficient to stop a microwave sig being formed). As for the other half of the multiplier the adder, a similar thing can be done but it's rather more difficult to describe.
Similar reversability can be done with many logic circuits however there is the question of the one way functions in Fiestel networks... Even though the text function (XOR) is clearly reversable the way the one way function is used it has no reason to be either reversable or for that matter have an inverse that would give a balanced power signiture...
These days the actual Gate power losses are much lower than the charge / discharge losses associated with interconnect between the gates. since most of this capacitance is to adjacent bit signals it also means that the power losses are data dependent. So two bits changing together can have lower power tan one bit switching.
In modern processes a lot of effort goes into lowering the dielectric constant of the insulator. It used to be that SOG (spin on glass) was the standard Interlevel dielectric (ILD) used by everyone, but know exotic organic compounds like "black diamond" or a whipped up glass called OSG are the most common.
There are ways to levelize the power, without completely killing the project. Ross's team at Cambridge has played around with fully differential logic which helps with the equalizing gate power but not necessarily with ILD switching losses.
My preferred method is to simply use a Shunt regulator. This can take the form of a Switched Capacitor regulator system whereby the storage cap charges when power is lower than average and reverses delivering power back to the system when the peaks occur. So you get some energy storage and because it is all local someone external will have real trouble doing DPA on the system, especially if the VDD (power) rail is variable.
The problem with my solution is that I have in effect created my own on chip DPA system. So I'd better be absolutely certain that no one can access any information about the state of this circuit. Otherwise.......
Herein lies the basic problem with designing high assurance chips. There are a lot of things that you can do but most of the time you don't know if you are helping or hurting because the answer depends on what glasses you are wearing.
Another real problem that is never discussed is "educating the hacker". If you assume that the "Enemy knows the system" or will know the system Than you also have to accept that the enemy will analysis every defensive circuit that you create. They will asks themselves Why is he doing that? Think about showing an anti DPA circuit to someone that does not know about this technique. It wont take them long to understand what you are trying to obscure.
When I analysis other peoples defensive circuits I sometimes say Wow, but more often Huh! what's that trying to do? There are cases where Huh becomes WOW when I've had some time to think about it. I remember being totally confused when I first saw an anti PICA (photon emission analysis ) circuit. Basically it was a distributed circuit designed to randomly generate photon emissions over a wide area of the circuit.
My take away upon discovering this circuit was that the use of chip backside photon emissions analysis is far more advanced / common than I had previously assumed. Naturally this knowledge effects where I focus my efforts.
Your mention of Camb labs and PICA jogged my memory also your previous menttion of some quite nice optical gear bubbled up
So llike all itches I had a scratch and a little hunt around later,
I dont know if you've seen the slides or even heard the talk, but the impression given is that rudimentary PICA can be done with what is in effect "hobbyists" equipment that can be purchased for around a quater of the average takehome pay...
One thing it does point out though is photon emmission from semiconductor nunctions has been well known for over fourty years, so I would suspect there are some "insiders" who have had good funding to make very significant efforts in taking advantage.
One thing I do remember from the late 70's was near IR optics were made of some interesting substances back then, some of which although common (ie potassium chloride) did not like being used in ordinary atmospheric conditions (due to such anoying things as water vapour etc).
"I dont know if you've seen the slides or even heard the talk, but the impression given is that rudimentary PICA can be done with what is in effect "hobbyists" equipment..."
The equipment does not necessarily need to be overly expensive you just need to know where to get what you want.
Personally I use oil immersion optics for back side analysis, (which IMHO is actually much simpler than topside optical analysis).
The semiconductor photons are generally emitted in the 700nm to 1100nm range so you don't need exotic sensors, typical Digital Camera sensors are actually very good (especially if you cool them and polish off the IR filter ) The CCD senors can then be directly laid on the back of the chip. if the sensor pixel area is less than 4um then you are at about the same focus limit as exists for any far field optical setup. This setup only requires you to buy a professional grade digital camera (less than $5K ) but even a cheap $100 camera will work albeit with a higher noise floor and no access to the actual image file (cheap cameras give JPEG compression)
I have my own techniques for getting much better resolution and sensitivity but I'm not willing to discuss this at the moment.
It may be architectural. I only say this because in talking with a rather large cloud vendor they stated they were aware of the issue and had already modified their stack to deal with this sort of thing. They use a heavily modified XEN.
Schneier.com is a personal website. Opinions expressed are not necessarily those of BT.