Schneier on Security
A blog covering security and security technology.
« RFID Passports Less Reliable than Traditional Passports |
| TSA Security Round-Up »
November 21, 2006
New Timing Attack Against RSA
A new paper describes a timing attack against RSA, one that bypasses existing security measures against these sorts of attacks. The attack described is optimized for the Pentium 4, and is particularly suited for applications like DRM.
Meta moral: If Alice controls the device, and Bob wants to control secrets inside the device, Bob has a very difficult security problem. These "side-channel" attacks -- timing, power, radiation, etc. -- allow Alice to mount some very devastating attacks against Bob's secrets.
I'm going to write more about this for Wired next week, but for now you can read the paper, the Slashdot thread, and the essay I wrote in 1998 about side-channel attacks (also this academic paper).
Posted on November 21, 2006 at 7:24 AM
• 37 Comments
To receive these entries once a month by e-mail, sign up for the Crypto-Gram Newsletter.
Its an interesting attack. In all liklyhood this can be extended to other similar arch types like dual cores. The big question is what about vitual servers?
Also as a defence, what about "slicing" the RSA with either other RSA operations with different keys or even extra instructions that affect the branch predictions, putting it all on the same thread/core? ie you take the performance hit to prevent the attack.
So we are essentially watching the processor and cribbing the math directly. This works well, if you control the box it's being calculated on (legitimately or not). So I can definitely see the DRM implications, since that's entirely local execution. I can also see the trojan implications, for injection onto a box running RSA that you want plaintext from.
Spelling nit: devAstating, not devistating. Think of "vast destruction" :-)
Perhaps Bruce was subtly implying that a certain
vendor may be on the Longhorns of a dilemma. :-)
Cory Doctorow's talk on DRM at Microsoft for the EFF from a few years ago is the best "layman terms" description of why DRM is a disaster I've read. Worth a look.
What's really interesting is that you really don't need "control" of the box beyond being able to execute code in user-space. This attack may work against any multi-user system that performs RSA operations, such as Windows Terminal Server or ssh servers.
I wonder if there's a simple solution, like replacing the Square & Multiple Exponentiation algorithm:
for i from 1 to n-1 do
__S=S*S (mod N)
__S[d[i]]=S * M (mod N)
This might work if T(S[d[i]], i=0) = T(S[d[i]], i=1), which requires that S and S be on the same cache line, the CPU to require the same time to execute the relevent underlying machine code, ideally one cycle to perform the dereference and assignment operation, and the CPU not be able to predict data dependencies across those instructions. This can be verified for target platforms, and additional instructions inserted to force the CPU to not recognize the lack of data dependencies. (This last might be solved by something as simple as a double-NOT operation on d[i].)
Essentially, all I did was remove the branches, and always performed every calculation using the data itself as the selector for the result rather than input to a branch. A similar approach could be taken for the Balanced Montgomery Powering Ladder.
The other interesting result of this paper should be a lesson to implementors:
Ensure that your algorithm executes in constant time in all circumstances on the target platform.
Finally, one last lesson: Reuse existing implementations, as they probably did it better than you will.
I believe the S=M step in my above algorithm is extraneous.
Not that this is feasible in the short term, but this could be another advantage of reversible computing aside from energy efficiency. That is, reversible computation is immune to side-channel attacks by converting energy dissipation into information that can be kept secret. I think that's what side-channel attacks are exploiting, effectively, information (or entropy)thrown out by the system because fundamental computer operations are irreversible.
I don't think this is a timing attack; it's a side-channel attack that exploits the fact that OpenSSL's impact on the branch prediction cache leaks information. The only "timing" done here is to determine whether a jump address is in the BTB or not.
@Joe - Yes; reversible computing could defend against a host of present side-channel attacks. However, with little or no entropy, it would appear (to someone unfamiliar with the field) that you'd open up other attacks: 1) reading the data that has not changed; 2) running the executed algorithm in reverse (thus likely obtaining the original inputs).
I would think that at least for a device that is supposed to be secure against its owner, reversible computing won't help. It might against someone who merely uses processing time on that device in a networked system.
That was a great read, thanks for the link :)
@Thomas H. Ptacek:
It is a timing attack in that the operations performed do not take a predetermined amount of time irrespective of the data being operated on. An implementation free of timing attacks would take the same clock time for each operation irrespective of the data and non-interrupt outside influences (eg, you should be able to stop and restart the process without being able to determine where in the process the code is presently executing [without having a debugger attached to the process]). Current implementations fail this test.
Please correct me if I read it wrong, but it appears the attacking 'process' requires access to the very processor doing the RSA crunching. And while the attack still works with a hyper-threaded single CPU, there is no mention of the attack working across two distinct CPUs (within one computer). (or did I miss it?)
While some dual core (Intel and AMD) CPU's have the CPU's closely tied, Intel's early dual core processors were still separated by the north-bridge. (And, I understand the current 4 core processor separates two of the cores by the north-bridge.)
So, I would have thought you could prevent this side channel attack by:
- Ensure at least two of the CPU's are separated by the North Bridge, or similar.
- Temporarily dedicate one side of the bridge solely to RSA execution. (and no other, or only 'trusted' processes.)
- Still employ traditional side channel spoofing methods to mask traffic to and from the bridge to the 'insecure' CPU.
Would this work?
I'm only an armchair enthusiast, so may have missed a critical piece of understanding, and would welcome any comments from the experts.
I didn't see anything about the attack working across distinct processors. I can imagine, however, that the cache synchronization protocols may reveal timing information if the code spans multiple cache lines. I'd have to look into the low level cache synchronization protocol to know for sure.
The solution that seems to have the least risk to me is to write the software in a time-independent fashion entirely, by executing every instruction and having every conditional operation access a cache line whether the result is discarded or not.
The fundamental notion I propose is to speculatively execute every calculation, reading all inputs from near-identical memory locations (same cache line), storing both results of the calculation into near-identical memory locations (same cache line), and choosing which result to continue with based on the data under examination, without using branch instructions. This might be a tall order, but I think it's achievable with effort.
You're not describing the attack in the paper, you're describing timing attacks in general. This attack relies on a (binary) timing difference between operations wholly unrelated to OpenSSL, which are influenced by OpenSSL's impact on the cache.
Branch prediction delays are timeable, but I'm not sure that's really the attack here. The side channel is the state the CPU keeps about what branches have been predicted, not the amount of time the OpenSSL operations take.
Bruce: that cannot be correct. We know that Alice will not attack Bob by definition. It'd have to be someone else - like Eve or Mallory!
Timing / side channel attacks have been known for a very long time and are reputed to have been the first reliable Tempest attack (see history behind the British / Canadian Rockex crypto system).
The usual thing you get told about Tempest is it's all about Energy / Bandwidth, what they tend not to tell you is in which direction your clock signal should propergate through your equipment... (along with a few other gems of what turn out to be common sense).
Basically if you either clock your system from the output or re-clock at the output, the majority of timing attacks won't work.
The reason they do work and are not likley to go away is that engineers etc always try to squeeze the last drop of performance out by using non synchronus devices and software.
Therefore the timing delays of different execution branches / cache hits-misses etc show up at an output (network / PSU / front pannel LEDs) in a repeatable way that alows an attacker to corelate out useful information.
As has been observed in the past (by the initiates) "it ain't rocket science".
Given: known algorithm (code), multithreaded processor sharing BTB
Goal: find private key
BTB is the Branch Target Buffer which is a cache of taken-branch targets
Create a spy program which keeps the BTB full. Run the spy program and encryption algorithm (RSA) at the same time on the same processor (multithreaded CPU requirement). Every time RSA takes a branch it replaces a spy-program's branch in the BTB (shared BTB requirement). The spy-program can notice its delay on its branch which RSA replaced in the BTB so it can know which RSA branches are taken (know RSA code requirement). Since RSA branches on key bits the key bits can be determined.
Observations: BTB is not shared across multicores. Caches are sometimes shared and side-channel information can be derived, but it is the L2 cache which is shared so the level of detail available in watching an L2 cache is much less than one gets from the BTB. The last time I checked, AMD multicores are not multithreaded, but Intel, IBM, and Sun multicores are multithreaded.
"Bruce: that cannot be correct. We know that Alice will not attack Bob by definition. It'd have to be someone else - like Eve or Mallory!"
Not true in the DRM case; Alice is a good girl who is just trying to get access to something she already owns. The bad guy is William who infected the system with DRM in the first place.
@Fred P: My understanding of side-channel attacks was that the entropy of the system was used as information to discover secrets and this information was difficult to obfuscate because people don't usually consider entropy as information that needed to be protected. Once we consider that entropy itself was information that could be used to attack a system, then we could device ways to obfuscate it.
So, perhaps, instead of switching to reversible computers to defend against side-channel atacks, I would suggest that the models used to think about reversibility be used in thinking about the overall security of a system. That is, reversible computers approach information processing as physical systems not just logical systems. The strength of RSA itself relies on the physical constraints of current computing technology.
Its very easy to extrapolate this into a pratical key gathering system.
1. Run n attacks on a known key.
2. Analyise the results to identify the timing signature of the processing of the before the known key. (probably key setup etc.)
3. Run your process continuously.
4. When the "signature" sequence is detected the following sequence will identifiy an unknown key.
Over time you will capture some of the keys which were used on the shared processor (baring in mind that the spy process will probably be shunted between cpus several times!, and, on a busy machine other processes will disrupt the timing measurements so you may fail recognise the "signature" perhaps 90% of the time ).
If the "secret" keys are used several times (several being 100s not 1000s)
then you could easily extract most of the keys with a relativley short run
(a couple of hours!).
You would then be faced with the rather unusual (but relatively trivial) cryptographic problem of having all the keys but no idea which locks they belong to.
While the attack is against the BTB, the measurement of the attack is time. I propose that a simple way to eliminate an entire class of attacks that work by measuring time is to audit the code for each indeterminate-time operation and eliminate them, considering time indeterminance over an entire system rather than an isolated piece, and over individual operations rather than composite cryptographic operations. Finally, I offer a specific implementation of the Square and Multiple Exponentiation Algorithm that possibly mitigates the measurement of the attack on current system implementations. I fail to see how my suggestions are irrelevent to the attack described in this paper, as I believe those suggestions both eliminate this particular attack vector and other known vectors that work by measuring time.
There may also be other attacks that can be prevented by generalizing my suggestion somewhat. I would suggest that ALL operations consume equal amount of resources, and to perform every operation on every data element. For example, if performing a multiplication takes more power than an addition, then perform both the multiplication and the addition. If taking one jump instruction takes a different amount of time from taking another jump instruction, then take both jumps. If accessing one memory location may take a different amount of time from accessing another memory location, and the memory is accessed in a data-dependent way, then access both memory locations. Use speculative execution techniques to discard unneeded results without having to execute different instruction sequences.
As experience is gained, we (humanity) can arrive at a list of resources that must be audited in each cryptographic primitive implementation. We should have a comprehensive test plan for cryptographic primitives utilizing all known attack vectors. One class of attacks is resource utilization, and we can add, over time, to the list of resources that the primitives consume and produce. That may be BTB cache space, memory, time, power, heat, sound, bus access, memory cache, etc.
True, I am not talking about the particular attack on the BTB, but presenting a methodology for eliminating many side-channel attacks on cryptographic primitives, both known and unknown.
Yes, as long as the prime motivating factor in implementing cryptographic primitives is performance, side channel attacks will be prevalent. When the top priority becomes security, side channel attacks will gradually disappear. My specific implementation suggestion takes longer on most platforms than algorithms optimized for speed, but my implementation doesn't leak information. I would suggest that people implementing cryptographic primitives be more concerned with security than speed.
I think that this paper (along with others) shows that cryptographic security is not a major consideration (and/or competence) of general-purpose CPU designers. As an example, removing Branch Prediction from the chip design (or enabling turning it off for the duration of the cryptography) would protect against this particular attack, as would preventing other threads from being on this processor for the duration.
My main concern with controlling entropy is that doing so in the most trivial conceptual way (nearly eliminating it) would likely open you up to another set of attacks. However, there may be better approaches. Trying to keep entropy essentially constant (but not particularly low) might help.
"shows that cryptographic security is not a major consideration (and/or competence) of general-purpose CPU designers"
There is actually no reason why it should be, it is not where the problem is...
If you design high performance motorbikes for performance track racing, you will find little of your "specialised" work related to designing garbage collection vehicals.
Crypto hardware is of it's self a very specialised field of endevor it rarely involves designing for high speed/utilisation. Processor design on the other hand does.
The four problems we have today that impact security are,
1) We have general purpose CPU's that are as capable as specialised crypto processors, but at considerably lower price by several orders of magnitude.
2) We have a need to use strong crypto in modern society at the "grass roots" level.
3) Engineers invariably make do with what is available and pragmatic to get systems "just working" for the modern high churn markets.
4) Product development engineers lack any kind of specialised training in security engeneering...
The last point is a que to Bruce & Niels / Ross / Adam & Moti to plug their respective books ;)
Would unrolling the loop somewhat help to reduce the effectiveness of this attack? I'm thinking it would cause many more branch instructions to be fed to the BTB blocks during execution, thus making it harder to deduce from the misprediction times the value of 'i' on the spy misprediction:
R0 = 1; R1 = M
while (i+S)<(n-1) do
i = i + S
(where S is a fixed number known at compile time)
If this sort of thing does have an impact, it could be improved by implementing many different versions of the loop with different 'S' (even providing paths that break the calculation up in to several sections with different 'S'), and choosing which gets executed at random on each execution.
One of the problems I'm seeing is that cryptography has reached a point where you need reliable results quickly. In other words, you can't just do it right (go too slow in an environment where you're called frequently and you end up with a backlog) and you can't just do it fast (obviously because you're prone to mistakes)--you need to do it right and fast at the same time: a double-hard order since doing one makes it harder to do the other.
Wouldn't it be sufficient to add noise, in the form of randomly executed code, to the executable algo code? (The algo code would then be obfuscated)
I haven't kept up-to-date with processor architectures, so my thinking may be a bit 8086-ish, but doesn't simply disabling interrupts for the period of execution of the algorithm solve the problem?
If this has too many undesirable side-effects, breaking the algorithm down into several interrupt-disabled chunks may help.
Apologies if this proves to be a naive observation.
I think there might be a connection between this type of attack and traffic analysis, in the sense that the countermeasures to both are very similar: introduce false traffic into the system so information about who is talking to whom and when, or the signatures of cryptographic primitive (e.g. modular multiplication and exponentiation) operations within the system are obscured by the additional traffic. These countermeasures are equally impractical, because both require suboptimal use of the shared system, or for this system to be totally dedicated to the traffic to be protected (communications network in the case of traffic analysis or the CPU to protect against this attack).
How to prevent such an attack?
Possible if the algo does not use ifs. Then both paths are always computed (no different memory locations, no branch prediction).
Some sample in pseudo code
if ((a and 1)=1)
b := c;
will be transformed to
mask := (a and 1) - 1 # 0 or -1 (all bits set)
b := (mask and b) or (not mask and c)
Then there are no differences in the execution expect from differences in the calculation of c.
This will cost performance, but will hopefully increase securty.
Hope it is understandable, I am not used to write in english.
Yes, that's a good idea. Besides, instead of
b := (mask and b) or (not mask and c)
you can use
b := ((b xor c) and mask) xor c
and thereby save an operation and a register. :-)
OpenSSL isn't actually vulnerable to this attack; the paper modified the code to make it vulnerable. Recent versions of OpenSSL use a fixed-window exponentiation scheme: compute c^i for i in 0..2^k-1. Then repeatedly square, and every k steps, multiply by c^i for the appropriate i. To make memory accesses independent of the key, it also interleaves the table of c^i values within cache lines.
There still is at least one key-dependent branch in OpenSSL, in the Montgomery reduction algorithm. But hopefully, you can't gain any information from the Mongomery reductions, because OpenSSL uses blinding too: multiply by r^e before decrypting, and then divide by r at the end.
All these tricks make OpenSSL kind of slow: RSA as implemented by GMP is much faster.
A dumb question (no rotten tomatoes, please):
When cryptographic procedures are running whithin an isolated "sandbox" virtual (emulated) system that involves emulating a processor (think QUEMU), are they still vulnerable to side chanel attacks like this one?
"When cryptographic procedures are running whithin an isolated "sandbox" virtual (emulated) system that involves emulating a processor (think QUEMU), are they still vulnerable to side chanel attacks like this one?"
Not a dumb question at all.
The answer depends on the design, not just of the processor but of the overall system.
IF you can prevent one process in a system effecting another (which is extreamly difficult) then it does not have the potential to leak information, otherwise it does which is your "side channel".
This leads to the next question which is, "Does the side channel have sufficient bandwidth to communicate usable information". Again if not then no usefull information leakes to the other process.
For example, one "off the wall" way a process may effect another is by causing the CPU to heat up and change the XTAL frequency that drives the CPU. It has been shown that you can get about eight bits of information an hour down this channel when viewed remotly across a network. Obviously a process running in the same system or CPU is likley to have a larger bandwidth, but again is there sufficient bandwidth in this or other methods.
The answer to this depends on many things, but if you get 1 bit for every use of the private key then after X usages you have sufficient information to reconstruct the key Dan Coppersmith estimated 25% of bits for full recovery (just wish I could remember the citation).
So on a busy web server it might take a day or less to get enough bits to make a search, depending on the channels available.
The next issue is even if a channel has insufficient bandwidth can it be used in agrigate with other channels...
The real problem from the defenders point of view however as this paper shows is not "IF you can use a channel" but "IF you can identify all channels and limit their use".
As I said further up the blog only if you use appropriate design not just in hardware but in software as well, which most commodity systems do not.
So the answer to your question is very probably yes, you just have to find the side channel and exploit it.
So basicaly a Virtual Machine enviroment, unless specificaly designed with side-channel protection in mind, is very likely to have its own side chanels.
"a Virtual Machine enviroment, unless specificaly designed with side-channel protection in mind, is very likely to have its own side chanels."
I would be a bit broader than that and say,
"Any system is very likely to have its own unknown (to the designer) side channels"
The problem is side channels are very very difficult to design out of a system.
And problem number one is you have to be aware it's possible to have one in XXXX manner, before you can design it out.
Also new side channel methods YYYY come up all the time as technology develops. So YYYY is always something that is known after the event, so it is more by luck than judgment you protect against it...
The usuall solution employed is the "two screened box" solution where,
1) you put your system into a closed environment (electricaly screened) and you very strictly control the inputs and the outputs with filters etc. Any decent book on EMC will show you this much.
2) Additionaly all inputs and outputs need to be re-clocking to prevent a range of time related attacks (think PWM / PPM / DSSS etc).
3) You then put this resulting system inside another electricaly screaned box.
4) Then there is the aditional problem of power supply / galvanic issolation, lead radiation / coupling, likewise for magnatic and achostic etc etc.
5) Also you do not bring any indicators out unless they likewise have been re-clocked and filtered...
A UK BID cipher machine had the unfortunate property that the output from it's internal stream generator was brought to the front pannel as a health status light.
It has since been discussed that a similar problem with a comercial cipher system in computer centers with glass walls in banks might well have been responsable for confidential data leakage, in that there where red & green channel LEDS directly connected to the data lines and you only needed a telescope and photodetector connected to a scope to see the results (Ouch).
As has been said "it ain't rocket science" but it "sure ain't easy" and you have to be able to think "off the wall".
"Additionaly all inputs and outputs need to be re-clocking to prevent a range of time related attacks (think PWM / PPM / DSSS etc)."
I guess that is exactly what can be done by mean of a Virtual Machine with an emulated processor!
Even more, certain intricacies of such re-clocking within an emulated computational environment can be set manualy by the admin of each system (server), thus, the arising side-channel phenomena (and we assume they might still be present) will differ from one machine to another greatly, as the will be relevant not only to the hardware, but also to the software, which will be fine-tuned slightly differently on each machine.
What do you think of this?
Schneier.com is a personal website. Opinions expressed are not necessarily those of Co3 Systems, Inc.