Schneier on Security
A blog covering security and security technology.
« Backscatter X-Ray Technology |
| Risks of Pointy Knives »
June 10, 2005
More MD5 Collisions
Two researchers from the Institute for Cryptology and IT-Security have generated PostScript files with identical MD5-sums but entirely different (but meaningful!) content. (Other MD5 attacks are summarized here.)
Posted on June 10, 2005 at 8:15 AM
• 25 Comments
To receive these entries once a month by e-mail, sign up for the Crypto-Gram Newsletter.
OK, this kills MD5, pretty much. The attacks are no longer theoretical.
What do you recommend instead? Tiger?
The demonstration is very nice. They used the fact that a postscript document is actually a program which displays the text. They could have used ANY two MD5 collisions. The same method can be used for anything programmable. They made this to make the point: using just one hash method for which exists any known collision to check some executable code is totally useless.
"OK, this kills MD5, pretty much. The attacks are no longer theoretical."
The attack was always practical. Read the summary. This research is useful because it shows "how badly wrong" people who thought it was just theoretical were. Kaminsky's paper was similar in this respect.
An interesting attack. It makes use of the postscript eq operator to test against colliding 1024 bit values to select between the two differing content streams (remember Postscript is really a programming language). The strength of this attack is that it could be used to select between any two arbitrary content streams. The drawback of this attack is that the proof of bad intent lies within both documents. That is your "evil" content exists within the "innocent" document and vice versa, so that if the documented is opened in a text editor you can realize what is going on. This almost (but not quite) reduces this attack to an interesting social engineering attack based on the fact we trust our tools (pick your favorite PS display tool), rather than look at the source of what we are signing.
However keep in mind that this approach has applicability outside of Postscript files and could be used on anything with a equality construct (web pages, applications, java applets, etc...)
That is why Tripwire performs an MD5 and a CRC32 check on a file to determine a change.
Wonder what the likelyhood of a file colliding with both of those....
Wouldn't an SHA-xxx hash be better?
I haven't got the skills to verify this properly, but surely a collision on either SHA-x or MD5 can't be collided in both, so check both hashes of the same data, some "defense-in-depth": ie.
letter_of_rec.ps MD5 a25f7f0b29ee0b3968c860738533a4b9
letter_of_rec.ps SHA1 07835FDD 04C9AFD2 83046BD3 0A362A65 16B7E216
order.ps MD5 a25f7f0b29ee0b3968c860738533a4b9
order.ps SHA1 3548DB4D 0AF8FD2F 1DBE0228 8575E8F9 F539BFA6
Thus files don't match.
CRC-32 by itself is quite susceptible to forging; you could make a file have any arbitrary CRC-32 checksum value by inserting junk data in previously empty space, since CRC-32 is just a linear function on the data. So my initial intuition was that using it in conjunction with MD-5 doesn't add a great deal of security. I found the following (an analysis on creating concurrent MD-5 and CRC-32 collisions) while conducting a bit of research:
I don't know enough to pass valid judgment on this short presentation; thoughts?
What about having an AES-like competition to create the next hash function? Hashing is used as a fundamental building block in so many protocols, that i think it makes sense to pool the resources of the community. Especially since (as far as i know) the design of a hash function is even more of a black art than encryption algorithm design.
(Probaby not a new idea at all; i first heard it when taking a class under Dr. Avi Rubin)
The CITS paper is nicely done, but hasn't MD5 been listed for some time by the National Institute of Standards and Technology (NIST) as "fully compromised (demonstrated)"?
At least since the beginning of this year NIST was suggesting a migrating to the SHA-256 algorithm by 2010. And then the panelists on the Feb 2005 RSA Cryptographer Panel (which was more focused on the SHA-1 vulnerability) seemed to say that *any* new security product should also use a new hashing algorithm (something other than CRC32, MD5, or SHA-1). But I think this was really just to say that the industry will be seriously impacted by a need for a suitable SHA-1 replacement, not that MD5 is a concern. In fact some of the panelists suggested there is no rush to migrate off SHA-1 (although it struck me that the most conservative of the group seemed to be Rivest, who designed MD4).
The Federal Indentity Credentialing Committee (FICC) put it nicely when they reported on March 8th that a MD5 "compromise is not seen as a major impact to the security product and services industry". See page three of the FPKI minutes:
The minutes also hint that NIST might formally respond and/or accelerate the recommendation for a SHA-1 replacement, but I have not seen anything yet. The NIST Special Publication 800-78, Recommendation for Cryptographic Algorithms and Key Sizes, was just released in April:
"They could have used ANY two MD5 collisions. The same method can be used for anything programmable."
No, it's worse than that. See, the final documents hash together, not the two blocks. What they did was to create a file with both documents and some space in there for some "random" characters. Then created some program to insert two different random characters in there and keep doing it until it found a pair where the hashes matched.
They really used the simplest demo they could, basically, but you can essentially do this with any document format which has the ability to include non-printable data. The two documents *don't* have to both contain the same data. You just have to be able to restrict where in the document you can insert your random data and make your changes as you test for hash collisions on the whole document.
In other words, yes, this case has both documents containing the same data. However, there's really nothing preventing somebody from doing this to two different documents, as long as they can insert data that isn't displayed into one or both of them.
Their attack still appears to be "only" a collision attack (i.e. not a preimage attack). Those of us using md5 just for verifying authenticity are still "safe."
I'm not particularly impressed with this attack. To those that are saying that hash algorithms are killed by finding collisions (yes, this includes SHA-1), you have to put this finding into the correct context. All that the researchers are demonstrating here is that it is possible to create two functionally different yet valid files with the same signature in a very short amount of time.
This and the other referenced attacks all take advantage of the length extension property of hash functions after finding two pieces of data with colliding hash values. Effectively, all they demonstrate is the possibility of wrapping identical content around these two separate sets of colliding bits.
While this is important, this is not nearly the same thing as demonstrating that it is possible to create two ARBITRARY documents with the same hash value or that it is possible to generate a NEW document with the same hash value as an EXISTING document.
In plain English, this is what the finding means: if we have control over both the original and the forgery AND the documents can appear or function quite differently with very minor changes, then MD5 (or any algorithm that has computable collisions) is broken. This means that we must be careful about what we sign ourselves and how we trust the signatures of others. However, if someone we trust has signed something and we know that they did not sign anything that has this sort of "dual functionality", then we can still trust that someone else has not created something malicious with the same signature and tried to pawn it off on us. Matthew Dempsky sums it up nicely: MD5 is still fine for its use as an authenticator in most scenarios.
Otto, your comment is incorrect. The original AC was correct. The two documents DO have to contain the exact same data except for the precomputed bits that have colliding MD5 sums, and those bits must both be inserted in the exact same place in the two documents.
Many have asked why we don't simply combine two different hash functions. This attack still works if we can find colliding values that collide in both functions. You could essentially think of the two combined algorithms as one algorithm with different collision properties.
To my understanding, "create two ARBITRARY documents with the same hash" is precisely what this attack demonstrated. If i'm reading the slides right (pg. 5, specifically), given strings A and B, you repeatedly generate random strings R1 and R2 until MD5(A;put(R1);) = MD5(B;put(R2);). With no restrictions on the contents of A and B, isn't this precisely the ability to create two arbitrary documents with the same hash?
Also, you mention that MD5 is still fine for its use an an authenticator for most scenarios. For instance, with a signing party that we trust not to have intentionally or unintentionally signed a document with dual functionality. But for most situations, we use a signing algorithm precisely because we don't fully trust the document author's good intentions or the signer's ability to avoid evil documents. This may be especially true of blind signatures, where we're trusting a 3rd party to verify and sign a document for us. If we really trusted the author of the document, we wouldn't need a signing algorithm in the first place (other algorithms can take care of integrity, confidentiality, etc.).
From what i know of hash collisions, and the way i'm interpreting the slides, i think Otto is correct.
Finally, i think there is potential in the combining-multiple-hash-function approach. True, the result can be considered as one algorithm with different collision properties. But to my understanding, it may be less collision-resistant than the originals... or more collision-resistant.
I don't know enough to say much about whether both documents needed to have the same content. However, if this isn't the case why would the authors of this attack go to a great deal of trouble to present a less impressive result?
Also while one can view two hash functions used together as one hash function it is strictly harder to find a collision. Why is this? Because in order to find a collision in this new combined hash function we must have a collision in both component hash functions. Of course if one does something silly and XORs them so as to not increase the number of bits then it very well may get weaker.
You seem to be comparing a hash with 2n bits to a hash with n bits and saying that it's harder to find a collision in the one with more bits. This is not very surprising. Furthermore, if the hash with 2n bits is composed of two parts -- a strong hash function and a weak hash function -- it seems like it would be less strong than simply doubling the number of bits output by the strong hash function.
If one can't prove anything about the relative strengths of either function, though, it might make sense to do what you suggest so that you aren't putting all your eggs in one basket.
This attack demonstrated that two arbitrary STRINGS with the same hash can be generated, but not two arbitrary DOCUMENTS. If you look at the actual strings that they created to have the same hash, it should be clear that they are not particularly useful on their own as documents in any standard format.
Have a closer look at slide 5. The authors concatenated the random strings R1 and R2 to the SAME preamble, not different strings A and B as you suggest. Again, this works because of the length extension property.
MD5 is not the same thing as a signature. This is an important distinction. MD5 is used in digital signature algorithms because it is much faster to compute the signature based on an MD5 hash than on the entire contents of a document.
I don't understand your assertions about how "we" use digital certs in "most situations." You said "we use a signing algorithm precisely because we don't fully trust the document author's good intentions or the signer's ability to avoid evil documents." How does a signature protect us from an author's evil intentions? If the author can't avoid evil documents, how does it help us if he signs them anyway? My assertion is that MD5 hashes are still useful in signature algorithms when we trust the original signatory and are verifying OTHER documents agains the signed hash of the original, or we are signing something we have vetted ourselves in order to authenticate the document.
Any technology can fail if it is used incorrectly. This is in fact the only point that the authors of the PostScript example are trying to make. It has already been shown that it is trivial to find hash collisions in MD5 by other researchers - they are simply trying to drive the point home that we have to be careful about the hashes that we sign ourselves and how we choose to trust the hashes signed by others.
As an aside, for those of you who make use of the handy md5sum utility, but who would like to use a similar utility which generates Whirlpool hashes, help is at hand courtesy of the United States Air Force Office of Special Investigations:
This is public domain software, and my all time best find of today (to quote Red Dwarf). Enjoy.
Ah, now i see what is going on in the slides... My problem was that i'd thought the two different displayed texts were in the preamble, not realizing that they are T1 and T2. My apologies for posting before really thinking through the slides.
You're right, of course, that MD5 is distinct from a signature.
It might be easiest to explain what i was trying to say about signing hashes by responding to your original post:
"However, if someone we trust has signed something and we know that they did not sign anything that has this sort of "dual functionality", then we can still trust that someone else has not created something malicious with the same signature and tried to pawn it off on us."
But sometimes we simply cannot know if a trusted party has signed something with dual-functionality. Suppose there's a fairly large open-source project. A malicious coder could contribute new code with dual functionality (or a group of like-minded coders could hide a portion of the new code in their separate contributions) to the project. It is possible to detect something like this, but the likelihood of detection decreases dramatically as the amount of code increases. They submit to the project the version with good behavior (in the context of the original example, they would submit the document that displays Caesar's recommendation letter). Suppose that eventually a new release is posted, with their evil code incorporated. The bad guys now post their own version of the new release (the document that displays the order to give Alice security clearance) on mirror sites. Now, when someone downloads from the evil mirror sites, they cannot discern that they've got evil code, even if they check its hash against the hash posted by the official project site (a source that we presumably trust).
Of course, it's always been the case that projects can be subverted by insiders. But before the demonstration of MD-5 collisions, MD-5 hashes were a good way to detect changes to code. Now there is atleast one way in which MD-5 hashes can be bypassed as a detection mechanism.
Hopefully i'm not just making a fool of myself all over again...
Your scenario is a plausible one. While the nature of open source projects makes it more likely that someone could inject a harmful piece of code in this way, that very same nature makes it much less likely to happen undetected.
Consider the same scenario for a closed-source piece of software. If a malicious developer can create 'evil' versions of compiled code, it is virtually impossible to detect in comparison to the raw source example.
Again, the important message here is that an MD5 hash should never be construed as a measure of trust. In most cases, developers don't sign the hashes of their code anyway; all they are offering is a way to verify that any code downloaded from third parties hasn't been modified. It is still up to us to decide whom to trust.
LOL. This is so cool. You know what other favorite file format could easily contain lots of junk at the end in order to generate a colliding hash, don't you?
I'm thinking of starting an new "text only email" campaign based on these results. :)
Some have suggested that it is necessary for the documents to have the same contents after the preamble. A couple of points:
a) So what? How, exactly, does this help you resist the attack? (Remember, the victim only gets one version of the document!) What you probably really meant to say is, "if I open this .ps in a programmer's editor and examine the PostScript code, I can see that there's an eq/if/else with two sets of text, which looks fishy". Yep, because--as the authors' stated in their summary--they avoided any obfuscation specifically in order to help you understand the demonstration. There's a whole bunch of ways to do this so that you could NOT immediately tell that something is wrong unless you ran everything through a disassembler and went over it line-by-line (eg an assembler JMP instruction which depends on the value of a preamble word). At that point, of course, a signature has become worthless.
b) In any case, it isn't actually essential that the remainder of the document be identical. What is demonstrated here is a simple technique ("the blind passenger attack") which makes it possible to create differently-behaving documents of arbitrary complexity, once given a single collision. It is a simple rejoinder to the claim that, because one cannot control the form of the colliding strings when finding a collision, they therefore cannot have a useful form. The existence of the blind passenger attack proves that this naïve view is false. BUT there is no guarantee that blind passenger is the only way to do this.
Consider the following. Using the same notation as that in the PDF slideshow, suppose I have:
Y1 = preamble; put(R1); put(R1); if(=) then T1 else T2;
Y2 = preamble; put(R2); put(R1); if(=) then T1 else T2;
where MD5(Y1) = MD5(Y2) but R1 != R2. Now suppose that it is essential, for some obscure reason, that my documents do not have the same ends. So I want to find:
Y1' = preamble; put(R1); put(R1); if(=) then T1 else T2; exit; abcd R3
Y2' = preamble; put(R2); put(R1); if(=) then T1 else T2; exit; abcd R4
where MD5(Y1') = MD5(Y2') but R3 != R4. (I have added enough arbitrary characters 'abcd' so that R3 and R4 start a new block.) Now we know that the two MD5 calculations are the same up to the "else T2;". By the extension property, we know they will therefore also be the same up to the " exit; abcd ". So our problem reduces to finding *arbitrary* collisions R3, R4 in an algorithm MD5_mod which is identical to MD5 except that it has a different initialisation vector. I haven't had the opportunity to examine the details of the Wang-Yu attack to see if it extends easily to this "MD5_mod", but it is perfectly feasible--indeed likely--that the exact choice of initialisation vector is irrelevant. In that case, there would be a simple method to extend "blind passenger" to the case of documents with any number of segments, in multiples of block sizes, where each segment is arbitrarily *EITHER* a structured string which is the same in both documents *OR* binary junk which differs between the two documents. The cost is one collision finding per differing segment, at a few PC-hours each. It might be possible to wriggle around the restriction of being in multiples of block sizes but I can't be bothered thinking that through at the moment.
To address your point a), the knowledge about the expected form of attack helps us resist the attack because once we know what to look for it becomes vastly more difficult to construct a pair of documents that hold up to even casual inspection. It seems that you undermine your own point by agreeing that once we're at the point of obfuscating PostScript or hand-coding in assembly "the signature has become worthless." In what scenario would we ever agree to sign the hash of a file to authenticate it if the contents are not transparent?
As for b), the only gain I could see from concatenating multiple collisions is to have more 'switches' that can trigger additional behavior. For the layman, where the attack described in the article was "Preamble + (R1 or R2) + Content", Roger's proposed attack resembles "Preamble + (R1 or R2) + Content + (R3 or R4) + .. + Content", where MD5(R3) = MD5(R4). More simply put, it is a series of identical strings concatenated with multiple sets of bits with known hash collisions. Again, this hardly approaches the same category of vulnerability as a method for taking two distinct documents and tweaking them until the hashes match.
Your arguments center on the premise that it is possible to make arbitrarily complex documents with the same hash using this technique. For MD5 to be seriously broken it must be demonstrated that it is possible to create extremely SIMPLE documents with the same hash, simple enough to pass as ordinary and innocuous.
Sorry to resurrect this ancient discussion but while I agree the fact MD5 is broken is bad, it seems to me it's not quite as bad as some people think. Obviously it means that someone can have 2 different versions of a document. But let's think about another thing. You always have to keep your own version. If you rely on the document creator to keep it for you, your being stupid because they could just 'lose' it at any time.
So let's say you have a contract. There are 2 different versions which say different things but both match the hash. You have one version which you agreed to. The creator has another version. A dispute arises and you go to court. You each present your version of the contract which is different. The judge goes WTF (well you get the idea). You realise what's going on and you call Bruce Scheier to testify. He says it's extremely unlikely this happened by accident, the creator of the document must have done this on purpose. The receiver of the document almost definitely could not have been the one who created a fake document which matched. Ergo the court trusts your version (the receivers version) not the other version (the creators version).
Now obviously if you are relying on the creator to send the document to your lawyers or other people for you then you would be in deep shit. But this is a rather silly practice. You should always sign your own documents and send them to people yourself. Ergo unless you are doing something fishy, there will only be on possible version the people who you send the documents to can receive. It doesn't matter that the creator of the documents has two different versions that match.
My point is, that while the fact MD5 is broken is bad, and obviously it would be smart not to rely on it, provided you are smart in what you do it's not really much of a risk.
sir, i need a complete working of md5 algorithm with comments side by side, pls sent it to my mail.pls send the code in c# such that it will be useful for me sir !
Schneier.com is a personal website. Opinions expressed are not necessarily those of BT.