SHA-1 Collision Found

The first collision in the SHA-1 hash function has been found.

This is not a surprise. We've all expected this for over a decade, watching computing power increase. This is why NIST standardized SHA-3 in 2012.

EDITED TO ADD (2/24): Website for the collision. (Yes, this brute-force example has its own website.)

EDITED TO ADD (3/7): This 2012 cost estimate was pretty accurate.

Posted on February 23, 2017 at 3:29 PM • 31 Comments

Comments

kjzdvhfFebruary 23, 2017 3:41 PM

Is SHA-3 really necessary, or just an attempt to have a fallback in case SHA-2 is weakened ?

Rick LobrechtFebruary 23, 2017 3:48 PM

The Google article Bruce links to recommends SHA-256 (part of the SHA-2 family).

AnuraFebruary 23, 2017 4:03 PM

@kjzdvhf

The main reason to change is that SHA-2 is not secure in the general case; it is vulnerable to length extension attacks, and thus sha2(key || message) is not secure as a MAC, even though it could be with a better design. More importantly, that it isn't secure in the general case can weaken encryption schemes that assume that's the case (hence why we use HMAC). For sha3, that construction is secure as long as the underlying compression function is secure, which also means that there are less likely to be flaws in protocols that rely on it.

SHA-2 was always kind of a band-aid to begin with. We knew SHA-1 was weak if only because it was too short, and no one really wanted to reinvent the wheel at the time. So SHA-2 was created which tweaks the compression function and increases the hash length to provide greater security and overcome some of the weaknesses. This alone should be enough for us to seek out a new standard selected through a proper process.

Scott LewisFebruary 23, 2017 5:32 PM

SHA-1 is used beyond encryption as well. Several storage systems (HPE 3PAR, XtremIO and VSAN) use SHA-1 calculations on incoming writes to check for deduplication.

That's less likely to be vulnerable to a deliberate hack, and given the odds of a random in the wild collision probably not a big fear for storage adminstrators, but a collision certainly would cause data corruption.

RhysFebruary 23, 2017 5:37 PM

Perhaps our understanding of "randomness" is flawed? And all these complexities are just devolving into race between the scale of the algorithm vs the scale of the device capabilities and/or our understanding of mathematics?

MD-5 went back to 1992. Gone 10 years later. SHA-0 was 1993. SHA-1 a "patch" to SHA-0 in 1994. SHA-2 was 2001. NIST made 2006 recommendation to move to SHA-2 by 2010. SHA-3 emerged from competition in 2012. For those worried the NSA had too much invested in SHA-2.

Its more disconcerting to me that methods used for authentication, like SHA, are under such broad spectrum of research and attack. The money spent by Google and the Dutch is only for the first solution. Each subsequent will be increasingly less expensive. Not largely owing to Moore's law either. Rainbow files will be following no doubt.

Keep in mind- this is only what is published. Not necessarily congruent with what is known.

These hashes are critical infrastructure with digital signatures (including documents that are e-signed), message authentication codes (MACs), and other forms of authentication.

Sad commentary on how little progress to update infrastructure at banks and businesses are being exposed. Our government loves to keep its information systems until they are Smithsonian quality. With the walled gardens of the cellular community, this was also published today: https://www.scmagazine.com/50-banking-smartphone-apps-fail-on-security/article/639775/

ThunderbirdFebruary 23, 2017 5:40 PM

I was curious how this attack actually works--not in gruesome detail, but in general. I understand that you start with a string A and find a B and a C that can be used to make AB and AC, both of which produce the same hash. My question is whether you can do something similar given a fixed AB, or does the attack rely on being able to pick both B and C?

AnuraFebruary 23, 2017 5:40 PM

@Scott Lewis

If all you care about is data duplication, MD5 is perfectly fine and probably a better choice. The odds of collision are so extraordinarily low that it shouldn't be a concern at all (which you can just include the file size as well), and the performance of MD5 is several times faster than SHA2.

OldFishFebruary 23, 2017 6:22 PM

Interesting that two crusty old hashes, MD5 and SHA1, will together be pretty good since a double collision isn't looking feasible

LewisFebruary 23, 2017 6:29 PM

@Thunderbird

Hacker News has several good explanations:

nneonneo [-]

The visual description of the colliding files, at http://shattered.io/static/pdf_format.png, is not very helpful in understanding how they produced the PDFs, so I took apart the PDFs and worked it out.

Basically, each PDF contains a single large (421,385-byte) JPG image, followed by a few PDF commands to display the JPG. The collision lives entirely in the JPG data - the PDF format is merely incidental here. Extracting out the two images shows two JPG files with different contents (but different SHA-1 hashes since the necessary prefix is missing). Each PDF consists of a common prefix (which contains the PDF header, JPG stream descriptor and some JPG headers), and a common suffix (containing image data and PDF display commands).

The header of each JPG contains a comment field, aligned such that the 16-bit length value of the field lies in the collision zone. Thus, when the collision is generated, one of the PDFs will have a longer comment field than the other. After that, they concatenate two complete JPG image streams with different image content - File 1 sees the first image stream and File 2 sees the second image stream. This is achieved by using misalignment of the comment fields to cause the first image stream to appear as a comment in File 2 (more specifically, as a sequence of comments, in order to avoid overflowing the 16-bit comment length field). Since JPGs terminate at the end-of-file (FFD9) marker, the second image stream isn't even examined in File 1 (whereas that marker is just inside a comment in File 2).

tl;dr: the two "PDFs" are just wrappers around JPGs, which each contain two independent image streams, switched by way of a variable-length comment field.

https://news.ycombinator.com/item?id=13713480

ThothFebruary 23, 2017 7:20 PM

@Clive Robinson, Nick P, Figureitout

Sadly, most security hardware would be stuck on SHA-1 for a long time due to the market so badly flooded with "legacy crypto accelerators". To my knowledge, the cheaper smart cards with limited capacities (and also the most widely available) would only have SHA-1, MD-5 (yes the devil is still there), 3DES, DES, RSA (hopefully not those cards with RSA-1024 and lower). Legacy is here to stay for a very long time despite GlobalPlatform (smart card standards body), ISO, NIST and others have already long warned of legacy algorithms on hardware accelerators usually found in cheap smart cards and should be ditched.

What it means is ... do not buy the cheap smart cards just to save the cash and purchase those with proper algorithm support. Writing and loading into these legacy smart cards one's own SHA-2 algorithm is possible but the performance would be noticeable and the overhead on the RAM and EEPROM would not be desirable.

MKFebruary 23, 2017 7:25 PM

If you use a SHA-1 hash when digitally signing the PDF, you can move the signature between the two files. But if you use SHA-2, you can't. PDF has defaulted to SHA-256 since Acrobat 9 (whew!).

SandroFebruary 24, 2017 3:04 AM

Almost the entire rely on the unwavering certainty of the properties of hash algorithms: it is time to abandon the SHA family and move to Keccak (SHA-3)

anyway it's better apply a timestamp or re-sign any previous documents signed with SHA1-RSA

NileFebruary 24, 2017 7:46 AM

An orthogonal issue is the use of older hashes for 'administrative' security - deduplication, basic data integrity - rather than as secure signatures.

The problem is 'use creep': hashes that were intended as convenient tags and barcodes end up getting treated as signatures and document verification for audit purposes.

I have bitter experience of an MD5 tag used in a 'Did you get the right message?' conversation being promoted to the status of a 'secure' message verification in a messaging system that was extended into payments & settlements processing.

The worst of it was that the system was configurable, so that it could've switched over to the then-secure SHA-1; and it could now be switched to SHA-3. This is, however, a trap: the implementation is fit for purpose in a tags-and-barcodes model of data verification, in a system handling low-value data; but it's nowhere near the level of verifiable security required for a threat model where it's financially worthwhile for an adversary to solve the MD5 hashes.

That's my mistake - do the system right, and verifiably right, no matter what the initial security requirement may be, or get it off-the-shelf from somebody you know is competent.

My mistake is out there in the wild, everywhere, in all those systems which can easily swap out one hash for another - and will! - without asking whether any of the system's design assumptions still hold good in the changed security environment. An obvious question would be: "If I'm now offering an assurance to the message recipients they are now secured by SHA-3, shouldn't I check that the system inputs (like vendor patches) are equally secure?"

It's easy to say that dim administrators ought to think of that: but a well-designed configuration interface based on these foreseeable use-cases would flag it up. And it ought, at least, to be there as a warning in the manual.

Is it in yours? Is it in anyone's manuals?

Joe RandomFebruary 24, 2017 8:46 AM

@Scott Lewis and @Anura: The real reason it shouldn't be a problem to use whatever hash you want for data deduplication is because the hash should only be being used to be able to detect differences earlier and skip needless comparisons. I'd be surprised and dismayed if the data with the same hash isn't being fully compared before being considered a duplicate.

Jonathan RosenneFebruary 24, 2017 4:47 PM

How does this affect the use of SHA-1 in key exchange protocols where the structure of the signed file does not allow unnoticeable changes in content?

PedantFebruary 24, 2017 11:24 PM

@Rhys

Sad commentary on how little progress to update infrastructure at banks and businesses are being exposed. Our government loves to keep its information systems until they are Smithsonian quality. With the walled gardens of the cellular community, this was also published today: https://www.scmagazine.com/50-banking-smartphone-apps-fail-on-security/article/639775/

Are you trying to encourage people to make software that wouldn't be easy for LEAs to break into as required per CALEA?
If so you're technically inciting sedition.

DefenseFebruary 24, 2017 11:41 PM

@Thoth
Sadly, most security hardware would be stuck on SHA-1 for a long time due to the market so badly flooded with "legacy crypto accelerators".
What do you mean by "sadly"? This is required for lawful interception and is the hard-won victory of the NSA's 50 billion dollar SigInt enabling campaigns.
If technology like Git used secure primitives instead of SHA-1 like it does, wouldn't it be harder for our brave service-men the NSA to implant SigInt enabling patches in the middle of a Git tree without rebasing with is very noticeable?

MD-5 (yes the devil is still there)
tongie-in-cheek; Devil fits Lucifer/Skipjack better than MD5.

FigureitoutFebruary 25, 2017 10:37 AM

Thoth
=>Nine quintillion (9,223,372,036,854,775,808) SHA1 computations in total
=>6,500 years of CPU computation to complete the attack first phase
=>110 years of GPU computation to complete the second phase

--That doesn't scare me much. Compromising an embedded system still generally takes a physical attack or attacking isolated systems where attackers have to be close to setup for attack (leaving evidence in the process if what you're protecting really matters, you're going to be watching) or attacking endpoints you program them with. And embedded is shifting somewhat to what I call "real embedded" and "modern embedded" w/ 32bit ARMv7 chips and the like, little monsters.

When the attacks I'm looking for start happening, then I'll be more concerned. Until then if you have that capability you could make a lot of money secretly and be set, why would you want all that heat otherwise?

NewbieFebruary 25, 2017 8:38 PM

SHA-1 is used beyond encryption as well. Several storage systems (HPE 3PAR, XtremIO and VSAN) use SHA-1 calculations on incoming writes to check for deduplication.

That's less likely to be vulnerable to a deliberate hack


What about where SHA-1 was assumed to be secure enough to skip comparing the whole block or the whole file.
Couldn't /etc/passwd or /etc/.sshd/* be attacked? Please forgive the dumb question.

J. R. HaighFebruary 26, 2017 10:50 PM

I'm not clear what it improves upon. On 2012-10-05, you quote Walker saying “If Stevens' attack of 260 SHA-1 operations serves as the baseline, […]”, which suggests to me that in 2012, the most efficient published attack for an SHA-1 collision would take 260 hash computations. (Btw., are we talking mean or maximum?)
    In 2015, a partial attack achieves 257.5, the SHAppening freestart collision attack on SHA-1's compression function. I don't have any idea as to how partial it is other than that a full collision attack would have higher complexity, but surely less than 22.5× greater complexity – otherwise, what's all the hype about?
    Then 16 months later, we see an actual collision emerge and it's claimed that it took 263.1 operations! Am I right in saying that this is about 8.6× more computationally expensive than if the 260 attack had been used?
    I suppose that Google did this elaborate chosen-prefix application to PDF files on the first published SHA-1 collision as a PR stunt, so that they can lock into history, as the ‘first ever published SHA-1 collision’, the pair of PDF files with their branding on. Right?
    So perhaps it's not the best indication of the cost per SHA-1 collision for more basic applications or those that criminal organisations would first be looking at. How much do you reckon it cost Google financially compared to rentable computation, ignoring adaptation costs?

InquirorMarch 2, 2017 11:07 AM

I am now hearing about a so-called "hardened SHA-1" that is SHA-1 augmented with alleged collision detection: https://github.com/cr-marcstevens/sha1collisiondetection

While it has been demonstrated to detect the collision produced by Google, how effective would this be in real scenarios? The "readme" text admits the detection is only for the top attack vectors. How long before there are practical attacks using other vectors?

Smells like "feel good security theater" that will only serve to further delay moving forward to better algorithms.

Leave a comment

Allowed HTML: <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre>

Photo of Bruce Schneier by Per Ervland.

Schneier on Security is a personal website. Opinions expressed are not necessarily those of IBM Resilient.