MD5 and SHA-1 Still Used in 2018

Last week, the Scientific Working Group on Digital Evidence published a draft document -- "SWGDE Position on the Use of MD5 and SHA1 Hash Algorithms in Digital and Multimedia Forensics" -- where it accepts the use of MD5 and SHA-1 in digital forensics applications:

While SWGDE promotes the adoption of SHA2 and SHA3 by vendors and practitioners, the MD5 and SHA1 algorithms remain acceptable for integrity verification and file identification applications in digital forensics. Because of known limitations of the MD5 and SHA1 algorithms, only SHA2 and SHA3 are appropriate for digital signatures and other security applications.

This is technically correct: the current state of cryptanalysis against MD5 and SHA-1 allows for collisions, but not for pre-images. Still, it's really bad form to accept these algorithms for any purpose. I'm sure the group is dealing with legacy applications, but I would like it to really push those application vendors to update their hash functions.

Posted on December 24, 2018 at 6:25 AM • 22 Comments

Comments

slow deviceDecember 24, 2018 6:47 AM

"remain acceptable for integrity verification and file identification applications"

For those purposes MD5 has the advantage of being fast and small.

tfbDecember 24, 2018 9:32 AM

Pretty sure git still uses SHA1. I convinced myself a while ago that this was OK but in any case clearly a lot of people are accepting SHA1 for that purpose, anyway. And given that git repos record history they probably will have to accept it for ever (you could reconstruct a git repo with a better hash function but then that repo wouldn't match any of its clones in any simple way -- perhaps the new objects in the repo could store the old hashes as part of themselves (would mean that blobs had to be structured objects) or have names which were a concatenation of the new hash with the old one or something).

Impossibly StupidDecember 24, 2018 10:29 AM

Some of the problem with "legacy applications" though is that the hash is the only data that is available for the system to work with. I wrote a system 15 years ago that audited files flowing through it (most of which eventually got deleted), and part of the de-duplication check was an MD5 checksum. That means I essentially can't eliminate the old algorithm, because it would mean the existing database would be unusable. At best I could start to transition it to using a newer algorithm, but MD5 would still have to be supported as the fallback. For the time being, for my purposes, it just isn't worth it, especially when there will likely be an even newer, better hash that comes along in 5-10 years and I'd have to manage yet another transition.

So, from a legacy standpoint, perhaps the issue to be solved is not how to eliminate older algorithms, but rather how newer algorithms should be approached/written so that they can best leverage/incorporate all the effort that went into establishing the older algorithms. If you want to push for eliminating MD5, push for an algorithm that is not only stronger but also with output that can be easily used to "degrade" into something that will significantly match an MD5 hash. Really, though, I don't see how it could be worth the effort of working with that sort of constraint. Far better is to simply accept that there will still be detritus in the digital age, and just deal with it in ways that make sense.

CallMeLateForSupperDecember 24, 2018 10:55 AM

Have a look at the list of penetrated sites published by Troy Hunt at
https://haveibeenpwned.com/PwnedWebsites
and weep at the number of those that used MD5 or SHA1. Sure, many of the sites that used MD5/SHA1 were breached in the 2000's or earlier, but some were breached long after those algos should have gone into the bin. One site, breached in July 2018 (THIS year!), used "a mix of salted MD5 and SHA-1 as well as unsalted MD5 passwords".

So... there are sites that play fast and loose with users' data. Do the users know this? Only *after* a breach, when it's too late. Typically a Terms of Service is pushed up front on a site - we are deluged with TOS! - but try to learn anything about security details before you "sign up" with a site and you'll likely call "no joy".

What hash(es) do they use and how do they use them; how are passphrases handled: is the entire passphrase processed or is just the first n chars processed; are embedded spaces removed; is passphrase converted to lower/uppercase. These details are important to know, should not be hidden from users.

65535December 24, 2018 3:12 PM

Ah, it is a draft prosal which has not been finalized. I have some suggestion below.

See SWGDE:
ht tps://www.swgde.org/documents/Released%20For%20Public%20Comment/SWGDE%20Position%20on%20the%20Use%20of%20MD5%20and%20SHA1%20Hash%20Algorithms%20in%20Digital%20and%20Multimedia%20Forensics

Kind of good analogy [SWGDE]

“MD5
“Random Collision Probability (about 1 in 1.84 x 10 x 19 ) [one in 1.84 times ten raised to the ninetinth power] or One drop out of all the water on Earth.

“SHA1
“Random Collision Probability (about 1 in 1.21 x 10 x 24 ), [one in 1.21 times 10 raised to the twenth-fourth power] or One drop out of all the water in the solar system” -SWGDE

I can see these 2 hash algo’s being used for computer files. That is all.

I am not so sure about Finger Prints or other biometric methods of IDing people for criminal purporses [False positive and False Negatives can very widely].

Apples Touch ID on iPhone can be hacked. Apple’s Face ID system has been shown to mix up mother’s face and a male child’s face.

In the real world I have seem CMS systems like old typePad and WordPress 2.5 and below versions use “salted MD5 hashes” for WP-Admin, Editor, Contributor, Author and Supscriber. But those are salted MD5’s. They could be safe at the time they were deployed. I am not sure of 2018.

Now, we find that most CMS with Linux, Apache, MySQL,Php stacks can be hacked. Probrobably because of other attack vectors than the salted MD5, but it is possible a plain MD5 can be scammed or a collision can be made.

Thus, I am not so positive that at an unsalted MD5 is a valid forensic method in 2018. There are too many MD5 crackers on the net as CallMeLateForSupper notes.

I would suggest that plain MD5 hash algo’s be dropped from the final SWGDE document or highly restricted.

For example say in a murder case which used an MD5 hash on a partial finger print is used as the "Beyond a reasonable doubt" standard to convict aa person as “forensic proof” is bad policy.

Maybe. in common law cases where penalties and burden of proof are less could a non-salted MD5 of a file matching a finger print image be permitted as acceptable “foresic proof”. That is still a thin floor of ice for “foresisc evidence” to stand on.

WeatherDecember 24, 2018 11:09 PM

1 bit has 2.5 parts of information, 0 or 1 and position, you need 2 bits to get position has its not a whole number.
Getting 20 zero in a row has the same probability as 1 zero, but after one the chance it won't be a zero increase until at about 20 zeros its close to 1:1.1 of non zero

Didn't really complete the math course, so grain of salt ;)

RobDecember 25, 2018 3:54 AM

I'm the author of nsrllookup, a tool used by DFIR people to do triage on files.

NIST keeps a huge database of the hash values of known pieces of software (the National Software Reference Library Reference Data Set, or NSRL RDS). At about fifty million distinct hash values, it's a remarkably useful tool for triage: if a file's hash is found in the NSRL RDS it's very unlikely to be of much interest to an investigator.

Early in my project's history I gave the option of searching for MD5, SHA-1, or SHA-256 hash values. I eventually dropped support for SHA-1 and SHA-256, because over 99.9% of the queries I was seeing were for MD5 values.

It's difficult to overstate the degree of inertia MD5 has in the DFIR space.

Jakub NarębskiDecember 25, 2018 5:42 PM

@tfb

> Pretty sure git still uses SHA1.

Git uses SHA1 because a.) the transition takes longer than expected, and b.) SHA1 is a bit less safe than expected, but still quite safe for the purposes Git is using it.

DaveDecember 25, 2018 11:38 PM

To see some scary crypto, look at RADIUS and the stuff run over it (MSCHAP, etc). It's like a museum of bad 1990s crypto, MD5, MD4 (!!), single DES, ECB mode, unsalted, non-iterated hashes everywhere, unilateral authentication all over the place, it's pretty much a what-not-to-do in crypto. And half the Internet is relying on it for authentication.

WaelDecember 25, 2018 11:55 PM

I'm sure the group is dealing with legacy applications

It's a bit more involved than that1. Applies to several technical fields as well. Take, for instance 3GPP: how long does it take a carrier to adopt or upgrade its systems to the next revision of the specifications, let alone the newest?

[1] Cost, resources, skillsets, politics.

CallMeLateForSupperDecember 27, 2018 9:14 AM

And... here we go. MD5 again. Troy Hunt's haveibeenpwned reports today:

"In October 2018, the bullion education and dealer services site GoldSilver suffered a data breach that exposed 243k unique email addresses spanning customers and mailing list subscribers. An extensive amount of personal information on customers was obtained including names, addresses, phone numbers, purchases and passwords and answers to security questions stored as MD5 hashes. In a small number of cases, passport, social security numbers and partial credit card data was also exposed."

Was salting used, I wonder?

We need web sites to be forthcoming and transparent, *from* *the* *get-go*, about their security policies, procedures and practices, so that our decisions to trust, or not trust, are informed ones.

AlainDecember 27, 2018 9:18 AM

"Still, it's really bad form to accept these algorithms for any purpose."

In the video industry there's a move from MD5 to xxHash64 for speed, those hashes are used to check the integrity of footage (video files). xxHash64 is very fast, but not a cryptographic hash and only 64bits.

It's my understanding the the cryptographic nature of a hash is only "needed" when there's a need to check against deliberate changes by a "bad guy". Where Iam I wrong?

When needing a cryptographic hash I would look at the "MarsupilamiFourteen", created by the Keccak team and based on SHA-3, but build in parallelism for longer messages.

Alain

Mike D.December 27, 2018 11:30 AM

I was studying for the CompTIA Security+ test back in March, and one of the questions expected you to say "yes, use it with Triple-DES" to one of the questions instead of "throw out the hardware." Practicality is an important consideration.

Honestly, MD5 and SHA1 are fine for winnowing down a list of potential candidates for file matches, which are then verified by other means.

Clive RobinsonDecember 27, 2018 6:30 PM

@ Alain,

It's my understanding the the cryptographic nature of a hash is only "needed" when there's a need to check against deliberate changes by a "bad guy". Where Iam I wrong?

Hash is a broad term and the meaning has expanded over time.

One of the first uses was to take a block of input and reduce it's size down to make indexing more efficient memory wise when it was priced at or more than $1/octet (yup memory used to be based on multiples of three bits...).

As part of building the compressed identifier it had to be sensitive to small changes in the input. Here it inherited "error detection techniques" from generating Error Detection and Correction Codes (EDCC). They can be very powerfull when used as part of Forward Error Correction (FEC) and there are masses of papers, learned journals and books to do with EDCC. In general such Error Detecting codes are a fraction the size of the Correction Code (think of Parity -v- Hamming) but lack the ability to be used as a hash. The correcting code however can in quite a few cases can be used as a hash the more errors it can correct for generaly the better it is. However the Correcting Codes quickly become inneficient as indexes thus most hash types follow a middle ground. That is they detect many more error types but lack the ability to correct them.

It was realised early on that certain hash types potentialy had usefull properties for proofs. Thus they started a new use for hashes that quickly became appropriate for what we now call information security. They became optimized for certain properties that as far as we know had not been thought of back when they were being used for building efficient indexes or for EDCC.

I feel reasonably sure that hashes will find new niches to not just fill but expand into new knowledge and information domains. What they will be I have only the faintest glimer of, thus your guess would be as good as mine or those of many others, human ingenuity being what it is ;-)

WaelDecember 27, 2018 6:55 PM

@Clive Robinson, @Alain,

thus your guess would be as good as mine or those of many others

My crystal ball says blockchain, next generation ;)

it was priced at or more than $1/octet (yup memory used to be based on multiples of three bits...)

Octet, three! What gives, imperial units or what? Spiked eggnog didn't wear off yet, I guess ;)

Clive RobinsonDecember 27, 2018 11:29 PM

@ Wael,

Octet, three! What gives

Ever hear of a "nine bit byte" it goes back to the 1970s and earlier and originated out of IBM, well it's the only natural number divider of that... But more importantly can be written with the normal numerical characters[1]. In the same way a nibble is the only sensible divider of an "eight bit byte"... But needs a mix of numerical and alpha characters[1] which makes life harder.

That is three bits gives you a 0-7 range or "octal numbers" we still see hanging around in *nix file systems and commands. Back then common computer word sizes were 12, 24 and 36 bits. The "nine bit byte" fit in nicely with the IBM 36bit word width dividing it by four. Likewise a three-bit divided a 12bit word by four.

I was told way back in the late 1970's all the easy "tri-" words like "tribit" had already been taken by the data communications people who actially used "three level signalling" which survived and still does in RS232 serial lines (look up "brk" signal). Further that both the Greek and Latin prefixes for three is "Tri-" therefore the usual "sex/hex", "quint/penta" option was not available, and guess what "oct" had the similar issues... So as "octet" had not been grabbed by the data comms guys for obvious reasons that's what early computer engineers used...

But also as you should know once upon a time we had only one Kilobyte (1024bytes) and Megabyte (1048576bytes) but for some reason known only to Hard Drive manufacturers they had their own definitions that inflated the storage numbers.

Then some idiot in a standards body stuck their oar in with "we have to go the SI way"...

So now we have the kibibyte and the old kilobyte name. But... Where as kilobyte was once 1024 bytes it's now 1000 and we are supposed to use kibi for 1024 which is shortrned to KiB, but... SI uses the small k for measures of a thousand to avoid issues with electrical units...

But we find in reality

1, K informal abbreviation for kibibyte or 1024 bytes, and people still say "K bytes", or even "kilobytes" not "kibibytes" when talking about RAM and other IC based storage.

2, k supposadly the informal abbreviation for kilobyte or 1000 bytes, just not used (except by those inflatory HD persons).


[1] Back in the 1940s through to 1960s and beyond, a cheap input device was a mechanical telephone dial with a little debounce it was easy to wire into a counter circuit. Such as a single 14pin 74 series TTL chip or earlier Norbits and valve circuits. If you look at early "Operator console" photos you will sometimes see them. Thus octal input was a "norm". Further when writing code checking for "0..7" often it is just two lines of assembler as it is unambiguous. Not so hexadecimal, which is ambiguous due to alpha case. That is you actually have to check three ranges "0..9","a..f" and "A..F" and do different translations. Whilst ASCII is not too bad there were other character sets that did not play as nicely (see ITA-2 international 5bit telegraphy code amongst others).

WaelDecember 27, 2018 11:55 PM

@Clive Robinson,

That is three bits gives you a 0-7 range or "octal numbers"

Right, right, right. Got got!

CallMeLateForSupperDecember 28, 2018 8:49 AM

@Clive
"Back in the 1940s through to 1960s and beyond, a cheap input device was a mechanical telephone dial with a little debounce it was easy to wire into a counter circuit."

I have several vivid memories from a visit to Chicago's Museum of Science and Industry back in 1958. A full-size, underground coal mine exhibit with walls and roof made of massive slabs of real coal was so depressingly realistic that I was repulsed beyond redemption. But the very first exhibit I encountered, near the entrance, thoroughly captured my attention(1) for half an hour or so (and ultimately changed the trajectory my life). It was a tic-tac-toe game that one played against a computer(2). A telephone dial was the input device.

(1) My parents and sisters became bored within 30 seconds and wandered off toward the sound of a booming heartbeat.
(2) A *mainframe* computer in a separate room was visible through a window in front of the player.

Clive RobinsonDecember 28, 2018 12:03 PM

@ CallMeLate...,

My parents and sisters became bored within 30 seconds and wandered off toward the sound of a booming heartbeat.

Yup had the same sort of thing happen in the London Science Museum. Every time I pressed my nose up against a case, my sister would say something an walk "mother would attend" and dad and myself would be given the less than subtle hint it was time to visit a different gallery.

I rember standing there one day at a maser amplifier they had opened up on display, and being absolutly captivated by the man made ruby crystal that ran the length of it. They also had a mercury arc rectifier runing, and that was like "Dr Who" stuff to a small lad in the 60's.

Then there was the experimental hand held radar and gold plated ping pong ball they tested it with. Oh and a large klystron TV transmitter valve that was about five feet long and bright orange/red, and apparent cooled by "Dry Steam" 500mW in 50KW out not bad for the late 1960's

Then there was the thing that made my sister never want to visit the Science Museum again. The flash, crack and briliant spark of the 1,000,000V spark...

Oh and the laser rifle sight stolen by the IRA in Feb 1978. I had decided to treat myself with some of my "birthday money" and had gone to the Science Museum on my own "to get away from the family" for various reasons. The Science Museum had anounced the laser sight as part of a special display and I went and saw it. I was not realy that impressed by it, certainly not as much as I was a few days later when it got stolen. Oddly it was the theft that has stuck more in my mind than the actual object, it was a serious headline news item on TV. I tried hunting out a link to the story and as usuall Google has the memory span of a kitten so not much luck.

But yes I've fond memories of the Science museum, and very very dull memories of the Natural History museum my sister liked and Victoria and Albert mother liked. I never did find out what my dad liked, because he was nearly always visiting the places I liked explaining to me what the stark display cards realy ment and pointing out those little but very important bits. Although he was a Charted Accountant by profession looking back I can't help but feel he realy wanted to be an engineer in a drafting office designing the latest and greatest bit of technology. It kind of rubbed off and I guess I lived his dream though he did not live to see me do it.

chrisJanuary 18, 2019 11:23 AM

@Rob, it's not hard to write two different programs with the same MD5 hash. What happens if someone writes a harmless program, eg a game, with some random looking data near the end that makes the MD5 hash come out the same as a bit of malware.
The idea is the game gets submitted to nsrllookup and marked as harmless (because it is). Then when they use the malware version anyone investigating is likely to ignore it as harmless junk.

At least allow the 0.01% of users who know what they are doing to query SHA256 hashes. And be ready to force a switch if that scenario starts happening (reject MD5 hashes with a message saying they must submit SHA256 hashes, and why).

Chris

Leave a comment

Allowed HTML: <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre>

Sidebar photo of Bruce Schneier by Joe MacInnis.

Schneier on Security is a personal website. Opinions expressed are not necessarily those of IBM Resilient.