Risks of Not Understanding a One-Way Function

New York City officials anonymized license plate data by hashing the individual plate numbers with MD5. (I know, they shouldn't have used MD5, but ignore that for a moment.) Because they didn't attach long random strings to the plate numbers -- i.e., salt -- it was trivially easy to hash all valid license plate numbers and deanonymize all the data.

Of course, this technique is not news.

ArsTechnica article. Hacker News thread.

Posted on June 25, 2014 at 6:36 AM • 33 Comments

Comments

HannoJune 25, 2014 7:05 AM

I don't know exactly how US license plates look like, but I assume the number of variations is quite limited (let's say in the thousands). Then even salting the hash wouldn't have helped a lot - it would have slowed down the attack, but it'd still probably be feasible on any normal computer.

RonKJune 25, 2014 7:13 AM

@ Hanno

From WP:
> there were an estimated 254.4 million registered passenger
> vehicles in the United States according to a 2007 DOT study.

So, assuming they did the salting properly, it might be OK.

WillJune 25, 2014 7:18 AM

(Naieve Assumption) - Just because the final hash is (say) BCrypt with a few thousand rounds, doesn't mean the input was the actual password.

Surely feeding a chain of hash functions into each other, and setting very high round numbers would make it infeasable with current technology (even GPU based crackers like Hashcat) to brute-force the correct path.
Enough rounds and chains would surely make it difficult even with a known plaintext


....buut, all of this could've been avoided and made a lot simpler had they just generated guids for each licence plate instead.

wiredogJune 25, 2014 7:19 AM

License plates in Virginia are generally 3 letters+4 numbers, so 1,757,60000 possible combinations. Specialized plates ("Clean Fuel, Kids First, NASM, etc.) probably reduce the attack space a bit.

RonKJune 25, 2014 7:21 AM

> So, assuming they did the salting properly, it might be OK.

Oops, if they do it properly, there is no dependence on the entropy of plate numbers.

BazJune 25, 2014 7:38 AM

Re combinations - It's not generic license plates, it was NYC taxi numbers. There's less than 14,000 of those in active service. And it's not even a matter of generating strings matching the format, you can download a list, including owners and driver names (http://www.nyc.gov/html/tlc/html/industry/current_licensees.shtml).

MikeJune 25, 2014 7:42 AM

@Hanno, if you read the article, what they were trying to hide wasn't simply vehicle license plate numbers, but the license numbers for NYC cabs. That's a much smaller set than "all registered passenger vehicles in the United states" - fewer than 100,000.

swrJune 25, 2014 7:45 AM

Salt would not have solved the problem, as salt is not secret. If it is secret, it is a key. The right cryptographic primitive here is a keyed hash. Even the dubious concatenate-and-MD5 keyed hash would probably have sufficed here.

But...

If they don't want the details revealed, why is that information included at all, hashed or otherwise? A sequentially incrementing number would probably have sufficed if they needed a unique identifier. Shuffling the items before numbering them would completely eliminate any possibility of information leakage through that field.

Solving problems with cryptography is hard. Not having the problem in the first place is better.

name.withheld.for.obvious.reasonsJune 25, 2014 8:37 AM

Not that this is about one-way functions but as it touches several other topics recently (the other about substitution attack) I chose this thread to spew some opinions about what I "understand" may be possible. Feel free to correct me if I am wrong--doesn't reach to far into my ego-sack (or did I mean to say ego-stack; guess I pushed that one a bit far). And I am sure this issue has been keyboarded to death...

I'm going to have to go back to something I've mentioned before--with at rest encryption there is an obvious scheme that prevents unwinding the clock so to speak. For example, using MS word's encryption feature, I encrypt all my documents and save them to disk--I email them occasional to others and the do the same. Classically, as at rest encryption, it is/was considered unreasonable to collect all the documents encrypted under MS word--at best it used to be a one-to-many or many-to-one proposition (MITM/monitoring/duplicating). But, if NSA policy is to store all encrypted documents then they've essentially accomplished "the many-to-many" mapping so to speak. Why is it a mapping? As the salts are stored in MS word files (as to my example) for generation of hashes, hashes leading to keys, and keys leading to deconvolution...I now have a way to "functionally" attack a cryptographic scheme or system. Or at least, a method for beginning to crack a cryptographic scheme using functional analysis.

Sorry for the simplification--it is my limited way of speaking to a perceived aspect of security engineering.

FOKJune 25, 2014 9:46 AM

I am wondering how would you properly anonymize such data.
For me it looks like mission impossible. There will be always enough data to undo it with small effort.
Even if you replace cab number by completely random data but keep it consistent across the set, then attacker needs only small sample of actual location for each cab he wants to reveal. It might be cab's usual stand or knowing where he did drop off at specific time and you can link this information to anonymized data.

x11794AJune 25, 2014 10:37 AM

@FOK: I fully agree. It's not clear why they even need the hash of the information they're trying to "throw away", but generally the entropy in time-location data is so high that a super small subset of the data makes up a unique fingerprint. My intuition is that the only way to anonymize a data-set like this would be to remove the ID parameter at all and just have a bunch of trips taken by anonymous cabs (of course, maybe the trips by a given cab are what they actually care about, in which case no way to really anonymize it in a way that isn't simple to reverse with a small amount of extra information).

Even then, I wouldn't be surprised if it's possible to correlate endpoint and startpoint data to re-associate individual trips with one another with some high probability. Considering that you know which cabs are actually en route at what times and where they were when they last dropped someone off, you can eliminate any cab that currently has a fare and any cab that was too far away from the pickup spot to get from their last fare to their next one in time. If the period of time without a fare is sufficiently low and the resolution of the "pickup time" were sufficiently high, it would be trivial to uniquely identify the history of a given cab.

Depending on the nature of taxi routes (whether they are randomly assigned or if a given person has a certain "area", or has "regulars"), you may even be able to stitch together data where the endpoints and startpoints aren't necessarily going to be correlated (e.g. data taken on different days), by "fingerprinting" the routes and areas covered.

WinterJune 25, 2014 11:30 AM

Anonimizing is hard (or even impossible?)

Maybe that should be the message: Do not try to anonimize complex data. Only publish data that could bear deanonimization.

For those that know what they are doing, there are metrics to quantify the fraction of individuals that can be identified (almost) uniquely from agregate data like, e.g., M/F, age, and ZIP code.

AnuraJune 25, 2014 12:12 PM

Rule number 1: Unless you have good reason otherwise, remove any attributes of an individual completely. Sure, it may be possible to correlate pickups and dropoffs, but I'm not so sure that's very useful. If the data itself is to be made available, it's probably not that important to keep secret.

If someone wants statistics for age and sex by postal code, aggregate it:

SELECT BirthYear, Sex, COUNT(*) FROM People GROUP BY BirthYear, Sex
(I hope you don't store Age in your database)

TitoJune 25, 2014 1:14 PM

Even with salting, this would be trivial to crack. Based on the blog post linked to (medium.com), the input entropy for the function is about 2 million possible values (NYC License plates). The tool oclHashcat on basic commodity hardware (1 video card) can do ~10.5 billion MD5 attempts per second. Some simple division puts this in the range of 5,000 actual breaks per second. This means a list of 1 million unique license plates are cracked in about 3 and a half minutes.

All a good salt can do is change the attack from "cracking all values at once" to "cracking one value at a time". If the crack is still subsecond, salting doesn't really help.


The fundamental mistake here is that you cannot hash over a low entropy input.


This is the reason passwords need to be run through bcrypt, pbdkd2 or other iterative hash to at least attempt to slow down cracking.

One right way to anonymize this would have been to fully tokenize the plates. A true token is mathematically independent of the original data. You have to have the :lookup list" to be able to reverse the mapping. So, for each plate, you generate a 16 byte cryptographically random number and put the two (plate,number) in a list. Every time a plate comes in, you look it up and replace it with the number.

Done this way, there isn't any relationship between the token and the plate. You'd have to get the mapping list to be able to reverse them.

Of course, you can also attack the anonymization through the other data, but this would at least take care of the plate hashing issue.

AlexJune 25, 2014 2:54 PM

Looks like too much cryptography means no cryptography, they should have used the record id column, since noone has access to database that was secret enough (unless any kind of order, of course).

RonKJune 25, 2014 3:36 PM

I have the distinct impression that what Bruce means by salting is "adding a long secret string before hashing" which is being misunderstood by many here who are assuming he meant salting as it is used in authentication schemes (where the added string cannot be a secret from the person checking the authentication).

GodelJune 25, 2014 6:35 PM

@swr
"Salt would not have solved the problem, as salt is not secret. If it is secret, it is a key. The right cryptographic primitive here is a keyed hash."

I think you're being overly picky. If it uses a hash algorithm and the process is not reversible then it's a hash in my book, whether the salt is secret or not.

EgadJune 25, 2014 7:03 PM

I'm sure this has been asked many times, so just a pointer to the answer would be appreciated.

However, can hash functions applied to large, non-reproducible files be used to generate sufficiently random strings for use as passwords, salts, etc.?

E.g., if one took a picture of foliage moving around in the breeze, SHA-whatevered the JPG, and then snipped 32 hex characters out of the hash, would that have the security of a truly random 128-bit number?

AnuraJune 25, 2014 7:39 PM

@Egad

The hash function in this case just normalizes the data to be psuedorandom, in fact it doesn't necessarily need to be a cryptographic hash for that situation, it just needs to be a good psuedorandom function. Sources of entropy do not necessarily have a statistically random distribution, but have a certain amount of unpredictability to it (in other words, if it would take me n guesses on average to guess the string it will have (log_2(2*n)) bits of entropy, so if it takes 128 guesses, it has 8 bits of entropy). This unpredictability is what we consider to be true randomness. So if you take a picture and it cannot be guessed with a probability greater than 2^127 (assuming the guesser knows exactly what the picture is of), then you hash it, then it is effectively a 128-bit random number.

Cameras, as it turns out, can actually be good sources of entropy. You can point a camera at a white wall, take as many pictures as you want, without getting two identical pictures. The problem is that alot of the noise that you care about can be lost when encoding, especially if you are using JPEG, and when you lose that data you can lose the noise. If you can get the raw signal from the CCD, that would be ideal.

Just a word of caution: don't roll your own. When random number generators are concerned, determining entropy is hard. We know there is entropy in the image, but we don't necessarily know how much. We may even think there is a lot when there really isn't. This is why you should use as many different sources of entropy as you can, and you should use an algorithm like Fortuna so that if you ever have enough entropy, the output cannot be determined without recovering the state through a side-channel.

FigureitoutJune 26, 2014 12:31 AM

Just a word of caution: don't roll your own.
Anura
--I wonder about this advice. Wouldn't it be equally as hard to determine my entropy sources (written on paper, taken at "random" times in the day). I occasionally think up some entropy sources for fun from time-to-time; for instance you mentioned taking a picture, what about bird droppings on concrete put on an X-Y plane (not near a powerline BTW)? Or slime paths of snails reflected in sunlight? Or paths of ants just outside their underground hole? I could go on but I'd be silly. Almost as if, everyone following their own way of gathering entropy, even if it had a pattern, would in and of itself provide entropy...Such a frustrating topic and the definition isn't even defined so it's just...I don't know, I'm inclined to say random doesn't exist.

ConfusedJune 26, 2014 2:31 AM

Errm

OK so interesting techno/crypto geekdom here but what is the risk concern

Licence plates by definition are surely public information

Even in bulk if not liked to any PII then surely meaningless

Unless there is more data here then we have been told about then the only association is that they are for cabs

BarneyJune 26, 2014 6:22 AM

Confused: The data set that was published wasn't just licence plates, it was a log of taxi movements. The history of movements of a particular taxi and taxi driver is not supposed to be public data. They are linked to PII - for one thing the taxi driver's medallion numbers were included, also badly anonymised, and these allow looking up names.

BlitheringJune 26, 2014 6:24 AM

Dynamic pLates with one time codes determined upon starting the vehicle and hourly thereafter having public keys and.... what are we doing here?

BarneyJune 26, 2014 6:33 AM

Two things I haven't seen on this story: Any comment from NYC Taxi drivers, and any discussion of whether the people that have published this data set could or should be either asked or ordered to stop making it available.

I understand of course that it's not possible to delete all copies of the data in existence, but perhaps the easy to find links on the web should be removed, to make it a bit harder to find personal information about taxi drivers and users.

squarooticusJune 26, 2014 11:26 AM

I disagree about the reflexive dismissal of MD5. Yes, MD5 is easy to compute and engineered collisions are possible, but I hardly think that matters in this case. The only reason not to use MD5 (vs SHA-512, etc.) here is to condition people not to use MD5 anywhere because there are uses for which MD5's reduced collision-resistance/cryptographic strength matters. There are valid arguments for wanting to use a secret-keyed BCrypt here, but they are the same vs. all cryptographic hashes.

mooJune 26, 2014 2:44 PM

@Anura, Figureitout:

This stuff always reminds me of lavarnd. Basically you need webcam(s) and some lava lamp(s) and then the usual software to sample and mix the entropy, estimate it, etc. The lava lamp(s) are nice because the goop inside them moves around in seemingly chaotic and unpredictable ways. You could get some entropy from this macro movement in addition to the micro entropy of the CCD (noise).

AnuraJune 26, 2014 2:57 PM

@moo

If you want to get crazy, get a high-quality HD camera, pointed at all of the following:

Lava Lamp
Smoke from a smoke Machine
Multiple fans, the kind that turn from right to left (to blow the smoke, they don't necessarily have to be in-camera)
Old TV tuned to static
You can also attach some lights behind the fans, so the light flickers and reflects off the smoke

Also, might as well capture the audio too (and since your TV is tuned to static, might as well turn that up).

AnuraJune 26, 2014 6:37 PM

Well, the real point is to combine as many sources of entropy as possible. So while that circuit may be a good source of randomness, I wouldn't rely on it alone. My proposal is intentionally silly, but it does provide a few different sources:

1) The density of the smoke is unpredictable, providing a constant, unpredictable visual change

2) Fans don't provide perfectly constant motion, making the currents themselves chaotic. This also affects the smoke

3) The light from the fan effects which portions of the smoke are brightest, and because the direction is changing with small variations in speed, the actual illumination is not predictable

4) The static from the TV is not predictable

5) The lava lamp, while slow moving is not predictable over time

6) The noise from the CCD is not predictable

7) The audio picking up the TV static and fan noise is not predictable

While many of the sources are related, the net effect should be a significant amount of entropy per second. An attacker might be able to manipulate some individual effects if they have control over, for example, the power signal going to the room, the fact that there are so many sources makes it pretty much impossible to manipulate all of them in a useful manner.

So go ahead and add a Geiger counter and radioactive source, take the nanoseconds from an atomic clock every time you hash, add a thermal noise based RNG, add your avalanche-based RNG, add a radio receiver tuned to a polluted 2.4Ghz band, but I would not rely on any one of those alone, and I would still make sure the algorithm that I'm feeding cannot be attacked even if you have the last yottabyte of output and no entropy added, unless you are able to recover the state through a side channel.

FigureitoutJune 28, 2014 12:24 AM

moo RE: lava lamp randomness
--Thought some about lava lamps, had one for awhile, never made good use of it. :p

Had an evil thought unfortunately in the shower (showerthoughts, I know), I remembered what candles looked like when the wax gets hot and you have a pool of wax, the bits of the wick get attracted to the center then pushed out again over and over. When the candle goes out, the movement stops of course. Could this be applied else where..?

So, what I'm trying to say is, and this is crazy no doubt...could someone affect the randomness of the lava lamp by changing out the lamp or even worse the transformer or whatever other power supply is supplying that light bulb...Cooler temps leading to less energy and movement of the "lava".

My method (one of them, never reveal them all) is to take some normal looking output. Take like 4 other samples, conduct some random boolean operations on it, cut it up, and spread it around. I'd be more than happy to test my entropy on anyone who thinks they can predict it...lol

AnnoyedJuly 1, 2014 1:32 AM

I don't think this is so much a problem with not understanding a one-way function but moreso a problem of an under trained employee using a tool they don't understand.

Some "lowest bidder" software company probably wrote a tool under contract with NYC officials and when they delivered the tool, the training probably amounted to 1 hour of:

Just enter your license plate numbers and click "Done". See, like this.

Then these NYC officials probably turned to the closest hourly-wage employee and said "Make me a list. Don't check it though, there's no overtime pay for this project".

Leave a comment

Allowed HTML: <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre>

Photo of Bruce Schneier by Per Ervland.

Schneier on Security is a personal website. Opinions expressed are not necessarily those of Resilient Systems, Inc.