Schneier on Security
A blog covering security and security technology.
« Patrick Smith on Airline Security |
| Build Your Own RFID Skimmer »
June 20, 2006
Privacy-Enhanced Data Mining
There are a variety of encryption technologies that allow you to analyze data without knowing details of the data:
Largely by employing the head-spinning principles of cryptography, the researchers say they can ensure that law enforcement, intelligence agencies and private companies can sift through huge databases without seeing names and identifying details in the records.
For example, manifests of airplane passengers could be compared with terrorist watch lists -- without airline staff or government agents seeing the actual names on the other side's list. Only if a match were made would a computer alert each side to uncloak the record and probe further.
"If it's possible to anonymize data and produce ... the same results as clear text, why not?" John Bliss, a privacy lawyer in IBM's "entity analytics" unit, told a recent workshop on the subject at Harvard University.
This is nothing new. I've seen papers on this sort of stuff since the late 1980s. The problem is that no one in law enforcement has any incentive to use them. Privacy is rarely a technological problem; it's far more often a social or economic problem.
Posted on June 20, 2006 at 6:26 AM
• 29 Comments
To receive these entries once a month by e-mail, sign up for the Crypto-Gram Newsletter.
I made this post in the NSA domestic wiretap blog. The phone companies could supply some large hash of identifying information and let the NSA do all the analysis it wants to establish what it thinks are our social networks. Then the FISA court could approve of further investigation of particular numbers.
This approach is not without its own perils. Hashing has significant problems which are not so apparent.
For example, a number-plate tracking system; It might be tempting to store a hash of plate numbers vs location/time and declare privacy guarded, but this would not be the case. As with any hash-system, if you want to to invade the privacy of a particular person, it's still trivial. Just run the hash with their details.
However, even for data-mining, we still don't win. If the input-space is limited, as it is with things like names and number plates, it becomes possible to pre-compute the hashes. Pre-computing every hash from an electoral register, or every possible plate number is at most a few days work for even my laptop. Once I have that information, I can mine the entire dataset to whatever extent I desire.
So we guard against casual and trivial meritless searches (which is worthy), but really there is no protection against a determined data miner.
A significant real-world problem is also the huge amount of variation in the spelling and combinatrics of names. Don't underestimate the ability of that problem to undermine the use of hashing in the first place, for names at least.
Good point about names. I recall a print article a few years ago detailing the difficulty in trackin specific individuals through several Muslim-majority countries because of the way each country has of spelling parts of Muslim names - Osama Bin Laden was Usama ibn Laden in one place; the name matching is a lot more complicated.
Same problem applies to partial matches, e.g. the first 4 digits of a number plate are known, no way to hash that and search the database...
that's because there is no standard transliteration between various forms of Arabic and English, unlike Japan. The basis of the "english-izing" Arabic or Farsi is phonetic, not one-to-one literal. I remember reading about the same issue when talking about Kadaffi / Gaddafi / Qaddafi / Qadhafi (the leader of Libya).
Soundex sorting would probably make more sense.
Even if they had the incentive, it doesn't address the real weakness in the whole scheme: the way to find a needle in a haystack is *NOT* to pile all the hay in the world onto the stack!
I'd be weary of these claims. The technologies I see applied to name and address matching are fairly primitive. The biggest problem with this approach isn't the hash, but poor input data quality. If you have data from different sources that have variations on names, you have to resolve those. For example, the following records could all match if they share the same address or other identifier, but if you had 2-3 instead of the complete list, you wouldn't be able to assume they're the same person:
W. J. Smith
William John Smith
William John Smith II
J. Smith Jr.
W. John Smith
This kind of variance in data is very common. One time, I received 14 pieces of mail from the same company because they had 14 variations on my name. The Mike/Michael thing is one, but they had with/without middle name/intial, misspelled last name, etc. It's amazing how much bad data is out there. I have misspellings of names and addresses on my credit report - a place you expect them to care about data quality.
Add these kinds of complications to a lack of incentive to protect privacy and you end up with privacy being ignored.
Financial institutions have regulatory reasons to separate the people doing selection of individuals for marketing offers from the people who handle the actual names and addresses. For law enforcement purposes, they can always claim some sort of boogeyman as a reason why they should not be held to the same standards as marketing people, as low as those standards may be.
You are assuming that the people who are mining the data have control of the hashes generated by the data providers.
If you use unique salts for the hash generated for each record, then the agency would require the original record from the data provider in order to match suspicious records to identifying data. Generating a hash from another data collected will not yield the correct index for the sanitized data.
This does nothing to resolve the issues about the quality of the data as identified by Mike Sherwood :P
This blog page deserves some references to techniques permitting classes of transparency in data.
I wasn't even thinking about hashing the name. The NSA just (claimed to have) wanted the from/to numbers and call time. They might have gotten the elapsed time as well (we may never know).
I was thinking along the same lines as havvok.
I read - with interest - the "Value of Privacy" article in the Jun 15th
CRYPTO-GRAM. In response, I'd like to pass along a question, whose answer is elusive.
Nonetheless, the question itself is forcefully relevant to the absolute demands
asserted in "Value of Privacy."
THE QUESTION: How does a democracy survive when it is required to protect the privacy
of an entity whose goals include suppressing privacy rights?
"no one in law enforcement has any incentive to use them"
I wouldn't say "no one". Some FBI agents were quite critical of the inability of carnivore to produce accurate results, precisely because the laws (at that time) were intended to protect innocent people from searches that were insufficiently narrow.
> THE QUESTION: How does a democracy survive when it is required to protect the privacy of an entity whose goals include suppressing privacy rights?
By voting it out of office of course.
The flaws in this idea start before you get into the cryptography, but if you want to chat about the crypto...
Hashing is not a smart way to do this. A better solution would be a one-time-pad.
You start with database records like:
You assign unique random numbers to each entry:
The problems with this start before the cryptography. But if you want to talk about the crypto, hashes are not the right way to do it. Salting the hashes makes it better, but is still more complex than you need to be. This is a good application for a one time pad.
You start with data like this:
Name Phone Number
You assign unique random numbers to each value in each field:
Name Phone Number
You hand the randomized data to law enforcement, where they perform whatever link analysis they want.
Law enforcement comes back to you saying "name 43104 is suspicious". Then you look up 43104 in your database, see that was the random number you assigned to Alice, and cough up her name and number.
In the past, this sort of stuff was done manually, and was most commonly called "patient matching" in the health industry.
Last blog entry on this subject:
An academic paper I pointed to:
Like the examples you posted, when I talk on the phone, I use "peter" instead of "pete" as "pete" can sometimes be heard as "keith." So, some of the mail that comes to me is addressed to "keith."
SRD wanted a fortune for NORA. But the ideas from their whitepapers are enough to get a start in that area.
I think the problem is defining the problem to be solved. This looks like a solution in search of a problem. While it keeps the data more anonymous, it also makes it less accountable. For example, google for "David Nelson TSA" to see what happens when you have a system that is not granular enough. When you remove the ability for a person to look at all of the data they have, by separating traits from names, you require blind faith in the record number assignment.
While a random number assigned to each record is a good solution where you have a central, authoritative repository for random numbers, it's hard to do that in real life. The random number has to also be unique across the entire database. The chances of a collision increase with the database size. Also, there has to be precisely one number per person to make the matches work, so the central repository would need a way to determine which records refer to the same individual. In the real world, you're going to end up with data entry errors from numerous sources, so your records for Alice would probably look more like:
If these are all treated as unique based on a combination of the name and telephone number, relevant traits would not show up because the correlations wouldn't overlap for the same reference number. If they're unique based on name alone, all of the Alices in your database will be merged. Unique based on phone means every household is treated as an individual. Also, real world data isn't always going to have all the fields you want, so you'd have data sources without telephone numbers that need to be merged in as well.
Figuring out which records correspond to the same individual is simple in concept, but there are numerous complications to doing it in practice. This is the kind of work I've been doing for the last few years, so I've got a bit of experience with the nuances of the field. It's not a difficult thing to do, but there are countless variances in data from different sources that makes the process much more painful than one may think.
@Brian - Works for phone numbers, since these are already more-or-less unique natural identifiers. Would work for similar data elements (credit cards, e.g.). BUT I imagine the data quality problems enumerated above prevent this method (along with any of the indeterminate/salted hash methods) for working with name-type data.
I imagine the NSA folks are smart enough to figure out that they need to apply some intelligence to connect names together beyond what marketers do - but if the output is indeterminate and random, there's too much information lost. Does Osama = 2039840293 "sound like" Usama = 3873890302? Not
Are there ways to maintain that kind of "sounds like" information while maintaining privacy? Can you establish that one thing "sounds like" another (contains a similar sequence of phonemes, e.g.) while not revealing what those phonemes actually are?
My concern with this is not so much that access to my name is blocked. My concern lies with a government's being able to decide, based on completely arbitrary and unpublished standards, what constitutes a record in need of further research. Combine this with NSA letters demanding nobody speak of the number of records being examined and you have effectively created the illusion of protection of privacy while still granting full access to all records and all information.
Yeah, I agree. If you need to do any kind of fuzzy matching on names, using a randomized value would make that impossible. I suspect the randomized values would be enough for the NSA to do their link analysis, but for many other applications randomization makes the end goal unreachable.
I've heard of two approaches to this problem. One is to avoid sharing the entire database, instead masking out certain fields. The other was a tamperproof hardware solution: http://doi.ieeecomputersociety.org/10.1109/...
This still seems like black magic or dowsing to me. How do they determine what a profile of a "terrorist" looks like? More likely they look for people who are not "just like them". Imagine somewhere like China using this sort of technology to look for "radicals". People would be pissed. Seems to me that the same thing is going on here, but they call them "terrorists" so no one can complain.
The glib answer to your question is they use the same methods that the private sector uses to build up profiles for other data mining applications. This ignores lots of questions about privacy and efficiency. So far, the US administration is getting away with the glib answer.
There are two separate problems you're talking about. First, there's a privacy problem, because the NSA can look up anyone's records. That makes it easy to abuse the system, discover who at the Pentagon has been talking to Seymour Hersh, etc. Second, there's a quality of data problem, because this kind of system can give you an unusably high number of false positives. (Bruce has written about the problems of false positives in this kind of system.)
The second problem makes FBI/DHS/CIA's job harder, so they probably really care about fixing it. (There are still incentive problems with fixing it, but the FBI would have an easier job if it were fixed.) But the first problem can't be solved without making the FBI's job harder. This is just a tradeoff we have to live with--if the FBI can check my phone records and tap my phone as soon as my name comes up on a suspect list, they'll probably be more efficient at finding bad guys. But they'll also be more efficient at finding political enemies of the president or the FBI, if that's what they decide to do. More technical and procedural restrictions on this stuff means that the FBI can't investigate as many people, but also that their investigations are more likely to be limited to real bad guys.
I think one of the big things they're looking at is supposed to be who you've been talking to. If you've been in contact with a lot of terrorism suspects, then there's some reason to think you might be one, too.
There are obvious problems with this, but to be fair, they have a really hard problem--there are almost no examples of Al Qaida terrorists attacking targets in the US, and so it's hard to build much of a profile. And there probably aren't many AQ terrorists in the US, because we're not noticing things blowing up all the time, despite the relatively large number of obvious, soft targets.
Let's see how this approach fares when it's your own data that's being mined. What would it take for Counterpane to anonymize the data from its network of security sensors sufficiently that it could be provided to appropriately qualified researchers to analyze for long-term trends and other significant phenomena? I bet there is enough scientific gold in that mine that the NSF and other agencies would solve the economic incentives (for Counterpane, at least) with substantial grants. If the anonymization really works, why should the researchers need to be qualified in any way, even, as long as their funding checks are good?
How will you convince your customers that your operational processes are sufficiently secure that the anonymization works in real life, and doesn't contain hidden channels, backdoors and other weaknesses?
What makes anyone think that the same problems don't exist with law enforcement data?
>> Largely by employing the head-spinning principles of cryptography [...]
I didn't know hashes were that confusing.
QUTE:If it's possible to anonymize data and produce ... the same results as clear text
Yes, sure thing, considering that anti-terrorist data mining produces, while being unanonymized/unencrypted, produce only false alarms (with an almost negligible amount of correct leads) , it becomes apparent that it is sooooo darn easy to make an equally efficient encrypted version.
And it also becomes apparent what will this future data mining system look like - take a peek at the future of anti-terrorist technology here:
"The glib answer to your question is they use the same methods that the private sector uses to build up profiles for other data mining applications."
So how many terrorists has the public sector found?
It just sounds like voodoo to me.
Uhm... of coarse this is easy to do, what else would they do? sift through the datases row for row?. a simple program can analyse a database without access by humans.
It isn't rocketscience ...
Schneier.com is a personal website. Opinions expressed are not necessarily those of BT.