Schneier on Security
A blog covering security and security technology.
« New Identity Theft Tool |
| "The Top 10 Data Breaches of 2007" »
December 18, 2007
Anonymity and the Netflix Dataset
Last year, Netflix published 10 million movie rankings by 500,000 customers, as part of a challenge for people to come up with better recommendation systems than the one the company was using. The data was anonymized by removing personal details and replacing names with random numbers, to protect the privacy of the recommenders.
Arvind Narayanan and Vitaly Shmatikov, researchers at the University of Texas at Austin, de-anonymized some of the Netflix data by comparing rankings and timestamps with public information in the Internet Movie Database, or IMDb.
Their research (.pdf) illustrates some inherent security problems with anonymous data, but first it's important to explain what they did and did not do.
They did not reverse the anonymity of the entire Netflix dataset. What they did was reverse the anonymity of the Netflix dataset for those sampled users who also entered some movie rankings, under their own names, in the IMDb. (While IMDb's records are public, crawling the site to get them is against the IMDb's terms of service, so the researchers used a representative few to prove their algorithm.)
The point of the research was to demonstrate how little information is required to de-anonymize information in the Netflix dataset.
On one hand, isn't that sort of obvious? The risks of anonymous databases have been written about before, such as in this 2001 paper published in an IEEE journal. The researchers working with the anonymous Netflix data didn't painstakingly figure out people's identities -- as others did with the AOL search database last year -- they just compared it with an already identified subset of similar data: a standard data-mining technique.
But as opportunities for this kind of analysis pop up more frequently, lots of anonymous data could end up at risk.
Someone with access to an anonymous dataset of telephone records, for example, might partially de-anonymize it by correlating it with a catalog merchants' telephone order database. Or Amazon's online book reviews could be the key to partially de-anonymizing a public database of credit card purchases, or a larger database of anonymous book reviews.
Google, with its database of users' internet searches, could easily de-anonymize a public database of internet purchases, or zero in on searches of medical terms to de-anonymize a public health database. Merchants who maintain detailed customer and purchase information could use their data to partially de-anonymize any large search engine's data, if it were released in an anonymized form. A data broker holding databases of several companies might be able to de-anonymize most of the records in those databases.
What the University of Texas researchers demonstrate is that this process isn't hard, and doesn't require a lot of data. It turns out that if you eliminate the top 100 movies everyone watches, our movie-watching habits are all pretty individual. This would certainly hold true for our book reading habits, our internet shopping habits, our telephone habits and our web searching habits.
The obvious countermeasures for this are, sadly, inadequate. Netflix could have randomized its dataset by removing a subset of the data, changing the timestamps or adding deliberate errors into the unique ID numbers it used to replace the names. It turns out, though, that this only makes the problem slightly harder. Narayanan's and Shmatikov's de-anonymization algorithm is surprisingly robust, and works with partial data, data that has been perturbed, even data with errors in it.
With only eight movie ratings (of which two may be completely wrong), and dates that may be up to two weeks in error, they can uniquely identify 99 percent of the records in the dataset. After that, all they need is a little bit of identifiable data: from the IMDb, from your blog, from anywhere. The moral is that it takes only a small named database for someone to pry the anonymity off a much larger anonymous database.
Other research reaches the same conclusion. Using public anonymous data from the 1990 census, Latanya Sweeney found that 87 percent of the population in the United States, 216 million of 248 million, could likely be uniquely identified by their five-digit ZIP code, combined with their gender and date of birth. About half of the U.S. population is likely identifiable by gender, date of birth and the city, town or municipality in which the person resides. Expanding the geographic scope to an entire county reduces that to a still-significant 18 percent. "In general," the researchers wrote, "few characteristics are needed to uniquely identify a person."
Stanford University researchers reported similar results using 2000 census data. It turns out that date of birth, which (unlike birthday month and day alone) sorts people into thousands of different buckets, is incredibly valuable in disambiguating people.
This has profound implications for releasing anonymous data. On one hand, anonymous data is an enormous boon for researchers -- AOL did a good thing when it released its anonymous dataset for research purposes, and it's sad that the CTO resigned and an entire research team was fired after the public outcry. Large anonymous databases of medical data are enormously valuable to society: for large-scale pharmacology studies, long-term follow-up studies and so on. Even anonymous telephone data makes for fascinating research.
On the other hand, in the age of wholesale surveillance, where everyone collects data on us all the time, anonymization is very fragile and riskier than it initially seems.
Like everything else in security, anonymity systems shouldn't be fielded before being subjected to adversarial attacks. We all know that it's folly to implement a cryptographic system before it's rigorously attacked; why should we expect anonymity systems to be any different? And, like everything else in security, anonymity is a trade-off. There are benefits, and there are corresponding risks.
Narayanan and Shmatikov are currently working on developing algorithms and techniques that enable the secure release of anonymous datasets like Netflix's. That's a research result we can all benefit from.
This essay originally appeared on Wired.com.
Posted on December 18, 2007 at 5:53 AM
• 32 Comments
To receive these entries once a month by e-mail, sign up for the Crypto-Gram Newsletter.
You kid does a school science project on some dread disease, doing most of the research online, and next thing you know your health care provider cancels your policy.
The fuss here isn't as awful as it might first appear and while the majority of this article is accurate I'm not sure that the tone is entirely correct.
Yes, users were de-anonymised as a result of information they made available across sites, but the key point is that in order to do this the user had to have made the information public in the first place. Researchers were able to achieve what they did *only* because someone made their opinions public.
What we need to do is teach people the value of their information and encourage them not to give it away so freely. If I make no secret of my movie opinions, I shouldn't be surprised when someone figures out who I am as a result of them.
If I have secrets to hide (dirty movies or whatever) then I should realise that this is a secret I want to hide and compartmentalise that information. The privacy leak comes from the fact that people chose to keep their secrets (the movies they rent) bound together with their public info (what they thought of certain movies).
"About half of the U.S. population is likely identifiable by gender, date of birth and the city, town or municipality in which the person resides."
Presumably for cities, this means "gender, date of birth, and municipal district", or some such thing?
For example, if you count Minneapolis as having a population of about 375,000, then I can believe the statistic, because almost all of those 375,000 people are going into something like 100*365*2 slots. At an average of 5 people per slot there will be a lot of singletons at the sparse (old) end. And in Nowheresville, MI, pretty much everyone bar twins will be unique.
But if you count Minneapolis as having a population of 3 million, and list the city as "Minneapolis-St. Paul" for all of them, then I don't believe it, because by that reckoning half the US population lives in "cities" of over 1 million people, of which very few will be unique by gender and date of birth.
I think we've known this lesson even longer than Bruce suggests. It reminds me of getting the Japanese to send known messages in their secret code in WWII: successfully decripting them told us we had the codes cracked, and the lessons learned was that one should only encrypt unknown data, not info where it is possible to make a good guess. The only additional info in this case is realizing that the data exists elsewhere in a clear form.
"The privacy leak comes from the fact that people chose to keep their secrets (the movies they rent) bound together with their public info (what they thought of certain movies)."
No, the privacy leak came because people rated a small number of movies on the supposedly-public IMDB, but a large number of movies on the supposedly-private Netflix. Netflix then released their database in a way which allowed users to be correlated.
So, even someone who was cleverly trying to compartmentalise their data, for example by reviewing only red-blooded patriotic movies on IMDB, while secretly telling Netflix that they want as much Robert Redford liberal claptrap as possible, have been outed as liberals. Tough on them.
It seems to me that one needs to define 'anonymous data'. In respect to Netflix, the data they released WAS anonymous. You could not use it by itself to determine the identity of Netflix users. Furthermore, if a person did not enter online reviews in IMDB (or elsewhere) it is doubtful they could be ID'd by this method. Only by using other resources where the Netflix user had posted information, could the researchers specifically identify a person (again, using this method). It may be assumed that expanding the search to other forums and postings outside of IMDB, the researchers could easily have ID'd almost everyone on the Netflix list.
Therefore, the question is, in the age of Google and vast information gathering/storing, can ANY data be truly made anonymous and still be usable? And if not, should the rule be that companies just have to make sure the data they release is anonymized relevant to themselves (i.e. Netflix users can't be ID'd through Netflix data)?
We must, however, face the fact that privacy is dead (or, at the very best, dying fast).
Why was the release of the AOL data a good thing?
I think you misunderstood the attack. The users had rated just a handful of movies (a dozen or less) on IMDB which do not contain secrets to hide. The point you said about "compartmentalize" is exactly what users thought they had achieved, by rating some movies publicly on IMDB. But the attack de-compartmentalizes the public info with private info (netflix viewing pattern) which gives out secrets that they do not intend to share.
I'm not sure Richard Gray misunderstood. The point I took away from his comments, and one that I think is important, is that people must think about using their given names when posting in a public forum. Both about the quality and tone of their comments, as well as whether or not the volume of their discourse easily tracks against a volume of the exact same opinions entered "anonymously" somewhere else. So far, the correlation to real-world identity was based on individuals publicly ranking the same movies in 2 different databases - once with their real name attached. Based on what I've read so far, it wasn't their Netflix viewing pattern that was the issue, it was that they rated the same movies in both places. Correct me, if I've missed the point.
"What we need to do is teach people the value of their information and encourage them not to give it away so freely."
And don't ask for child benefit nor take driving lessons...
The problem is that users may think they are using their real name only in one particular public forum (perhaps to help establish a professional reputation), but techniques like this can correlate information they supplied to other sites under a pseudonym to information in the original forum. So, if your name gets attached to anything, a little data crunching can de-anonymize other data.
There's a lesson in it for all of us, but Bruce's advice was directed to people fielding anonymity systems.
"About half of the U.S. population is likely identifiable by gender, date of birth and the city, town or municipality in which the person resides."
Probably has few collisions. Add one other item like an initial, probably solves for that.
While I can appreciate the value of anonymity (anonymous systems?), I find this "research" a bit ridiculous and very much over-dramatized. There's nothing new about comparing two sets of related data and finding similarities.
I find it kind of ironic that people are identified by so little data and that anonymity is so hard, yet things like the no-fly list seem barely capable of telling anyone apart.
I think there's a difference between anonymous and pseudonymous. This comment is anonymous (unless Bruce correlates it to my IP address). All the comments ever posted by any anonymous person are attributed to the same source.
If I were to use a consistent pseudonym, such as "quincunx", or an account-number, then that's a unique ID that no other person uses (unless I give the password away, or others start posing as quincunx).
Of course, there are other ways to break down anonymous or pseudonymous data, such as analyzing the text for patterns.
I don't get it. A while ago, I read an very convincing article about how computers aren't (and possibly never will be) good at finding people (terrorists): it's the needle in the haystack situation, there are too many false positives, etc.
Today, this post leads me to think in the opposite direction: how easy it is to actually 'profile' people, how few pieces of data are needed to uniquely identify someone. Equally convincing.
I'm not trying to stir anything, guess I'm just missing something.
"...they can uniquely identify 99 percent of the records in the dataset."
The normal reading of this is clearly not true. Unless they had a known subset of either IMDB users who were in the netflix data, or a known subset of netflix users who had rated in IMDB, it's impossible. I suspect what they were saying was that among those they could match, 99% of those matches were to unique IMDB users, as opposed to being highly likely candidates for several IMDB users. This then raises the question of how accurate are their matches, and can they even in principle estimate their accuracy?
> guess I'm just missing something.
If you are missing anything, it is that the only one that can have it both ways is Murphy's Law.
luca: It's all about the costs, benefits, and risk profiles.
With terrorist identification, you a high degree of proof, with someone you need to assume is potentially intelligently hostile, and that's a hard bar to meet.
When taking huge databases and correlating them with other huge databases, you'll make some mistakes, but if the only thing at stake is who you pay to advertise shaving cream at, you don't care.
In both cases, you're looking at the same basic accuracy rates: 80-99%, let's say. The problem is sometimes that good enough, and sometimes it's not. No contradictions, just different contexts.
Richard Gray: While I mostly agree with you in theory, in practice the average person is not going to expect that by giving a bit of info to this site and a bit of info to that site and a bit to the other, that the sites can collaborate to determine with significant statistical accuracy some very surprising things. It goes even beyond "movie ratings"; with just a bit more work these people could make very good guesses at gender, age, political orientation, and some other things that you'd probably be quite surprised about. Again, we may only be talking 80%, but that's quite possibly more than you'd care to reveal.
Information becomes radically more than the sum of its parts quite quickly, and almost nobody's privacy frameworks have a framework for handling that problem.
Why is the AOL research data a good thing?
I _think_ there are two different problems. Here, you've got a set of public actions S for some individual and a big "anonymous" dataset containing records of sets "actions" on a essentially the same task and you're looking to find the single record in the dataset which best match S as this turns out often to be the same individual.
In the terrorism example, you've got a whole set of records of new public actions (eg, attending a radical mosque, buying plane tickets) that by themselves may be harmless or signifiers of terrorist intent. For training, you've got a set of relationships between these public actions and terroristic status _for people who are now purely historical_ and you're trying to figure out people in your new dataset who very, very probably falls within the terroristic class from the correlative relationships between actions and status inferred from your historic dataset. When both the correlation between public actions and terrorism is weak and the penalty for false positives and negatives is very high, this is a very, very hard problem.
The terrorism issue corresponding to the video data problem is "we know an unnamed particular terrorist is trying to spread anthrax in hotdogs at the superbowl" and identifying the particular name from records of sets of people buying sports books, bacteriological culturing vessels and hotdog stands. This kind of problem is hard but probably doable with much better error scores. But it's a distinct problem to what is commonly referred to as data mining.
When I wrote "When both the correlation between public actions and terrorism is weak" what I really mean is that there is also a strong correlation between the public actions and not being a terrorist, so that public actions are poor predictors of terrorism. (Note that this strong correlation may come about either because the public action is just very loosely connected with being a terrorist or because of a heavy weighting in the general population's prior probability towards not to be a terrorist.)
Theres a (harmless) bug in this pages html-code.
Just shows how hard it is to get coding right :-)
The link to the words "some of" has two "" at the end (one would be enough).
Or is its presence a secret signal (hidden on this web-site)?
P.S. Huaah - i am not even able to post the name of this double html-tag.
Its automagically removed from my post.
The no fly list is not about keeping people off of airplanes. The no fly list is about intimidation of a large segment of the population.
And what is the purpose of the characterization of the Netflix and IMBG data? Is it to identify terrorists? Crooks? Or, is it for some targeted marketing scheme? It seems they do such a good job of targeted marketing but they can't seem to stop when I don't buy (where is that same data analysis?) In fact, when I don't buy, they call more times, send double or triple the advertisements, and 10 to 100 times as much spam. Hubris. The only thing we are good at is not doing very well with large amounts of data. More data will be lost, discarded, or ignored than ever used.
If you take the time to read the research, what you will find is that it's a pretty poor proof of concept.
They didn't de-anonymize much of anything. In other words, if it was your Netflix account they tried to de-anonymize, you don't have anything to worry about because they don't have a clue as to who you are.
About the best they can do is say something along the lines of, "there is someone in your neighborhood who owns a car and mows the lawn on the weekend." Or for you urban dwellers, "walks to work and crosses streets in your neighborhood."
This is akin to saying that someone who looks like Osama bin Laden is living in the Pakistani mountains and therefore we have found him.
The evidence they produce is at best circustantial, at worst, negligent, incompetent and immaterial.
In the 1980s, when privacy issues were getting more popular as discussion topics, I realized the way to protect your info wasn't to try to "hide" but to do the opposite. Whenever possible "release" to the public numerous birth dates, release your name as John B. Smith, John R. Smith, Jon A. Smith, enter your zip code with minor typos - use your driver's license number for your social security number. Put your information "out there" with so many permutations that over the years instead of fighting the battle to protect your data you're actually making it impossible to effectively use it.
January 15, 2008 @ 5:56 AM
N40° 46.565' W073° 58.756'
We have two contrary problems: we have no privacy and, in other news, there is no accepted, satisfactory way to determine identity. Odd, no?
More constructively, can we use something based on the reported research to achieve IDs? It would be something like establishing a reputation, I think.
This is all very well, however in the Netflix example I can't think of an effective way of anonymizing the data set while leaving it's needed properties intact for the competition.
Removing the movie titles doesn't seem like it would help much, since from the imdb data you could just relate similar ratings at similar times.
So you gain a little anonymity, for a bigger loss in terms on not being able to use movie titles in your analysis.
From my thinking if something is properly anonymized then the relations between items are going to pretty much random. Therefore you mays well just train on random data.
Otherwise it seems that you are going to be able to statistically rebuild the original dataset, you just might need more other data then you would have in the case of non-anonymized case. Which doesn't seem that much of a problem if you want to go mine it.
Most people who have commented here seem to have completely not grasped the point being made by Mr. Schneier.
Mr. Schneier stated in the opening lines that "[The researchers] did not reverse the anonymity of the entire Netflix dataset". In plain English, this means that using the Netflix data ALONE it is still assumed statistically IMPOSSIBLE to identify any of the people in the data. Therefore, all people who are debating whether it is possible to identify a person from the date of birth, area, and gender ALONE are debating something that is already known to be statistically impossible and so need not debate any more.
So what did Mr. Schneier actually talk about?
What he actually talked about was what happens when publicly released anonymised data is CORRELATED with some other known data, and to do this you have to be in possession of a second set of known data. Given this second set of known data, anonymous data can be statistically sorted and matched with a very high degree of precision.
Let's create a real example. Let's create "Fred", an employee at Amazon.com, of which IMDB is an affiliate company. It actually happens that to contribute to some of IMDB's forums you have to have an "authenticated" account, which involves IMDB verifying who you are as a person. One method they use do this is to tie your IMDB account to your Amazon.com purchases - i.e. an anonymous IMDB account is locked to a real person with a physical delivery address, a credit card number, and a contact telephone number. If you have an IMDB account, you can see an actual example of this in action by logging into IMDB and then clicking the URL below:
As stated before, the Netflix data by itself is meaningless. However, given that the Netflix data is data about films, Fred at Amazon.com used the NON-ANONYMISED items in the Netflix data to select some film reviews on IMDB. Fred then traces the "authenticated" account verification back to Amazon.com for each review to throw up a name (or several names). At this stage Fred still does not know the identity of the people in the Netflix list. However, in the Amazon.com database, peoples' dates of birth, gender, and area (through delivery addresses and contact telephone numbers) are known, and Fred checks these details against the Netflix data (as these are NOT anonymised) and successfully IDs his people.
Now this all seems too easy. However Fred has a HUGE problem, and this problem lies in his first step; that is, creating canditates to search the Amazon.com database for in the first place. These candidates, harvested from simply guessing matches of the Netflix data against the public film reviews posted on IMDB, are just far too numerous. But the research that Mr. Schneier talks about is able to reliably reduce the number of candidates to such a small number that Fred can check them by himself in a very short space of time. That is where the research is being shown to be successful.
Without a secondary database of information against which you can correlate the date of birth, area, and gender that appear in the anonymised data, it is virtually impossible to identify somebody reliably. However, anybody like our employee Fred who works for a company trading over the internet and accepting credit card payments has access to a secondary database and can therefore identify anonymous people virtually perfectly.
Given this, Mr. Schneier's blog was about the fact that generic employee Fred is inherently untrustworthy, and so the release of an anonymous list like the Netflix data poses a huge privacy and safety risk as Fred illegally uses his organisation's database as a secondary database to identify anonymous people. However, providing that Fred at Amazon.com (or central government, where information is held on eveybody whether they like it or not) remains under tight control when he is accessing our data, we all have nothing to worry about and privacy will still be around for a very long time to come.
Schneier.com is a personal website. Opinions expressed are not necessarily those of BT.