Reidentifying Anonymous Data

Latanya Sweeney has demonstrated how easy it can be to identify people from their birth date, gender, and zip code. The anonymous data she reidentified happened to be DNA data, but that’s not relevant to her methods or results.

Of the 1,130 volunteers Sweeney and her team reviewed, about 579 provided zip code, date of birth and gender, the three key pieces of information she needs to identify anonymous people combined with information from voter rolls or other public records. Of these, Sweeney succeeded in naming 241, or 42% of the total. The Personal Genome Project confirmed that 97% of the names matched those in its database if nicknames and first name variations were included.

Her results are described here.

Tags: anonymity, de-anonymization, DNA, privacy

Posted on May 8, 2013 at 1:54 PM • 10 Comments

Comments

Madeleine Ball • May 8, 2013 5:38 PM

Unfortunately the article has several inaccuracies, the most important of which is a misrepresentation of the Personal Genome Project: we describe our project as “non-anonymous”, participants should understand their data is highly identifiable. See my blog post here:
http://blog.personalgenomes.org/2013/05/02/a-very-personal-genome-project/

We do not encourage participants to scrub data, if they are uncomfortable with being identifiable we strongly advise them to consider leaving the project. Melissa Gymrek and Yaniv Erlich have demonstrated that genetic data alone is highly identifying. Thus our participants are a select group that are comfortable with the risk of being identified. (Indeed, many skip the suspense and identify themselves from the outset.) It’s a radically different approach to sharing biomedical research data, I’d be happy to tell you more about it.

Reporting on the Sweeney group’s work was also criticized here: http://blogs.law.harvard.edu/infolaw/2013/05/01/reporting-fail-the-reidentification-of-personal-genome-project-participants/

signalsnatcher • May 8, 2013 10:54 PM

The core of the article is that birthdate, gender, postcode and other seemingly non-personal data can be aggregated and used to search public records to identify individuals.

For this very reason, Australian electoral rolls have recently been removed from public access.

Some years ago I was employed to do “targeted research on individuals”. When pursuing data in the US it always amazed me how much personal information was public – car registrations, building plans, local tax records – and how much could be accessed for nominal fees.

Even if you don’t have a Facebook page, your friends do. Local media report on sporting and social activities and they like to name local identities in stories. You may be gossiped about in forums and chat rooms. Google Image Search is surprisingly efficient.

On the other hand it is an order of magnitude more difficult to research indioviduals in the European Union. The difference? Government regulation.

Paul • May 9, 2013 4:48 AM

Maybe a legal approach, in jurisdictions where there is data protection legislation, would be to make de-anonymized data subject to the similar data protection rules that already apply – that it must be fairly obtained, accessible/correctable by the subject, used for the purposes for which originally given, etc. Actually, if you de-anonymize data, you are by definition processing personally identifying information, so the act is probably already covered. I suspect lawyers may attempt to create wiggle room around “fairly obtained”…

People should probably be allowed to opt out – though I can see a problem with any central register (like Do Not Call) of people who don’t want to be identified :-/ and a case-by-case opt-out system runs the risk of the subject never realizing they have been included in a de-anonymized database in the first place.

Interesting times…

Dirk Praet • May 9, 2013 4:55 AM

@ signalsnatcher

The difference? Government regulation

Correct. And that’s exactly why mostly US companies such as Amazon, Google, Facebook and Yahoo have been lobbying fiercely against reformation of EU legislation addressing the demand of EU citizens to ensure their right to data protection and privacy is upheld, and as called for by the European Parliament . The USG has also lobbied heavily to ensure that its own laws, particularly surveillance and counter-terrorism laws, are not hampered in any way by the new rules.

I quote just one example: “The right to be forgotten would allow EU citizens to force companies, such as social networks and search engines, to delete data held on them that was inaccurate or no longer relevant — effectively removing traces of their past lives from the Web. Google is currently fighting a European court battle that could determine whether the right to be forgotten is feasible under European law.

The EU Data Protection Regulation is set to be voted on by June.

See http://www.zdnet.com/eu-under-pressure-for-new-data-privacy-law-changes-u-s-tech-firms-breathe-sigh-of-relief-7000012235/ .

Chelloveck • May 9, 2013 9:30 AM

Thanks to Madeleine Ball for the explanation and link to the refutation. For a moment I was worried. Not that an individual could be identified from birthdate, gender, and postcode; it seems obvious to me that those data alone would narrow the field to a very small group of matching individuals. Rather I was worried by the idea that anyone dealing anonymized data might actually consider the inclusion of such information as non-identifying. I appreciate the clarification.

John Jorsett • May 10, 2013 12:20 PM

This is why I routinely lie in my answers to even innocuous-seeming questions posed by merchants, web sites, supermarkets, etc.

Q • May 10, 2013 3:18 PM

I wonder the purpose of asking for birth dates for most uses. Birth year, sure. Someone’s rough age is relevant to many things. But it’s rare that you need to know exactly the date upon which someone was born.

Jack S • May 12, 2013 10:06 AM

EFF had a nice writeup on this concept a few years ago:

https://www.eff.org/deeplinks/2010/01/primer-information-theory-and-privacy

Madeleine Ball • May 14, 2013 1:52 PM

Answer for Q: Health records often don’t allow for “only year” input as they’re designed with the assumption of a high level of privacy. Currently, instances of this birth date data in the PGP public data originate from optionally imported health record data. Participants haven’t been asked for birth dates directly (but it would be good to add this feature).

iDeals • September 11, 2015 5:34 AM

The same techniques could be used to identify people in various surveys and records, pharmacy purchases, or from a wide variety of seemingly anonymous activities such as Internet searches.

Reidentifying Anonymous Data

Comments

Leave a comment Cancel reply