Computer scientists Arvind Narayanan and Dr Vitaly Shmatikov, from the University of Texas at Austin, developed the algorithm which turned the anonymous data back into names and addresses.
The data sets are usually stripped of personally identifiable information, such as names, before it is sold to marketing companies or researchers keen to plumb it for useful information.
Before now, it was thought sufficient to remove this data to make sure that the true identities of subjects could not be reconstructed.
The algorithm developed by the pair looks at relationships between all the members of a social network—not just the immediate friends that members of these sites connect to.
Social graphs from Twitter, Flickr and Live Journal were used in the research.
The pair found that one third of those who are on both Flickr and Twitter can be identified from the completely anonymous Twitter graph. This is despite the fact that the overlap of members between the two services is thought to be about 15%.
The researchers suggest that as social network sites become more heavily used, then people will find it increasingly difficult to maintain a veil of anonymity.
In “De-anonymizing social networks,” Narayanan and Shmatikov take an anonymous graph of the social relationships established through Twitter and find that they can actually identify many Twitter accounts based on an entirely different data source—in this case, Flickr.
One-third of users with accounts on both services could be identified on Twitter based on their Flickr connections, even when the Twitter social graph being used was completely anonymous. The point, say the authors, is that “anonymity is not sufficient for privacy when dealing with social networks,” since their scheme relies only on a social network’s topology to make the identification.
The issue is of more than academic interest, as social networks now routinely release such anonymous social graphs to advertisers and third-party apps, and government and academic researchers ask for such data to conduct research. But the data isn’t nearly as “anonymous” as those releasing it appear to think it is, and it can easily be cross-referenced to other data sets to expose user identities.
It’s not just about Twitter, either. Twitter was a proof of concept, but the idea extends to any sort of social network: phone call records, healthcare records, academic sociological datasets, etc.
Here’s the paper.