Entries Tagged "de-anonymization"

Page 5 of 6

Using Language Patterns to Identify Anonymous E-Mail

Interesting research. It only works when there’s a limited number of potential authors:

To test the accuracy of their technique, Fung and his colleagues examined the Enron Email Dataset, a collection which contains over 200,000 real-life emails from 158 employees of the Enron Corporation. Using a sample of 10 emails written by each of 10 subjects (100 emails in all), they were able to identify authorship with an accuracy of 80% to 90%.

Posted on March 14, 2011 at 5:04 AMView Comments

AT&T's iPad Security Breach

I didn’t write about the recent security breach that disclosed tens of thousands of e-mail addresses and ICC-IDs of iPad users because, well, there was nothing terribly interesting about it. It was yet another web security breach.

Right after the incident, though, I was being interviewed by a reporter that wanted to know what the ramifications of the breach were. He specifically wanted to know if anything could be done with those ICC-IDs, and if the disclosure of that information was worse than people thought. He didn’t like the answer I gave him, which is that no one knows yet: that it’s too early to know the full effects of that information disclosure, and that both the good guys and the bad guys would be figuring it out in the coming weeks. And, that it’s likely that there were further security implications of the breach.

Seems like there were:

The problem is that ICC-IDs—unique serial numbers that identify each SIM card—can often be converted into IMSIs. While the ICC-ID is nonsecret—it’s often found printed on the boxes of cellphone/SIM bundles—the IMSI is somewhat secret. In theory, knowing an ICC-ID shouldn’t be enough to determine an IMSI. The phone companies do need to know which IMSI corresponds to which ICC-ID, but this should be done by looking up the values in a big database.

In practice, however, many phone companies simply calculate the IMSI from the ICC-ID. This calculation is often very simple indeed, being little more complex than “combine this hard-coded value with the last nine digits of the ICC-ID.” So while the leakage of AT&T’s customers’ ICC-IDs should be harmless, in practice, it could reveal a secret ID.

What can be done with that secret ID? Quite a lot, it turns out. The IMSI is sent by the phone to the network when first signing on to the network; it’s used by the network to figure out which call should be routed where. With someone else’s IMSI, an attacker can determine the person’s name and phone number, and even track his or her position. It also opens the door to active attacks—creating fake cell towers that a victim’s phone will connect to, enabling every call and text message to be eavesdropped.

More to come, I’m sure.

And that’s really the point: we all want to know—right away—the effects of a security vulnerability, but often we don’t and can’t. It takes time before the full effects are known, sometimes a lot of time.

And in related news, the image redaction that went along with some of the breach reporting wasn’t very good.

Posted on June 21, 2010 at 5:27 AMView Comments

De-Anonymizing Social Network Users

Interesting paper: “A Practical Attack to De-Anonymize Social Network Users.”

Abstract. Social networking sites such as Facebook, LinkedIn, and Xing have been reporting exponential growth rates. These sites have millions of registered users, and they are interesting from a security and privacy point of view because they store large amounts of sensitive personal user data.

In this paper, we introduce a novel de-anonymization attack that exploits group membership information that is available on social networking sites. More precisely, we show that information about the group memberships of a user (i.e., the groups of a social network to which a user belongs) is often sufficient to uniquely identify this user, or, at least, to significantly reduce the set of possible candidates. To determine the group membership of a user, we leverage well-known web browser history stealing attacks. Thus, whenever a social network user visits a malicious website, this website can launch our de-anonymization attack and learn the identity of its visitors.

The implications of our attack are manifold, since it requires a low effort and has the potential to affect millions of social networking users. We perform both a theoretical analysis and empirical measurements to demonstrate the feasibility of our attack against Xing, a medium-sized social network with more than eight million members that is mainly used for business relationships. Our analysis suggests that about 42% of the users that use groups can be uniquely identified, while for 90%, we can reduce the candidate set to less than 2,912 persons. Furthermore, we explored other, larger social networks and performed experiments that suggest that users of Facebook and LinkedIn are equally vulnerable (although attacks would require more resources on the side of the attacker). An analysis of an additional five social networks indicates that they are also prone to our attack.

News article. Moral: anonymity is really, really hard—but we knew that already.

Posted on March 8, 2010 at 6:13 AMView Comments

Flash Cookies

Flash has the equivalent of cookies, and they’re hard to delete:

Unlike traditional browser cookies, Flash cookies are relatively unknown to web users, and they are not controlled through the cookie privacy controls in a browser. That means even if a user thinks they have cleared their computer of tracking objects, they most likely have not.

What’s even sneakier?

Several services even use the surreptitious data storage to reinstate traditional cookies that a user deleted, which is called ‘re-spawning’ in homage to video games where zombies come back to life even after being “killed,” the report found. So even if a user gets rid of a website’s tracking cookie, that cookie’s unique ID will be assigned back to a new cookie again using the Flash data as the “backup.”

Posted on August 17, 2009 at 6:36 AMView Comments

On the Anonymity of Home/Work Location Pairs

Interesting:

Philippe Golle and Kurt Partridge of PARC have a cute paper on the anonymity of geo-location data. They analyze data from the U.S. Census and show that for the average person, knowing their approximate home and work locations—to a block level—identifies them uniquely.

Even if we look at the much coarser granularity of a census tract—tracts correspond roughly to ZIP codes; there are on average 1,500 people per census tract—for the average person, there are only around 20 other people who share the same home and work location. There’s more: 5% of people are uniquely identified by their home and work locations even if it is known only at the census tract level. One reason for this is that people who live and work in very different areas (say, different counties) are much more easily identifiable, as one might expect.

“On the Anonymity of Home/Work Location Pairs,” by Philippe Golle and Kurt Partridge:

Abstract:

Many applications benefit from user location data, but location data raises privacy concerns. Anonymization can protect privacy, but identities can sometimes be inferred from supposedly anonymous data. This paper studies a new attack on the anonymity of location data. We show that if the approximate locations of an individual’s home and workplace can both be deduced from a location trace, then the median size of the individual’s anonymity set in the U.S. working population is 1, 21 and 34,980, for locations known at the granularity of a census block, census track and county respectively. The location data of people who live and work in different regions can be re-identified even more easily. Our results show that the threat of re-identification for location data is much greater when the individual’s home and work locations can both be deduced from the data. To preserve anonymity, we offer guidance for obfuscating location traces before they are disclosed.

This is all very troubling, given the number of location-based services springing up and the number of databases that are collecting location data.

Posted on May 21, 2009 at 6:15 AMView Comments

Identifying People using Anonymous Social Networking Data

Interesting:

Computer scientists Arvind Narayanan and Dr Vitaly Shmatikov, from the University of Texas at Austin, developed the algorithm which turned the anonymous data back into names and addresses.

The data sets are usually stripped of personally identifiable information, such as names, before it is sold to marketing companies or researchers keen to plumb it for useful information.

Before now, it was thought sufficient to remove this data to make sure that the true identities of subjects could not be reconstructed.

The algorithm developed by the pair looks at relationships between all the members of a social network—not just the immediate friends that members of these sites connect to.

Social graphs from Twitter, Flickr and Live Journal were used in the research.

The pair found that one third of those who are on both Flickr and Twitter can be identified from the completely anonymous Twitter graph. This is despite the fact that the overlap of members between the two services is thought to be about 15%.

The researchers suggest that as social network sites become more heavily used, then people will find it increasingly difficult to maintain a veil of anonymity.

More details:

In “De-anonymizing social networks,” Narayanan and Shmatikov take an anonymous graph of the social relationships established through Twitter and find that they can actually identify many Twitter accounts based on an entirely different data source—in this case, Flickr.

One-third of users with accounts on both services could be identified on Twitter based on their Flickr connections, even when the Twitter social graph being used was completely anonymous. The point, say the authors, is that “anonymity is not sufficient for privacy when dealing with social networks,” since their scheme relies only on a social network’s topology to make the identification.

The issue is of more than academic interest, as social networks now routinely release such anonymous social graphs to advertisers and third-party apps, and government and academic researchers ask for such data to conduct research. But the data isn’t nearly as “anonymous” as those releasing it appear to think it is, and it can easily be cross-referenced to other data sets to expose user identities.

It’s not just about Twitter, either. Twitter was a proof of concept, but the idea extends to any sort of social network: phone call records, healthcare records, academic sociological datasets, etc.

Here’s the paper.

Posted on April 6, 2009 at 6:51 AMView Comments

Fingerprinting Paper

Interesting paper:

Fingerprinting Blank Paper Using Commodity Scanners

Will Clarkson, Tim Weyrich, Adam Finkelstein, Nadia Heninger, Alex Halderman, and Edward W. Felten

Abstract: This paper presents a novel technique for authenticating physical documents based on random, naturally occurring imperfections in paper texture. We introduce a new method for measuring the three-dimensional surface of a page using only a commodity scanner and without modifying the document in any way. From this physical feature, we generate a concise fingerprint that uniquely identifies the document. Our technique is secure against counterfeiting and robust to harsh handling; it can be used even before any content is printed on a page. It has a wide range of applications, including detecting forged currency and tickets, authenticating passports, and halting counterfeit goods. Document identification could also be applied maliciously to de-anonymize printed surveys and to compromise the secrecy of paper ballots.

Posted on March 19, 2009 at 6:07 AMView Comments

Defeating Caller ID Blocking

TrapCall is a new service that reveals the caller ID on anonymous or blocked calls:

TrapCall instructs new customers to reprogram their cellphones to send all rejected, missed and unanswered calls to TrapCall’s own toll-free number. If the user sees an incoming call with Caller ID blocked, he just presses the button on the phone that would normally send it to voicemail. The call invisibly loops through TelTech’s system, then back to the user’s phone, this time with the caller’s number displayed as the Caller ID.

There’s more:

In addition to the free service, branded Fly Trap, a $10-per-month upgrade called Mouse Trap provides human-created transcripts of voicemail messages, and in some cases uses text messaging to send you the name of the caller—information not normally available to wireless customers. Mouse Trap will also send you text messages with the numbers of people who call while your phone was powered off, even if they don’t leave a message.

With the $25-a-month Bear Trap upgrade, you can also automatically record your incoming calls, and get text messages with the billing name and street address of some of your callers, which TelTech says is derived from commercial databases.

Posted on February 26, 2009 at 12:53 PMView Comments

The NSA Teams Up with the Chinese Government to Limit Internet Anonymity

Definitely strange bedfellows:

A United Nations agency is quietly drafting technical standards, proposed by the Chinese government, to define methods of tracing the original source of Internet communications and potentially curbing the ability of users to remain anonymous.

The U.S. National Security Agency is also participating in the “IP Traceback” drafting group, named Q6/17, which is meeting next week in Geneva to work on the traceback proposal. Members of Q6/17 have declined to release key documents, and meetings are closed to the public.

[…]

A second, apparently leaked ITU document offers surveillance and monitoring justifications that seem well-suited to repressive regimes:

A political opponent to a government publishes articles putting the government in an unfavorable light. The government, having a law against any opposition, tries to identify the source of the negative articles but the articles having been published via a proxy server, is unable to do so protecting the anonymity of the author.

This is being sold as a way to go after the bad guys, but it won’t help. Here’s Steve Bellovin on that issue:

First, very few attacks these days use spoofed source addresses; the real IP address already tells you where the attack is coming from. Second, in case of a DDoS attack, there are too many sources; you can’t do anything with the information. Third, the machine attacking you is almost certainly someone else’s hacked machine and tracking them down (and getting them to clean it up) is itself time-consuming.

TraceBack is most useful in monitoring the activities of large masses of people. But of course, that’s why the Chinese and the NSA are so interested in this proposal in the first place.

It’s hard to figure out what the endgame is; the U.N. doesn’t have the authority to impose Internet standards on anyone. In any case, this idea is counter to the U.N. Universal Declaration of Human Rights, Article 19: “Everyone has the right to freedom of opinion and expression; this right includes freedom to hold opinions without interference and to seek, receive and impart information and ideas through any media and regardless of frontiers.” In the U.S., it’s counter to the First Amendment, which has long permitted anonymous speech. On the other hand, basic human and constitutional rights have been jettisoned left and right in the years after 9/11; why should this be any different?

But when the Chinese government and the NSA get together to enhance their ability to spy on us all, you have to wonder what’s gone wrong with the world.

Posted on September 18, 2008 at 6:34 AMView Comments

Anonymity and the Netflix Dataset

Last year, Netflix published 10 million movie rankings by 500,000 customers, as part of a challenge for people to come up with better recommendation systems than the one the company was using. The data was anonymized by removing personal details and replacing names with random numbers, to protect the privacy of the recommenders.

Arvind Narayanan and Vitaly Shmatikov, researchers at the University of Texas at Austin, de-anonymized some of the Netflix data by comparing rankings and timestamps with public information in the Internet Movie Database, or IMDb.

Their research (.pdf) illustrates some inherent security problems with anonymous data, but first it’s important to explain what they did and did not do.

They did not reverse the anonymity of the entire Netflix dataset. What they did was reverse the anonymity of the Netflix dataset for those sampled users who also entered some movie rankings, under their own names, in the IMDb. (While IMDb’s records are public, crawling the site to get them is against the IMDb’s terms of service, so the researchers used a representative few to prove their algorithm.)

The point of the research was to demonstrate how little information is required to de-anonymize information in the Netflix dataset.

On one hand, isn’t that sort of obvious? The risks of anonymous databases have been written about before, such as in this 2001 paper published in an IEEE journal. The researchers working with the anonymous Netflix data didn’t painstakingly figure out people’s identities—as others did with the AOL search database last year—they just compared it with an already identified subset of similar data: a standard data-mining technique.

But as opportunities for this kind of analysis pop up more frequently, lots of anonymous data could end up at risk.

Someone with access to an anonymous dataset of telephone records, for example, might partially de-anonymize it by correlating it with a catalog merchants’ telephone order database. Or Amazon’s online book reviews could be the key to partially de-anonymizing a public database of credit card purchases, or a larger database of anonymous book reviews.

Google, with its database of users’ internet searches, could easily de-anonymize a public database of internet purchases, or zero in on searches of medical terms to de-anonymize a public health database. Merchants who maintain detailed customer and purchase information could use their data to partially de-anonymize any large search engine’s data, if it were released in an anonymized form. A data broker holding databases of several companies might be able to de-anonymize most of the records in those databases.

What the University of Texas researchers demonstrate is that this process isn’t hard, and doesn’t require a lot of data. It turns out that if you eliminate the top 100 movies everyone watches, our movie-watching habits are all pretty individual. This would certainly hold true for our book reading habits, our internet shopping habits, our telephone habits and our web searching habits.

The obvious countermeasures for this are, sadly, inadequate. Netflix could have randomized its dataset by removing a subset of the data, changing the timestamps or adding deliberate errors into the unique ID numbers it used to replace the names. It turns out, though, that this only makes the problem slightly harder. Narayanan’s and Shmatikov’s de-anonymization algorithm is surprisingly robust, and works with partial data, data that has been perturbed, even data with errors in it.

With only eight movie ratings (of which two may be completely wrong), and dates that may be up to two weeks in error, they can uniquely identify 99 percent of the records in the dataset. After that, all they need is a little bit of identifiable data: from the IMDb, from your blog, from anywhere. The moral is that it takes only a small named database for someone to pry the anonymity off a much larger anonymous database.

Other research reaches the same conclusion. Using public anonymous data from the 1990 census, Latanya Sweeney found that 87 percent of the population in the United States, 216 million of 248 million, could likely be uniquely identified by their five-digit ZIP code, combined with their gender and date of birth. About half of the U.S. population is likely identifiable by gender, date of birth and the city, town or municipality in which the person resides. Expanding the geographic scope to an entire county reduces that to a still-significant 18 percent. “In general,” the researchers wrote, “few characteristics are needed to uniquely identify a person.”

Stanford University researchers reported similar results using 2000 census data. It turns out that date of birth, which (unlike birthday month and day alone) sorts people into thousands of different buckets, is incredibly valuable in disambiguating people.

This has profound implications for releasing anonymous data. On one hand, anonymous data is an enormous boon for researchers—AOL did a good thing when it released its anonymous dataset for research purposes, and it’s sad that the CTO resigned and an entire research team was fired after the public outcry. Large anonymous databases of medical data are enormously valuable to society: for large-scale pharmacology studies, long-term follow-up studies and so on. Even anonymous telephone data makes for fascinating research.

On the other hand, in the age of wholesale surveillance, where everyone collects data on us all the time, anonymization is very fragile and riskier than it initially seems.

Like everything else in security, anonymity systems shouldn’t be fielded before being subjected to adversarial attacks. We all know that it’s folly to implement a cryptographic system before it’s rigorously attacked; why should we expect anonymity systems to be any different? And, like everything else in security, anonymity is a trade-off. There are benefits, and there are corresponding risks.

Narayanan and Shmatikov are currently working on developing algorithms and techniques that enable the secure release of anonymous datasets like Netflix’s. That’s a research result we can all benefit from.

This essay originally appeared on Wired.com.

Posted on December 18, 2007 at 5:53 AMView Comments

Sidebar photo of Bruce Schneier by Joe MacInnis.