Entries Tagged "anonymity"
Page 6 of 8
Interesting paper: “A Practical Attack to De-Anonymize Social Network Users.”
Abstract. Social networking sites such as Facebook, LinkedIn, and Xing have been reporting exponential growth rates. These sites have millions of registered users, and they are interesting from a security and privacy point of view because they store large amounts of sensitive personal user data.
In this paper, we introduce a novel de-anonymization attack that exploits group membership information that is available on social networking sites. More precisely, we show that information about the group memberships of a user (i.e., the groups of a social network to which a user belongs) is often sufficient to uniquely identify this user, or, at least, to significantly reduce the set of possible candidates. To determine the group membership of a user, we leverage well-known web browser history stealing attacks. Thus, whenever a social network user visits a malicious website, this website can launch our de-anonymization attack and learn the identity of its visitors.
The implications of our attack are manifold, since it requires a low effort and has the potential to affect millions of social networking users. We perform both a theoretical analysis and empirical measurements to demonstrate the feasibility of our attack against Xing, a medium-sized social network with more than eight million members that is mainly used for business relationships. Our analysis suggests that about 42% of the users that use groups can be uniquely identified, while for 90%, we can reduce the candidate set to less than 2,912 persons. Furthermore, we explored other, larger social networks and performed experiments that suggest that users of Facebook and LinkedIn are equally vulnerable (although attacks would require more resources on the side of the attacker). An analysis of an additional five social networks indicates that they are also prone to our attack.
Universal identification is portrayed by some as the holy grail of Internet security. Anonymity is bad, the argument goes; and if we abolish it, we can ensure only the proper people have access to their own information. We’ll know who is sending us spam and who is trying to hack into corporate networks. And when there are massive denial-of-service attacks, such as those against Estonia or Georgia or South Korea, we’ll know who was responsible and take action accordingly.
The problem is that it won’t work. Any design of the Internet must allow for anonymity. Universal identification is impossible. Even attribution — knowing who is responsible for particular Internet packets — is impossible. Attempting to build such a system is futile, and will only give criminals and hackers new ways to hide.
Imagine a magic world in which every Internet packet could be traced to its origin. Even in this world, our Internet security problems wouldn’t be solved. There’s a huge gap between proving that a packet came from a particular computer and that a packet was directed by a particular person. This is the exact problem we have with botnets, or pedophiles storing child porn on innocents’ computers. In these cases, we know the origins of the DDoS packets and the spam; they’re from legitimate machines that have been hacked. Attribution isn’t as valuable as you might think.
Implementing an Internet without anonymity is very difficult, and causes its own problems. In order to have perfect attribution, we’d need agencies — real-world organizations — to provide Internet identity credentials based on other identification systems: passports, national identity cards, driver’s licenses, whatever. Sloppier identification systems, based on things such as credit cards, are simply too easy to subvert. We have nothing that comes close to this global identification infrastructure. Moreover, centralizing information like this actually hurts security because it makes identity theft that much more profitable a crime.
And realistically, any theoretical ideal Internet would need to allow people access even without their magic credentials. People would still use the Internet at public kiosks and at friends’ houses. People would lose their magic Internet tokens just like they lose their driver’s licenses and passports today. The legitimate bypass mechanisms would allow even more ways for criminals and hackers to subvert the system.
On top of all this, the magic attribution technology doesn’t exist. Bits are bits; they don’t come with identity information attached to them. Every software system we’ve ever invented has been successfully hacked, repeatedly. We simply don’t have anywhere near the expertise to build an airtight attribution system.
Not that it really matters. Even if everyone could trace all packets perfectly, to the person or origin and not just the computer, anonymity would still be possible. It would just take one person to set up an anonymity server. If I wanted to send a packet anonymously to someone else, I’d just route it through that server. For even greater anonymity, I could route it through multiple servers. This is called onion routing and, with appropriate cryptography and enough users, it adds anonymity back to any communications system that prohibits it.
Attempts to banish anonymity from the Internet won’t affect those savvy enough to bypass it, would cost billions, and would have only a negligible effect on security. What such attempts would do is affect the average user’s access to free speech, including those who use the Internet’s anonymity to survive: dissidents in Iran, China, and elsewhere.
Mandating universal identity and attribution is the wrong goal. Accept that there will always be anonymous speech on the Internet. Accept that you’ll never truly know where a packet came from. Work on the problems you can solve: software that’s secure in the face of whatever packet it receives, identification systems that are secure enough in the face of the risks. We can do far better at these things than we’re doing, and they’ll do more to improve security than trying to fix insoluble problems.
The whole attribution problem is very similar to the copy-protection/digital-rights-management problem. Just as it’s impossible to make specific bits not copyable, it’s impossible to know where specific bits came from. Bits are bits. They don’t naturally come with restrictions on their use attached to them, and they don’t naturally come with author information attached to them. Any attempts to circumvent this limitation will fail, and will increasingly need to be backed up by the sort of real-world police-state measures that the entertainment industry is demanding in order to make copy-protection work. That’s how China does it: police, informants, and fear.
Just as the music industry needs to learn that the world of bits requires a different business model, law enforcement and others need to understand that the old ideas of identification don’t work on the Internet. For good or for bad, whether you like it or not, there’s always going to be anonymity on the Internet.
This essay originally appeared in Information Security, as part of a point/counterpoint with Marcus Ranum. You can read Marcus’s response below my essay.
EDITED TO ADD (2/5): Microsoft’s Craig Mundie wants to abolish anonymity as well.
What Mundie is proposing is to impose authentication. He draws an analogy to automobile use. If you want to drive a car, you have to have a license (not to mention an inspection, insurance, etc). If you do something bad with that car, like break a law, there is the chance that you will lose your license and be prevented from driving in the future. In other words, there is a legal and social process for imposing discipline. Mundie imagines three tiers of Internet ID: one for people, one for machines and one for programs (which often act as proxies for the other two).
Excellent paper: “On Locational Privacy, and How to Avoid Losing it Forever.”
Some threats to locational privacy are overt: it’s evident how cameras backed by face-recognition software could be misused to track people and record their movements. In this document, we’re primarily concerned with threats to locational privacy that arise as a hidden side-effect of clearly useful location-based services.
We can’t stop the cascade of new location-based digital services. Nor would we want to — the benefits they offer are impressive. What urgently needs to change is that these systems need to be built with privacy as part of their original design. We can’t afford to have pervasive surveillance technology built into our electronic civic infrastructure by accident. We have the opportunity now to ensure that these dangers are averted.
Our contention is that the easiest and best solution to the locational privacy problem is to build systems which don’t collect the data in the first place. This sounds like an impossible requirement (how do we tell you when your friends are nearby without knowing where you and your friends are?) but in fact as we discuss below it is a reasonable objective that can be achieved with modern cryptographic techniques.
Modern cryptography actually allows civic data processing systems to be designed with a whole spectrum of privacy policies: ranging from complete anonymity to limited anonymity to support law enforcement. But we need to ensure that systems aren’t being built right at the zero-privacy, everything-is-recorded end of that spectrum, simply because that’s the path of easiest implementation.
I’ve already written about wholesale surveillance.
Computer scientists Arvind Narayanan and Dr Vitaly Shmatikov, from the University of Texas at Austin, developed the algorithm which turned the anonymous data back into names and addresses.
The data sets are usually stripped of personally identifiable information, such as names, before it is sold to marketing companies or researchers keen to plumb it for useful information.
Before now, it was thought sufficient to remove this data to make sure that the true identities of subjects could not be reconstructed.
The algorithm developed by the pair looks at relationships between all the members of a social network — not just the immediate friends that members of these sites connect to.
Social graphs from Twitter, Flickr and Live Journal were used in the research.
The pair found that one third of those who are on both Flickr and Twitter can be identified from the completely anonymous Twitter graph. This is despite the fact that the overlap of members between the two services is thought to be about 15%.
The researchers suggest that as social network sites become more heavily used, then people will find it increasingly difficult to maintain a veil of anonymity.
In “De-anonymizing social networks,” Narayanan and Shmatikov take an anonymous graph of the social relationships established through Twitter and find that they can actually identify many Twitter accounts based on an entirely different data source—in this case, Flickr.
One-third of users with accounts on both services could be identified on Twitter based on their Flickr connections, even when the Twitter social graph being used was completely anonymous. The point, say the authors, is that “anonymity is not sufficient for privacy when dealing with social networks,” since their scheme relies only on a social network’s topology to make the identification.
The issue is of more than academic interest, as social networks now routinely release such anonymous social graphs to advertisers and third-party apps, and government and academic researchers ask for such data to conduct research. But the data isn’t nearly as “anonymous” as those releasing it appear to think it is, and it can easily be cross-referenced to other data sets to expose user identities.
It’s not just about Twitter, either. Twitter was a proof of concept, but the idea extends to any sort of social network: phone call records, healthcare records, academic sociological datasets, etc.
Here’s the paper.
Fascinating history of an illegal industry:
Today’s schemes are technologically very demanding and extremely complex. It starts with the renting of computer servers in several countries. First the Carders are active to obtain the credit cards and client identities wrongfully. These data are then passed to the falsifiers who manufacture wonderful official documents so that they can be used to identify oneself. These identities and credit card infos are then sold as credit card kits to operators. There is still an alternative where no credit card is needed: in the U.S. one can buy so-called Visa or MasterCard gift cards. However, these with a certain amount of money charged Visa or MasterCard cards usually only usable in the U.S.. Since this anonymous gift cards to buy, these are used to over the Internet with fake identities to pay. Using a false identity and well-functioning credit card servers are then rented and domains purchased as an existing, unsuspecting person. Most of the time an ID is required and in that case they will simply send a forged document. There is yet another alternative: a payment system called WebMoney (webmoney.ru) that is in Eastern Europe as widespread as PayPal in Western Europe. Again, accounts are opened with false identities. Then the business is very simple in Eastern Europe: one buys domains and rents servers via WebMoney and uses it to pay.
As soon as the server is available, a qualified server admin connects to it via a chain of servers in various countries with the help of SSH on the new server. Today complete partitions are encrypted with TrueCrypt and all of the operating system logs are turned off. Because people consider the servers in Germany very reliable, fast and inexpensive, these are usually configured as HIDDEN CONTENT SERVERS. In other words, all the illegal files such as pictures, videos, etc. are uploaded on these servers – naturally via various proxies (and since you are still wondering what these proxies can be – I’ll explain that later). These servers are using firewalls, completely sealed and made inaccessible except by a few servers all over the world – so-called PROXY SERVERs or FORWARD SERVERs. If the server is shut down or Someone logs in from the console, the TrueCrypt partition is unmounted. Just as was done on the content servers, logs are turned off and TrueCrypt is installed on the so-called proxy servers or forward servers. The Russians have developed very clever software that can be used as a proxy server (in addition to the possibilities of SSL tunneling and IP Forwarding). These proxy servers accept incoming connections from the retail customers and route them to the content Servers in Germany – COMPLETELY ANONYMOUSLY AND UNIDENTIFIABLY. The communication link can even be configured to be encrypted. Result: the server in Germany ATTRACTS NO ATTENTION AND STAYS COMPLETELY ANONYMOUS because its IP is not used by anyone except for the proxy server that uses it to route the traffic back and forth through a tunnel – using similar technology as is used with large enterprise VPNs. I stress that these proxy servers are everywhere in the world and only consume a lot of traffic, have no special demands, and above all are completely empty.
Networks of servers around the world are also used at the DNS level. The DNS has many special features: the refresh times have a TTL (Time To Live) of approximately 10 minutes, the entries usually have multiple IP entries in the round robin procedure at each request and rotate the visitor to any of the forward proxy servers. But what is special are the different zones of the DNS linked with extensive GeoIP databases … Way, there are pedophiles in authorities and hosting providers, allowing the Russian server administrators access to valuable information about IP blocks etc. that can be used in conjuction with the DNA. Each one who has little technical knowledge will understabd the importance and implications of this… But what I have to report to you is much more significant than this, and maybe they will finally understand to what extent the public is cheated by the greedy politicians who CANNOT DO ANYTHING against child pornography but use it as a means to justify total monitoring.
TrapCall instructs new customers to reprogram their cellphones to send all rejected, missed and unanswered calls to TrapCall’s own toll-free number. If the user sees an incoming call with Caller ID blocked, he just presses the button on the phone that would normally send it to voicemail. The call invisibly loops through TelTech’s system, then back to the user’s phone, this time with the caller’s number displayed as the Caller ID.
In addition to the free service, branded Fly Trap, a $10-per-month upgrade called Mouse Trap provides human-created transcripts of voicemail messages, and in some cases uses text messaging to send you the name of the caller — information not normally available to wireless customers. Mouse Trap will also send you text messages with the numbers of people who call while your phone was powered off, even if they don’t leave a message.
With the $25-a-month Bear Trap upgrade, you can also automatically record your incoming calls, and get text messages with the billing name and street address of some of your callers, which TelTech says is derived from commercial databases.
Definitely strange bedfellows:
A United Nations agency is quietly drafting technical standards, proposed by the Chinese government, to define methods of tracing the original source of Internet communications and potentially curbing the ability of users to remain anonymous.
The U.S. National Security Agency is also participating in the “IP Traceback” drafting group, named Q6/17, which is meeting next week in Geneva to work on the traceback proposal. Members of Q6/17 have declined to release key documents, and meetings are closed to the public.
A second, apparently leaked ITU document offers surveillance and monitoring justifications that seem well-suited to repressive regimes:
A political opponent to a government publishes articles putting the government in an unfavorable light. The government, having a law against any opposition, tries to identify the source of the negative articles but the articles having been published via a proxy server, is unable to do so protecting the anonymity of the author.
This is being sold as a way to go after the bad guys, but it won’t help. Here’s Steve Bellovin on that issue:
First, very few attacks these days use spoofed source addresses; the real IP address already tells you where the attack is coming from. Second, in case of a DDoS attack, there are too many sources; you can’t do anything with the information. Third, the machine attacking you is almost certainly someone else’s hacked machine and tracking them down (and getting them to clean it up) is itself time-consuming.
TraceBack is most useful in monitoring the activities of large masses of people. But of course, that’s why the Chinese and the NSA are so interested in this proposal in the first place.
It’s hard to figure out what the endgame is; the U.N. doesn’t have the authority to impose Internet standards on anyone. In any case, this idea is counter to the U.N. Universal Declaration of Human Rights, Article 19: “Everyone has the right to freedom of opinion and expression; this right includes freedom to hold opinions without interference and to seek, receive and impart information and ideas through any media and regardless of frontiers.” In the U.S., it’s counter to the First Amendment, which has long permitted anonymous speech. On the other hand, basic human and constitutional rights have been jettisoned left and right in the years after 9/11; why should this be any different?
But when the Chinese government and the NSA get together to enhance their ability to spy on us all, you have to wonder what’s gone wrong with the world.
Last year, Netflix published 10 million movie rankings by 500,000 customers, as part of a challenge for people to come up with better recommendation systems than the one the company was using. The data was anonymized by removing personal details and replacing names with random numbers, to protect the privacy of the recommenders.
Arvind Narayanan and Vitaly Shmatikov, researchers at the University of Texas at Austin, de-anonymized some of the Netflix data by comparing rankings and timestamps with public information in the Internet Movie Database, or IMDb.
They did not reverse the anonymity of the entire Netflix dataset. What they did was reverse the anonymity of the Netflix dataset for those sampled users who also entered some movie rankings, under their own names, in the IMDb. (While IMDb’s records are public, crawling the site to get them is against the IMDb’s terms of service, so the researchers used a representative few to prove their algorithm.)
The point of the research was to demonstrate how little information is required to de-anonymize information in the Netflix dataset.
On one hand, isn’t that sort of obvious? The risks of anonymous databases have been written about before, such as in this 2001 paper published in an IEEE journal. The researchers working with the anonymous Netflix data didn’t painstakingly figure out people’s identities — as others did with the AOL search database last year — they just compared it with an already identified subset of similar data: a standard data-mining technique.
But as opportunities for this kind of analysis pop up more frequently, lots of anonymous data could end up at risk.
Someone with access to an anonymous dataset of telephone records, for example, might partially de-anonymize it by correlating it with a catalog merchants’ telephone order database. Or Amazon’s online book reviews could be the key to partially de-anonymizing a public database of credit card purchases, or a larger database of anonymous book reviews.
Google, with its database of users’ internet searches, could easily de-anonymize a public database of internet purchases, or zero in on searches of medical terms to de-anonymize a public health database. Merchants who maintain detailed customer and purchase information could use their data to partially de-anonymize any large search engine’s data, if it were released in an anonymized form. A data broker holding databases of several companies might be able to de-anonymize most of the records in those databases.
What the University of Texas researchers demonstrate is that this process isn’t hard, and doesn’t require a lot of data. It turns out that if you eliminate the top 100 movies everyone watches, our movie-watching habits are all pretty individual. This would certainly hold true for our book reading habits, our internet shopping habits, our telephone habits and our web searching habits.
The obvious countermeasures for this are, sadly, inadequate. Netflix could have randomized its dataset by removing a subset of the data, changing the timestamps or adding deliberate errors into the unique ID numbers it used to replace the names. It turns out, though, that this only makes the problem slightly harder. Narayanan’s and Shmatikov’s de-anonymization algorithm is surprisingly robust, and works with partial data, data that has been perturbed, even data with errors in it.
With only eight movie ratings (of which two may be completely wrong), and dates that may be up to two weeks in error, they can uniquely identify 99 percent of the records in the dataset. After that, all they need is a little bit of identifiable data: from the IMDb, from your blog, from anywhere. The moral is that it takes only a small named database for someone to pry the anonymity off a much larger anonymous database.
Other research reaches the same conclusion. Using public anonymous data from the 1990 census, Latanya Sweeney found that 87 percent of the population in the United States, 216 million of 248 million, could likely be uniquely identified by their five-digit ZIP code, combined with their gender and date of birth. About half of the U.S. population is likely identifiable by gender, date of birth and the city, town or municipality in which the person resides. Expanding the geographic scope to an entire county reduces that to a still-significant 18 percent. “In general,” the researchers wrote, “few characteristics are needed to uniquely identify a person.”
Stanford University researchers reported similar results using 2000 census data. It turns out that date of birth, which (unlike birthday month and day alone) sorts people into thousands of different buckets, is incredibly valuable in disambiguating people.
This has profound implications for releasing anonymous data. On one hand, anonymous data is an enormous boon for researchers — AOL did a good thing when it released its anonymous dataset for research purposes, and it’s sad that the CTO resigned and an entire research team was fired after the public outcry. Large anonymous databases of medical data are enormously valuable to society: for large-scale pharmacology studies, long-term follow-up studies and so on. Even anonymous telephone data makes for fascinating research.
Like everything else in security, anonymity systems shouldn’t be fielded before being subjected to adversarial attacks. We all know that it’s folly to implement a cryptographic system before it’s rigorously attacked; why should we expect anonymity systems to be any different? And, like everything else in security, anonymity is a trade-off. There are benefits, and there are corresponding risks.
Narayanan and Shmatikov are currently working on developing algorithms and techniques that enable the secure release of anonymous datasets like Netflix’s. That’s a research result we can all benefit from.
This essay originally appeared on Wired.com.
Interesting. So often man-in-the-middle attacks are theoretical; it’s fascinating to see one in the wild.
(I’ve written about anonymity and the Tor network before.)
EDITED TO ADD (12/6): The guy claims that he just misconfigured his Tor node. I don’t know enough about Tor to have any comment about this.
Sidebar photo of Bruce Schneier by Joe MacInnis.