The genetic data posted online seemed perfectly anonymous - strings of billions of DNA letters from more than 1,000 people. But all it took was some clever sleuthing on the Web for a genetics researcher to identify five people he randomly selected from the study group. Not only that, he found their entire families, even though the relatives had no part in the study —identifying nearly 50 people.
Other reports have identified people whose genetic data was online, but none had done so using such limited information: the long strings of DNA letters, an age and, because the study focused on only American subjects, a state.
Entries Tagged "de-anonymization"
Page 4 of 6
Abstract: Most of the voice over IP (VoIP) traffic is encrypted prior to its transmission over the Internet. This makes the identity tracing of perpetrators during forensic investigations a challenging task since conventional speaker recognition techniques are limited to unencrypted speech communications. In this paper, we propose techniques for speaker identification and verification from encrypted VoIP conversations. Our experimental results show that the proposed techniques can correctly identify the actual speaker for 70-75% of the time among a group of 10 potential suspects. We also achieve more than 10 fold improvement over random guessing in identifying a perpetrator in a group of 20 potential suspects. An equal error rate of 17% in case of speaker verification on the CSLU speaker recognition corpus is achieved.
The article is in the context of the big Facebook lawsuit, but the part about identifying people by their writing style is interesting:
Recently, a team of computer scientists at Concordia University in Montreal took advantage of an unusual set of data to test another method of determining e-mail authorship. In 2003, the Federal Energy Regulatory Commission, as part of its investigation into Enron, released into the public domain hundreds of thousands of employee e-mails, which have become an important resource for forensic research. (Unlike novels, newspapers or blogs, e-mails are a private form of communication and aren’t usually available as a sizable corpus for analysis.)
Using this data, Benjamin C. M. Fung, who specializes in data mining, and Mourad Debbabi, a cyber-forensics expert, collaborated on a program that can look at an anonymous e-mail message and predict who wrote it out of a pool of known authors, with an accuracy of 80 to 90 percent. (Ms. Chaski claims 95 percent accuracy with her syntactic method.) The team identifies bundles of linguistic features, hundreds in all. They catalog everything from the position of greetings and farewells in e-mails to the preference of a writer for using symbols (say, “$” or “%”) or words (“dollars” or “percent”). Combining all of those features, they contend, allows them to determine what they call a person’s “write-print.”
It seems reasonable that we have a linguistic fingerprint, although 1) there are far fewer of them than finger fingerprints, 2) they’re easier to fake. It’s probably not much of a stretch to take that software that “identifies bundles of linguistic features, hundreds in all” and use the data to automatically modify my writing to look like someone else’s.
I’ve been asked this question by countless reporters in the past couple of weeks. Here’s a good explanation. Shorter answer: it’s easy to spoof source destination, and it’s easy to hijack unsuspecting middlemen and use them as proxies.
No, mandating attribution won’t solve the problem. Any Internet design will necessarily include anonymity.
This is impressive, and scary:
Every computer connected to the web has an internet protocol (IP) address, but there is no simple way to map this to a physical location. The current best system can be out by as much as 35 kilometres.
Now, Yong Wang, a computer scientist at the University of Electronic Science and Technology of China in Chengdu, and colleagues at Northwestern University in Evanston, Illinois, have used businesses and universities as landmarks to achieve much higher accuracy.
These organisations often host their websites on servers kept on their premises, meaning the servers’ IP addresses are tied to their physical location. Wang’s team used Google Maps to find both the web and physical addresses of such organisations, providing them with around 76,000 landmarks. By comparison, most other geolocation methods only use a few hundred landmarks specifically set up for the purpose.
The new method zooms in through three stages to locate a target computer. The first stage measures the time it takes to send a data packet to the target and converts it into a distance—a common geolocation technique that narrows the target’s possible location to a radius of around 200 kilometres.
Wang and colleagues then send data packets to the known Google Maps landmark servers in this large area to find which routers they pass through. When a landmark machine and the target computer have shared a router, the researchers can compare how long a packet takes to reach each machine from the router; converted into an estimate of distance, this time difference narrows the search down further. “We shrink the size of the area where the target potentially is,” explains Wang.
Finally, they repeat the landmark search at this more fine-grained level: comparing delay times once more, they establish which landmark server is closest to the target. The result can never be entirely accurate, but it’s much better than trying to determine a location by converting the initial delay into a distance or the next best IP-based method. On average their method gets to within 690 metres of the target and can be as close as 100 metres—good enough to identify the target computer’s location to within a few streets.
Interesting research: “One Bad Apple Spoils the Bunch: Exploiting P2P Applications to Trace and Profile Tor Users“:
Abstract: Tor is a popular low-latency anonymity network. However, Tor does not protect against the exploitation of an insecure application to reveal the IP address of, or trace, a TCP stream. In addition, because of the linkability of Tor streams sent together over a single circuit, tracing one stream sent over a circuit traces them all. Surprisingly, it is unknown whether this linkability allows in practice to trace a significant number of streams originating from secure (i.e., proxied) applications. In this paper, we show that linkability allows us to trace 193% of additional streams, including 27% of HTTP streams possibly originating from “secure” browsers. In particular, we traced 9% of Tor streams carried by our instrumented exit nodes. Using BitTorrent as the insecure application, we design two attacks tracing BitTorrent users on Tor. We run these attacks in the wild for 23 days and reveal 10,000 IP addresses of Tor users. Using these IP addresses, we then profile not only the BitTorrent downloads but also the websites visited per country of origin of Tor users. We show that BitTorrent users on Tor are over-represented in some countries as compared to BitTorrent users outside of Tor. By analyzing the type of content downloaded, we then explain the observed behaviors by the higher concentration of pornographic content downloaded at the scale of a country. Finally, we present results suggesting the existence of an underground BitTorrent ecosystem on Tor.
Interesting research. It only works when there’s a limited number of potential authors:
To test the accuracy of their technique, Fung and his colleagues examined the Enron Email Dataset, a collection which contains over 200,000 real-life emails from 158 employees of the Enron Corporation. Using a sample of 10 emails written by each of 10 subjects (100 emails in all), they were able to identify authorship with an accuracy of 80% to 90%.
I didn’t write about the recent security breach that disclosed tens of thousands of e-mail addresses and ICC-IDs of iPad users because, well, there was nothing terribly interesting about it. It was yet another web security breach.
Right after the incident, though, I was being interviewed by a reporter that wanted to know what the ramifications of the breach were. He specifically wanted to know if anything could be done with those ICC-IDs, and if the disclosure of that information was worse than people thought. He didn’t like the answer I gave him, which is that no one knows yet: that it’s too early to know the full effects of that information disclosure, and that both the good guys and the bad guys would be figuring it out in the coming weeks. And, that it’s likely that there were further security implications of the breach.
Seems like there were:
The problem is that ICC-IDs—unique serial numbers that identify each SIM card—can often be converted into IMSIs. While the ICC-ID is nonsecret—it’s often found printed on the boxes of cellphone/SIM bundles—the IMSI is somewhat secret. In theory, knowing an ICC-ID shouldn’t be enough to determine an IMSI. The phone companies do need to know which IMSI corresponds to which ICC-ID, but this should be done by looking up the values in a big database.
In practice, however, many phone companies simply calculate the IMSI from the ICC-ID. This calculation is often very simple indeed, being little more complex than “combine this hard-coded value with the last nine digits of the ICC-ID.” So while the leakage of AT&T’s customers’ ICC-IDs should be harmless, in practice, it could reveal a secret ID.
What can be done with that secret ID? Quite a lot, it turns out. The IMSI is sent by the phone to the network when first signing on to the network; it’s used by the network to figure out which call should be routed where. With someone else’s IMSI, an attacker can determine the person’s name and phone number, and even track his or her position. It also opens the door to active attacks—creating fake cell towers that a victim’s phone will connect to, enabling every call and text message to be eavesdropped.
More to come, I’m sure.
And that’s really the point: we all want to know—right away—the effects of a security vulnerability, but often we don’t and can’t. It takes time before the full effects are known, sometimes a lot of time.
And in related news, the image redaction that went along with some of the breach reporting wasn’t very good.
Interesting paper: “A Practical Attack to De-Anonymize Social Network Users.”
Abstract. Social networking sites such as Facebook, LinkedIn, and Xing have been reporting exponential growth rates. These sites have millions of registered users, and they are interesting from a security and privacy point of view because they store large amounts of sensitive personal user data.
In this paper, we introduce a novel de-anonymization attack that exploits group membership information that is available on social networking sites. More precisely, we show that information about the group memberships of a user (i.e., the groups of a social network to which a user belongs) is often sufficient to uniquely identify this user, or, at least, to significantly reduce the set of possible candidates. To determine the group membership of a user, we leverage well-known web browser history stealing attacks. Thus, whenever a social network user visits a malicious website, this website can launch our de-anonymization attack and learn the identity of its visitors.
The implications of our attack are manifold, since it requires a low effort and has the potential to affect millions of social networking users. We perform both a theoretical analysis and empirical measurements to demonstrate the feasibility of our attack against Xing, a medium-sized social network with more than eight million members that is mainly used for business relationships. Our analysis suggests that about 42% of the users that use groups can be uniquely identified, while for 90%, we can reduce the candidate set to less than 2,912 persons. Furthermore, we explored other, larger social networks and performed experiments that suggest that users of Facebook and LinkedIn are equally vulnerable (although attacks would require more resources on the side of the attacker). An analysis of an additional five social networks indicates that they are also prone to our attack.
Sidebar photo of Bruce Schneier by Joe MacInnis.