Fascinating research on de-anonymizing code—from either source code or compiled code:
Rachel Greenstadt, an associate professor of computer science at Drexel University, and Aylin Caliskan, Greenstadt’s former PhD student and now an assistant professor at George Washington University, have found that code, like other forms of stylistic expression, are not anonymous. At the DefCon hacking conference Friday, the pair will present a number of studies they’ve conducted using machine learning techniques to de-anonymize the authors of code samples. Their work could be useful in a plagiarism dispute, for instance, but it also has privacy implications, especially for the thousands of developers who contribute open source code to the world.
Posted on August 13, 2018 at 4:02 PM •
Interesting research: “De-anonymizing Web Browsing Data with Social Networks“:
Abstract: Can online trackers and network adversaries de-anonymize web browsing data readily available to them? We show—theoretically, via simulation, and through experiments on real user data—that de-identified web browsing histories can be linked to social media profiles using only publicly available data. Our approach is based on a simple observation: each person has a distinctive social network, and thus the set of links appearing in one’s feed is unique. Assuming users visit links in their feed with higher probability than a random user, browsing histories contain tell-tale marks of identity. We formalize this intuition by specifying a model of web browsing behavior and then deriving the maximum likelihood estimate of a user’s social profile. We evaluate this strategy on simulated browsing histories, and show that given a history with 30 links originating from Twitter, we can deduce the corresponding Twitter profile more than 50% of the time. To gauge the real-world effectiveness of this approach, we recruited nearly 400 people to donate their web browsing histories, and we were able to correctly identify more than 70% of them. We further show that several online trackers are embedded on sufficiently many websites to carry out this attack with high accuracy. Our theoretical contribution applies to any type of transactional data and is robust to noisy observations, generalizing a wide range of previous de-anonymization attacks. Finally, since our attack attempts to find the correct Twitter profile out of over 300 million candidates, it is—to our knowledge—the largest scale demonstrated de-anonymization to date.
Posted on February 10, 2017 at 8:25 AM •
In this article, detailing the Australian and then worldwide investigation of a particularly heinous child-abuse ring, there are a lot of details of the pedophile security practices and the police investigative techniques. The abusers had a detailed manual on how to scrub metadata and avoid detection, but not everyone was perfect. The police used information from a single camera to narrow down the suspects. They also tracked a particular phrase one person used to find him.
This story shows an increasing sophistication of the police using small technical clues combined with standard detective work to investigate crimes on the Internet. A highly painful read, but interesting nonetheless.
Posted on August 24, 2016 at 9:30 AM •
Here’s the story of how it was done. First, a fake ad on torrent listings linked the site to a Latvian bank account, an e-mail address, and a Facebook page.
Using basic website-tracking services, Der-Yeghiayan was able to uncover (via a reverse DNS search) the hosts of seven apparent KAT website domains: kickasstorrents.com, kat.cr, kickass.to, kat.ph, kastatic.com, thekat.tv and kickass.cr. This dug up two Chicago IP addresses, which were used as KAT name servers for more than four years. Agents were then able to legally gain a copy of the server’s access logs (explaining why it was federal authorities in Chicago that eventually charged Vaulin with his alleged crimes).
Using similar tools, Homeland Security investigators also performed something called a WHOIS lookup on a domain that redirected people to the main KAT site. A WHOIS search can provide the name, address, email and phone number of a website registrant. In the case of kickasstorrents.biz, that was Artem Vaulin from Kharkiv, Ukraine.
Der-Yeghiayan was able to link the email address found in the WHOIS lookup to an Apple email address that Vaulin purportedly used to operate KAT. It’s this Apple account that appears to tie all of pieces of Vaulin’s alleged involvement together.
On July 31st 2015, records provided by Apple show that the me.com account was used to purchase something on iTunes. The logs show that the same IP address was used on the same day to access the KAT Facebook page. After KAT began accepting Bitcoin donations in 2012, $72,767 was moved into a Coinbase account in Vaulin’s name. That Bitcoin wallet was registered with the same me.com email address.
Posted on July 26, 2016 at 6:42 AM •
Interesting paper: “Anonymization and Risk,” by Ira S. Rubinstein and Woodrow Hartzog:
Abstract: Perfect anonymization of data sets has failed. But the process of protecting data subjects in shared information remains integral to privacy practice and policy. While the deidentification debate has been vigorous and productive, there is no clear direction for policy. As a result, the law has been slow to adapt a holistic approach to protecting data subjects when data sets are released to others. Currently, the law is focused on whether an individual can be identified within a given set. We argue that the better locus of data release policy is on the process of minimizing the risk of reidentification and sensitive attribute disclosure. Process-based data release policy, which resembles the law of data security, will help us move past the limitations of focusing on whether data sets have been “anonymized.” It draws upon different tactics to protect the privacy of data subjects, including accurate deidentification rhetoric, contracts prohibiting reidentification and sensitive attribute disclosure, data enclaves, and query-based strategies to match required protections with the level of risk. By focusing on process, data release policy can better balance privacy and utility where nearly all data exchanges carry some risk.
Posted on July 11, 2016 at 6:31 AM •
This research shows how to track e-commerce users better across multiple sessions, even when they do not provide unique identifiers such as user IDs or cookies.
Abstract: Targeting individual consumers has become a hallmark of direct and digital marketing, particularly as it has become easier to identify customers as they interact repeatedly with a company. However, across a wide variety of contexts and tracking technologies, companies find that customers can not be consistently identified which leads to a substantial fraction of anonymous visits in any CRM database. We develop a Bayesian imputation approach that allows us to probabilistically assign anonymous sessions to users, while ac- counting for a customer’s demographic information, frequency of interaction with the firm, and activities the customer engages in. Our approach simultaneously estimates a hierarchical model of customer behavior while probabilistically imputing which customers made the anonymous visits. We present both synthetic and real data studies that demonstrate our approach makes more accurate inference about individual customers’ preferences and responsiveness to marketing, relative to common approaches to anonymous visits: nearest- neighbor matching or ignoring the anonymous visits. We show how companies who use the proposed method will be better able to target individual customers, as well as infer how many of the anonymous visits are made by new customers.
Posted on February 5, 2016 at 6:56 AM •
Interesting blog post:
We are able to de-anonymize executable binaries of 20 programmers with 96% correct classification accuracy. In the de-anonymization process, the machine learning classifier trains on 8 executable binaries for each programmer to generate numeric representations of their coding styles. Such a high accuracy with this small amount of training data has not been reached in previous attempts. After scaling up the approach by increasing the dataset size, we de-anonymize 600 programmers with 52% accuracy. There has been no previous attempt to de-anonymize such a large binary dataset. The abovementioned executable binaries are compiled without any compiler optimizations, which are options to make binaries smaller and faster while transforming the source code more than plain compilation. As a result, compiler optimizations further normalize authorial style. For the first time in programmer de-anonymization, we show that we can still identify programmers of optimized executable binaries. While we can de-anonymize 100 programmers from unoptimized executable binaries with 78% accuracy, we can de-anonymize them from optimized executable binaries with 64% accuracy. We also show that stripping and removing symbol information from the executable binaries reduces the accuracy to 66%, which is a surprisingly small drop. This suggests that coding style survives complicated transformations.
Here’s the paper.
And here’s their previous paper, de-anonymizing programmers from their source code.
Posted on January 4, 2016 at 7:41 AM •
There’s pretty strong evidence that the team of researchers from Carnegie Mellon University who cancelled their scheduled 2015 Black Hat talk deanonymized Tor users for the FBI.
Details are in this Vice story and this Wired story (and these two follow-on Vice stories). And here’s the reaction from the Tor Project.
Nicholas Weaver guessed this back in January.
The behavior of the researchers is reprehensible, but the real issue is that CERT Coordination Center (CERT/CC) has lost its credibility as an honest broker. The researchers discovered this vulnerability and submitted it to CERT. Neither the researchers nor CERT disclosed this vulnerability to the Tor Project. Instead, the researchers apparently used this vulnerability to deanonymize a large number of hidden service visitors and provide the information to the FBI.
Does anyone still trust CERT to behave in the Internet’s best interests?
EDITED TO ADD (12/14): I was wrong. CERT did disclose to Tor.
Posted on November 16, 2015 at 6:19 AM •
Those of you unfamiliar with hacker culture might need an explanation of “doxing.”
The word refers to the practice of publishing personal information about people without their consent. Usually it’s things like an address and phone number, but it can also be credit card details, medical information, private e-mails—pretty much anything an assailant can get his hands on.
Doxing is not new; the term dates back to 2001 and the hacker group Anonymous. But it can be incredibly offensive. In 2014, several women were doxed by male gamers trying to intimidate them into keeping silent about sexism in computer games.
Companies can be doxed, too. In 2011, Anonymous doxed the technology firm HBGary Federal. In the past few weeks we’ve witnessed the ongoing doxing of Sony.
Everyone from political activists to hackers to government leaders has now learned how effective this attack is. Everyone from common individuals to corporate executives to government leaders now fears this will happen to them. And I believe this will change how we think about computing and the Internet.
This essay previously appeared on BetaBoston, who asked about a trend for 2015.
EDITED TO ADD (1/3): Slashdot thread.
Posted on January 2, 2015 at 7:21 AM •
Kevin Poulson has a good article up on Wired about how the FBI used a Metasploit variant to identify Tor users.
Posted on December 17, 2014 at 6:44 AM •
Sidebar photo of Bruce Schneier by Joe MacInnis.