De-Anonymizing Users from their Coding Styles

We are able to de-anonymize executable binaries of 20 programmers with 96% correct classification accuracy. In the de-anonymization process, the machine learning classifier trains on 8 executable binaries for each programmer to generate numeric representations of their coding styles. Such a high accuracy with this small amount of training data has not been reached in previous attempts. After scaling up the approach by increasing the dataset size, we de-anonymize 600 programmers with 52% accuracy. There has been no previous attempt to de-anonymize such a large binary dataset. The abovementioned executable binaries are compiled without any compiler optimizations, which are options to make binaries smaller and faster while transforming the source code more than plain compilation. As a result, compiler optimizations further normalize authorial style. For the first time in programmer de-anonymization, we show that we can still identify programmers of optimized executable binaries. While we can de-anonymize 100 programmers from unoptimized executable binaries with 78% accuracy, we can de-anonymize them from optimized executable binaries with 64% accuracy. We also show that stripping and removing symbol information from the executable binaries reduces the accuracy to 66%, which is a surprisingly small drop. This suggests that coding style survives complicated transformations.

Here’s the paper.

And here’s their previous paper, de-anonymizing programmers from their source code.

Tags: academic papers, anonymity, de-anonymization

Posted on January 4, 2016 at 7:41 AM • 38 Comments

Comments

tux. • January 4, 2016 8:22 AM

And here’s the 32c3 talk from last week:

https://media.ccc.de/v/32c3-7491-de-anonymizing_programmers#video

Martin • January 4, 2016 8:28 AM

Should we foresee a market for tools to spaghetify code, or to transform it into the style of well known programmers? But that would be tricky, since the analysis presumably looks at things like choice of datatypes, indexing methods, length of modules as well as control flow.

Bob Paddock • January 4, 2016 9:29 AM

Gather random code from GitHub and BitBucket from random authors, turn off Link Time Optimization so dead code is not stripped, compile into executable. How does the methodology fair then?

John • January 4, 2016 9:35 AM

Maybe the code would be more anonymous if programmers were educated. Then they’d all be using comments, tabbed indents, inline braces, and Hungarian notation.

(kidding!!)

tux. • January 4, 2016 9:59 AM

Bob: That’s briefly explained in the Talk. 🙂

Uhu • January 4, 2016 10:00 AM

@Bob Paddock:
This seems to be an obscurity measure rather than a security measure. It would be trivial to analyze the binary and remove dead code prior to fingerprinting. Ok, you could link from random points in your code to these libraries with conditions that are never true but that the compiler cannot recognize as such. Then again, the fingerprinting would simply have to profile a running instance and then only fingerprint the active code. If you were to actually call some of the overhead code, you would either have to seriously slow down your code or profiling could still be used to target only the most used code (variable threshold).

@John:
The joke is probably based on the fact that even the simplest compiler will remove any information on indents (except for languages that require it semantically), comments and in-line braces. Hungarian notation might actually be part of the fingerprint (how many spelling mistakes were made? preferences for certain words?). This could be eliminated somewhat with code obfuscation, but it would still be visible in public interfaces (then we know who designed the interface, not who did the coding…)

Anonymous Cow • January 4, 2016 10:21 AM

Would be interesting to see how well the techniques in the paper (which I haven’t read) work against current commercial obfuscation tools intended to frustrate reverse engineering and provide software tamper resistance.

Anyone familiar with the subject care to comment?

David M • January 4, 2016 10:37 AM

So who’s Satoshi?

Satoshi • January 4, 2016 10:53 AM

I am Satoshi.

TimH • January 4, 2016 11:36 AM

stripping and removing symbol information from the executable binaries

Colour me confused as an olde assembler guy. Why do the compilers leave this cruft in anything but debug compiles?

jordan • January 4, 2016 11:40 AM

@uhu

the fingerprinting would simply have to profile a running instance and then only fingerprint the active code

We appear to have quite different understandings of “simply” and “trivial”.

For a start, where do you get a use-case-complete test process? If you don’t know what values might get passed in, how can you know which code paths are relevant? The Underhanded C Contest says you can’t do this just by looking at the code.

r • January 4, 2016 12:01 PM

we’ve seen things along these lines before, remember kaspersky crowdsourcing identification of the duqu ‘language’ ?
more often the simplistic version of this is identifying the primary language of a coder, all of these are factors in quantifying your target.

@jordanh,

function determines structure, the underhanded c contests wouldn’t provide a large enough data set – although they might provide technologies unique to an individual on a small scale to be included later in larger sets. i’d think that contest itself is too small, too specific and too optimized to provide unique identifiers alone – outside of (traditional, think tor) identity correlation attacks.

@TimH,

in the case of MSVC/MASM, i believe they’re module hints – it’s not just the compiler that leaves them it’s more specifically the linker.
compilers make decisions too, the people who built the compiler make decisions, one can readily tell which compiler and linker you’ve used.

I can readily tell if you used borland, clang or gcc.

@anonymous cow, all

don’t expect a polymorphic wrapper or anti-debugging/tracing tricks to stop something like this, one should be able to filter/strip those ezpk style. only fully fledged meta/oligomorphism should work. keep your footprint down and link against other’s code. think of this as a unique per-coder set of heuristics. the tools exist for attacking this type of datamining – as pointed out above. the biggest key to fighting this is the awareness and education that one’s design decisions pass through the compilation phase – be careful with commercial tools as they could introduce steganographic identifiers such as /customer_id/ directly into the resultant output – be careful with anything specifically claiming to mitigate this in the future – even your choice of solutions to prevent this may amplify any resultant signal… one may even be able to identify more subtle things given a large enough dataset – for instance everyone in a particular class may have picked up very specific unique programming habits from a known & specific professor.

if you’re the type of person to write proper error handlers, expect your code to stand out. if you optimize specific things manually, expect to stand out. if you use specific technologies, you’re going to stand out. if you choose certain interfaces over others(or over others): you’re going to stand out. these attacks WILL ONLY get better.

to the extreme, look at things like the https://github.com/hcrypt-project.

as i said(3x), i believe the tools for mitigating this already exist. i don’t want to spell it out for people as the biggest thing is education and awareness. it’s funny how after the talk on the squid thread this has direct bearing on those formally trained vs those informally trained. how many people from the home+schooled r/e field noticed this long ago? (i could be wrong)

be safe, these ‘fingerprints’ are a function of the ‘footprint[s]’ you leave.
truecrypt, bitcoin, whonix, i2p.

please, don’t think this only applies only to malware: this is a weapon against public speech and dissent also.

SteveInMA • January 4, 2016 12:36 PM

They were using stylometry back in the 1970’s. See “Literary detection: how to prove authorship and fraud in literature and documents” by Andrew Morton from 1978. The canonical examples were trying to sort out authorship of the disputed Federalist Papers or the Pauline Epistles. The techniques rely on unconscious writing habits such as ratio of big words to little words, sentence length, enclitic count (in Greek), the frequency we use all the little words we never think about while writing. The most impressive thing about the methodology was how well it resisted attempts to disguise writing style.

r • January 4, 2016 1:52 PM

The comments on the linked blog indicate an already developed security industry centered around the preprocessor angle of this, post processing as is the case with virtual machines – even strong ones like hcrypt may still emit potentially vulnerable output. I firmly believe this is much akin to heuristics analysis. Clive and Doug’s comments further indicate this is hardly limited to x86, in the embedded field the java vm may lend itself as a weapon against this to a point – again still requiring both pre processor and post processor manipulation.

Bauke Jan Douma • January 4, 2016 3:54 PM

So much for K&R.

Anura • January 4, 2016 4:56 PM

Clearly we need to start teaching programmers how to be less consistent in coding style and to copy and paste more code examples from the internet.

r • January 4, 2016 5:33 PM

“The abovementioned executable binaries are compiled without any compiler optimizations, which are options to make binaries smaller and faster while transforming the source code more than plain compilation. As a result, compiler optimizations further normalize authorial style.”

actually, interestingly enough… -Os wasn’t tested from what i’ve seen: only -O3 and -00.
the gcc manual says -00 is default behaviour incase anyone is curious.

i saw a different set of reversers the other day talking about how they’d never seen register based calling conventions on x86/64. i’m assuming they are familiar with cdecl but not fastcall.

r • January 4, 2016 5:36 PM

-Os may have flattened their google-code-jam data set considerably.

Anura • January 4, 2016 5:46 PM

Yeah, it makes sense that the more directly the produced binaries represents the written code, the more easily it will be able to distinguish who wrote it. To that end, I wonder how well this would work with a language like Haskell, which uses lazy evaluation. I’d imagine it would result in a much more normalization from what you write and what the compiler outputs.

r • January 4, 2016 6:13 PM

@anura,

specifically, i’m unfamiliar with haskell – but i am looking at the wiki and you may be effectively right: in the implementations list is specifically says that the glasgow haskel implementation can output to c directly as an intermediaty for compilation.

Anura • January 4, 2016 6:35 PM

Here’s what a simple Haskell program looks like when it gets compiled:

http://blog.ezyang.com/2011/04/tracing-the-compilation-of-hello-factorial/

You can see the huge disconnect between what you write and what gets output, even before optimization.

r • January 4, 2016 7:37 PM

@anura, all

uponing installing ghc(the glasgow haskell compiler) in debian, i find that it’s backend requires llvm – which provides further access to potential scrubbing/transforms provided by llvm’s code based jit vm’s.

the reference to state machines and jit vm’s via clang/llvm is in the comment on the linked blog by “Avraham Bernstein”.

the code-restructing of true meta and oligomorphic softwares (quality ones) are largely based on the concept of virtualized basic blocks for a non-existant or simplified language/cpu [think java] – the one’s from the //OLD// h/v/a/c scene are largerly not preprocessors per se as they are [or were] functional basic block detection routines with a cheatsheet builtin for basic block replacement/translation and re-ordering intended to defeat both heuristics and static (signature based) analysis. i believe they all worked at assembly level however (this is why i use -E and -S in my builds /w/ clang & gcc) i make use of perl based transforms at those output levels myself. using technology like this may finger one just through the knowledge, possession, use and understanding of it. this is a large part of the reason i suspect vxheavens was brought to it’s knees after being permitted to exist for as long as it had – i believe it was an intentional squelch on relevant data by a nation state. to me, nobody in their right mind would believe a site that had been ignored for 10+ years would suffer the fate that it did had the internally IC not been beefing up their understanding and acquisition of both code and coders.

HOWEVER, this very type of technology is quite literally dual-use and can be used in reverse in a good engine to break everything compiled back down into reasonable basic blocks like a ‘decompiler’ and may still be readily available to these types of authorship mining attacks. if i had the money i would supplement their results with these concepts in the cloud and check against a couple of these other options/armors myself, but alas i am largely financially insecure and do not wish to be an viewed as an enabler and/or a
manufacturer of ‘hacktools’ within my current regime now or ever – i love my country, i just don’t always agree with it.

MORE importantly, to let the cat out of the bag. such jvm style meta/oligo/homomorphic engines hold the key to software piracy, i’ve waited a long time to see per-download uniquely signed binaries (think drm) with dynamicly [re]generated code (try to static patch that, even dynamic patching chokes mostly – making it far beyond the range of run-of-the-mill cracks[and crackers]) but so far the only thing even remotely similar i’ve seen from my largerly uninteresting retired ventures are the binaries from the most recent NSA Challenge. it just so happens that software drm is going the opposite direction -> the cloud.

that’s just my view on the subject, it’s open for review and speculation but my thought process on this has led to a paralyzing and paranoid view of the IC and big data, i’m getting old and i’d love to get back into school but i have a considerable amount on my plate financially and medically at this point in my life. I would love for nothing more than to have a reliable job and be doing something i feel good about instead of the stress of less-than-minimum wage and ungodly hours 60+ a weak learning various different trades every year.

As for haskell, i’ll check it out – thanks.

2 notes,
obfuscation defeats auditing, there is no place in the open source world for obfuscating the source of something as important as truecrypt or potentially bitcoin (if it truly is liberating) as it would readily defeat any viable chance at an audit imb.
further more, even an advanced encrypted vm like the ones in development by the hcrypt-project could potentially still leak structure and function, which as we know from these slides is their main source of mined correlation signals.

if i sound like a fool, just tell me. it’s no big deal.

r • January 4, 2016 7:39 PM

*international IC, not ‘internal IC’

Ray Dillinger • January 4, 2016 8:29 PM

I’ve been able to do this with source code with eerie precision, for a long time.

I don’t even try, really. I just look at it. No matter how tight the coding style inhouse is, or how much the people follow it slavishly, I can look at code and say, “That was Chris” or “That was Kate” or “That was Mike.” I’ve gotten the hang of this at four different jobs over the years, and I still can’t explain exactly how I do it.

Some subtle combination of variable names, preferred looping constructs, attention to efficiency vs. attention to readability, what gets done inside vs. outside of loops, what gets done in separate loops when it could be done together, technical expertise that shows in the code, etc… I just develop a feel for who wrote what, and I can recognize the “fist” of someone I know. I can’t really express what it is, but these are some of the things I look at before I get an impression of whose code it is.

The weird part is that it even works when people cross into new programming languages.

dsd • January 4, 2016 9:06 PM

r • January 4, 2016 9:22 PM

mozilla says “freedom-to-tinker.com” has a <!> certificate, anyone familiar with certificates wanna look at the credentials and say if that’s straight?

it’s missing data in 2 places, is that just a substandard-but-authentic issue or something?

r • January 4, 2016 9:28 PM

nvm, i’m green – must have non-https elements cuz the only diff between the cert here and there’s [other than the cert] is they’re at 4096 vs the 2048 here.

r • January 4, 2016 9:41 PM

@ray, look at the comments by doug and clive on the squid post this weekend. you’re not the only one who really believes this is old news (no offense to anyone, considering this is final confirmation of alot of our suspicions over the years.). but they make some very interesting points, clive says his assembly habits haunt him [quite literally cross-platform in his case] as do mine and appearently dougs too. i’ve readily believed this is directly related to language, education, and design decisions for a couple years now… but something else that’s interesting about what you’re saying – it’s kind’ve related to the talk in the video so i’ll link the paper – is…

http://developers.slashdot.org/story/16/01/04/1637257/overcoming-intuition-in-programming

this article came out today too, and should have an indirect relation to some of the things talked about in the video and papers mr. schneier linked.

it’s about the complexity of solutions relating to the presentation of the problem itself.

Conclusion? • January 5, 2016 4:32 AM

Can this then be used for something useful like identifying all the scum writing hacking code world-wide for future prosecutions?

If they are any good, presumably they have code in the public domain somewhere (or able to be reverse-engineered) with their name attributed against it.

Ditto 3-letter agency cyber-tards who breach hundreds of laws at a time?

If so, great, and quite ironic. Using their favorite fingerprinting methods against the heretofore anonymous crowd regularly role-playing ‘Lisbeth’ would be… absolutely delectable.

jordan • January 5, 2016 5:18 AM

My point about the Underhanded C Contest is that it’s not possible to comprehensively determine use cases just from examining code, which undermines @uhu’s statement that identifying relevant vs dead code would be “trivial” given a running instance.

I don’t think the UCC is relevant to the general question of fingerprinting code (because the control flow can be so fundamentally obfuscated / optimised) other than to demonstrate that it’s an arms race rather than an open/closed problem. Besides,

i believe the tools for mitigating this already exist

Yeah. This new fingerprinting might help you identify authors without, say, looking at the commit logs. I’m not sure it will help identify authors that have taken steps to remain unknown, ie, when it really matters.

@Conclusion?

Can this be used for .. identifying scum … for future prosecutions?

Maybe it forms part of a set of identification tools, but it’s not a silver bullet. Even then:

1/ burden of proof is more strict than “de-anonymize 600 programmers with 52% accuracy”

2/ the 3-letter-agencies based inside your jurisdiction have been granted immunity from the courts you’d have to use

3/ the 3-letter-agencies based outside your jurisdiction are outside your jurisdiction

Clive Robinson • January 5, 2016 6:48 AM

Indirectly I’ve been thinking about this issue for some years. Not as programing languages but the languages we speak and how they effect the way we think.

Look at it this way, if you speak two languages fairly fluently, what do you actualy think in your first language or the second language. Is this effected by “source material” such as scientific papers, data sheets etc etc. It’s known that scientists that travel and research in different cultures tend to take a more productive way to look at things. Is this just throw meating more scientists or getting involved in more languages and cultures, thus rounding out your way of thinking, making more thought options available to you? I have noticed in the past that those who speak more than a couple of languages well are generaly better at getting ideas across to others, especially when using an analogy.

Getting back to programing, I’ve been known to accuse “single language” programmers of being “code cutters” in that they know how to get the best out of the language but much less so other areas of computer science like algorithms. I’ve also pointed out that programmers that use multiple languages tend to abstract out fundementals and use those instead of inbuilt constructs. Great if you work close to the metal with RISC type CPUS, much less so for CISC CPUs and a number of well libraried languages (C++ and Python come to mind 🙂

So a question arises, which is better, to get intimately aquainted with the libraries and what they offer, –knowing it will in effect be non transferable thus wasted effort at some point in time–, or stick to a subset of basic general purpose constructs, and invest time in learning more esoteric but transferable skills such as algorithms, abstract data types and language independent methodologies (ie functional -v- imperative -v-…) and where they are best applied.

I’m on record as saying the latter is what I tend to look for, however I suspect that many employers want “a cog” not an “engine” thus would prefer the former.

BoppingAround • January 5, 2016 9:25 AM

Anura,

and paste more code examples from the internet.

Sarcasm, right? I presume this behaviour is already well present. Even here
where I live — quite the backcountry — several people have bragged to me that
their job is ‘essentially to cut and paste and rearrange code pieces from
Stack Overflow and then make it work’.

r • January 5, 2016 10:30 AM

@jordan,

their training set was limited to google-code-jam, scum aside.

truecrypt was a semi-massive effort, with respect to the remark about the initial commit by the satoshi commit imagine the truecrypt developers are home grown developers in the united states, the level of prowess involved considering their successful audit vs 7.1a hints at quality formal training – the researchers involved only used google-code-jam.

there are far larger ‘ground truth’ datasets available.

think college, think universities, think world wide.

Daniel • January 5, 2016 10:39 AM

@ Clive Robinson

Most of the cog work is outsourced to folks half way around the globe who can do it for pennies on the dollar, so I think most has moved onto open source projects or making money off other programmers/coders because there isn’t many transferable skills these days. Everything is ephemeral.

David Leppik • January 5, 2016 11:32 AM

@Clive Robinson

I remember when I was told I needed to learn Windows programming; I couldn’t build a career as a general-purpose programmer without it. That was 15 years ago, and that’s looking less important every day. Point being, if you are an expert in something useful, you’ll probably be able to get a job— so long as that something doesn’t suddenly become useless, and the skills aren’t transferable. (Adobe Flash programmers had a very rude awakening, whereas Visual Basic developers picked up C# pretty fast.)

But it’s also hard to be a good developer if you only use one language. Nothing’s done more to improve my Java skills than learning Scala. Knowing how things are done differently in other languages brings your main language’s trade-offs into sharper focus. Of course, for that to work, you actually need to learn languages that actually make different, thoughtful trade-offs.

Thomas_H • January 5, 2016 10:35 PM

Really, the best comment is that by SteveInMA, and it needs some promotion. This stuff doesn’t only apply to coding, it applies to any written text (the bit I wrote on the Squid post is actually not based on coding I saw; it’s based on the learnings of reading approx. 20,000 pages of scientific descriptions in-depth in the past few years, applied to computer code). Improperly done obfuscation is not going to solve much, nor is code-stealing for that matter – heck, I bet different programmers will combine stolen bits of code differently, and therefore will be individually identifiable.

Re: Clive’s question:
I personally think the second approach is the correct one, for any field really. Generally applicable methods are like tools, you can use them in multiple situations that may be very different. However, such methods are ironically also what makes experts more likely to be personally identifiable than people who spend their time copying others’ code without permission, because such methods are more intrinsically linked to personal interests and viewpoints. Originality gets noticed.

Buck • January 5, 2016 11:23 PM

The only comment I can currently offer on this particular slice of a larger matter-at-hand is but one single lowly link:

http://www.watchpeoplecode.com

unnamed labrat • January 6, 2016 7:33 AM

Certainly the research points to the necessity of developing an anti-stylometry toolbench that actually feeds back AST based (as well as lexical) metrics to a programmer who wishes to remain anonymous, providing refactoring guidance.

This could make for a nice anti-stylometry talk.

Schneier on Security

De-Anonymizing Users from their Coding Styles

Comments

Leave a comment Cancel reply