Identifying Programmers by Their Coding Style

Fascinating research on de-anonymizing code -- from either source code or compiled code:

Rachel Greenstadt, an associate professor of computer science at Drexel University, and Aylin Caliskan, Greenstadt's former PhD student and now an assistant professor at George Washington University, have found that code, like other forms of stylistic expression, are not anonymous. At the DefCon hacking conference Friday, the pair will present a number of studies they've conducted using machine learning techniques to de-anonymize the authors of code samples. Their work could be useful in a plagiarism dispute, for instance, but it also has privacy implications, especially for the thousands of developers who contribute open source code to the world.

Posted on August 13, 2018 at 4:02 PM • 24 Comments

Comments

BBBAugust 13, 2018 6:18 PM

This is not too surprising. I remember working as a grader for a C.S. course in college, and we were asked to be on the lookout for students copying each other's work. You can spot it intuitively, even with variables renamed and other attempts at obfuscation. Usually, generating the parse tree of the source code in the work was also enough to identify cheats. So, it's not surprising that this could be done computationally with the help of general AI techniques.

zorroAugust 13, 2018 7:43 PM

I remember this sort of thing from my college days... There were N ways to solve the problem, and N+1 students in the class. Somebody would get accused of cheating. Either they gracefully accepted their fate & force-failed the class, or they fought it and got kicked out of the school. Really sucked. Didn't matter if the students were in the top 1% of the school, taking the class not even as an elective, just as an extra because they wanted to learn. Nope, your work is too similar to what somebody else did. Other guy checked in first. You lose.

It reached the point where the students would cheat, and show each other their code, just to avoid getting canned.

WaelAugust 14, 2018 12:21 AM

Fascinating research, but there are claims that I don't believe.

the pair will present a number of studies they've conducted using machine learning techniques to de-anonymize the authors of code samples

The implications are {authors, code} are as unique as {authors, biometric fingerprints}, which is hard to believe. I still believe it's possible to statistically identify the most probable author among a limited sample of authors, with some assumptions such as ___. ____

but it also has privacy implications, especially for the thousands of developers who contribute open source code to the world.

The majority of open source contributors identify themselves. Some prefer to remain anonymous.

trimming the list from hundreds of thousands to around 50 or so.

It's one thing to claim the permutations / combinations of say n elements amount to hundreds of thousands, and a totally different thing to claim there are hundreds of thousands of attributes to investigate. Seriously? Post a link. I'd like to see that list and count it instead of the sheeps I count every other night.

The researchers don't rely on […] how code was formatted.

Obviously!

a program called a compiler turns it into a series of 1s and 0s that can be read by a machine, called binary. To humans, it mostly looks like nonsense.

Nonsense!

The researchers say that in the future, however, programmers might be able to conceal their styles using more sophisticated methods.

They've been doing that for quite some time !!!

The team found that a developer may be able to spoof their "coding signature," even if they're not specifically trained in creating forgeries.

I believe that.

they found they could differentiate between code samples written by Canadian and by Chinese developers with over 90 percent accuracy.

That's believable. This is rude and politically incorrect. I gots to encode it: compbyeeyenararsLee's

For now, the researchers stress that de-anonymizing code is still a mysterious process, though so far their methods have been shown to work.

In other words: witchcraft and sorcery. So the "methods work", but the process is mysterious, yet they did "research". What process is mysterious? I'll tell you what the problem is. You're trying to use statistics on fuzzy variables or something like that. Look at Lotfi Zadeh's work. It may or may not help. I don't know.

Clive RobinsonAugust 14, 2018 2:11 AM

@ Wael,

In other words: witchcraft and sorcery. So the "methods work", but the process is mysterious, yet they did "research".

Let us agree it's "a determanistic process with unknown input" or a "black box process". Aside from the oft quoted GIGO the researchers, are first looking at the output to find elusive correlations as a first step.

They then move to looking for individual input to output correlations, then output correlations for known similar inputs.

Any one who has done cryptanalysis will be familiar with those three steps.

The point is they have not gone on to the all important characterization of the determanistic process.

We see a lot of this in AI research and it's not good. When pushed you get the long jargon and arm waving and talk of evolving complexity.

One of the requirments for evidence is "independent repeatability by scientific method". We are not seeing this so it's still in the realms of curiosity not tool.

As for the inplication of "magic" well Arthur C. Clark had a thing or two to say about that. The problem is the "sufficiently advanced" I'm not seeing it with this research...

RealFakeNewsAugust 14, 2018 2:31 AM

They use "AI" huh? Laughable.

If they just said they used statistical analysis I'd find it more compelling, but then that's how fingerprinting writing, artwork, drawings, works in the first place.

Not impressed.

As for the invocation of "AI" (as if it's the only way to solve these problems), there was someone on the radio a couple of months ago talking about how AI worked in their service for selecting employees from a pool of candidates.

The lady did a great job of demonstrating how useless it really is when she said that it only filtered the list by a few criteria, and the majority of the work was completed by...humans.

Not bad for an online service that should be fully automated, but instead involved a small army of people sat in a room.

Until I see something to the contrary, AI will remain nothing more than meaningless marketing jargon.

WaelAugust 14, 2018 3:16 AM

@Clive Robinson,

Let us agree it's "a determanistic process with unknown input" or a "black box process".

Deterministic it is. That's not the only necessary condition to extract the unknown input; not every F has an F-1. There will be collisions in these situations and there will be authors that change their style and thinking methods after they, for example, learn a new language, algorithm, technology, etc. I haven't seen a treatment of such effects on de-anonymization fidelity.

One of the requirments for evidence is "independent repeatability by scientific method".

Yes. They are in the early stages. What they need to show is that the code fingerprint space is large enough to cover all developers. Then they need to show that all developers have uniquely identifiable "styles".

One of the requirments for evidence is "independent repeatability by scientific method".

Some have voiced similar comments about AlphaZero

WeatherAugust 14, 2018 4:03 AM

Don't need to read much,they might do one plus one quicker,but they can only learn at real time like people,so if they let it run for twenty years what will it know,probably just what it knowed two hours after it started,is a ant is more intelligence, like in another forum if the can use English sentence and construction there own sentence and know,know,know what they generatored after twenty years,by bench mark they might be called Ai

wiredogAugust 14, 2018 5:46 AM

In compiled code? Really? Given all the different optimizations different compilers can throw in there? I seriously doubt that.

vas pupAugust 14, 2018 8:43 AM

Related to AI decision making:
Artificial intelligence 'did not miss a single urgent case':
https://www.bbc.com/news/health-44924948

"Some previous attempts at using AI have led to what's known as a "black box" problem - where the reasoning behind the computer analysis is hidden.

[ATTENTION!!!]By contrast, the DeepMind algorithm provides a visual map of where the disease is, allowing clinicians to check how the AI has come to its decision, which is crucial if doctors and patients are to have confidence in its diagnoses."

Denton ScratchAugust 14, 2018 9:49 AM

Professional programmers are normally subject to some degree of code review. One of the aims of this review is to impose some kind of 'house style'. An programmer learns to adjust his style to the house rules. This would extend to, for example, choice of names for code objects.

Another aim is to ensure that the code is maintainable; one common way of doing that is to highlight complexity, and try to keep it under control. Again, a programmer adjusts his approach to match house rules, for fear of having to re-do his work.

Programmers also adjust their style to match the style of their colleagues, regardless of rules and reviews; it just makes for a nicer piece of code.

All that said, every significant piece of commercial code I've worked on in the last ten years was a palimpsest, bearing the fingerprints of up to a dozen authors, all woven together. Good luck to the researchers untangling real-world code like that!

Me myselfAugust 14, 2018 9:56 AM

Great! This idea could be incorporated into SVN's "blame" command. We would finally be able to single out idiots who uploaded crap into our codebases before version control was established.

Now come on, does anybody really believes if this thing really worked (in a level of actual usefulness, not just "programmer A is 15% more likely to have written this code than programmers B,C &D"?) the NSA wouldn't shush these researchers, take their work and attempt to use it to identify authors of state-level malware?

@wiredog: Fully agree with you. Next thing you know they'll promise their AI can give the code writer's current phone and address

albertAugust 14, 2018 2:57 PM

@Wael, @et al,

"...either source code or compiled code:..." Yeah, you really need 'AI' to evaluate source code. Doing it with compiled code might be impressive, but I'd prefer to see results that can stand up in a court of law, not xx% certainty.

Re: Zadehs regrettably named 'fuzzy logic'.
Last I heard, folks were using FL for image-analysis and related fields. Thousands of engineers quietly use it in machine control applications, except in Japan, where they advertised it. I wrote a subroutine in RLL for our field application engineers. It worked very well. Interestingly, it worked even with rungs of ladder logic missing or with incorrect parameters. You could cripple it quite a bit before things started to go wrong. Impressive.

. .. . .. --- ....

WaelAugust 14, 2018 3:48 PM

@albert,

but I'd prefer to see results that can stand up in a court of law, not xx% certainty.

Let's wait until Fuzzy laws become the norm :)

You could cripple it quite a bit before things started to go wrong. Impressive.

Yes! That would be impressive. Fault tolerance to the next level.

Clive RobinsonAugust 14, 2018 4:02 PM

@ Albert, @Wael and others,

Zadehs regrettably named 'fuzzy logic'.

His reasons for naming it were valid all those years ago "in the year of Sixty nine" as was "infinite value logic" prior to it.

One of the things people have problems with on first meeting it is understanding the difference between "fuzzy truth" and probability.

Fuzzy truth is a measure of "vagueness" whilst probability is a measure of "ignorance". To make it worse for newcommers, both degrees of truth and probabilities range between 0 and 1 and thus seeing the differrnce is somewhat harder.

But it gets worse the degrees of truth may be quantified not with numbers but words thus a variable "hight" may contain "short, average, tall, extra tall, and extra extra tall. The use of words in general effects little as the logical rules are built to accept them. That is the words form an arbitary ordered set, where they have implicit range equivalents within membership functions.

The problem is that fuzzy logic deals not with ignorance but vagueness. Thus there is an implicit assumption that all problems are not just known but bounded. Further that the system inputs are known and ordered in a similar manner to the fuctioning of a state engine.

Translating "ignorance" to "vagueness" is neither easy or perfect. In the case of the application quoted in the article, it appears to be well neigh impossible.

WaelAugust 14, 2018 7:03 PM

@Clive Robinson, @albert,

Fuzzy truth is a measure of "vagueness" whilst probability is a measure of "ignorance".

I like that distinction! But is probability a measure of ignorance or is it an inverse measure of ignorance?

Translating "ignorance" to "vagueness" is neither easy or perfect. In the case of the application quoted in the article, it appears to be well neigh impossible.

Spot on! Excellent diagnosis. How about translating vagueness to ignorance? :)

@albert,

Can you elaborate more on your experience with Fuzzy Logic?

IsmarAugust 15, 2018 1:04 AM

I feel obliged to comment here as I have years of "coding" in professional environments.

Most of the software development companies strive to make the code based as uniform as possible.
This is done by using coding guidelines, coding standard, development frameworks with IDE support for scaffolding as well as tools for optimising code for both maintenance and performance.
In short, while every developer introduces traces of their individuality in to the code base they are heavily obfuscated by the aforementioned processes.

On the other hand, all of the code written while working for a software house is by definition associated by each individual developer due to the extensive use of code repositories with tools that track every little code change done by anybody allowed to change the code base.


The only valid usage of the techniques mentioned in the article, would, therefore be in the cases of attribution of malware to their authors where code base was extracted from binaries found on infected machines but access to any other information about the code is not available.

justinacolmenaAugust 15, 2018 11:34 AM

privacy implications, especially for the thousands of developers who contribute open source code to the world.

Good grief.

I would have reason to think that most open source developers are proud of their work to have contributed it to the world on such terms.

If open source developers are in such dire need of privacy, the concern here is more the censorship of open source code, a constant threat from Microsoft and other proprietary software shops who may or may not be so proud of their own work.

Clive RobinsonAugust 15, 2018 2:37 PM

@ justinacolmena,

I would have reason to think that most open source developers are proud of their work to have contributed it to the world on such terms.

Most probably are proud of their work at some point. However I suspect few do it for the "fame" in the general populoud, because this often makes them a "commodity" or worse in many peoples eyes.

As was once noted,

    You can please some of the people some of the time...

Or to put it another way at times the whole world feels like a critic, even those that are close to you. When you add the fact that many idiots assume the Internet gives them privacy to abuse you how they see fit, you can understand why some Open Source developers desire to be in effect anonymous.

In the past I've been asked why I don't have my own "security" web site, well there are various reasons I've mentioned in that they can be a lot of effort in many ways, especially keeping on top of various issues like patching stopping various types of spam and less than pleasent user.

If you look back on this blog you will see that some people have assumed our host here @Bruce is there for them to use and sometimes abuse in various ways. It's a "big ask" of anybody to have to put up with that day after day.

And that's all before having to go out and look for topics to post atleast once every weekday.

I've no idea how much time @Bruce and @Moderator spend each day on the upkeep of this blog, and I for one am gratefully that they do. Likewise I hope they will continue to do so for the foreseeable future. But which ever way you look at it, it's a considerable investment not just in effort but also in self. I'm sure there must be times when @Bruce would rather be doing something else with that ever scarce resource "time" than doing what he does for this blog and it's readers and contributors and commentors.

The point is some Open Source coders want to spend their time coding, not handeling loads of emails, twitter comments etc, nor do they want to run teams of other coders, or work to somebody elses wish list or many other things that form the straight jacket and working mans noose of many software industry jobs that might only pay 9-5 but expect body, soul and original thought 365.25 even when it's not in the employment contract...

TRXAugust 15, 2018 2:49 PM

So, you have a room full of students, all of approximately the same age and background, in the same classroom, with the same instructor, taught the same lessons from the same books, and...

Well, obviously any similar styles would be cheating.

ChrisAugust 15, 2018 4:17 PM

@Winter
Oddly, in the study you reference for Paul's epistles, the author users the King James translation, rather than the original greek.

Leave a comment

Allowed HTML: <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre>

Photo of Bruce Schneier by Per Ervland.

Schneier on Security is a personal website. Opinions expressed are not necessarily those of IBM Resilient.