Schneier on Security
A blog covering security and security technology.
« Identifying People from their DNA |
| Shaming as Punishment for Repeated Drunk Driving »
January 24, 2013
Identifying People from their Writing Style
It's called stylometry, and it's based on the analysis of things like word choice, sentence structure, syntax and punctuation. In one experiment, researchers were able to identify 80% of users with a 5,000-word writing sample.
Download tools here, including one to anonymize your writing style.
Posted on January 24, 2013 at 1:33 PM
• 30 Comments
To receive these entries once a month by e-mail, sign up for the Crypto-Gram Newsletter.
Does anybody know how easy or difficult it is to impersonate someone "stylometrically"?
Put another way, does stylometry assume authors are not actively trying to defeat it?
We could look at the way they send Morse Code to identify the operator, commonly referred to as the operator's "fist" in Military Intelligence paralance.
This can't be new. I remember this exact thing used as a plot point in some kind of TV procedural years ago.
And something similar was used on Dexter during the second(?) season when he culled various works for his manifesto and was called out on it because there were dozens of different authors.
I wonder how hard it would be to use steganography and stylometry so that differences in writing style could be used to encode a message?
@Troy Mckee - My understanding is that stylometry has been used in this way for centuries: in particular, an agent or correspondent can change his/her style to signal duress or suspected surveillance. This can be done either by including pre-chosen code words, or - if no such arrangement was made ahead of time - simply by writing in a style dramatically different from usual.
@redsnorf - It's relatively easy, depending on the target: remember the Bad Hemingway competition?
I can see where this might be more appropriate to source code than to letters. I remember that I had a very different style of writing C and Perl than most of the others that worked with me, so much so that people could tell what pieces had been pulled from my code and what was written by others in the development team.
It isn't new, and it's just as overstated as previous studies. The 80% accurate identification rate applies only within the 5,000 subjects. The identification rate will plummet when the pool of writers expands to everyone who can write in English.
The identification rate may improve when the pool of writers is reduced, such as when there is a specific group of suspects.
It's always going to be unreliable, perhaps better used as an exculpatory process. If the suspect has a 90 IQ and the text has a high degree of literacy than it's probably not them.
The usual simple method of anonymisation mentioned is to use Google Translate to change your text into different languages, then back again and clean up the obvious mistakes.
Used as exculpatory, this sounds like a worthwhile innovation (and the other writers are right, it's not new). But we're fast becoming such a police state that I expect it soon to be used as "proof" of guilt.
"But we're fast becoming such a police state that I expect it soon to be used as "proof" of guilt."
Not soon, it happened a few years ago in Berlin, Germany. A scientist was arrested because the police said, he is a member of a group which burns down army stuff (i think they destroyed just trucks, cars and other vehicles).
One reason they said it, was because some words he was using in his publications, were in the writings of the anti army group.
Here are more informations. His wife made a blog about this:
It's like in the Silver Linings Playbook movie, when the main character realizes the identity of a letter because another person often uses a certain phrase.
@B. Johnson "This can't be new. I remember this exact thing used as a plot point in some kind of TV procedural years ago."
Writeprint is a real product deployed by Dark Web in 2007, already doing stylometry. Against islamists activists. It was called Writeprint. Quoting http://www.nsf.gov/news/news_summ.jsp?... :
One of the tools developed by Dark Web is a technique called Writeprint,
which automatically extracts thousands of multilingual, structural, and
semantic features to determine who is creating 'anonymous' content online.
[...] By analyzing these certain features, it can determine with more than 95
percent accuracy if the author has produced other content in the past.
Message to Bruce:
Years ago, the comment form of you blog advised > under the field "Name".
Why did you stop this message ? Did a Three Letter Agency tell you that their stylometric software would perform better without this advice ?
Is it a move to fight comment-spammers ?
Oops, citation cut. The citation is:
Real names aren't required, but please give us something to call you. Conversations among several people called "Anonymous" get too confusing.
Max Vision said the second time he was busted for hacking DoD sites was because the investigators noticed a unique colloqialism he was using.
I lost the artical but Chinese authorities do the same thing with db that store everything political dissidents under nyms write to compare to their university papers so they can match names
I agree about coding style. I found source code online for the Apple /// OS recently, originally written in 1980-81.
After several minutes of examining the 6502 assembly-language source code of the floppy disk driver, I recognized it as my own. (To be fair, I should admit that I recalled rewriting that driver, though I wasn't sure at first whose version I'd found.)
Claims of 80% success rate would depend on a lot of factors that may not apply to real-world examples. Unfortunately, this is not from a peer-reviewed paper but a conference presentation, and I don't feel like trawling through the 1hr40min video to work out how well the system really works.
I wonder if this software can be used to match forum postings with academic papers to find out who the elusive Satoshi is (secret inventor of bitcoin) though its probably numerous people using one handle
The stylometry software is that it such software is often used to prove Dreams From My Father was written by Bill Ayers, and not Barry Soetoro. Naturally, this cannot be true ...
I meant to start off saying "The problem with" ... wups. I'll call that poor-man's anonymizing.
I'm the author of plenty plays
and behind the names of many
are hidden tales of love,
lust, freedom and revenge
Few know me and my style is varied
I output at least in two languages
In film, in music, in books, on tv
There are a great many instances
And noone has cottened onto yet.
And that's exactly how it should be.
Thing is, you can anonymize the style all your want but the writer's personal obsessions are still likely to come through.
Some work [not mine] on extending stylometry to Internet scales:
I did a project for a class doing stylometry with forum posts. It's harder than blogs because you generally have fewer words per author and some people post disproportionately, but even with a limited feature set, you can achieve much better performance than random chance. Some other people in that class did a similar thing with chat logs and got very impressive results; they used a much larger feature set (word-based multinomial event model, if I recall).
This is a powerful tool; I'd bet that most governments invest heavily in it.
If enough people use it, wouldn't others be able to recognize the output of Drexel University's JSAN software? I think it's mildly funny that the purpose of JSAN is to anonymize writing, but we could potentially later identify JSAN's output as coming from JSAN.
Schneier.com is a personal website. Opinions expressed are not necessarily those of Co3 Systems, Inc.