Identifying People from their Writing Style

It's called stylometry, and it's based on the analysis of things like word choice, sentence structure, syntax and punctuation. In one experiment, researchers were able to identify 80% of users with a 5,000-word writing sample.

Download tools here, including one to anonymize your writing style.

Posted on January 24, 2013 at 1:33 PM • 30 Comments

Comments

redsnorfJanuary 24, 2013 1:52 PM

Does anybody know how easy or difficult it is to impersonate someone "stylometrically"?

Put another way, does stylometry assume authors are not actively trying to defeat it?

Dagny TaggartJanuary 24, 2013 2:31 PM

We could look at the way they send Morse Code to identify the operator, commonly referred to as the operator's "fist" in Military Intelligence paralance.

B. JohnsonJanuary 24, 2013 2:39 PM

This can't be new. I remember this exact thing used as a plot point in some kind of TV procedural years ago.

And something similar was used on Dexter during the second(?) season when he culled various works for his manifesto and was called out on it because there were dozens of different authors.

Troy MckeeJanuary 24, 2013 2:41 PM

I wonder how hard it would be to use steganography and stylometry so that differences in writing style could be used to encode a message?

MarcJanuary 24, 2013 3:52 PM

@Troy Mckee - My understanding is that stylometry has been used in this way for centuries: in particular, an agent or correspondent can change his/her style to signal duress or suspected surveillance. This can be done either by including pre-chosen code words, or - if no such arrangement was made ahead of time - simply by writing in a style dramatically different from usual.

MarcJanuary 24, 2013 3:53 PM

@redsnorf - It's relatively easy, depending on the target: remember the Bad Hemingway competition?

MJ McEvoyJanuary 24, 2013 4:19 PM

I can see where this might be more appropriate to source code than to letters. I remember that I had a very different style of writing C and Perl than most of the others that worked with me, so much so that people could tell what pieces had been pulled from my code and what was written by others in the development team.

MingoVJanuary 24, 2013 4:52 PM

It isn't new, and it's just as overstated as previous studies. The 80% accurate identification rate applies only within the 5,000 subjects. The identification rate will plummet when the pool of writers expands to everyone who can write in English.

GodelJanuary 24, 2013 5:17 PM

@MingoV

The identification rate may improve when the pool of writers is reduced, such as when there is a specific group of suspects.

It's always going to be unreliable, perhaps better used as an exculpatory process. If the suspect has a 90 IQ and the text has a high degree of literacy than it's probably not them.

The usual simple method of anonymisation mentioned is to use Google Translate to change your text into different languages, then back again and clean up the obvious mistakes.

John David GaltJanuary 24, 2013 6:49 PM

Used as exculpatory, this sounds like a worthwhile innovation (and the other writers are right, it's not new). But we're fast becoming such a police state that I expect it soon to be used as "proof" of guilt.

Toor UseerJanuary 24, 2013 9:27 PM

"But we're fast becoming such a police state that I expect it soon to be used as "proof" of guilt."

Not soon, it happened a few years ago in Berlin, Germany. A scientist was arrested because the police said, he is a member of a group which burns down army stuff (i think they destroyed just trucks, cars and other vehicles).

One reason they said it, was because some words he was using in his publications, were in the writings of the anti army group.

Here are more informations. His wife made a blog about this:

http://annalist.noblogs.org/

http://de.wikipedia.org/wiki/Andrej_Holm

http://www.sowi.hu-berlin.de/lehrbereiche/...

redsmurfJanuary 24, 2013 11:32 PM

It's like in the Silver Linings Playbook movie, when the main character realizes the identity of a letter because another person often uses a certain phrase.

scripted lynx userJanuary 25, 2013 6:45 AM

@B. Johnson "This can't be new. I remember this exact thing used as a plot point in some kind of TV procedural years ago."

Writeprint is a real product deployed by Dark Web in 2007, already doing stylometry. Against islamists activists. It was called Writeprint. Quoting http://www.nsf.gov/news/news_summ.jsp?... :

One of the tools developed by Dark Web is a technique called Writeprint,
which automatically extracts thousands of multilingual, structural, and
semantic features to determine who is creating 'anonymous' content online.
[...] By analyzing these certain features, it can determine with more than 95
percent accuracy if the author has produced other content in the past.

scripted lynx userJanuary 25, 2013 7:45 AM

Message to Bruce:

Years ago, the comment form of you blog advised > under the field "Name".

Why did you stop this message ? Did a Three Letter Agency tell you that their stylometric software would perform better without this advice ?

Is it a move to fight comment-spammers ?

scripted lynx userJanuary 25, 2013 7:47 AM

Oops, citation cut. The citation is:

Real names aren't required, but please give us something to call you. Conversations among several people called "Anonymous" get too confusing.

derpmasterflexJanuary 25, 2013 8:42 AM

Max Vision said the second time he was busted for hacking DoD sites was because the investigators noticed a unique colloqialism he was using.

I lost the artical but Chinese authorities do the same thing with db that store everything political dissidents under nyms write to compare to their university papers so they can match names

Rick AuricchioJanuary 25, 2013 10:35 AM

@MJ McEvoy:

I agree about coding style. I found source code online for the Apple /// OS recently, originally written in 1980-81.

After several minutes of examining the 6502 assembly-language source code of the floppy disk driver, I recognized it as my own. (To be fair, I should admit that I recalled rewriting that driver, though I wasn't sure at first whose version I'd found.)

Chris LawsonJanuary 25, 2013 10:41 AM

Claims of 80% success rate would depend on a lot of factors that may not apply to real-world examples. Unfortunately, this is not from a peer-reviewed paper but a conference presentation, and I don't feel like trawling through the 1hr40min video to work out how well the system really works.

lolJanuary 25, 2013 11:55 AM

I wonder if this software can be used to match forum postings with academic papers to find out who the elusive Satoshi is (secret inventor of bitcoin) though its probably numerous people using one handle

HowardJanuary 25, 2013 3:36 PM

I meant to start off saying "The problem with" ... wups. I'll call that poor-man's anonymizing.

anonJanuary 26, 2013 8:20 AM

I'm the author of plenty plays
and behind the names of many
are hidden tales of love,
lust, freedom and revenge
plots a-plenty

Few know me and my style is varied
I output at least in two languages
In film, in music, in books, on tv
There are a great many instances
of me

And noone has cottened onto yet.
And that's exactly how it should be.

CypherMonkeyJanuary 26, 2013 7:18 PM

Some work [not mine] on extending stylometry to Internet scales:
http://33bits.org/2012/02/20/...

I did a project for a class doing stylometry with forum posts. It's harder than blogs because you generally have fewer words per author and some people post disproportionately, but even with a limited feature set, you can achieve much better performance than random chance. Some other people in that class did a similar thing with chat logs and got very impressive results; they used a much larger feature set (word-based multinomial event model, if I recall).

This is a powerful tool; I'd bet that most governments invest heavily in it.

Christian KochJanuary 27, 2013 2:21 AM

If enough people use it, wouldn't others be able to recognize the output of Drexel University's JSAN software? I think it's mildly funny that the purpose of JSAN is to anonymize writing, but we could potentially later identify JSAN's output as coming from JSAN.

Leave a comment

Allowed HTML: <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre>

Photo of Bruce Schneier by Per Ervland.

Schneier on Security is a personal website. Opinions expressed are not necessarily those of Co3 Systems, Inc..