Identifying People from their Writing Style

It’s called stylometry, and it’s based on the analysis of things like word choice, sentence structure, syntax and punctuation. In one experiment, researchers were able to identify 80% of users with a 5,000-word writing sample.

Download tools here, including one to anonymize your writing style.

Tags: anonymity, biometrics, de-anonymization, identification

Posted on January 24, 2013 at 1:33 PM • 30 Comments

Comments

redsnorf • January 24, 2013 1:52 PM

Does anybody know how easy or difficult it is to impersonate someone “stylometrically”?

Put another way, does stylometry assume authors are not actively trying to defeat it?

Dagny Taggart • January 24, 2013 2:31 PM

We could look at the way they send Morse Code to identify the operator, commonly referred to as the operator’s “fist” in Military Intelligence paralance.

B. Johnson • January 24, 2013 2:39 PM

This can’t be new. I remember this exact thing used as a plot point in some kind of TV procedural years ago.

And something similar was used on Dexter during the second(?) season when he culled various works for his manifesto and was called out on it because there were dozens of different authors.

Troy Mckee • January 24, 2013 2:41 PM

I wonder how hard it would be to use steganography and stylometry so that differences in writing style could be used to encode a message?

Marc • January 24, 2013 3:52 PM

@Troy Mckee – My understanding is that stylometry has been used in this way for centuries: in particular, an agent or correspondent can change his/her style to signal duress or suspected surveillance. This can be done either by including pre-chosen code words, or – if no such arrangement was made ahead of time – simply by writing in a style dramatically different from usual.

Marc • January 24, 2013 3:53 PM

@redsnorf – It’s relatively easy, depending on the target: remember the Bad Hemingway competition?

MJ McEvoy • January 24, 2013 4:19 PM

I can see where this might be more appropriate to source code than to letters. I remember that I had a very different style of writing C and Perl than most of the others that worked with me, so much so that people could tell what pieces had been pulled from my code and what was written by others in the development team.

MingoV • January 24, 2013 4:52 PM

It isn’t new, and it’s just as overstated as previous studies. The 80% accurate identification rate applies only within the 5,000 subjects. The identification rate will plummet when the pool of writers expands to everyone who can write in English.

Godel • January 24, 2013 5:17 PM

@MingoV

The identification rate may improve when the pool of writers is reduced, such as when there is a specific group of suspects.

It’s always going to be unreliable, perhaps better used as an exculpatory process. If the suspect has a 90 IQ and the text has a high degree of literacy than it’s probably not them.

The usual simple method of anonymisation mentioned is to use Google Translate to change your text into different languages, then back again and clean up the obvious mistakes.

John David Galt • January 24, 2013 6:49 PM

Used as exculpatory, this sounds like a worthwhile innovation (and the other writers are right, it’s not new). But we’re fast becoming such a police state that I expect it soon to be used as “proof” of guilt.

Toor Useer • January 24, 2013 9:27 PM

“But we’re fast becoming such a police state that I expect it soon to be used as “proof” of guilt.”

Not soon, it happened a few years ago in Berlin, Germany. A scientist was arrested because the police said, he is a member of a group which burns down army stuff (i think they destroyed just trucks, cars and other vehicles).

One reason they said it, was because some words he was using in his publications, were in the writings of the anti army group.

Here are more informations. His wife made a blog about this:

http://annalist.noblogs.org/

http://de.wikipedia.org/wiki/Andrej_Holm

http://www.sowi.hu-berlin.de/lehrbereiche/stadtsoz/mitarbeiterinnen/a-z/holm

humblist • January 24, 2013 11:03 PM

Syntax,game,five words,logical

redsmurf • January 24, 2013 11:32 PM

It’s like in the Silver Linings Playbook movie, when the main character realizes the identity of a letter because another person often uses a certain phrase.

Mahrud • January 25, 2013 1:22 AM

Coursera, a free online education website, is going to use “unique typing patterns” as student’s signature in order to prevent cheating:
http://blog.coursera.org/post/40080531667/signaturetrack

ThomasC • January 25, 2013 5:13 AM

@MingoV

Not 5,000 subjects. 5000 words per author [1].

[1] https://psal.cs.drexel.edu/index.php/JStylo-Anonymouth

motters • January 25, 2013 5:25 AM

Also see https://launchpad.net/stylom

scripted lynx user • January 25, 2013 6:45 AM

@B. Johnson “This can’t be new. I remember this exact thing used as a plot point in some kind of TV procedural years ago.”

Writeprint is a real product deployed by Dark Web in 2007, already doing stylometry. Against islamists activists. It was called Writeprint. Quoting http://www.nsf.gov/news/news_summ.jsp?cntn_id=110040 :

One of the tools developed by Dark Web is a technique called Writeprint,
which automatically extracts thousands of multilingual, structural, and
semantic features to determine who is creating ‘anonymous’ content online.
[…] By analyzing these certain features, it can determine with more than 95
percent accuracy if the author has produced other content in the past.

scripted lynx user • January 25, 2013 7:45 AM

Message to Bruce:

Years ago, the comment form of you blog advised <<Real names aren’t required, but please give us something to call you. Conversations among several people called “Anonymous” get too confusing.>> under the field “Name”.

Why did you stop this message ? Did a Three Letter Agency tell you that their stylometric software would perform better without this advice ?

Is it a move to fight comment-spammers ?

scripted lynx user • January 25, 2013 7:47 AM

Oops, citation cut. The citation is:

Real names aren’t required, but please give us something to call you. Conversations among several people called “Anonymous” get too confusing.

derpmasterflex • January 25, 2013 8:42 AM

Max Vision said the second time he was busted for hacking DoD sites was because the investigators noticed a unique colloqialism he was using.

I lost the artical but Chinese authorities do the same thing with db that store everything political dissidents under nyms write to compare to their university papers so they can match names

Rick Auricchio • January 25, 2013 10:35 AM

@MJ McEvoy:

I agree about coding style. I found source code online for the Apple /// OS recently, originally written in 1980-81.

After several minutes of examining the 6502 assembly-language source code of the floppy disk driver, I recognized it as my own. (To be fair, I should admit that I recalled rewriting that driver, though I wasn’t sure at first whose version I’d found.)

Chris Lawson • January 25, 2013 10:41 AM

Claims of 80% success rate would depend on a lot of factors that may not apply to real-world examples. Unfortunately, this is not from a peer-reviewed paper but a conference presentation, and I don’t feel like trawling through the 1hr40min video to work out how well the system really works.

lol • January 25, 2013 11:55 AM

I wonder if this software can be used to match forum postings with academic papers to find out who the elusive Satoshi is (secret inventor of bitcoin) though its probably numerous people using one handle

Howard • January 25, 2013 3:35 PM

The stylometry software is that it such software is often used to prove Dreams From My Father was written by Bill Ayers, and not Barry Soetoro. Naturally, this cannot be true …

Howard • January 25, 2013 3:36 PM

I meant to start off saying “The problem with” … wups. I’ll call that poor-man’s anonymizing.

anon • January 26, 2013 8:20 AM

I’m the author of plenty plays
and behind the names of many
are hidden tales of love,
lust, freedom and revenge
plots a-plenty

Few know me and my style is varied
I output at least in two languages
In film, in music, in books, on tv
There are a great many instances
of me

And noone has cottened onto yet.
And that’s exactly how it should be.

Wendy M. Grossman • January 26, 2013 8:35 AM

Thing is, you can anonymize the style all your want but the writer’s personal obsessions are still likely to come through.

Slv • January 26, 2013 2:55 PM

Stylometry is a well established field and is further maturing with the explosion of interest in text mining as of late (full disclosure: I am a researcher in this and similar areas). Here are a couple other interesting, somewhat related examples:

http://www.npr.org/templates/story/story.php?storyId=127211884

http://www.secretlifeofpronouns.com/

CypherMonkey • January 26, 2013 7:18 PM

Some work [not mine] on extending stylometry to Internet scales:
http://33bits.org/2012/02/20/is-writing-style-sufficient-to-deanonymize-material-posted-online/

I did a project for a class doing stylometry with forum posts. It’s harder than blogs because you generally have fewer words per author and some people post disproportionately, but even with a limited feature set, you can achieve much better performance than random chance. Some other people in that class did a similar thing with chat logs and got very impressive results; they used a much larger feature set (word-based multinomial event model, if I recall).

This is a powerful tool; I’d bet that most governments invest heavily in it.

Christian Koch • January 27, 2013 2:21 AM

If enough people use it, wouldn’t others be able to recognize the output of Drexel University’s JSAN software? I think it’s mildly funny that the purpose of JSAN is to anonymize writing, but we could potentially later identify JSAN’s output as coming from JSAN.

Schneier on Security

Identifying People from their Writing Style

Comments

Leave a comment Cancel reply