Schneier on Security
A blog covering security and security technology.
« Developments in Facial Recognition |
| New Information on the Inventor of the One-Time Pad »
August 3, 2011
Identifying People by their Writing Style
The article is in the context of the big Facebook lawsuit, but the part about identifying people by their writing style is interesting:
Recently, a team of computer scientists at Concordia University in Montreal took advantage of an unusual set of data to test another method of determining e-mail authorship. In 2003, the Federal Energy Regulatory Commission, as part of its investigation into Enron, released into the public domain hundreds of thousands of employee e-mails, which have become an important resource for forensic research. (Unlike novels, newspapers or blogs, e-mails are a private form of communication and aren’t usually available as a sizable corpus for analysis.)
Using this data, Benjamin C. M. Fung, who specializes in data mining, and Mourad Debbabi, a cyber-forensics expert, collaborated on a program that can look at an anonymous e-mail message and predict who wrote it out of a pool of known authors, with an accuracy of 80 to 90 percent. (Ms. Chaski claims 95 percent accuracy with her syntactic method.) The team identifies bundles of linguistic features, hundreds in all. They catalog everything from the position of greetings and farewells in e-mails to the preference of a writer for using symbols (say, "$" or "%") or words ("dollars" or "percent"). Combining all of those features, they contend, allows them to determine what they call a person’s "write-print."
It seems reasonable that we have a linguistic fingerprint, although 1) there are far fewer of them than finger fingerprints, 2) they're easier to fake. It's probably not much of a stretch to take that software that "identifies bundles of linguistic features, hundreds in all" and use the data to automatically modify my writing to look like someone else's.
EDITED TO ADD (8/3): A good criticism of the science behind author recognition, and a paper on how to evade these systems.
Posted on August 3, 2011 at 6:08 AM
• 47 Comments
To receive these entries once a month by e-mail, sign up for the Crypto-Gram Newsletter.
isn't this exactly what people who study literature have done for many years?
It seems to me reasonable that people can be identified fairly accurately by their writing styles (and, perhaps, by their linguistic habits generally). I don't claim to be particularly introspective, but I certainly have preferred ways of saying or writing things. (For example, I am much more likely to put "certainly" in the previous sentence than "for sure". And, though I haven't measured it, I think parenthesized comments beginning with "For example" are pretty common in my writing, and speech.)
Rex Stout, in one of his Nero Wolfe detective novels (called Plot It Yourself, I think), used the individuality of writing styles as a central plot device.
Your idea of using the measured characteristics of someone's writing to "forge" another's linguistic style is fascinating. I'd love to see someone look into this.
This has certainly been done, including by statistical methods, by literary critics for many years, but there was always a question of whether it was only applicable to highly-skilled writers, who are much more conscious of stylistic questions.
Any published author is highly-skilled by the standards of the median person, and most are much more aware of literary style than the median.
It's worthy of note that the signal is much stronger in fiction, where writers are chosen solely on their ability to write, than in non-fiction where writers are chosen on the basis of knowledge of the subject they are writing about.
Going down to 80-90% from the 99%+ that literary writing can muster rather proves the point that the median writer has a less distinctive voice than the published author.
A fair chunk of the stuff where it would be useful to be able to identify the author by literary style (e.g. email harassment) is such different style of writing from the norm that I suspect that these techniques would not work as well; you might be able to prove that harasser A and harasser B were the same person, but the commonalities between a harassing email and a business one would be so slim that it would be hard to show they were the same writer.
It's a fairly useful technology, but as with any behavioral-based identification, behavior can be modified to avoid or fake identification.
A good example is credit card companies. They have an array of techniques to identify fraudulent behavior, but the fraudsters often learn how to stay under the radar. Same with behavioral based security screening (airports, etc.).
I picked up a mystery novel, some years ago, by an author whose name I didn't recognize. About 2 chapters in, I thought "This person writes a LOT like this other author." Turns out the unknown author was a pseudonym for the other author.
Maybe now we can determine whether Shakespeare really wrote Shakespeare.
Will Strunk approach to muddying the waters, "Omit needless words."
Telegraph operators recognized each other by their "fist" (cadence of keying).
Art experts recognize the work of an artist by brushstrokes.
People recognize each other by their voices.
As technology improves, these processes get automated. At the same time, technology can identify errors in attribution (analysis of the oil/canvas/etc used in a painting might prove the age and thus exclude the attributed author, for example)
The problem is recognizing the error (sorry, but pun intended). We used to say if it is on green-bar paper (from a computer) it has to be correct. That attitude, in more modern form, still prevails.
There isn't much of a scientific basis for the claims. See
for a discussion and comparison to the more rigorous basis used to judge speech recognition work. Basically they fitted the model to a limited set of examples, and tested its performance on the same set of examples.
I've already seen correlation between turns of phrase on twitter by Anonymous members, Anonymous IRC chats, and the few news interviews.
In my head. No computer involved. Weird turns of phrase and uncommon misspelling always jump out at me.
What's really interesting to me is when you find a person who publicly posts in multiple formats but their "voice" radically changes depending upon their audience. Note the word radical.
After reading about what HBGary was doing I have become painfully aware of the huge plume of data that is kicked up by every post I make.
I, for one, think that this posting was not written by Bruce. The telltale "I wrote about this in..." phrase is missing.
@chris: LOL really did LOL
I hope to see the headline: Schneier on Security Website Hack Foiled by Observant Admin - ...we knew there was a problem when the fake Bruce tried to log in using an epic passpoem, detailing the life and works of SIX mythical CELTIC heroes...
Some time ago, I heard it was feasable to identify a writer by counting the word frequency and looking at the tail of the distribution.
In that tale, the words used are not of interest, but the form of the tail of the distribution was said to be very distinctive...
But I guess writers who do not want to be identified can circumvent that easily
Ted Kaczynski was probably the most famous modern example of someone outed by writing style. IIRC, his brother recognized Ted's style when the Unabomber Manifesto was published.
How long before someone offers a text anonymizer that reduces identifiable writing characteristics?
I don't know what's the point of the blog but as I was reading this entry, I was thinking...
hey, that's easy to fake; I could stick in some fake emails to incriminate someone if I had access by analyzing that person's email style.
And then the last few paras of this entry had the same thought.
If the point is to make me think like a criminal, well, you're getting there, sir.
My writing style has been altered just by regular reading of this blog.
When I look at my writings now I see many phrasings and the very short paragraph stylings of Clive Robinson.
And I often find myself saying 'I wrote about that in ...' as Bruce does. Imitation and flattery and all that. But regardliss of the cause, it's true; my wiriting is changing over time.
Now if you think about it the wiritings of others almost certainly changes over time ...
I would guess that there is an inverse relationship between the ability to fake a linguistic signature and the amount of material that you have written. It's interesting to note that Ted Kaczynski wrote long pieces; if he had simply written short one-sentence ransom notes, perhaps he never would have been identified.
Panzerfaust, one easy-to-use text anonymizer would be to use Google Translate to send text between various languages. For example, if I take English text, translate it to Korean, and then back to English, I get:
Panzerfaust an easy-to-use text pseudonym Google sends text between different languages is to use a translation. I accept the English text, for example, translated into Korean and again in English, I:
"out of a pool of known authors"
So it only works in a closed system. On the internet there are effectively an infinite number of possible authors.
This faces the same identification problems that Bruce has mentioned in the past. Technology tends to anonymize communication. Characteristics we use to identify people, e.g., tone of voice, face, posture, are generally stripped or much easier to fake. A recording is indistinguishable from a live transmission. Of course this is what makes phishing successful too.
This new technology is likely to suffer from the same vulnerabilities. It's easy to spoof an identity if all you need is a name and address to give to a computer system. If additional requirements can be defeated with copy and paste, security is weakened, not strengthened.
Sure it shan't be long ere 'anonymizers' become STD practice but what 'bout those of us who HAVE TO put a personal spin to our phraseology?
If I can't be cute and/or EMOTIONAL I'll have to rely on grammer, form and content to express my self. WTF?
@echowit - Proper grammar and all that is just another style. If Bill Buckley rose from the dead and posted an anon blog entry he would be identified immediately by the few remaining survivors of the zombie apocalypse.
Fravia+'s comments on this from back in 1998 still seem pretty relevant today. The table at the bottom is handy.
He also elsewhere made the observation that while a person with a large vocabulary can deliberately use a smaller set of simpler words, no one can use a word which he doesn't know, unless he's pulling randomly from a thesaurus, which may have unintended results.
@Mark R. - AWESOME table. It's very similar to what goes on in my head when I do troll research.
One fingerprint I look for is repeated incorrect use of a complex word.
Funny, I was busted for writing an unapproved newsletter by the Honor Council at my High School. They caught me by doing an analysis of spelling mistakes in the newsletter, showed a list of badly spelled words to the English teachers, who fingered me in 24 hours.
All those English teachers are dead now.
Natural causes. Honest. :)
If you want to emulate Clive, be sure to ignore the auto-spell-checker in the comment composition box. ;-)
And keep in mind regional spelling differences, such as colour/color, recognise/recognize.
Anybody know a good "spell-mangler" that could substitute various words with those from a known source?
@Increasingly Less Anonymous:
"One fingerprint I look for is repeated incorrect use of a complex word"
"You use that word a lot.. I do not think you know what it means"
Thank you Princess Bride..
Two things stick in my head from these comments:
Some suggestions remind me of George Orwell's "Politics and the English Language" essay on how to improve (and possibly anonymize) writing style.
Secondly, thank you Anon Bruce Poster for the phrase 'epic passpoem'! You made my day.
Just write like Yoda and everyone will assume you're a green dwarf with big ears.
Uh, wait...that should be:
Like Yoda just write and assume you're a green dwarf with big ears everyone will.
How does this method compare to a Machine Learning based Maximum Entropy model using the presence of tokens as the feature set?
I can usually pick Clive, tommy and RSH and Nick P here without looking at the sig!
I've been using a rough form of linguistic forensics on my university and college students for years. (Warning: do not try and cheat on tests or crib essays if your prof is an expert on forensic linguistics.)
My wife, as a secretary, found all kinds of ways to identify people that I'd never even heard of. (one was a characteristic "line length.") (Apparently, I have two *different* styles, depending upon what I'm writing.)
Yes, if you know the characteristic that someone is looking for, you can try to fake it, but it turns out to be harder than it looks. (I've tried.) There are all kinds of characteristics that can be used. The most effective ones are purely numeric metrics, such as a simple letter frequency count. This may seem counter-intuitive, but think of how hard it would be to try and fake it.
@ Rob Slade
I remember a very old DIY programming article in Byte magazine used frequency analysis of letter digraphs (at ph th er ed etc.) to identify Shakespeare's plays, but the writer also found to his surprise that the frequency of single letters gave better results.
The home computers of the time didn't have enough power to go onto trigraphs or higher combinations.
Vles: We're easy - just look at the length of the post! Which this one violates!
Hmmm... lets see....
I coud use tommys humourous stile and sine his name ....
Posted by: Clive Robinson
Research done in the 1980s by David Bell showed that a micro-encapsulated module could analyze and reproduce any writer's style with high assurance. See this link: ....
Posted by: Nick P.
Suck it up! There *is* no privacy, no anonymity, no security! That is my meme, and I've been right all along!
Posted by: Richard Steven Hack
A good "ear", and the ability to reproduce non-verbal cues, are the fundamental skills of impressionists (comedians), a la Chevy Chase/Gerald Ford, etc.
Posted by: tommy
i gues if i mispeled & use short sentense maybe leet. nobody nose its me. lol
@surprise, surprise, surprise , let me see, Tommy hates me and the names last but not any smile faces, but his ages is probable 25-35 so maybe not, the style is devils activate so maybe RSH and tried to modifies his saying. the writer is about 45-55 so maybe Bruce. Clive would spell sine write, but modifier some other character so probable not him, sarcastically using no one knows its me, leads to Bruce or mod(P/G/C) , then again how many people actual post on this blog...?
I'm going with mod, PC GP or something
A good way to remove style from your messages is to throw it through several conversions in Google translate. Some suggested ones are to go through Spanish and German consecutively. While this will muddy up the message a bit, most people will be able to puzzle out the message.
(Look below to see the conversion going English > German > Spanish > English)
A good way to remove the style of news is to launch it in Google translated through several transformations. Some are suggested to go for the Spanish and German in a row. Although this is the message a little muddy, most people will be able to decrypt the message.
(This is English > Spanish > German > English)
A good way to delete the style of your message is, throw it through several transformations in Google Translate. Some are proposed to go through a series of Spanish and German. While this is the message a little cloud, most people will be able to decrypt the message.
I knew emoticons would be my downfall ^_^'
The names were in alphabetical order. More randomness.
Why does Tommy hate you?
Did you mean "devil's advocate"?
How did you derive the age ranges?
Who or what is "P/G/C"? Who is PC and GP?
All interesting parts of your own method of analysis.
@surprise, surprise, surprise, parnoid or a comment that triggered that,maybe reverse pysco use your name last if first rev-rev.... Yeap devils advocate, i allways thought he worked at the fbi.:).., age, more time used to right the message normal from someone not born jacked in. pc/gp someone with a username like that made a comment that look like it would fit, or last name S. general tone was a bit pissed off.
don't worry not trying to track people down, just first reaction or part of them.
Surely an algorithm that can fingerprint an identity/style can also fake that same identity, thereby creating an element of plausible deniability?
Also news sites requiring registration/paywalls suck.
@Andy, with a ISP email, and a legit looking name, alls good ;)
How do you know how much time was used to write the message? Not everyone subscribes to real-time comment feed. Or maybe they were busy ATM.
May I suggest that you not attempt a career in linguistic forensics? While looking for truffles on the floor of the forest, you missed the gaping neon sign.
@surprise, surprise, surprise , "May I suggest that you not attempt a career in linguistic forensics? While looking for truffles on the floor of the forest, you missed the gaping neon sign. "
This is a crypto blog, but any more hints ?
@Panzerfaust, John E. Bredehoft, Some random reader:
Google translate is NOT enough. I can't find the link to where they busted it, though. It's all about Google Translate keeping a lot of the original structure even though it looks randomized. It's a bit like poor crypto, easy to analyze.
"It will also cover our current progress in establishing a large corpus of writing samples and attack data and the creation of a tool which can aid authors in preserving their privacy when publishing anonymously."
Reminds me of Author Unknown: On the Trail of Anonymous by Don Foster.
Schneier.com is a personal website. Opinions expressed are not necessarily those of BT.