Using Language Patterns to Identify Anonymous E-Mail

Interesting research. It only works when there’s a limited number of potential authors:

To test the accuracy of their technique, Fung and his colleagues examined the Enron Email Dataset, a collection which contains over 200,000 real-life emails from 158 employees of the Enron Corporation. Using a sample of 10 emails written by each of 10 subjects (100 emails in all), they were able to identify authorship with an accuracy of 80% to 90%.

Tags: biometrics, de-anonymization, email, identification

Posted on March 14, 2011 at 5:04 AM • 63 Comments

Comments

anonymous • March 14, 2011 6:46 AM

Not only is the degree of accuracy very low for “evidence” (80-90%); anyone suspecting this sort of anyalysis could easily either change their characteristics for any “anonymous” email or even duplicate another persons. This kind of “evidence” is scary.

Mooman • March 14, 2011 7:04 AM

Interesting indeed. A couple of points. An accuracy rate of 80 to 90% means that there is still at least a 10% chance that the ‘evidence’ could be wrongly used to convict someone. That seems very high indeed.

Secondly, regarding the use of an IP address to trace the approximate source (i.e. buidling/apartment), that would suggest that there is an awful lot of clueless scammers. A so called ‘professional’ criminal would be less inclined to use their own home connection. At the very least I would have thought that they would use some type of anonymizing proxy service or use somebody else’s wireless connection that had poor or no security controls.

Maybe, just maybe a lot of the criminals just ain’t that smart.

And another thing – if people are committing this crime from their home connection …there must be quite a few ISPs with relaxed policies about allowing their customers to use the connection to facilitate an SMTP relay.

Mooman

OQ (anonymously) • March 14, 2011 7:17 AM

Write to reminder: “change features of texts when sending anonymous e-mails; switch case ofsome chars, add/remove commas and spaces, made several misstakes”

Miguel Farah • March 14, 2011 7:27 AM

Several years ago, during a flame war in a local Usenet group, I posted a scathing article condemning both sides for their pointless fight. To avoid controversy, I did anonymously… yet I was quickly discovered. Why? Simple: I was the only one that wrote properly, with no spelling mistakes, no grammar mistakes, using upper and lower case, et cetera.

I learned from that lesson to not only post anonymously, but to also change the redaction style into something that I’d normally wouldn’t be caught dead using.

I can see this technique being better used for something other than e-mails: internal documentation. Around here, it’s common to not have it signed by the author, as “the company” is it.

Clive • March 14, 2011 7:39 AM

It seems other people have already pointed out, but it bears repeating: they tested accuracy on e-mails in which the authors weren’t trying to disguise their identity.

Nor, indeed, deliberately imitating the writing style of another possible author.

Nor, indeed, using that software to verify the unidentifiability of their anonymous writings.

yt • March 14, 2011 7:41 AM

I’ve been saying for quite some time that I can always spot Clive’s comments not only by their length, but also by his characteristic language patterns. Based on some of the recurring typos (e.g. “where” for “were”), I’m even fairly confident I could make some guesses about what kind of accent he speaks with.

Of course, now that I’ve said this, Clive will post a totally uncharacteristic comment. 😉

Gianluca Ghettini • March 14, 2011 8:08 AM

To me, calling it a “proof” or “evidence” is too much. Maybe it can be usefull to leverage or support already made judments. Nothing more. However, it’s very interesting to see how much side channels there are on human text data…

Steven • March 14, 2011 8:51 AM

Author Unknown: On the Trail of Anonymous
Dan Foster
http://www.amazon.com/Author-Trail-Don-Foster/dp/0805063579

A whole book on this subject.
He got his start by writing his dissertation on whether an unsourced poem was written by Shakespeare.

He went on to unmask Joe Klein as the author of Primary Colors.

cipherpunk • March 14, 2011 9:31 AM

Analysis would of course depend on the number of samples an investigator has. Given the mantra most organizations have that provide “free” email (i.e. google -7.5GB+) – “you never have to delete anything again”; I guess this would not be a problem. Greg Conti provided an interesting perspective on this in his book, “Googling Security: How Much Does Google Know About You?”.

Same technique could also be applicable to so called ‘anonymous’ forums.

Thunderbird • March 14, 2011 10:34 AM

Re: “very informative blog…. Posted by: best stock picks”

I was puzzled by this very bland comment. I guess it’s just there for the purposes of the link back to “best stock picks”. Probably helps their Google ranking, based on the fact that I find pointers like that that go to people that claim to be “search engine optimization experts.” I guess in any ecosystem you get parasites.

trol • March 14, 2011 12:18 PM

Since evil people could try to imitate the style of their victim, this could be described as:
1) if no one tries to hide themselves it works 80/90%
2) if people try to impersonate others they will likely succeed.

I doubt it would meet standards for admissibility in court, or confidence for other use.

This is not a droid you are looking for….

Shane • March 14, 2011 12:20 PM

The only thing new about this is the automation of the process and an awful 10-20% false positive rate.

Forensic document examiners have been doing this work for decades with great results. The only thing I’m seeing here are a few extra buzzwords to update it for the millennials’ easy consumption, and automation for our new generation of lazy law enforcement.

You can’t replace human expertise (especially regarding human habits) with some lo-jacked algorithm and a list of email headers… Just like wholesale surveillance can never replace good old fashioned detective work.

Clive Robinson • March 14, 2011 12:44 PM

The simple fact is authenticity is “deniable”

David • March 14, 2011 1:08 PM

As “evidence” in a court proceeding, the 10-20% error rate would make it very suspect.

OTOH, the process’s use as a purely investigatory tool is another question entirely. Narrowing down a field of suspects, or checking into the “forged/not forged” question could save time and effort…

…but then, I can see investigators of all stripes beginning to use this to direct their entire investigation. Polygraphs, vague witness descriptions, and other less-than-absolute tools come to mind.

“He MUST be guilty–the doc analysis says it’s 90% sure he wrote the ransom note!”

Shane • March 14, 2011 1:10 PM

@Clive – agreed 100%. Just like witness testimony, even the most convincing of cases for authorship can never stand alone. All this data mining for crime prevention is purely hype aimed (likely) at beefing up VC, or so it would seem.

It’s the 49’er gold rush all over again, except the government built the mines this time (thanks to their wholesale electronic surveillance) and also pay the premiums for gold struck. The ‘miners’ in this case have no idea what they’re supposed to be looking for, so every shiny rock they find can be passed off initially as ‘striking gold’.

How a 10%-20% error rate using a SMALL and KNOWN set of users even makes the news is beyond me.

Shane • March 14, 2011 1:23 PM

@David – Re: ‘OTOH’

It could also lead investigators down the wrong path with a bias towards sticking with it (due to some ‘likelyhood’ that they’re on the right track).

Far FAR too many cases suffer from this short-sightedness. Even cases where the evidence was (at one time) considered infallible (bite marks for example) have since been overturned. This lack of interest in investigating any parties beyond the one that matched the ‘infallible’ evidence is responsible for countless wrongful prosecutions (even in capital cases).

Again, there is absolutely no replacement for real old fashioned detective work. Not even a full confession.

10-20% is a joke. I wouldn’t be surprised if cold readers had better success rates for the same known user sets.

Shane • March 14, 2011 1:26 PM

@David – Ooops, seems you already argued my point with yourself 🙂

echowit • March 14, 2011 1:43 PM

My subsequently software mission; a ersatz-random expression generating thesaurus line up to effect my stations and electronic messages.

two spaces • March 14, 2011 1:48 PM

@Miguel Farah:
I judge a typist by the same indicators as well. One thing I look for, since I do it myself, is if they use two spaces after a period, which how I was taught in typing class about 30 years ago.

@cipherpunk:
“i.e. google -7.5GB+) – “you never have to delete anything again”

That “suggestion” from Google just scares the hell out of me. It brainwashes people into leaving all that “evidence” in their trashcan where it can be retrieved when necessary.

Anonymous Meat Popsicle • March 14, 2011 2:15 PM

This is indeed sketchy “…evidence…”, in the olden days of Morse Code we called this the operators “…fist…”, it was routine to study the fist of operators, and use it to masquerade as the operator. Using “…evidence…” with this lack of credibility in a Court of Law is a very sad indictment against the level of intelligence of the Judge that might allow it, the jury that might use it and the prosecutor, only interested in a TV interview that might extol the virtues of it. All around, drooler material.

JimFive • March 14, 2011 3:14 PM

@two spaces

And you will notice that the web page ignores your two spaces after the period and only displays one of them anyway. It is a “feature” of html parsing that extraneous whitespace is ignored.

JimFive

First Timer • March 14, 2011 3:41 PM

@yt and @Clive Robinson

Nearly sprayed tea on my monitor. 🙂 Thanks for the laugh!

Clueless Noob • March 14, 2011 3:57 PM

A couple of trips through Bablefish using different languages might remove identifying characteristics.

Trying this out on the sentence above (English to Spanish to French to German back to English):

Uces entsprech of journeys with Bablefish, by using differently the languages could l’ take away; Characteristic identification.

Alan Green • March 14, 2011 4:02 PM

Having worked with the Enron corpus, I am suspicious of any conclusions drawn from it. It consists of email to and from people involved in a court case, and even then, some obviously irrelevant, personal personal messages have been removed from it.

Unfortunately, the Enron emails are the only dataset freely available to researchers. It turns out that, despite having billions of emails flying around the ‘net, no reputable organization or representative group of people are willing to have their email read by random academics.

Dirk Praet • March 14, 2011 4:06 PM

I guess what they have done is simply apply method, science and technology to intuitive recognition techniques most of us are capable of. After a while on this forum – or any other for that matter – it becomes quite easy to finger a particular individual to a certain post having read sufficient previous posts by the hand of said person. The combination of lenght, content, pet peeves, grammar, recurring typos etc. can indeed give you away, even when hiding behind an unsecured wifi access point, darknet and/or anonymising software.

The new element for me is that this type of evidence may be admissible in court, but with a hit rate of 80-90% I’m having doubts as to which extent this could prove conclusive evidence if there is nothing more. It wouldn’t seem that hard to me to change one’s writing style for “sensitive” communications or when posting under a different persona. Or even analysing someone else’s style and then spoofing his/her identity to publish incriminating material. Hello HBGary ? Anybody out there ?

Clive Robinson (the original) • March 14, 2011 4:55 PM

@ Clive, yt, Clive Robinson

Now I don’t know who you all are but I think yt is a young lady around about 27years old (possibly Belgian).

As Bruce and others may remember, Bruce thought he could sufficiently recognise my style that I could not get away with being a “sock puppet”.

Others have claimed to be able to recognise my style (or lack there of, BF Skinner thiings I might be a Klingon, others know I like Douglas Adams writtings).

But in the past there have been others calling themselves Clive or Clive Robinson on this blog and that may or may not be their real names (who knows woof woof 😉

So the question all of you (except Bruce or the Moderator) are going to have to ask is which is the Clive Robinson who posts longish entries on this blog most often and you would recognise as the “Clive Robinson” you think you know?

I’ll give you a clue the last time this happened I suggested Bruce check my IP address which had been different to normal because I was in hospital at the time and using their facillities in the patients lounge…

Now I know there’s no way I can prove it’s me (funny I say that alot in one way or another) so I shall leave you all to make your own minds up…

I could of course drop Bruce an Email from the Email account I’ve used when inviting him to a cup of tea when he dropped in on the creator of the Panoptigon but that still is not proof…

Ho hum 8)

Nick P • March 14, 2011 5:58 PM

Yeah, thats why smart folks have always had one or more other people write the messages for them. Im surprised this vulnerability is news.

Publius • March 14, 2011 6:49 PM

It’s like the mandatory manager review that my company made me fill out.

Afterward the boss came around to a couple of people in our department and thanked us for filling out the thing because some people did not, he said.

It was IBM.

tommy • March 14, 2011 7:00 PM

Surprised no one has suggested this: u$3 1eet-$p33k 4 3vrth1ng + vary 0k4zun1y.

(In addition to what others said, I can tell a Clive post from the first few sentences. The “Hmmm…” at the beginning is often a dead give-away. Lots of mispelings for such an obviously inteligent person [sic], though seems to be more careful today … also, en-UK limits the potential population. Plus the ho-hum and a few others … none of which alters the facts that, as Bruce says in general, many of his posts are interesting. Keep ’em coming!)

Dirk Praet • March 14, 2011 7:14 PM

@ Clive

“Others have claimed to be able to recognise my style (or lack there of, BF Skinner thinks I might be a Klingon …).”

I’m afraid I’m going to have to disagree this once with BF Skinner. Your gigantic factual knowledge, infallible logic, eye for detail and diplomacy to me are conclusive evidence that you are in fact Vulcan, probably monitoring our futile attempts to develop warp drive capability at Area 51. Your recent stay in hospital either has to do with side-effects of going through Pon Farr or your ears reverting back to their original shape. The one thing I can’t figure out is how you are able to conceal your green blood from the nurses 😎

Johnston • March 14, 2011 7:20 PM

I’ve done authorship-based linguistic analysis before. Based on my work/experience, I doubt the article accurately reflects the sophistication of the tests done by the researchers. What I did was far more advanced than what the article describes, and my work was far less serious.

It’s been quite a while and the memory is now fuzzy, but here are some characteristics of a “linguistic fingerprint” to keep in mind:

sentence length.
avg ratio of words:clauses within a sentence.
frequency of sentences not beginning with noun phrase components (adjs, nouns).

There was a LOT more… wish I could remember!

Richard Steven Hack • March 14, 2011 7:40 PM

Re Google’s “you never have to delete anything again”.

Well, I delete everything I don’t want to keep. I’m religious about emptying my spam box even though they say it goes away in X days.

So I’m using a grand total of 14MB out of my 7.5GB. (I don’t get a lot of email.)

My other Gmail account, the one I first set up, is my spam trap. And that has even less since I clean it out once a week or so. I keep only some stuff that ended up going there anyway.

The other thing I do is periodically run Thunderbird and let it “archive” all my Gmail stuff in its database on my hard drive. Don’t want to be like those 150,000 whose Gmail accounts were wiped (even though most were recovered.)

I’m not sure anyone mentioned this, but what kidnappers do is cut and paste words into sentences in their ransom notes to avoid leaving handwriting evidence. I assume if you cut and paste text from Internet sources into your threatening emails that it would have the same effect on this sort of analysis.

I believe there are programs out there that are experiments in “text generation”: given a topic, they generate text on that topic. Could probably be easily modified to generate emails which could then be edited with cut and paste, to totally defeat this analysis method.

Hackers who send emails to companies extorting them for money probably should pay attention to this from now on. Also outfits like Anonymous, to defeat Aaron Barr’s next effort to find out who they are. Chat rooms are another area where such “anti-forensics” would be valuable.

Metallurgical coal • March 14, 2011 7:50 PM

It sounds very interesting but 10% could be make innocent people that were injured !

Bruce Clement • March 14, 2011 8:26 PM

I’d be interested in seeing it trained against larger sample sets. Would this get the accuracy rate up to something useful, or is the 80% to 90% a limitation of their technology?

David • March 14, 2011 9:18 PM

@Bruce Clement

It’s also possible the accuracy could go down…or become useless.

I’m studying these types of machine learning problems, and there’s a couple of assumptions in the way results are presented (not saying they are here, but these are typical):

–Classification problems (in this case, who wrote what) usually come out as a numerical ranking. The article doesn’t say, but assignment to an originator may mean ” Author A is originator with a metric of 0.999″ but he beat out Author B who was at 0.998. Adding more people may just crowd the field.

–I haven’t seen the Enron data, but it’s possible that the 10 people selected had very different styles. If a larger data set has multiple occurrences of similar styles, the accuracy could go down.

I would suggest trying this against, say, the PubMed article database. Pull out a large number of articles for training, then pull out another subset and “hide” the authors from the system. See if it works as well.

Davi Ottenheimer • March 14, 2011 11:05 PM

@Shane

“The only thing I’m seeing here are a few extra buzzwords to update it for the millennials’ easy consumption, and automation for our new generation of lazy law enforcement.”

Agree 100%. Too many buzzwords. The timing of the paper and the lack of attribution also are suspect.

It seems to me the competitive world of graduate school forced this group of students to find a lucrative nail for their data mining hammer.

Will every paper now talk about finding Anonymous? Is the ERP market so dead that data mining is now panning for security nuggets? Maybe they just heard one of us at a security conference mention data analysis of email…

I cover several differences in their paper from our methods here:

http://www.flyingpenguin.com/?p=10511

We can identify fraud using only a small set of data from anonymous sources trying to hide their identity, for example.

Our work, let alone others, also predates their paper and research by about seven years.

Davi Ottenheimer • March 14, 2011 11:20 PM

@David

“It’s also possible the accuracy could go down…or become useless.”

Agreed. The paper appears to be written from a data mining view and lack reference to the security body of knowledge.

It does not take anyone long in security to realize, for instance, that a massive amount of malware code has not increased the accuracy of pattern detection engines.

That is why we referenced prior linguistic and security techniques when we presented our work in 2003 on a simple approach to identify anonymous e-mail using language patters. It seems to me that their paper in 2008 and this one in 2010 seem to have followed our updates.

yt • March 15, 2011 4:54 AM

@Clive Robinson “Now I don’t know who you all are but I think yt is a young lady around about 27years old (possibly Belgian).”

I’m incredibly curious to know what gives you that impression.

Jonadab the Unsightly One • March 15, 2011 6:28 AM

I’m pretty sure only needing to differentiate between ten potential senders was a major factor in producing such high accuracy numbers. In most real-world anonymous-email scenarios the number of potential senders is rather higher.

mw • March 15, 2011 6:43 AM

Nice study: “A novel approach of mining write-prints for authorship” cited here: http://www.concordia.ca/now/what-we-do/research/20110307/identifying-anonymous-email-authors.php.

A few issues:

Write-prints can potentially be spoofed, manually or algorithmically, unlike immutable fingerprints or DNA sequences.

Emails, such as those in an Enron context, often contain blocks of text that are copied/pasted. Those emails would then legitimately include frequent linguistic patterns from a set of multiple authors.

The proposed author attribution method requires analyzing emails from a set of potential suspects. Let’s say an anonymous “malicious” email is being investigated for authorship. If the actual author has not been included in the database of potential suspects and their associated emails, will the author attribution method incorrectly point to an innocent person based on a best write-print match?

Trivial issue: On page 44, the current pdf says “people are not couscous about the spelling and grammatical mistakes particularly in informal e-mails “. Probably this should read instead “conscious about the…”.

Finally, in my opinion there are probably more commercial applications for using this kind software to enable authors to instantly change the stylistic appearance of their text, rather than for using it in forensic investigations.

Roger • March 15, 2011 7:09 AM

@Alan Green:
“no reputable organization or representative group of people are willing to have their email read by random academics.”

At one time, people tended to keep all their letters indefinitely, and post mortem examination of these stashes sometimes gave historians personal insights into the thoughts of people involved in significant events.

I wonder if the internet age will lose all this to us — or make it more accessible than ever?

jonkx • March 15, 2011 7:50 AM

I can’t speak for the scholarship of the study but acknowledge the statement by Johnston as credible. On the other hand, I don’t know how or why the 10 subjects were selected. Also, it seems that 10 people working for the same company would have similar speech and writing patterns. A more comprehensive study of Enron could be more revealing. I guess that could show an even lower percentage of identity recognition.

Miguel Farah • March 15, 2011 8:22 AM

@yt “I’m incredibly curious to know what gives you that impression.”

That’s because Clive has no clue. It’s perfectly evident you’re 26…

Chris • March 15, 2011 8:51 AM

So which of the Federalist Founding Fathers was “Publius” at any given time? There are lots of historical “datasets” which can be used to test the accuracy of the technique.

And there are probably lots of ways to defeat this. First of all, why not take your screed, manager’s evaluation, manifesto or whatever and translate it into say, French, with an on-line translation engine? Then, translate it back into English with another one and edit it for clarity. That should give this technique fits.

yt • March 15, 2011 8:58 AM

@Miguel Farah: I don’t mean to imply anything about the correctness or incorrectness of Clive’s conclusions, I’m just curious to know how they were arrived at.

Clive Robinson (the original) • March 15, 2011 9:59 AM

@ yt,

“I’m just curious to know how they were arrived at.”

Just as I’m currious about your guess of,

“I’m even fairly confident could make some guesses about what kind of accent he speaks with.”

Which you have yet to reveal…

And there is the old story about the golfer and the tuft of grass his ball lands on…

That is you have to name the tuft before you tee off, not after it lands.

So time to put your cards on the table 8)

JohnJ • March 15, 2011 10:07 AM

@Roger – I wonder about that as well. For the past few hundred years history has been recorded not only in the official (sanitized) way but in the publications of the people. Letters, books, music in physical form have survived even if in degraded condition to reveal detail about our forefathers.

With the digital communications era, how much of that data will be available 250 years from now? What kind of data will remain for archeologists and historians to use to describe us?

IMHO I don’t think much will survive. Much of the data is stored on servers of commercial entities that, some years from now (assuming the entity is still in business), will have little use for all of that data. They will no longer be able to derive profit from mining and exploiting the data. At some point the cost of maintaining electronic archives will outweigh the revenue and they’ll be purged.

The proof of our existence, individual beliefs, interpersonal relations, and achievements will be wiped out by an automated script that purges data over x days old.

BF Skinner • March 15, 2011 10:38 AM

Clive tlhIngan maH!

Hah! knew it.

Davi Ottenheimer • March 15, 2011 3:07 PM

@mw

“in my opinion there are probably more commercial applications for using this kind software to enable authors to instantly change the stylistic appearance of their text”

ha, they should make it an outlook plugin and call it “how to write more like your boss”. the plugin (paperclip?) can then gently suggest changes as you type:

“your style is too direct. add more words with four or more syllables”

another commercial use. scanning all the grant proposals and learning how to write like the ones that get the largest awards:

“you are not being vague enough and you have only used the word terrorism 13% of the time. increase to 25%”

Clive Robinson • March 15, 2011 3:53 PM

Feh!
We are ALL Clive Robinson.

Roger • March 15, 2011 4:22 PM

The proposed algorithm, AuthorMiner, has some very interesting features which weren’t covered in the article.

It seems to have been designed to protect civil liberties. I don’t know if this was intentional, but the design is such that it is unusable for trawling databases of texts to identify all writings by some target of surveillance. That is a feature of the algorithm, and is true regardless of parameter choices or accuracy rate.

It doesn’t actually try to identify authors. What it actually does is eliminate suspects from a suspect list, and give some measure of how strongly it does so. This identifies the probable author if and only if we already know the true author is on the suspect list. Accuracy depends on having a quite short suspect list to begin with.

The accuracy increases as the suspect list gets shorter, and as the training data increase. Training data are required for all suspects, i.e. it is incapable of eliminating a suspect for whom little training data is available. It is thought (but was not actually tested) that accuracy would also increases as the length of each training message is increased; thus it is likely to be much less effective for twittering.

While they didn’t assess the ability for someone to disguise his writing, some of the features being detected are quite subtle and are likely to survive cruder disguise attempts. They manually examined some of the extracted feature-sets and commented that they included features that were indiscernible to normal reading. However, it can also use cruder features if the suspects have very different styles. Probably here the algorithm chose subtle features because these were all business emails, on a similar topic, from the same company. More diverse styles would probably also increase the accuracy.

The 80 ~ 90% accuracy rate is per email. It isn’t clear if this is independent for different emails. If it is, and you have several messages from the same suspect, you could rapidly squeeze that 10 ~ 20% error rate down to something reasonable.

@jonkx:
“On the other hand, I don’t know how or why the 10 subjects were selected.”

They were selected at random from 158 available. Actually, several tests were done with different sized groups, and the members selected randomly each time.

” Also, it seems that 10 people working for the same company would have similar speech and writing patterns. A more comprehensive study of Enron could be more revealing. I guess that could show an even lower percentage of identity recognition.”

No, it would show higher accuracy, not lower. The more diverse the alternative suspects are, the better it is at eliminating them.

Nick P • March 15, 2011 6:54 PM

@ Dirk Praet

Definitely more like a Vulcan. If he was a warmonger, he’d be rich from his exploits. And he would have never entered the hospital. Vulcans can be too peaceful for their own good.

Clive Robinson (the original) • March 16, 2011 3:31 AM

@ Roger,

All the flippancy around my name aside.

With regards your comments,

“The proposed algorithm, AuthorMiner, has some very interesting features which weren’t covered in the article.”

If what you say with,

“What it actually does is eliminate suspects from a suspect list, and give some measure of how strongly it does so.”

It is then effectivly a filter and all determanistic and stable filters have their inverses in their domain of operation.

For instance in the frequency domain a notch filter is the oposite to a narrow band pass filter likewise a high pass and a low pass filters.

Even so called “one way” filters have their inverse that can be found by using the output of the filter against it’s input in some manner (that is if you correcttly subtract the output of a notch filter from it’s input you get the narrow band-pass filter).

Wether finding the inverse of a filter is practical or not is sometimes a matter of understanding the filters hidden charecteristics and appropriate ordinality.

Thus the “AuthorMiner” itself could be used to disguise the writting of an individual (all be it in an inefficient manner) simply by “twiddeling” with the input and seaking a correlation minima at the output.

Also of note when seeking attribution or not would be the output ranking and spacing when using a list of test subjects against two or more unattributed messages. But of more importance is using the set of unattributed messages against each other to see what the actuall degree of correlation is between the messages to test the confidence of results. Likewise the set of messages that are representative of the candidates on the list of test subjects.

Anton • March 16, 2011 5:07 AM

Not much use for evidence, but great for espionage

Mr. Ballsack T Baghard • March 16, 2011 6:01 AM

It doesn’t matter whatever this research says. It changes zero. Diddly squat.

If you want to send anonymous emails, use TOR, then use multiple chained proxies within TOR, such as Anonymouse, Mixminion, etc.

We are anonymous, we are ballsacks.

Roger (the original) • March 16, 2011 8:49 AM

@Clive:
“All the flippancy around my name aside.”
Indeed. You have my sympathy.

“It is then effectivly a filter and all determanistic and stable filters have their inverses in their domain of operation.”

True — but in this case, the domain is “the writings of the people on the list of people whom investigators will subsequently suspect.” If you can accurately predict that list (or influence it), and also obtain sufficiently detailed samples of writing, you could probably mask yourself from AuthorMiner.

However that could easily be quite tricky. For example, suppose your inverse filter depends upon suspect “A” being in the domain. If “A” turns out to have a watertight alibi, then he is excluded from the domain, your inverse filter is detuned, and you could be in hot water. Maybe.

yt • March 17, 2011 7:16 AM

Clive Robinson (the original): “‘I’m even fairly confident could make some guesses about what kind of accent he speaks with.’

Which you have yet to reveal… So time to put your cards on the table 8)”

Unfortunately, it’s hard to explain without reading one of your posts out loud (I’m not sure how to describe my impressions about your accent in terms of geography/regional variations in pronunciation). I suppose I could make a recording as a project for this weekend.

Seiran • March 17, 2011 11:37 AM

Easy steps to fool linguistic analysis:

Machine translate language -> foreign language
Machine translate foreign language -> original language
Make some corrections
???
Ransom!!!

One may be able to skip step 1 if they are proficient in another language already, though certain characteristics may reveal what the original language was.

I just tried this with Google Translator, but it seems it cannot translate “plushie” – as in, “PAY $1. OR YOU’LL NEVER SEE YOUR PLUSHIE AGAIN” – correctly.

Randall • March 18, 2011 2:31 PM

Stylometry is a term using not-super-obvious clues in the text to figure out who wrote something — e.g., vocabulary size, frequencies of words and phrases, etc.

There’s a lot you can do. Some methods almost certainly need much more than your typical couple-of-paragraphs e-mail as input; you can’t tell how often someone uses a word from a couple sentences.

Ichinin • March 18, 2011 9:03 PM

Basically, this is how they tracked down Ted Kazynski – The Unabomber.

The forensic lingquistics wasnt used as evidence, it was used to track him down and to gather more evidence. By itself, this stuff has little to no evidence value.

Chris • March 18, 2011 10:12 PM

@Ichinin: I would say that you’re almost tangentially right. Ted Kaczynski’s Manifesto was recognized by his brother, David, who tipped off the FBI. It wasn’t so much the writing style — though that was a factor — as much as the ideology. Computers weren’t involved at all and if they were, the software technique we’re discussing would be worse than useless because all it could tell you is that two pieces of writing were written by the same unknown author which was obvious to everyone.

Clive Robinson (not the original) • March 21, 2011 5:58 AM

Let’s hope the current generation of Clive clones… doesn’t survive after the present thread dies :=)

TRX • March 21, 2011 6:25 AM

It turns out that, despite having
billions of emails flying around the ‘net,
no reputable organization or
representative group of people are
willing to have their email read by
random academics.

There’s the White House email dataset, which became prominent during the Oliver North trial, and there are thousands of mailing lists, some more than 25 years old, if you want to check change in style over time.

Re Google’s “you never have to delete
anything again”.

Well, I delete everything I don’t want to
keep. I’m religious about emptying my
spam box even though they say it goes
away in X days.

Just because you “delete” a message on GMail doesn’t mean it goes away. It just means you don’t see it any more.

As “evidence” in a court proceeding, the
10-20% error rate would make it very
suspect.

For a trial, true. But it’s more than enough probable cause to direct attention to a suspect or justify a search warrant.

Several years ago, during a flame war
in a local Usenet group, I posted a
scathing article condemning both sides
for their pointless fight. To avoid
controversy, I did anonymously… yet I
was quickly discovered.

Back decades ago, it was common for users on local BBSs to use different usernames on each BBS, and often to have multiple usernames on the same BBS. Few people could maintain a false flag for more than a handful of messages before being outed…

With a huge writer pool like “the internet” your chances of remaining anonymous are high. But the smaller the pool, the more likely you are to be discovered.

Schneier on Security