Identifying People by their Writing Style
The article is in the context of the big Facebook lawsuit, but the part about identifying people by their writing style is interesting:
Recently, a team of computer scientists at Concordia University in Montreal took advantage of an unusual set of data to test another method of determining e-mail authorship. In 2003, the Federal Energy Regulatory Commission, as part of its investigation into Enron, released into the public domain hundreds of thousands of employee e-mails, which have become an important resource for forensic research. (Unlike novels, newspapers or blogs, e-mails are a private form of communication and aren’t usually available as a sizable corpus for analysis.)
Using this data, Benjamin C. M. Fung, who specializes in data mining, and Mourad Debbabi, a cyber-forensics expert, collaborated on a program that can look at an anonymous e-mail message and predict who wrote it out of a pool of known authors, with an accuracy of 80 to 90 percent. (Ms. Chaski claims 95 percent accuracy with her syntactic method.) The team identifies bundles of linguistic features, hundreds in all. They catalog everything from the position of greetings and farewells in e-mails to the preference of a writer for using symbols (say, “$” or “%”) or words (“dollars” or “percent”). Combining all of those features, they contend, allows them to determine what they call a person’s “write-print.”
It seems reasonable that we have a linguistic fingerprint, although 1) there are far fewer of them than finger fingerprints, 2) they’re easier to fake. It’s probably not much of a stretch to take that software that “identifies bundles of linguistic features, hundreds in all” and use the data to automatically modify my writing to look like someone else’s.
EDITED TO ADD (8/3): A good criticism of the science behind author recognition, and a paper on how to evade these systems.
Leave a comment