Plagiarism in Crossword Puzzles
Yet another fraud discovered through data analysis.
EDITED TO ADD (3/11): More.
Yet another fraud discovered through data analysis.
EDITED TO ADD (3/11): More.
Tatütata • March 9, 2016 3:51 PM
How many different crossword puzzles are possible anyway? The English vocabulary consists of a finite number of words, and is but a small subset of the possibilities offered by mere N-grams of the alphabet.
I tried my hand at writing a program for finding anagrams of a given starting sequence through a combination of brute force, random permutations, frequency tables, and a dictionary. Too ugly/hacky to publish, but I found a few good ones that let me shine. [Is that cheating?]
It’s not too difficult to find a set of words that fit the starting sequence. But finding a grammatically correct sequence sentence is already more challenging, and a meaningful one even more. The rules of the language limit the possibilities.
In a crossword puzzle the additional rules are provided by the intersections. I view it as a kind of error detecting/correcting code.
Shannon looked into these questions 70 years ago, I was already planning to look up his paper on crossword puzzles.
David Leppik • March 9, 2016 5:28 PM
@Tatütata, computer generated crossword puzzles are very different from ones written by people. For human puzzles, not only do the clues need to be clever, but the puzzles have themes which tie many clues together. The themes may be jokes, long phrases, or even a series of complete sentences.
@Ewan, if you read the article you’ll find that the features in question are themes that can’t be generated algorithmically. What’s more, the copying is directional. A theme would show up in the NY Times and then be copied on a later date; themes never showed up in the NY Times second. And when they were copied, the order of the clues was preserved.
Data Dog • March 9, 2016 6:13 PM
Wait, they used data analysis to find fraud? But data is a toxic asset!
Meow • March 9, 2016 6:48 PM
@Data Dog:
Alright, troll, I’ll bite: personal information can be toxic when a leak would cause you a lot of harm in reputation or legal problems, so the implication is don’t just sit on mountains of it that you don’t need…
analyzing puzzles isn’t personal information.
@Tatütata, David Leppik:
Human-generated puzzles can easily be a mixture of hand-crafted seeds or “themes” and then computer-aided for the “filler”… And this is always what I assumed such themed crossword puzzles were! Not just wrote copies!
Bruce Schneier • March 9, 2016 7:25 PM
“Wait, they used data analysis to find fraud? But data is a toxic asset!”
Ha!
Toxic does not equal useless.
blake • March 10, 2016 4:46 AM
It was pretty toxic the the fraudster.
How many medicines are really low dosage poisons that disproportionately affect our ailments more than they affect us?
There’s probably a concept of dosage with data too: have enough to control your fraud, etc, but don’t ingest so much that you kill the host.
Tatütata • March 10, 2016 7:25 AM
Some of the “plagiarised” puzzles in the link do indeed look quite suspicious, with identical skeletons thematic phrases, but with different “filler”, albeit of the same geometry.
I see in these sub-areas connecting to other sections by just a few letters, so it might be possible to work out by brute force all the possible solutions for a given set of geometries up to a dimension MxN of a moderate size, and their connectors.
I think I found something like what I was looking for in the IEEE Information Theory Society Newsletter:
See in particular p. 6/19, middle of the left column.
Kamendae • March 10, 2016 6:56 PM
Here’s a more-informational article on how this is all but certainly plagiarism in action: http://www.slate.com/articles/life/gaming/2016/03/how_to_spot_a_plagiarized_crossword.html
crossword • August 13, 2016 12:52 PM
thanks !
nice article
Subscribe to comments on this entry
Sidebar photo of Bruce Schneier by Joe MacInnis.
Ewan • March 9, 2016 2:09 PM
Fraud or just misunderstanding? I wouldn’t be surprised if these crosswords are created by a generator algorithm each day and the author is whoever ran the generator. The newspapers in question are all part of the same publishing group. It’s highly likely they just happened to fit the same words together and they are running off of the same dictionary of course.