Reverse-Engineering the Redactions in the Ghislaine Maxwell Deposition

Slate magazine was able to cleverly read the Ghislaine Maxwell deposition and reverse-engineer many of the redacted names.

We’ve long known that redacting is hard in the modern age, but most of the failures to date have been a result of not realizing that covering digital text with a black bar doesn’t always remove the text from the underlying digital file. As far as I know, this reverse-engineering technique is new.

EDITED TO ADD: A similar technique was used in 1991 to recover the Dead Sea Scrolls.

Posted on October 27, 2020 at 6:34 AM34 Comments

Comments

Daniel October 27, 2020 7:19 AM

Leaving a full alphabetized index in a redacted document is kinda like locking a door with a padlock and leaving a crowbar next to the door …

Dr.S October 27, 2020 7:39 AM

There was no “crack”, for displaying the names the “hacker” used Podofylin by Eclectic Light without crediting the software author!

Chris October 27, 2020 7:50 AM

While redacting digital documents is hard, that doesn’t really seem to be the story here. This reverse-engineering would have worked just as well against a paper document. It reminds me of how the Bulletin of Atomic Scientists was able to deduce the U.S. stored nuclear weapons on Japanese soil based on blacked out entries in an alphabetized list.

ferritecore October 27, 2020 8:15 AM

This is reminiscent of the reconstruction of Dead Sea scrolls from concordances.

The scrolls had been parceled out to accedemics who had right of first publication and controlled access to the text. There were complaints about this system, and the long (decades?) delays involved.

Some of these embargoed scrolls had published concordances. An enterprising soul reversed concordances to produce text, also producing prominent professorial pouts.

Chelloveck October 27, 2020 10:06 AM

@Arindrew — Recent articles (within the past week) from well-known news sources don’t mention an unredacted version. Where did your copy come from?

Regardless of whether or not an unredacted version is publicly available, the article is interesting from a security perspective. Anyone who relies on redaction for secrecy needs to be aware of this class of attack. Despite sounding obvious to us, at least one person didn’t think about it when doing the redactions.

Wael October 27, 2020 10:27 AM

The clues are all over the document: a combination of mastermind, cross words, and cryptograms. The new thing is that someone thought about un-redacting it. Clever, nonetheless.

covering digital text with a black bar doesn’t always remove the text from the underlying digital file

It’s just plain stupid to “redact” text by placing a black bar atop the text: you gotta place a red bar over it, dawg!

Chelloveck October 27, 2020 11:06 AM

@Dr.S: He also didn’t credit whatever software he used to write his article, or the manufacturer of his computer. The cad! At what point does one’s tools become worthy of mention? AFAICT from its web page, Podofyllin is just a plain ol’ PDF reader. Does it have any exceptional features that were key to this analysis or is it just a piece of commodity software used incidentally?

Clive Robinson October 27, 2020 11:31 AM

@ All,

It’s a rainy day in London and due to the “human malware” going around they can not go off and play with their friends.

However if it were not for the narure of the content of the document, getting them to de-redact the document would probably keep them quiet for an hour or so not much more.

After all if you follow these three sugestions inbthe order I’ve listed, you could probavly make “childs play” of it as well.

Starting with the attack an automaton could do,

@ Jon,

Presumably, the length of the black bar provides a little cross-reference as well.

It’s almost certainly the first thing I’d try, starting with the shortest or longest unique length and working my way through. As it would be “an easy fit” to reduce the number of other black bars I’d have to try later.

Moving onto the slightly more chalanging,

@ Daniel,

Leaving a full alphabetized index in a redacted document is kinda like locking a door with a padlock and leaving a crowbar next to the door

More like wedging the door closed with the crowbar and the padlock hanging open in the hasp… When you think about both knowing the initial character and the black bar length Jon mentioned.

Knowing the likely initial leter and a list of potential candidate names, could alow you to identify a redaction by the immediat context.

The most obvious being an unredacted pronoun giving an indicator of the sex of the redacted name.

Which brings us onto,

@ Wael,

a combination of mastermind, cross words, and cryptograms.

Would be some of the ways to “context search” as the next line of attack.

It would also enable you to disprove some tentative matches where you have two or more black bars of the same or similar length.

This is the point where it would start to get hard, and you might expect to end up with some only resolved to a couple of tentative names.

At which point, we move from lexical logic into more emotive or subjective analysis and thus more circumstancial identification. This would be based on what is known of the person dictating the statment, and also that of those who are potential candidates for not fully resolved redactions.

As noted by,

@ Chris,

This reverse-engineering would have worked just as well against a paper document.

And is known to have done so in the past.

Peter A. October 27, 2020 1:15 PM

The most common error probably is ‘redacting’ the visual representation, as rendered by a specific kind of software that interprets the digital document in question, and not the actual bitstream, which contains coded representation of words, images, etc.

Even getting your hands dirty, by digging into the heap of bits a digital document is, isn’t easy with modern highly structured formats that:
– may use different encoding for different parts, so you need to take care of re-encoding and canonicalizing into a common code set/representation
– may intersperse encoded text with visual representation codes, encoding shape, position etc., so you need to reconstruct the actual pure text in order to be able to decide if a part is to be redacted
– may include graphical representation of text (e.g. labels, legend etc. in pictures, graphs, diagrams), so you need to OCR every image with a really good algorithm not to miss something a human brain could recognize
– may have uninterpreted or unrendered sections containing sensitive material, either on purpose (e.g. edit history, metadata) or as a result of a bug, so you need to find all such sections and remove them
– maybe still do something strange I have missed out

After all of that you could search for ‘forbidden words’ and replace them with something, preferably of a different length, and then reformat all the document to get rid of gaps which size hint at what was redacted. On top of that you need someone competent to proofread it all not to reveal some sensitive information by context, use of synonyms, cross-referencing etc.

That is why I’de recommend pure ASCII (or, if you really need it, UTF-8) for really sensitive documents. Images, if you need them, are harder, you need procedural guarantees, such as using specified font/size and forbidding resizing, esp. downsizing.

SpaceLifeForm October 27, 2020 2:53 PM

@ Peter A.

That is why I’de recommend pure ASCII (or, if you really need it, UTF-8) for really sensitive documents.

I feel a great echo in the Force.

JonKnowsNothing October 27, 2020 5:25 PM

@Clive @All

Marcy Wheeler has been going through a number of documents in a thorny court case where redactions and additions and re-redactions and un-redactions occurred in a whole pile of documents.

Essentially these are government evidence documents supplied to the court that have been altered and re-dated and re-timed to fit a particular narrative.

Some of the added-value-text or added-value-date can be determined by 200% zoom-in on the suspicious areas.

Recently @Clive and Others had a good exchange on time-date and what it is and isn’t. Very helpful when looking at the so-called government supplied chronology.

iirc(badly) During the Snowden Storm and CIA Torture as Enhanced Interrogation Program period, (still on going afaik), there was a hoohaa over a redaction in a document that was published. It had to do with 100% Take on all communications in several countries. 100% of everything, not just some things but ALL communications. The blackout block was @11 chars long. There is only one place where the USA had an unofficial war going on with that many letters (still going on afaik).

Artists use negative space often. It’s quite handy.

ht tps://www.emptywheel.net
note: very detailed reviews and analysis.

ht tps://www.emptywheel.net/2020/10/26/part-of-what-i-shared-with-the-fbi/
note: chat logs of some very hinky content.
(url fractured to prevent autorun)

JonKnowsNothing October 27, 2020 5:39 PM

@Peter A.

re: why I’d recommend pure ASCII

In old times, since the development of the printing press @1440, we used to cut out letters from the headlines of major newspapers.

Each newspaper had a font and type style so the letters could be verified to which paper and date, depending on how many you glued to the page.

Lemon juice works pretty well too…

ht tps://en.wikipedia.org/wiki/Invisible_ink

ht tps://en.wikipedia.org/wiki/Benedict_Arnold#Secret_communications

(url fractured to prevent autorun)

Paul Suhler October 27, 2020 9:42 PM

Here are Alex Wellerstein’s thoughts on redaction:
https://blog.nuclearsecrecy.com/2013/04/12/the-problem-of-redaction/

When researching my own book on stealth and the Lockheed Blackbird, a director at the Skunk Works once told me, “Those old guys don’t know what’s secret and what’s not these days.” He was referring to people I’d interviewed, who might or might not have revealed something that was still classified, but it also could apply to the CIA retirees doing redaction as contract workers.

Cheers,

Paul

SocraticGadfly October 27, 2020 9:53 PM

@Arindrew, this is the age of the Internet. People, if they don’t have their own websites, have Blogger or WordPress blogs. Post a link or it’s a lie.

JonKnowsNothing October 28, 2020 1:20 AM

@SocraticGadfly

re: Has Marcy ever admitted who she narced on and why?

Partially, yes.

Some while back while reporting on the FBI findings that related to “something” she told or shared with them there were some “hints and ahems”.

However the post on Oct 26,2020 went into a lot more details. It was a very hinky exchange with someone who was doing some serious phishing and innuendo and possibly someone able to do some nasty-stuff. She shared some images of chat logs and exchanges with someone who used the email: guccifer…[I am not going to use the full tag].

After reading some of the exchange, it made me want to pull the plug on my
computer permanently…

fwiw: When the story first showed up that she was helping out the FBI i was under-impressed and jumped to the same conclusions as a lot of others. Based on the very dry details of the early reports it seemed out of character and out of bounds. There was even a claw fight with GG over Si or Non…(well, loads of people have that response to GG too). After reading even a sanitized version of the exchanges I reversed my views.

It’s another aspect of the lack of security and privacy that stalkers, nasty folks and those with less-than-social sensibilities use the internet for personal gratification at others expense. Quite a few that have been arrested have gotten the Light Touch treatment for doing similar.

It is part of the job for journalists to be approached by all sorts of people: those that take baths and those that shower and those that don’t do either.

ht tps://www.emptywheel.net/2020/10/26/part-of-what-i-shared-with-the-fbi/
(url fractured to prevent autorun)

Clive Robinson October 28, 2020 1:43 AM

@ JonKnowsNothing,

Marcy Wheeler has been going through…

I started reading through the chat logs she released the other day. It was just way to creepy to take onboard in one go.

I’ll let others draw their own conclusions when they read it, but it’s implications are realy not at all pleasent, and as one commenter observed, her recent move to Ireland msy not be far enough…

JonKnowsNothing October 28, 2020 1:50 AM

@ferritecore @Bruce @all

re: dead sea scroll reverse engineering

iirc(badly) I recall reading an article with a photo, about how a rabbi and a young person used one of the early model Macintosh computers to work on a section. The young person had a program with the Hebrew alphabet that could be displayed on the screen. The rabbi read the provided text piece and compared that with what he saw on the screen. When the rabbi found the matching letter, it was tagged into some sort of spreadsheet or rudimentary database.

The project then combined all the sheets/data into the full text.

It was a great use of queuing theory with parallel distributed computing.

iirc recent enhancements to assembling disintegrated documents, matches scanned pieces based on their fragment edges.

barfa October 28, 2020 5:46 AM

I was thinking about Hanlons razor, and if this really could be explained by just stupidity. The document is ~410 pages of transcribed video interview. I assume that the index was first compiled for the uncensored text, and the censoring applied to the whole document. So, either some poor bastard sat and manually went through 410 pages of no-recollections or they used some kind of automated search-replace. In the latter case, should’nt that have acted on the index too. In the former case, the clerk would need to just have forgotten about the index, which also seems a bit unlikely…?

Of course, the third option is that there are some formal rules that makes censoring not apply to the index. It does seem strange though. They could have had the names in the index, but removed pagenumbers, and instead had only a total number of occurences in the full document. And why not, in a digital document where reflowing of text must be very possible, replace all censored words with black bars of identical size?

I also notice that when Slate asks “Want to help us crack some more redactions?”, they do not make any difference as to whose names they are asking for, alleged perpetrators or alleged victims. One can just hope that they will not publicize the names of alleged victims without their consent.

ferritecore October 28, 2020 7:42 AM

@barfa

In this case I think they did a global search and obliterate on a list of keywords. One of the redactions was a street name in an addres, Andrews St or Road or such. The rest of the address was not redacted, it looks like a law firm address. Two things about this:

  • It solidly confirms this particular unredaction
  • A person may not have done this particular redaction

One of my takeaways is that entire index entries, not just the keywords, should have been redacted.

As somebody I know is fond of saying, “You can find sympathy in the dictionary between sex and syphilis”

John October 28, 2020 9:12 AM

@barfa @ferritecore
It’s pretty obvious that they didn’t use a simple search and replace of forbidden words from the “Andrews” example given.
They redacted the name Andrews since that was important, while they left alone the word Andrews in the address. And since the automatically generated index was merely looking at the sequence of letters, without regards to context, both usages were in the same index entry. And looking at the article, that happens multiple times in the document. When a word was used in a sensitive context,that word was redacted along with the index entry. But when the same word wasn’t in a sensitive context, it was left alone… and the reference in the index was unaltered. So, for the most part, all that was needed to be done was to look at each redacted index entry and examine the referenced page&line entries. If you find an unredacted line, it’s rather trivial to note which word(s) have the required starting letters and length and then make that substitution in the rest of the document

ferritecore October 28, 2020 9:57 AM

@john

It looks like I misinterpreted a highlight in the article. I also seem to be having difficulty accessing the Slate article today.

SocraticGadfly October 28, 2020 11:14 AM

@JonKnowsNothing

I saw that link in your initial comment and checked it out.

As someone who rejects “twosiderism” on this issue, ie, I know the Russkies hacked the DNC, not a Seth Rich theft (I’m NOT a conspiracy theorist), but ALSO know they hacked the RNC, not just the DNC, and created both pro-Trump and pro-Clinton Facebook groups, etc. and that there is therefore?

NO “collusion,” contra what Marcy and #TheResistance(TM) would wish?

I find Marcy’s partial big reveal more tease than substance, and have thusly updated the July 2018 blog post I wrote about her initial concerns.

ht tps://socraticgadfly.blogspot.com/2018/07/mueller-time-new-angle-how-serious-to.html

(Sadly, many of the collusion deniers, if not outright Seth Rich conspiracy theorists, play footsie with that. That’s why Idries Shah is so right:

“To “see both sides” of a problem is the surest way to prevent its complete solution. Because there are always more than two sides. – Idries Shah, REFLECTIONS.

The past four years of American history is littered with pundits and analysts who need some serious meditation on that.)

Thunderbird October 28, 2020 1:13 PM

I’d like to second the mention of Alex Wellerstein’s blog in re redactions. It is a great discussion of just how hard a problem it is. The part I thought was most interesting is that since it has historically been a manual process done by a bunch of people applying possibly-changing guidance independently, you can get differently-redacted versions of the same document that combine into a less-redacted (or unredacted) version.

Clive Robinson October 28, 2020 2:59 PM

@ Thunderbird, ALL,

It is a great discussion of just how hard a problem it is.

The big problem is “information does not exist in isolation” and people from different knowledge domains have different views.

For instance I know of a bunch of numbers that are still technically clasified, even though they are freely available in mathmatics refrence books along with other information that tells you “their properties”. Now if you only care about mathmatics then you know that is all in the public domain.

But the fly in the ointment is “maths gets applied” to other knowledge domains, where even though you are aware the information is public you do not want attention to be drawn to it which causes all sorts of problems, some of them quite unexpected[1].

Anyone who has a broad base of knowledge covering many domains can if they have an inquiring mind –which is usually the case with a broad knowledge base– can connect dots that others are not even aware exist. Our host @Bruce calls it “thinking hinky” but it involves a creative ability that is often novel. Because others whilst they may have considerable depth they lack knowledge in all but one or two domains[2] don’t get to see the dots that are in other domains, thus they do not realise how others can put things together apparently more easily than an expert can…

The thing is as we know the “classified world” is also a “siloed world” which makes the experts and nearly all others residing there not just “tunnel visioned” but unable to talk to others.

The result is we get redactions that are quite imperfect and can be put back together.

[1] When wearing the green I was sent on a course, that was for me stupifingly dull, and full of inaccuracies. The reason being that the course designers were trying to hide information from those taking the course. One of the course tutors got annoyed with the fact that I spent much of the course “staring out the window” and obviously “not participating” and he made the mistake of demanding why. He was not happy when I told him that I was unable to answer him, and he escalated it up the chain of command. As I pointed out to the OC there were reasons and he gave me a skeptical look. So I pulled out some documentation with my signiture on it and pulled out an engineering drawing from the equipment manual and asked if he noticed anything? At which point the penny dropped. He did ask why I was on the course, and as I pointed out it was a “trade requirment” and it was not just the course instructor that did not know what sort of engineer I’d been at one point or another… At which point the OC started to laugh in a wry sort of way.

[2] I’ve worked at quite a number of places including Universities. Where I provided technical knowledge to people doing the likes of PhD’s. I was chating to one about her research project she had been given which was to test various polymers for their stability in certain types of radiation. Being a materials scientist she was wondering why the project sponsor wanted some tests that appeared to be just plain whacky. I asked her if she knew who the project sponsers were actually working for and she said no. I told her she might want to search on the Internet about them and I would bring a book in the next day that in broad terms would answer her question. We met for lunch the following day and when I pulled out the book it’s title made part of the penny drop, and I pointed out the section she should read towards the end of Richard Rhodes book. The following day she asked me how I knew, to which the honest reply was “I’m an engineer that reads a lot”.

Garabaldi October 29, 2020 2:02 PM

Redaction is a trade off. The intention is to release some information while hiding other information. Both sides matter.

If you follow most of advice here the redacted document has the probative value of a Kentucky Attorney General’s statement about grand jury testimony.

This is hard even for the mythical honest, disinterested redactor. It is even harder for the slightly less mythical honest, interested redactor.

It is very easy for the dishonest redactor, just create a new document that says what you want it to say. That will be indistinguishable from following many of the suggestions in this thread.

If you strip all the hidden clues you have also stripped everything that could be used to question the document. That removes the possibility of falsification, and any reason to believe the release.

SpaceLifeForm October 31, 2020 4:49 PM

@ Clive

“I’m an engineer that reads a lot”.

Reminds me of a class I was required to take at [redacted location] long ago.

The teacher told me:

“you should be teaching this class”

Yep, I was actually giving the teacher knowledge that the teacher had not learned.

But the teacher was smart enough to not get upset and defensive.

The teacher was smart enough to realize that there are always knowledge gaps.

The teacher actually wanted to learn from me.

There is ALWAYS something to learn.

If one decides to stop wanting to learn, they have given up.

White Rabbit.

Go Ask Alice.

Clive Robinson November 1, 2020 1:46 AM

@ SpaceLifeForm,

“you should be teaching this class”

My father did that.

He was a Chartered Accountant at a very well respected large organization. However due to the way he got started in life, in the evenings, he went and taught Accountancy in evening classes “to give back”. The result was he was offered a senior lecturer position which when he developed heart problems he gladly accepted.

As I’ve mentioned before, I don’t think he wanted to be an accountant, his interests were more practical and he very much encoraged me to become interested in all sorts of practical engineering. I had many opportunities because of that, including being offered a job in boat design when I was quite young, however for me it was pulling signals out of the air from distant places. I got involved with radio astronomy but back then it was a niche accademics only field of endevor where multiple PhD’s were required. Through Pirate Radio I got involved with high power RF engineering and perhaps oddly Space communications with satellites. But again they were niche fields and I realised you had to be “first in field” to not waste years of your life in academia. So in the early 1980’s with computers just starting to appear in usable form and prices, I got in quick. They are an unhealthy obsession at the best of times and I was drawn to using them for something rather more than playing games, so through robotics I got into embedded systems then safety critical, medical and them industrial control. I did some time doing “mil spec” work involving communications and later put in some time wearing the green. However I’ve always had “itchy feet” and was always off chasing the next challenge rather than working my way up some greasy pole where politics were of more importance than getting real work done. So I again jumped ship, into the security side of things where electronics was making inroads into physical security and environmental control. I then jumbed back into comms and Fast Moving Consumer Electronics, then back into what we now call ICT about the time the “green team” got together and about half a decade before their little side project Oak got renamed Java. Which for those old enough to have been tearing their hair out was when MFC was getting the “guild secret” treatment and all the politics that involved (making me even more determined not to be a consumer code jockey throwing my life in ever changing and turbulant waters no sane man would venture and wiser men simply wrote “here be dragons”).

But like my father I’ve “given back” where I can and yes I’ve been offered teaching roles, and worked in education at Uni level, but in this day and age, teaching is due to moronic decisions and decisive divisions by politicos not practical any more. We realy do not teach “engineering” these days in the UK. Which is why a lifelong friend and myself were working on making changes to bring in those who need in their early lives to be “practical” and later move into theory and higher education. Sadly he recently had an untimely death and things sadly have had to be put on hold.

MarkH November 1, 2020 3:25 AM

@Chad Elliott:

What you propose — editing an image, rather than the document — makes sense from a security perspective.

However, even that can fail, depending on how it’s done.

Most digitized images are stored in compressed formats, which often use something like a spectrum analysis for compression purposes. Such image files carry the mathematical equivalent of “faint echos” of an image element far from the element’s actual location.

I recall that a number of years ago, somebody demonstrated that when an image of document was edited to blank out a portion of the text, it was possible to use those “echos” to help in reconstructing the obscured text!

So to be safe, it would be best to start with an uncompressed image of the document (raw bitmap), do the redaction on that uncompressed file, and only then save it an a compressed format such as JPEG.

The trick nowadays, would be finding an imaging device that will export uncompressed image files.

lurker November 1, 2020 12:26 PM

@Clive

I was drawn to using [computers] for something rather more than playing games…

Sorry, me too. At a job interview in the late ’80s I was asked what I knew about computers. I said I hadn’t bothered with them because the small ones becoming popular were being used mostly to play games or as electric typewriters. That seemed to be a good answer, and 5 years later I hung up my brass scissors.

Brass is non-magnetic, useful for cutting audio tape for splicing. Imagine George Martin with a modern DAW back in the ’60s…

Leave a comment

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.