Metadata in MS Office

Hidden metadata is in the news again. The New York Times reported that an unsigned Microsoft Word document being circulated by the Democratic National Committee was actually written by, wait for it, the Democratic National Committee.

Okay, so that's not much of a revelation, but it does serve to remind us that there can be all sorts of unintended information hidden in Microsoft Office documents. The particular bits of unintended information that precipitated this news story is the metadata.

Metadata is information on who created the file, what it was originally called, etc. To see your metadata, open a file, go to the "File" menu, and choose "Properties."

I'll bet at least some of you will be really surprised by what's in there. Not because it's secret, but because it has nothing to do with you or your document. That's because metadata follows the file, and not its contents.

Here's what I do when I want to create a MS Word document. Maybe it's a file I've written, and maybe it's a file I received from someone else. I find some other document that has basically the same style I want, open it up, delete all the contents, and save it under a new filename. MS Word doesn't change the metadata, so whatever was in the "Title," "Subject", "Author," "Company," and other fields of the original document remains in my new document. This means that occasionally those metadata fields are filled with information I've never seen of before and from who knows where. I'm sure I'm not the only one who uses this trick to avoid dealing with MS Word stylesheets. So metadata is much less a smoking gun than many make it out to be.

I don't mean this to minimize the problem of hidden data in Microsoft Office documents. It's not just the metadata, but comments, deleted parts of the document, even parts of other documents (it's happened).

I have two recommendations regarding Microsoft Office and hidden data. The first is to realize that programs like Word and Excel are designed for authoring documents, not for publishing them. Get into the habit of saving your documents into pdf before distributing them. (Although if you're going to redact a pdf document, be smart about it or you'll have similar problems.)

The second is to install Microsoft's tool for deleting hidden data. (Works for Office 2003; there are third-party tools for older versions.) Or at least read the page about deleting private data in MS Office files. And to follow through on deleting data.

This probably won't work for many of us, though. The last sentence of the article explains why:

"The real scandal here," Mr. Max told The Los Angeles Times after Democrats expressed outrage over the White House's fingerprints on the testimony, "is that after 15 years of using Microsoft Word, I don't know how to turn off 'track changes.'"

Posted on November 14, 2005 at 12:07 PM • 44 Comments

Comments

Milan IlnyckyjNovember 14, 2005 12:53 PM

One thing to be aware of: some of the free PDF converters for Windows will grab the meta-data from Word and add it to the PDF files you are creating.

The unwanted data in the properties of this file (http://www.irsa.ca/NASCAfinal.pdf) is a case in point.

ubiycaNovember 14, 2005 1:22 PM

Testdriving Microsoft Office 12 (Word 12) i noticed that they have added the feature:

"Document Inspector" - it allows you to actually scan the current document "for inappropriate or provate information", and have Word auto-fix it.

It can be called from the File menu, under Finalize Document.

Word checks for:

- Comments and Revisions
(Detects comments, versions and tracked changes)

- Document Information
(Detects document properties and other personally identifiable information (PII) stored within the document

- Headers & Footers
(Inspects the document for information in headers and footers)

- Hidden Text
(Detect hidden text)

Not so bad for this early MS Office 12 beta,and it actually works

Davi OttenheimerNovember 14, 2005 1:25 PM

"after 15 years of using Microsoft Word, I don't know how to turn off 'track changes.'"

Too true. There's no warning that extra data is in the document, yet exporting to RTF (or other forms of removing excess data) will give you stern warnings about a loss of "functionality" or "features" or whatever.

For what it's worth, one of the first things to check when people apply for security positions is whether they have left Microsoft Office droppings (some call it metadata) in their Resume. Believe it or not, even Microsoft "experts" still make this mistake...

Pat CahalanNovember 14, 2005 1:27 PM

Remember that PDF isn't quite the Portable Document Format it used to be. In fact, it's hardly "portable" at all, anymore. NSF won't even accept submitted documents created in Acrobat if the version is not 5.0, due to compatibility problems.

People make PDFs in all sorts of ways, via specific versions of Adobe Acrobat Distiller (a commercial Windows product), using open source tools like ps2pdf (which calls another open source tool, ghostscript, which dumps to some version of PDF which may or may not be N compatible for whichever N version of Acrobat.

This leads to all sorts of bizarre problems, like documents that won't print, or will print only a particular portion of the document, or will print, but only on a Windows machine using a particular PDL version of a driver for a specific printer (e.g., it will print with binary PS, or it won't print with binary PS, or it prints differently to a PCL or KPDL or some other bizarre printer definition language than it does to PS).

For Windows-only people, using only the current commercial version of acrobat, this isn't a bad solution (make all your docs PDFs before distributing them). For everyone else, it's just as much of a headache as "make all your docs {fill in the blank application} documents before you distribute them".

If you really want a portable document format, make 'em .jpgs.

tNovember 14, 2005 1:36 PM

You could also use plain text files. Often all the bells and whistles are hardly needed. Just look at this blog article, almost all of it is good ol' plain text. It works quite fine, and this is true for many texts. For centuries, there was nothing else. I also feel I'm starting to get sick of formatting and computer problems overkill, so a simple text format might be a solution.

Davi OttenheimerNovember 14, 2005 2:03 PM

@ Pat

Good point. I am often amazed that Adobe's PDF suite wants to gobble up more disk space than the entire Microsoft Office distro, and that's clearly not just a coincidence.

But the bigger question is why these tools are still predominantly client-based and whether the risks will be addressed before they become service-oriented...

Server-side solutions like PHPdf (http://www.phpdf.com/) are surely a better model for future document creation and submission but only if they take into account a reasoable degree of back-end consistency and control.

For example, I was just assessing the state of CA's online PDF forms and noticed is a stern warning to clear your data before you leave the site in order to prevent someone else from seeing it. Is that a reasonable approach? Here's a rather ironic example (the plaintiff claim form for small claims court):

http://www.courtinfo.ca.gov/forms/fillable/sc100.pdf

puckNovember 14, 2005 2:19 PM

"Metadata is not written in stone, it can also be modified and faked."

Which, of course, is going to be the most fun part. Once businesspeople and law enforcement en masse get wise to it, there's going to be a field day of mischief to be made by sending the wrong document -- with some carefully forged metadata -- to the wrong people.

Then, of course, everything will balance out and we'll realize that the metadata in a document has become exactly as trustworthy as every other piece of data in that document (i.e. not at all).

Kevin S.November 14, 2005 3:32 PM

"To solve this problem i don't use windows. You do have a choice"

So, what about OpenOffice - will its documents keep metadata? I've been toying with OpenOffice a bit since a friend introduced me to a Knoppix CD... It seems to be a good alternative.

ZipNovember 14, 2005 3:39 PM

There was a fun article circa 1999 on a topic like this referenced in a few magazines at the time:
Microsoft's annual report was published as a Word document created - according to the metadata - on a Macintosh. Now the OS cold war of yesteryear is pretty much over, but I remember quite a few animated discussions that this triggered.

http://www.macintouch.com/msannual.html

AnonymousNovember 14, 2005 3:50 PM

@Greg

"To solve this problem i don't use windows. You do have a choice."

OpenOffice has used MetaData for quite a long time. In fact, it is happy to automatically convert Word metadata and store it in your open office document. This is not a "Windows Bug" but a functionality that some business types demand:

http://www.xml.com/pub/a/2001/02/07/openoffice.html

JosephNovember 14, 2005 3:50 PM

@Greg

"To solve this problem i don't use windows. You do have a choice."

OpenOffice has used MetaData for quite a long time. In fact, it is happy to automatically convert Word metadata and store it in your open office document. This is not a "Windows Bug" but a functionality that some business types demand:

http://www.xml.com/pub/a/2001/02/07/openoffice.html

GregNovember 14, 2005 4:15 PM

I don't use OpenOffice., and its not the metadata I know exists that i worry about.

@ Pat Cahalan
I have not had these problems at all with PDF, and sure theres is metadata in a PDF. But are you really sure that there is not "other" information in a word doc thats not Publicily disclosed?

ASCII works fine for many things esp Email (the number of ppl sending a 2 line emial as a word Doc! shesh). As long as you speak english i guess.

My big concern with Word docs, is that DRM will make any "choice" imposable if you work with people that insist on using it as a stardard document format. Then think how much info could be hidden in a format that is "imposable/illigal" to reverse enginner.

Greg

Richard VeryardNovember 14, 2005 4:47 PM

It is often amusing, and sometimes useful, to detect plagiarism. I have seen a draft contract sent out by a firm of lawyers, who obviously didn't realise that the name of a rival firm of lawyers was hidden in the metadata!

gregNovember 14, 2005 4:52 PM

@ jonathan

Yes! Why write a document when you can "program" one.

I wonder if its posable to put evil code into latex file, that will get executed when compiled?

Pat CahalanNovember 14, 2005 5:00 PM

Just out of curiousity, if anyone knows offhand...

Since Acrobat 6 was released, "digital signing" of Acrobat documents has been a functionality available on the Windows Acrobat Professional client.

Anyone know what the signing hash method is?

Davi OttenheimerNovember 14, 2005 6:20 PM

@ Pat

Good question, I believe it is MD5 and the digital signing capability has been available since Acrobat 4.0, or at least it came with a "self-sign" plug-in back then that could be swapped for one from an Adobe "digital signature partner".

Elcomsoft posted a vulnerability about it some time ago:

http://archives.neohapsis.com/archives/vuln-dev/2003-q1/0195.html

Incidentally, if you follow the links to the CERT doc (http://www.kb.cert.org/vuls/id/549913) and then to Adobe's response (http://www.kb.cert.org/vuls/id/JSHA-5EZQGZ), you might eventually come to the rather odd statement by Adobe:

"This vulnerability will not adversely affect an Acrobat user's system unless they download and install malicious third party software. [...] Exploits of this vulnerability violate the End User License Agreement included with Adobe Acrobat and Adobe Acrobat Reader."

Right, so don't install untrusted software or touch the "secure" signatures because they are really brittle, ok? Or at least it was in 2003, when they said they would fix it at some unspecified time in the future...

Since Adobe is to PDF what Microsoft is to RTF and IBM is to PC, I would look more at different implementations of the spec than at the originating vendor for solutions.

Davi OttenheimerNovember 14, 2005 6:28 PM

I guess to be fair I should add that more recent Adobe documents say Acrobat "can be used with PKI to provide authenticity and integrity checking capabilities to sensitive electronic content. Using up to 2048 bit RSA keys..."

More information here:
http://www.adobe.com/security/pdfs/acrobat_security_wp.pdf

(page 15)

• PKCS #1, #7, #11, and #12
• RSA (512-, 1024-, and 2048-bit)
• DSA
• eXtensible Markup Language (XML) signatures
• MSCAPI support

Evan MurphyNovember 15, 2005 12:29 AM

greg: you say "Yes! Why write a document when you can 'program' one" in what I assume is meant to be a dismissive tone of voice. However, I fail to see why "programming" documents is, on the face of it, a bad idea.

LaTeX, like HTML and CSS, separate content and formatting---most web designers agree that this is a good idea. More importantly to the document author, it separates the necessarily-interactive authoring from the potentially-programmatic rendering. This modularization has a number of benefits: I can write my documents in the interactive environment of my choice (vim, emacs, Visual Studio, notepad); I can automate common tasks easily; and I can modify and transform my document text with all the other powerful text-based tools on my UNIX system.

It's interesting to note that as the powerful visual creation systems (MS Office, MS Visual Studio, Adobe Acrobat, Pagemaker, and InDesign) have matured, they have all gained increasingly powerful programmatic interfaces to documents and the editing environment. MS VBA macros, the plugin API in VS, Adobe's plugins and hooks are all examples of this. I'd say that programming your documents is a very powerful option to have at your disposal.

Then you say "I wonder if its posable to put evil code into latex file, that will get executed when compiled?"

Well, it's possible in the sense that all complicated software can have bugs that allow for unintended consequences, but of all document-preparation software in existence, I'd venture to say that TeX is one of the most thoroughly tested and understood.

Furthermore, it's not as though programmatic interfaces to document rendering are uniquely suited to malformed input attacks. Remember the libpng and zlib exploits on UNIX? The jpeg and emf/wmf exploits on Windows? Somehow, perfectly "static" data can still totally compromise your system.

Finally, here's something to think about. In a well-defined document language---and I'm by no means including TeX in this category---documents could be statically checked, like type-checking in conventional programming languages. It would be straightforward to prove at "compile" time that:

* Documents have no "hidden" metadata attached;
* Documents do not have particular malicious properties;
* Documents conform to certain standards of output.

That's just scratching the surface--I'd suggest looking around more academically-oriented sites like Lambda the Ultimate if you're interested in this sort of thing.

Maybe it's not a bad idea to encourage people to program their documents afterall. I'm not sure LaTeX is the language of choice for grandma, but the field of domain-specific languages is exploding---it won't be long before we see a document language that most anyone can use.

another_bruceNovember 15, 2005 5:34 AM

the lo-tech solution would be two different boxes. edit your docs on one, and when it's time to issue the final version, type into the second box what you see on the monitor of the first. this approach also offers better security for encrypted communications because the algorithm isn't stored on the box that sent the ciphertext. see, when you're looking at the internet, the internet is looking back at you.

XtofNovember 15, 2005 8:48 AM

When I think of all the "innocuous" metadata conveyed by RDF feeds and SOAP RPCs, the Microsoft Office case seems secluded at the tip of a fast-growing iceberg.

Tim HowlandNovember 15, 2005 8:49 AM

The Original MSWord format was a straight binary dump of the contents of the memory buffer that the document lived in- in other words, you didn't traverse a model of the document to extract it's data, it just flushed whatever happened to be in memory straight out to disk.

This meant that if you deleted a bunch of stuff from the document and then saved it, it would still be the same size- the memory buffer wouldn't shrink- and there would be a ton of "deleted" data floating around in the buffer.

They did this for performance reasons, but it's obviously got huge implications for security and reliability; I'm convinced that's a big piece of why they are moving to XML data formats.

humanNovember 15, 2005 1:58 PM

In related news, millions of stupid people still use Windows and marvel every day at yet another reason why it sucks.

Evan MurphyNovember 15, 2005 11:37 PM

You know, with all these ridiculous information protection failures with the ad-hoc revision control built into documents and particular editing program (Yes, Office, I'm looking at you), it seems like someone would have the bright idea of moving to a real revision control system. In a real RCS, you're actually storing your document and its revisions in a structurally rigorous format, and you don't have to worry about exporting your other changes when you meant to export a particular revision.

I don't think I'd inflict CVS on your average non-technical office worker, but darcs is startlingly easy to use. It doesn't have a GUI client, but that could be solved pretty easily. For most of these applications, you really only need three options: "Save these changes", "view previous changes", and "Revert to/export previous version".

Hmm. I guess it's another reason why it's great to abstract away the text authoring and the graphical rendering--I can put my document sources in RCS just like I can anything else. For me, that's absurdly more useful than anything resembling "change tracking" that Microsoft can bolt on to the side of Word.

Janus ChristensenNovember 16, 2005 12:44 AM

Evan Murphy:
"Well, it's possible in the sense that all complicated software can have bugs that allow for unintended consequences, but of all document-preparation software in existence, I'd venture to say that TeX is one of the most thoroughly tested and understood."

Interestingly Donald Knuth gives cash awards to people who find bugs in TeX.

David FrierNovember 16, 2005 3:29 AM

Oh grow up. The metadata that has everyone's knickers in a twist is all in the Properties box. It's hidden only from the lazy. This is the data that might carry over to PDF, but if you don't clean it up before exporting it's only because you CAN'T BE BOTHERED. Which, for your information, is NOT Windows' fault!

The other thing to do is to Accept All Changes so that deleted and modified text is not present via Track Changes. Or PDF it, which removes all of that.

Of course, it's more fun to bash Windows. Look, I know Windows is, shall we say, suboptimal from a technologist's perspective. But honestly. If Windows had not induced the spread of PCs to an 8-digit (or is it 9-digit yet?) number...

HOW MANY OF YOU WOULD HAVE JOBS?

Linus TorvaldsNovember 16, 2005 11:43 AM

"Of course, it's more fun to bash Windows. Look, I know Windows is, shall we say, suboptimal from a technologist's perspective. But honestly. If Windows had not induced the spread of PCs to an 8-digit (or is it 9-digit yet?) number..."

"HOW MANY OF YOU WOULD HAVE JOBS?"

Well, certainly the anti-virus software writers would be cold and hungry... :-)

David FrierNovember 16, 2005 4:41 PM

@Linus (yeah, right)

> Well, certainly the anti-virus software writers would be cold and hungry... :-)

Dude, if there weren't a hundred million Windows PCs there wouldn't be Linux.

elegieNovember 16, 2005 8:17 PM

With open file formats, it should be easier to develop tools for removing hidden data and metadata. Encouraging the use of open formats would be useful (especially for government documents.)

@Tim Howland:
Partially filled memory buffers can be another source of hidden data ending up in documents. Even game-related files (level files and saved games) can be affected by this. The hidden data is not viewable during normal use of the file.

Jon SowdenNovember 16, 2005 8:32 PM

"Interestingly Donald Knuth gives cash awards to people who find bugs in TeX."

Yes, but the metadata in Office can't really be considered a bug. Can it?

The problem, so often, is that MS decides for us what we'd like, then goes ahead and does it, hidden away in the background. It does exactly what it was meant and designed to do (therefore =/= bug), it's just that sometimes that isn't what the user thought, knew, or intended.

Jon

rivergardenNovember 17, 2005 3:54 AM

> In a real RCS, you're actually storing your document and its revisions in a structurally rigorous format,

Word already has versioning. File | Save As. Click on the tools icon top-right and go "Save Version".

Of course it is stored in the same file. Why not...? It is to recover back to a previous version.

It is down to the user to use or not use this functionality as required.

I find this interesting but not as worrying as the potential abuses posed by "Word Bugs". Imagine the Jamie Oliver cookbook (that was accidentally released recently in word format) with a URL to a hidden image. Every time the dos is opened, the website logs will show who has the file and possibly allow it to gather other information too from the HTTP headers.

Useful for tracing the source of leaked documents but also potentially nefarious too.

Pat CahalanNovember 17, 2005 11:54 AM

Document revision and control and authoring seems like an old problem, but having been a technical participant to a disclosure summons let me tell you that the current practices in the real world do not in any way make sense for any corporation that wants to CIA (substitute Y for I and you'll know what I mean here).

There are several individual problems with the current "processes" (sarcasm tongs here) for document creation and control, namely that there really aren't any.

The problem of metadata in Office (the topic of this thread) is only one aspect of the problem. (The version control in Office actually makes this particular problem *worse*, not better, which I'll get to later.)

Anyone who used Word Perfect back in the 5.n days knows that Office isn't the only word processor that leaves cruft in a document. WP was just nicer in that you could toggle "reveal codes" and remove the markups that weren't supposed to be in there.

Here's what you really want, as a business. You want the ability to compile data, format data into a document, and render that document to an output device. You want those three bits of functionality to be modular, so that different people can compile data using different methods, different people can format the data into a document using different methods, and the resulting document can be rendered to various output formats using different methods. You also want the ability to tag each chunk of data with identifying markers for version control (author of data/date created/date edited, etc., whatever), AND with particular metadata (this data is related to "foo"). This last bit of functionality is totally lacking in today's document formats, and for legal purposes its astounding that nobody has done this correctly.

If I'm sued regarding "foo", I should be able to locate all digital documents I have stored relating to "foo" regardless of whether those documents are individual emails, word processing documents, spreadsheets, accounting reports, whatever, with a simple search on my data store(s). Related to this, if I'm legally required to retain documents for a certain period of time (7 years, for example), but I want to dispose or archive those documents at the end of this period, there currently is no way for me to do so.

Here's where the metadata stored in Office becomes a problem -> if someone who doesn't realize what they're doing incorrectly edits an existing document (leaving the data relating to "foo" in a document pertaining to "bar"), there is no easy way to remove the "foo" data, or even know that it is there. This can lead to inadvertant data disclosure, which can be anything from embarassing to illegal.

Imagine your doctor's office receptionist accidentally leaving the entire text of the letter informing someone that they have a disease in as "version1", and making a form letter as "version2". Anyone getting a copy of a letter saying, "Mr. Jones, you have a minor case of arthritis requiring over the counter pain medication" could reveal version1, and find out that Ms. Edwards has an embarassing social disease.

Having a "publish" button that pops up with a GUI such as, "There is data in this document relating to 'foo' and 'bar', do you want to (a) leave it in, (b) delete all data relating to 'foo', (c) delete all data referring to 'bar', " or something of that nature would probably be a great help.

Optimally, there would be a standard for those three bits of functionality (ie, all document preparation programs would have raw data stored in one way, with the formatting tags stored in another way, and the rendering information stored in a third). This would actually BE a real PDF ("portable document format ")-> if you wanted to open a document in some application that was different from the authoring application, you could choose to open just the raw data, the raw data with the formatting interpreted into the new application, etc.

Office is horrible in that the formatting commands are munged together with the rendering commands, which is why selecting a different printer can screw up your document. This is absurd. But one can't really take MS to task for this, because there isn't a standard for them to violate :)

Pat CahalanNovember 17, 2005 4:00 PM

@ David

> Dude, if there weren't a hundred million Windows PCs there wouldn't be Linux.

That's certainly an unprovable statement, and rather untrue.

It's certainly the case the Microsoft Windows' ease of use for the new user was one of the reasons for the massive adoption of the personal computer, but lets face it, if Bill hadn't produced Windows 3.1, some other GUI OS would have taken its place (since the Mac OS was already available and gaining steam, it's entirely possible that Mac would have flooded the market before some other company came up with a PC GUI OS) -> the home PC was already making giant inroads into the consumer marketplace on the strength of the TRS-80, IBM's PC AT (running admittedly an MS version of DOS), etc.

Windows was an enabler in the marketplace, but if it wasn't around there would have been another enabler.

True, if the other OS (whatever it turned out to be) was more to Torvalds' liking, there might not be a Linux, but FreeBSD, etc. would undoubtedly still be around...

Clive RobinsonNovember 23, 2005 6:16 AM

Folks just for fun,

Save any of your Word docs in Rich Text Format (RTF) then look through the resulting file with a text editor...

I did this with earlier versions of Word a couple of years ago and there appeared to be a lot more in there than just the Meta-data that you are talking about.

More interestingly at the time it did not appear in the Microsoft RTF documentation. Most of it was easily removed without causing problems.

I have not tried it since as when I am forced to use MS I still use Off97.

Mainly I stick to using 7Bit ASCII files with the LF line disaplin (Traditional Unix ;) as this alows me to work most places without problem. It's only when I have to format it for the "Nicety" of others that I waste my time cutting and pasting into a WYSIWYG Word Pro etc.

I guess I am showing my age :)

AndrewMay 26, 2006 2:25 AM

Intresting article. And what the current status of situation? Does anybody know?

DalilaOctober 3, 2007 4:44 PM

If I use the MS remove hidden tool is there 3rd party software that can recall that information?

Leave a comment

Allowed HTML: <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre>

Photo of Bruce Schneier by Per Ervland.

Schneier on Security is a personal website. Opinions expressed are not necessarily those of Co3 Systems, Inc..