Electronic Document Files
Electronic Document Files
Many electronic file formats contain extra information about the document itself, regardless of what the contents of the document actually are. This metadata usually gives clues as to the physical hardware and possibly the actual user account information of the creator or editors of a document, together with time / date stamps etc, any of which may be hazardous to the anonymity of a whistleblower.
On the other hand, this metadata embedded within a leaked document may provide the strongest clues as to its authenticity.
- Adobe .pdf documents have been published online, where some of the personal details e,g, email addresses have been "blacked out" using Adobe .pdf software , which has effectively simply put an extra layer on top of the supposedly censored words. Simply copying and pasting into say Windows Notepad or Wordpad or Word etc. has revealed the hidden data.
Anybody publishing such stuff online needs to be aware of this, to protect their Home Office or other sources.
- See this Adobe Technical Note:Technical Redaction of Confidential Information in Electronic Documents - How to safely remove sensitive information from Microsoft Word documents and PDF Documents Using Adobe Acrobat (.pdf) (or from our local mirror copy here at ht4w)
- Similarly Adobe .pdf documents or Microsoft Word documents, Excel spreadsheets etc. may well have Meta information (see the Document Properties) showing the author of the leaked document (which may in turn lead back to the "leak source").
- Microsoft Word Documents, especially draft documents worked on by several people, often have the Version feature enabled. Sometimes examining the changes made to a document, and by whom gives extra clues about policies or coverups etc.
The same feature on a whistleblower's own computer, could, of course betray their identity, by adding their default name properties to any document which they edit or view, before passing it on.
- Older versions of Microsoft Word (and other Office products like Excel or PowerPoint) can also betray the MAC Address of the Ethernet card of the computer on which a document was created or edited on, as part of the Global Unique ID data, embedded in the document. Most people will not have changed the MAC addresses of their computers (often possible through software), and there are likely to be inventory records or network logfiles which will pin point which MAC address belongs to which computer either at work or at home.
- Microsoft do now make available some tools to remove such GUID and other hidden meta data, versions, comments etc. from final published Microsoft Office products. e.g. the Microsoft Office 2003/XP Remove Hidden Data Add-in which removes most of, but not quite all of the Hidden File Data in Microsoft Word, Excel, and PowerPoint files. N.B. this does not work on Office 2007 files, but there seem to be built in Document Inspector settings, which do this as standard, but not by default.
Types of data this add-in can remove
The following types of data are removed automatically.
* Previous authors and editors.
* User name.
* Personal summary information.
* Revision marks. The tool accepts all revisions specified in the document. As a result, the contents of the document will correspond to the Final Showing Markup view on the Reviewing toolbar.
* Deleted text. This data is removed automatically.
* VB Macros. Descriptions and comments are removed from the modules.
* The ID number used to identify your document for the purpose of merging changes back into the original document.
* Routing slips.
* E-mail headers.
* Scenario comments.
* Unique identifiers (Office 97 documents only).
Note The Remove Hidden Data tool also turns on the Remove Personal Information feature. For more information on this feature, please search for "Remove Personal Information" in the application Help.
- The US National Security Agency has published a technical report: Redacting with Confidence: How to Safely Publish Sanitized Reports Converted From Word to PDF (.pdf) - (or from our I733-028R-2008.pdf local ht4w copy )
- See also Microsoft's Knowledge Base article KB223396 pointing to other articles about meta data in various Microsoft Office products: How to minimize metadata in Office documents
Obviously any journalist or blogger should double check that what they make available online does not contain identifiable clues to their anonymous sources, not just on the face of the published document, but within any "track changes" previous versions of a document, or document template as well.
Track Changes and Versions
- Remember that Microsoft Word has a "track changes" facility, which is useful when different versions of a document are written, edited or approved by more than one person. Several politically embarrassing Government leaks have happened because previously edited versions of words or paragraphs have been revealed by the public simply turning on the "show changes" option when they read it in Microsoft Word.
The Liberal Democrat blog Home Office Watch reports on how the extremely controversial secret policy document regarding plans for "Big Brother" surveillance of millions of innocent people was revealed because someone forgot to turn off "track changes".
As more journalists and political activists are becoming familiar with this feature or vulnerability, this may perhaps sometimes be a useful covert channel for information to be leaked to the media and the public, with a certain amount of "plausible deniability" for insider whistleblowers i.e. one document is effectively hidden within another, to a casual observer.
- The more recent versions Microsoft Word i.e. 2003, 2007 have a couple of Security / Privacy options which are worth enabling under the Tools / Options / Security menu.
- Remove personal information from file properties on save (off by default)
- Warn before printing, saving or sending a file that contains tracked changes or comments (off by default)
- Store a random number to improve merge accuracy (on by default) - supposedly a harmless random number, but worth switching off if you are not merging documents with anything else.
- Make hidden markup visible when opening or saving (on by default) - worth keeping on to let you check that you have successfully erased identifying personal data if necessary.
N.B. "Authoring references not entered by the application are not removed automatically. For instance, those references entered through the use of field codes are not removed or changed. Or, if hidden text was used to tag a line, and the author of the hidden text embedded his or her initials or name in the hidden text, this reference is not removed because it is not an identified author reference."
Examples of Inept "Redaction" or Censorship
- Sometimes digital files simply copy and magnify the errors which are the result of people being under a time pressure deadline. See the inept redaction / censorship with a marker pen of a legal Exhibit document in the Bank Julius Baer versus Wikileaks court case in February 2008. The plaintiff's lawyers took a digital screendump of a web page, which they then printed out and tried to hide the name of one of their clients former customers, by using a black marker pen, and the digitally scanning the result and submitting it electronically to the Court, as an Adobe .pdf document. Apart from failing to redact or censor the postal address of the customer and the name of the customer in the heading of a page (printed in the largest typeface used in the document), they also failed to cover all the descending tails of the lower case letters in the name, which could have led to some intelligent guesswork. By digitally zooming in on the .pdf image scan, the name could be read through the fading marker pen ink overlay.
- Sometimes (.pdf) files have been "Redacted" or Censored by using the Drawing facility within the software to "paint" thick black lines over the text as an overlay. This has led to several "whistleblower leaks" of the hidden data, through the simple technique of copy and pasting the text out of the (.pdf) viewer software into a another application programme such as a text editor or word processor, which has then revealed the underlying words which have supposedly been hidden. e.g. the failed attempt to hide the IP Addresses of military and government computers, in a (.pdf) copy of a US Grand Jury indictment against the alleged UK computer hacker Gary McKinnon
- Sometimes the encryption and "protection" features used to hide information in an Adobe (.pdf) file can be overcome through password guessing etc. e.g. the Wikileaks.org publication of an unredacted version of South African Competition Commission's final Report on Banking, 12 Dec 2008
Document File MetaData
- The ExifTool Perl scripts or Windows binary executable which reads the meta data of image files, also displays it for Microsoft Word .doc, Excel .xls, Powerpoint .pps and Adobe .pdf files etc. as well. - see the Photo Image Files section
- You can examine (but not change or delete) such photo or document image metadata via this website, which is powered by the ExifTool perl script software: Jeffrey's Exif Viewer
Remember that sometimes a whistleblower or journalist or blogger needs to read and understand this sort of hidden meta data or document change history, to help to determine if the leaked document is genuine..
If the leaked document has not been edited on a computer which is linked in anyway to the whistleblower, then sometimes, the hidden meta data and "track changes" edits are in fact the main point of the whistleblower leak, perhaps showing evidence of a last minute reversal of Government or Corporate policy, or the censorship of independent expert advice, or even the outright fabrication of "facts" by political spin doctors in the final version of a document etc.