Data Redaction: You're Doing it Wrong

Published: 2010-04-22
Last Updated: 2010-04-23 11:58:02 UTC
by John Bambenek (Version: 1)
10 comment(s)

PDF files are a common way to distribute documents on the Internet and even are used for distributing documents with redacted (removed) content.  However, when you distribute redacted documents make sure that the data you don't want out there isn't, in fact, still in the file.

Case in point, take the upcoming trial of former Governor Rod Blagojevich. He just submitted a motion to force President Obama to testify during his criminal trial.  As you can imagine, there is sensitive information in the motion.  You can read the motion here. The areas that are redacted are pretty obvious.  Now, hit Control-A.  Open a text editor or Microsoft Word (or the like).  Hit Control-C.

Hello, Mr. Face.  Meet, Mr. Palm. This particular mistake isn't new. There was a well-publicized SNAFU involving the US Department of Defense publishing a redacted document that contained classified information which was happily leaked on the Internet using the same method.

If the data is important enough to redact, it is probably important enough to verify that the data is actual gone.  Of course, this is a problem for more than just PDF documents.  An amusing HR trick is to take a look at Microsoft Word resumes, particular the "Track Changes" history.

The take away is to make sure to use commercial tools (or tools specifically designed for the task) to delete, not just mask, redacted information and to check to ensure that the redacted information is not easily retrievable... especially with something as trivial as "Copy-Paste".  If you are too stingy for a commercial software package, just print the document with the redacted portions and re-scan it as PDF to ensure the text is gone.

(You can read about the issue from this article which is heavy on the facts of the particular trial in question).

--
John Bambenek
bambenek at gmail /dot/ com

Keywords: data redaction pdf
10 comment(s)

Comments


My take on this has always been to use print-to-pdf functionnalities, at time of publication.

This way, the pdf mimicks paper and only contains the information wanted and nothing else.
It reminds me of a US Army internal report regarding the death of the "head" of italian secret services back in 2005. The PDF report contained some omissis, black bars over sensible words or phrases.

But with a simple copy/paste the report was published in clear text ( you can read the full story here : http://www.voltairenet.org/article30249.html )

So it seems that almost 5 years later someone forgot the lesson.
Print to PDF, from originals in PDF, Word, or any other format, also omits any metadata found in the original document. That sometimes contains details you'd rather keep private. I assume the commercial products mentioned do the same?

I tested print to PDF on the subject document from Evince on Linux, and the copy/paste of redacted sections revealed nothing. I'd be dang cautious about assuming that it's not there at all, though. I have no idea how the viewing app & print driver conspire to compose the output based on the input. I'd not be surprised if results could vary from software to software, too.
Print to PDF does not mimic paper. It mimics a PostScript printer, and there are subtle implications. For instance, open Microsoft Word. Enter some text. Select some of the text and use the highlight text button to set the background to black. I suspect this is exactly what was done on the Blagojevich document. Then print through a PDF queue (I've tested with both Acrobat and CutePDF Writer). Voila - blacked out text that can be selected and copy and pasted.

Open a vector drawing program, type in some text, then draw a white rectangle over some of the text. Print through a PDF queue. Voila - you can still select and copy the overlaid text.

When Word prints black text on a black background to a PDF queue, it converts it to PostScript (still black text on a black background), which gets cleansed a little and compressed and turned into PDF. But the PostScript instructions inside the PDF file still say, "Print this text in black with a black background". Which means the content of the text is still there.

As far as metadata, Properties on the Blagojevich document shows that Aaron Goldstein was the listed author, and the title as "Microsoft Word - motion to subpoena president redacted", which supports my assertion that they probably just highlighted the text with a black background in Microsoft Word.

I imagine that if you were experienced with a PDF inspection tool (which I am not), you might be able to find even more interesting things.

One tool worth keeping around is the Remove Hidden Data tool for Office XP/2003 (the same thing is built into Office 2007). I can't vouch for it being perfect, but it's probably better than nothing.

Also, one other approach similar to the print and scan solution would be to print through a TIF print queue (like the Microsoft Office Document Image Writer queue that comes with Office 2003) and then PDF the resultant TIF file. Still worth looking for potentially incriminating metadata. This approach avoids the noise and alignment issues in the print and scan approach.
Test using multiple PDF viewers; Apple Preview 5.0.1 (503) doesn't include the "redacted" text in a copy operation, but Adobe Reader on the same Mac (10.6.x) does.
you mean just drawing on my screen with black marker does not hide the data??

Why not just deleted it totally?

they could have been real clever and made it white on white.
you mean just drawing on my screen with black marker does not hide the data??

Why not just deleted it totally?

they could have been real clever and made it white on white.
I was at least expecting to have to use a text editor to check PDF revisions, etc. This is just depressing.
See the NSA document on Reaction "Redaction with Confidence" There are earlier versions for Word 2003 and PDF versions 5 thru 7

http://www.nsa.gov/ia/_files/support/I733-028R-2008.pdf
Cases like this that involve the costly and damaging inadvertent release of privileged information are completely avoidable. Dedicated redaction software is designed expressly for the permanent blocking out of private or privileged information in shared documents or public records. Customers using ID Shield by Extract Systems have successfully redacted over one billion pages. If the information is blocked by an ID Shield redaction zone, it's permanently removed--period. No more metadata or text behind the redaction zones that can be revealed by cut and paste tricks.

Mark Miller
VP Sales
www.extractsystems.com

Diary Archives