I had a need to process a bunch of PDFs containing some sensitive info, before they could be forwarded to an external team. This is how I did that data obfuscation from a Unix command line.

In my case the external team receiving the PDFs was supporting us in solving some process problems, so we already had agreements in place to "not do bad things". But still, it's important to be as careful as possible. And given the frightening stories of incompetence with personal data we see regularly from our government and large banks, I wanted to do better than black-on-black text styling or any similar nonsense.

Anyway, to keep this post short, the guts of what I did is:

$ pdftk myOriginal.pdf output uncompressed.pdf uncompress 
$ sed -e "s/\s\d{5,7}\s/ XXXXX /g" <uncompressed.pdf >no-ids.pdf 
$ sed -e "s/RealName/Obfuscated/g" <no-ids.pdf >no-ids-or-names.pdf 
...repeat for any bits to be changed
$ pdftk no-ids-or-names.pdf output myObfuscated.pdf compress 
$ rm uncompressed.pdf no-ids.pdf no-ids-or-names.pdf

That was bundled up into a little shell script to be run against a pile of PDFs which could then be spot-checked before sending on securely.

The key parts are that I'm using the pdftk tool to uncompress the original PDF, so that the text inside is actually visible. We can similarly un-secure a PDF by providing the right password if we're using secured PDFs.

After that, it's just using sed to find and replace all the sensitive text with non-sensitive text. In my situation there was only a small number of strings to be handled. If you had a full address book of names that needed replacing in a large number of PDFs you'd need something cleverer than this.

Finally it's pdftk again to recompress (and optionally re-secure) the cleaned PDF.

Previous Post Next Post

© Me. Best viewed with a sense of humour and a beer in hand.