joshdata / pdf-redactor Goto Github PK

View Code? Open in Web Editor NEW

177.0 6.0 59.0 149 KB

A general purpose PDF text-layer redaction tool for Python 2/3.

License: Creative Commons Zero v1.0 Universal

Python 100.00%

pdf-redactor's Introduction

pdf-redactor

A general-purpose PDF text-layer redaction tool, in pure Python, by Joshua Tauberer and Antoine McGrath.

pdf-redactor uses pdfrw under the hood to parse and write out the PDF.

This Python module is a general tool to help you automatically redact text from PDFs. The tool operates on:

the text layer of the document's pages (content stream text)
plain text annotations
link target URLs
the Document Information Dictionary, a.k.a. the PDF metadata like Title and Author
embedded XMP metadata, if present

Graphical elements, images, and other embedded resources are not touched.

You can:

Use regular expressions to perform text substitution on the text layer (e.g. replace social security numbers with "XXX-XX-XXXX").
Rewrite, remove, or add new metadata fields on a field-by-field basis (e.g. wipe out all metadata except for certain fields).
Rewrite, remove, or add XML metadata using functions that operate on the parsed XMP DOM (e.g. wipe out XMP metadata).

How to use pdf-redactor

Get this module and then install its dependencies with:

pip3 install -r requirements.txt

pdf_redactor.py processes a PDF given on standard input and writes a new, redacted PDF to standard output:

python3 pdf_redactor.py < document.pdf > document-redacted.pdf

However, you should use the pdf_redactor module as a library and pass in text filtering functions written in Python, since the command-line version of the tool does not yet actually do anything to the PDF. The example.py script shows how to redact Social Security Numbers:

python3 example.py < tests/test-ssns.pdf > document-redacted.pdf

Limitations

Not all content may be redacted

The PDF format is an incredibly complex data standard that has hundreds, if not thousands, of exotic capabilities used rarely or in specialized circumstances. Besides a document's text layer, metadata, and other components of a PDF document which this tool scans and can redact text from, there are many other components of PDF documents that this tool does not look at, such as:

embedded files, multimedia, and scripts
rich text annotations
forms
internal object names
digital signatures

There are so many exotic capabilities in PDF documents that it would be difficult to list them all, so this list is a very partial list. It would take a lot more effort to write a redaction tool that scanned all possible places content can be hidden inside a PDF besides the places that this tool looks at, so please be aware that it is your responsibility to ensure that the PDFs you use this tool on only use the capabilities of the PDF format that this tool knows how to redact.

Character replacement

One of the PDF format's strengths is that it embeds font information so that documents can be displayed even if the fonts used to create the PDF aren't available when the PDF is viewed. Most PDFs are optimized to only embed the font information for characters that are actually used in the document. So if a document doesn't contain a particular letter or symbol, information for rendering the letter or symbol is not stored in the PDF.

This has an unfortunate consequence for redaction in the text layer. Since redaction in the text layer works by performing simple text substitution in the text stream, you may create replacement text that contains characters that were not previously in the PDF. Those characters simply won't show up when the PDF is viewed because the PDF didn't contain any information about how to display them.

To get around this problem, pdf_redactor checks your replacement text for new characters and replaces them with characters from the content_replacement_glyphs list (defaulting to ?, #, *, and a space) if any of those characters are present in the font information already stored in the PDF. Hopefully at least one of those characters is present (maybe none are!), and in that case your replacement text will at least show up as something and not disappear.

Content stream compression

Because pdfrw doesn't support all content stream compression methods, you should use a tool like qpdf to decompress the PDF prior to using this tool, and then to re-compress and web-optimize (linearize) the PDF after. The full command would be something like:

qpdf --stream-data=uncompress document.pdf - \
 | python3 pdf_redactor.py > /tmp/temp.pdf
 && qpdf --linearize /tmp/temp.pdf document-redacted.pdf

(qpdf's first argument can't be standard input, unfortunately, so a one-liner isn't possible.)

Exotic fonts

This tool has a limited understanding of glyph-to-Unicode codepoint mappings. Some unusual fonts may not be processed correctly, in which case text layer redaction regular expressions may not match or substitution text may not render correctly.

Testing that it worked

If you're redacting metadata, you should check the output using pdfinfo from the poppler-utils package:

# check that the metadata is fully redacted
pdfinfo -meta document-redacted.pdf

Developing/testing the library

Tests require some additional packages:

pip install -r requirements-dev.txt
python tests/run_tests.py

The file tests/test-ssns.pdf was generating by converting the file tests/test-ssns.odft to PDF in LibreOffice with the Archive PDF/A-1a option turned on so that it generates XMP metadata and Export comments turned on to export the comment.

pdf-redactor's People

Contributors

Stargazers

Watchers

Forkers

divergentdave trueskills edbrannin ivanifchen hibellm hgsongatwanted madanh stuartstobie mohamed-ali-elbaz tonygithinji lrpatterson deepakagrawal dkloving und-arc rivy emettely yujinyuz jborensky-tgm fagan2888 sandeeptukaram litterfeldt onlyone0001 joaovct jonatastbelotti kexplo frenchcommando khanfarhan10 asabi lbruand abandaru saharmor vilesa1 cfculhane vitalbeats qatadamts atmosuwiryo doit2day dfalkcreative angry-hui anton-holmes justinong hakiergrzonzo mameiias venidera kangaroo78 kletel yunnuan s1syphos thepeshka phuduong85 pavmb akhil4rajan yamina118 srinathredbery tacitness ppml38 gugoen mhgsmith

pdf-redactor's Issues

Bookmarks

Bookmarks are not currently covered by any filters. This is under /Outlines in the document catalog.

WinAnsiEncoding quirks

From the PDF standard:

In WinAnsiEncoding , all unused codes greater than 40 map to the bullet character.
However, only code 225 is specifically assigned to the bullet character; other codes are
subject to future reassignment.

I fed this document in and got an encoding error that traced back to b'\x81 C'.decode("cp1252", "replace"). There's a bullet point in the corresponding position in the document. It appears that WinAnsiEncoding is a superset of CP-1252, because the Wikipedia article says:

According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar maps these to the corresponding C1 control codes.

Any example for using your code as kinda like library for replacing text within a file? How do I manipulate a pdf file?

example.py create 0 sized file that cannot be opened

None of the examples work or are explained?

It seems like it doesnt work or is just not explained, im not that new to python but this seems just too much of missing information. I tried both examples and none seem to work

Arrays and dictionaries in page content may be split across two stream objects

I tried processing a pdf I had lying around, and I got an IndexError in tokenize_stream. The root cause is that a dictionary in the content stream is split between two stream objects, and thus two invocations of tokenize_stream, so the << token gets thrown away before the end of the dictionary is parsed. Fixing this will require keeping the stack around between stream objects. I'll take a crack at a PR to do so.

Test case:

curl https://www.ncua.gov/About/Pages/inspector-general/audit-reports/Documents/ncua-report-cybersecurity-act-aug-10-2016.pdf > a.pdf
qpdf --stream-data=uncompress --pages a.pdf 1 -- a.pdf b.pdf
python example.py < b.pdf

Crashes on large files

No errors for smaller files, but crashes for http://www.stolyarov.info/books/pdf/progintro_vol1.pdf

$ ./smoketest.py progintro_vol1.pdf
./smoketest.py:69: TqdmExperimentalWarning: GUI is experimental/alpha
  for fn in tqdm(list(gen_filenames(paths))):
IndexError while reading progintro_vol1.pdf
Traceback (most recent call last):
  File "./smoketest.py", line 40, in smoke_test_file
    pdf_redactor.redactor(options)
  File "/home/oleg/pdf-redactor/pdf-redactor-master/pdf_redactor.py", line 101, in redactor
    text_layer = build_text_layer(document, options)
  File "/home/oleg/pdf-redactor/pdf-redactor-master/pdf_redactor.py", line 451, in build_text_layer
    prev_token[i] = make_mutable_string_token(prev_token[i])
  File "/home/oleg/pdf-redactor/pdf-redactor-master/pdf_redactor.py", line 423, in make_mutable_string_token
    token = TextToken(token.to_bytes(), current_font)
  File "/home/oleg/pdf-redactor/pdf-redactor-master/pdf_redactor.py", line 373, in __init__
    self.original_value = toUnicode(value, font, fontcache)
  File "/home/oleg/pdf-redactor/pdf-redactor-master/pdf_redactor.py", line 647, in toUnicode
    fontcache[font.ToUnicode.stream] = CMap(font.ToUnicode)
  File "/home/oleg/pdf-redactor/pdf-redactor-master/pdf_redactor.py", line 586, in __init__
    add_mapping(code, cid_or_name1, code-code1)
  File "/home/oleg/pdf-redactor/pdf-redactor-master/pdf_redactor.py", line 547, in add_mapping
    char = char[0:-1] + (chr if sys.version_info >= (3,) else unichr)(ord(char[-1]) + offset)
IndexError: string index out of range

pdf-redactor failing to perform string substitution

So, I have a PDF file (the small text says aaa): https://cdn.discordapp.com/attachments/283280590860582912/521025782664003584/r.png

I want to replace aaa with "test123".

My code is this:

#;encoding=utf-8

import re
from datetime import datetime

import pdf_redactor

options = pdf_redactor.RedactorOptions()

options.content_filters = [
    # First convert all dash-like characters to dashes.
    (
        re.compile(r"aaa"),
        lambda m : "test123"
    ),
]

pdf_redactor.redactor(options)

(If https://github.com/JoshData/pdf-redactor/blob/master/example.py works, why doesn't my script work?)

(Using Python3.7)

Line break in replace

Hello!

I'm having trouble with the line break.

If you replace a small string with a long string, some of the content is not visible in the PDF, because there is no line break.

Do you have any suggestions to adjust this?

deleted letters from PDF even if those letters were present in the source document

I used pdf-redactor to change some text in a pdf file, but in part of the pdf I've lost all the 'n' characters.
The affected text was not the one that I hoped to change. The text was handled in the "class TextToken" by the "str(self)" function as an unchanged text, i.e., it passes through condition "if self.value == self.original_value:". Nevertheless it has changed. What I managed to do is to track that the function to blame is "PdfString.from_bytes(...)" in line 379 of pdf_redactor.py:
# If unchanged, return the raw original value without decoding/encoding.
return PdfString.from_bytes(self.raw_original_value)
By forcing the encoding of the unchanged TextToken to 'hex' I managed to fix the issue:
return PdfString.from_bytes(self.raw_original_value, bytes_encoding = 'hex')
This simple change helped in my case, but I do not know if it is a general case. Can you try this and, eventually push this fix to your code?

Page thumbnails

It might be prudent to wipe out page thumbnails stored in the document. Thumbnails are referenced by /Thumb in the page object. The thumbnail image itself is an XObject, so extra work would be required to excise it from the resulting file.

pip package

Do you plan to release a pip package?

Or would you accept help in releasing it that way?

Personally, I prefer to include dependency in requirement.txt instead of cloning code and incorporating it in the pipe.

Ligatures

Many PDF authoring suites replace "fi", etc. with ~~dipthong~~ ligature characters or glyphs. This may require special handling, either in the library or in calling code to avoid false negatives.

Overlapping of Text

Hi Joshua,

I am facing an issue while using Pdf Redactor.

I am replacing the word "GENIUS" with "wonderful" in a Pdf. I am using example.py for this purpose.

Issues:

Overlap of "wonderful" with next world
If "wonderful" is not getting overlapped, the next work is getting overlapped. You can see in the second paragraph.
If the Line is becoming big, extra words are not shifting to the new line.

Original Text in PDF

Redacted File

I would be really grateful if you can help me in this regard. Looking forward to hearing from you.

Thanks
Kapil Nakra

numbers in particular costs with decimal

Having trouble with blanking out costs with format 12.00 or 12345.98 or 123.76
The problem is it blanks out whole numbers in pdfs too although not all whole numbers which makes it really weird to me.

What I suspect is if pdfs "encode" whole numbers with decimals too? Meaning something displayed in a pdf as 12 for example is actually 12.00. Below is the code which is from example.py and i run it in console.

red.py:
#;encoding=utf-8
from pdf_redactor import redactor, RedactorOptions
import re

#set options.
redactor_options = RedactorOptions()

redactor_options.content_filters = [
(re.compile(u"Cost Price"), lambda m : ""),
(re.compile(u"Cost"), lambda m : ""),
(re.compile(u"[0-9](.)[0-9]{2}"), lambda m : ""), #this is my regex for costs with 2 decimals
(re.compile(u"Value Price"),lambda m : ""),
]
redactor_options.content_replacement_glyphs = ['#', '', '/', '-']
redactor(redactor_options)

python red.py < a.pdf > anew.pdf

python3 red.py < a.pdf > anew.pdf does not work for me.

Would appreciate if anyone can help.

Missing metadata isn't handled.

When applying the example.py script to some PDF files, I've been getting errors -

"Title": [lambda value: value.upper()], AttributeError: 'NoneType' object has no attribute 'upper'

I've found what I think is a simple fix for this - checking if the value is None in line 155 of pdf_redactor.py:

# Filter the value. if value is not None: value = f(value)

Not sure about the etiquette of submitting a PR, but would like to do so - this has fixed the issues I've had when working on a number of PDFs using redactor.

Is there any way to use pdf-redactor purely from .py instead of from terminal?

So, instead of setting options in pdf-redactor.py and running it through terminal with

python3 pdf-redactor.py < doc-to-redact.pdf > doc-after-redact.pdf,

we can use:

...
some options
....
dataout = b""
with open("doc-to-redact.pdf", "rb") as f:
    data = f.read()
    dataout = pdf_redactor.redactor(data, options)
with open("doc-after-redact.pdf", "wb") as f:
    f.write(dataout)

or maybe show the workaround on this? Thank you!

Doesn't redact bank statements.

Bank statements contains lot of sensitive data and usually are more decorated. The redactor seems to work on plain pdf only.

cannot replace to korean?

I want to translate the pdf from Japanese to Korean.
I'd like to replace the text with a translated one.
It's gone as soon as I fix it's gone..
Attached is the pdf used.

My Code

import re
from datetime import datetime

import pdf_redactor

options = pdf_redactor.RedactorOptions()

options.input_stream = r'.\filetest\test.pdf'
options.output_stream = r'.\filetest\test_transed.pdf'

options.content_filters = [

(
	re.compile(u"論文の書き方ガイド"),
	lambda m : u'테스트'
),

]

pdf_redactor.redactor(options)
test.pdf

Just replace first term

Is there a way to just replace the first term with the options stated?
I don't know if i have to put this question here, thanks in advance

pdfrw now decode Unicode strings in Python 3

As pdfrw now uses Unicode for PdfString (pmaupin/pdfrw@d8a9292) pdf-redactor fails with an error on this new version:

"pdf_redactor.py", line 676, in toUnicode
    string = string.encode("Latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 4-7: ordinal not in range(256)

Redaction swaps random characters

I have noticed that for some PDFs, the redaction will swap random characters, here's an example:

Original pdf:

Phone number and email redacted:

A regex that doesn't match anything:

The content_filters successfully redacts the phone number and email, however, it's also changing all the es to is for some reason. I have done this to various pdfs, looks like this swapping issue isn't happening all the time and when it happens, it isn't consistent (es or is can be any letter).

Any ideas? I am happy to provide more examples.

Can't redact with brackets "[" and "]"

I want to redact my client name like "Pepsi" with term "[Client]". I add this rule in options.content_filters. However, the result is like "?Client?". Do you know how to fix this? Many many thanks...

Command-line tool

I'd love to have this tool take command-line arguments like:

pdf-redactor --title "New TItle" --xmp=remove --content-replace "\d{3}-\d{2}-\d{4}" "###-##-####" . . .

key-value pattern regex to identify key-value pairs

Had posted it on stack long time back... not sure if still relevant. Opening issue anyways.

https://stackoverflow.com/questions/62467452/in-a-pdf-want-to-identify-key-value-pairs-and-programatically-redact-values-onl

The question in case the link is not accessible...

I'm using joshdata redactor and key-value pattern regex to identify key-value pairs in a pdf text token stream. However, with some of the pdfs, thought the key-value pair appear adjacent when visually in a pdf document, in its token stream they appear far apart. Hence, doesn't get a match with key-value pattern.

How can this be solved?