joshdata / pdf-redactor Goto Github PK

View Code? Open in Web Editor NEW

181.0 7.0 61.0 149 KB

A general purpose PDF text-layer redaction tool for Python 2/3.

License: Creative Commons Zero v1.0 Universal

Python 100.00%

pdf-redactor's Issues

pip package

Do you plan to release a pip package?

Or would you accept help in releasing it that way?

Personally, I prefer to include dependency in requirement.txt instead of cloning code and incorporating it in the pipe.

Any example for using your code as kinda like library for replacing text within a file? How do I manipulate a pdf file?

Can't redact with brackets "[" and "]"

I want to redact my client name like "Pepsi" with term "[Client]". I add this rule in options.content_filters. However, the result is like "?Client?". Do you know how to fix this? Many many thanks...

Overlapping of Text

Hi Joshua,

I am facing an issue while using Pdf Redactor.

I am replacing the word "GENIUS" with "wonderful" in a Pdf. I am using example.py for this purpose.

Issues:

Overlap of "wonderful" with next world
If "wonderful" is not getting overlapped, the next work is getting overlapped. You can see in the second paragraph.
If the Line is becoming big, extra words are not shifting to the new line.

Original Text in PDF

Redacted File

I would be really grateful if you can help me in this regard. Looking forward to hearing from you.

Thanks
Kapil Nakra

example.py create 0 sized file that cannot be opened

Bookmarks

Bookmarks are not currently covered by any filters. This is under /Outlines in the document catalog.

Doesn't redact bank statements.

Bank statements contains lot of sensitive data and usually are more decorated. The redactor seems to work on plain pdf only.

numbers in particular costs with decimal

Having trouble with blanking out costs with format 12.00 or 12345.98 or 123.76
The problem is it blanks out whole numbers in pdfs too although not all whole numbers which makes it really weird to me.

What I suspect is if pdfs "encode" whole numbers with decimals too? Meaning something displayed in a pdf as 12 for example is actually 12.00. Below is the code which is from example.py and i run it in console.

red.py:
#;encoding=utf-8
from pdf_redactor import redactor, RedactorOptions
import re

#set options.
redactor_options = RedactorOptions()

redactor_options.content_filters = [
(re.compile(u"Cost Price"), lambda m : ""),
(re.compile(u"Cost"), lambda m : ""),
(re.compile(u"[0-9](.)[0-9]{2}"), lambda m : ""), #this is my regex for costs with 2 decimals
(re.compile(u"Value Price"),lambda m : ""),
]
redactor_options.content_replacement_glyphs = ['#', '', '/', '-']
redactor(redactor_options)

python red.py < a.pdf > anew.pdf

python3 red.py < a.pdf > anew.pdf does not work for me.

Would appreciate if anyone can help.

pdfrw now decode Unicode strings in Python 3

As pdfrw now uses Unicode for PdfString (pmaupin/pdfrw@d8a9292) pdf-redactor fails with an error on this new version:

"pdf_redactor.py", line 676, in toUnicode
    string = string.encode("Latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 4-7: ordinal not in range(256)

None of the examples work or are explained?

It seems like it doesnt work or is just not explained, im not that new to python but this seems just too much of missing information. I tried both examples and none seem to work

Ligatures

Many PDF authoring suites replace "fi", etc. with ~~dipthong~~ ligature characters or glyphs. This may require special handling, either in the library or in calling code to avoid false negatives.

Line break in replace

Hello!

I'm having trouble with the line break.

If you replace a small string with a long string, some of the content is not visible in the PDF, because there is no line break.

Do you have any suggestions to adjust this?

Page thumbnails

It might be prudent to wipe out page thumbnails stored in the document. Thumbnails are referenced by /Thumb in the page object. The thumbnail image itself is an XObject, so extra work would be required to excise it from the resulting file.

Redaction swaps random characters

I have noticed that for some PDFs, the redaction will swap random characters, here's an example:

Original pdf:

Phone number and email redacted:

A regex that doesn't match anything:

The content_filters successfully redacts the phone number and email, however, it's also changing all the es to is for some reason. I have done this to various pdfs, looks like this swapping issue isn't happening all the time and when it happens, it isn't consistent (es or is can be any letter).

Any ideas? I am happy to provide more examples.

deleted letters from PDF even if those letters were present in the source document

I used pdf-redactor to change some text in a pdf file, but in part of the pdf I've lost all the 'n' characters.
The affected text was not the one that I hoped to change. The text was handled in the "class TextToken" by the "str(self)" function as an unchanged text, i.e., it passes through condition "if self.value == self.original_value:". Nevertheless it has changed. What I managed to do is to track that the function to blame is "PdfString.from_bytes(...)" in line 379 of pdf_redactor.py:
# If unchanged, return the raw original value without decoding/encoding.
return PdfString.from_bytes(self.raw_original_value)
By forcing the encoding of the unchanged TextToken to 'hex' I managed to fix the issue:
return PdfString.from_bytes(self.raw_original_value, bytes_encoding = 'hex')
This simple change helped in my case, but I do not know if it is a general case. Can you try this and, eventually push this fix to your code?

WinAnsiEncoding quirks

From the PDF standard:

In WinAnsiEncoding , all unused codes greater than 40 map to the bullet character.
However, only code 225 is specifically assigned to the bullet character; other codes are
subject to future reassignment.

I fed this document in and got an encoding error that traced back to b'\x81 C'.decode("cp1252", "replace"). There's a bullet point in the corresponding position in the document. It appears that WinAnsiEncoding is a superset of CP-1252, because the Wikipedia article says:

According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar maps these to the corresponding C1 control codes.

Crashes on large files

No errors for smaller files, but crashes for http://www.stolyarov.info/books/pdf/progintro_vol1.pdf

$ ./smoketest.py progintro_vol1.pdf
./smoketest.py:69: TqdmExperimentalWarning: GUI is experimental/alpha
  for fn in tqdm(list(gen_filenames(paths))):
IndexError while reading progintro_vol1.pdf
Traceback (most recent call last):
  File "./smoketest.py", line 40, in smoke_test_file
    pdf_redactor.redactor(options)
  File "/home/oleg/pdf-redactor/pdf-redactor-master/pdf_redactor.py", line 101, in redactor
    text_layer = build_text_layer(document, options)
  File "/home/oleg/pdf-redactor/pdf-redactor-master/pdf_redactor.py", line 451, in build_text_layer
    prev_token[i] = make_mutable_string_token(prev_token[i])
  File "/home/oleg/pdf-redactor/pdf-redactor-master/pdf_redactor.py", line 423, in make_mutable_string_token
    token = TextToken(token.to_bytes(), current_font)
  File "/home/oleg/pdf-redactor/pdf-redactor-master/pdf_redactor.py", line 373, in __init__
    self.original_value = toUnicode(value, font, fontcache)
  File "/home/oleg/pdf-redactor/pdf-redactor-master/pdf_redactor.py", line 647, in toUnicode
    fontcache[font.ToUnicode.stream] = CMap(font.ToUnicode)
  File "/home/oleg/pdf-redactor/pdf-redactor-master/pdf_redactor.py", line 586, in __init__
    add_mapping(code, cid_or_name1, code-code1)
  File "/home/oleg/pdf-redactor/pdf-redactor-master/pdf_redactor.py", line 547, in add_mapping
    char = char[0:-1] + (chr if sys.version_info >= (3,) else unichr)(ord(char[-1]) + offset)
IndexError: string index out of range

Command-line tool

I'd love to have this tool take command-line arguments like:

pdf-redactor --title "New TItle" --xmp=remove --content-replace "\d{3}-\d{2}-\d{4}" "###-##-####" . . .

Missing metadata isn't handled.

When applying the example.py script to some PDF files, I've been getting errors -

"Title": [lambda value: value.upper()], AttributeError: 'NoneType' object has no attribute 'upper'

I've found what I think is a simple fix for this - checking if the value is None in line 155 of pdf_redactor.py:

# Filter the value. if value is not None: value = f(value)

Not sure about the etiquette of submitting a PR, but would like to do so - this has fixed the issues I've had when working on a number of PDFs using redactor.

Just replace first term

Is there a way to just replace the first term with the options stated?
I don't know if i have to put this question here, thanks in advance

Arrays and dictionaries in page content may be split across two stream objects

I tried processing a pdf I had lying around, and I got an IndexError in tokenize_stream. The root cause is that a dictionary in the content stream is split between two stream objects, and thus two invocations of tokenize_stream, so the << token gets thrown away before the end of the dictionary is parsed. Fixing this will require keeping the stack around between stream objects. I'll take a crack at a PR to do so.

Test case:

curl https://www.ncua.gov/About/Pages/inspector-general/audit-reports/Documents/ncua-report-cybersecurity-act-aug-10-2016.pdf > a.pdf
qpdf --stream-data=uncompress --pages a.pdf 1 -- a.pdf b.pdf
python example.py < b.pdf

pdf-redactor failing to perform string substitution

So, I have a PDF file (the small text says aaa): https://cdn.discordapp.com/attachments/283280590860582912/521025782664003584/r.png

I want to replace aaa with "test123".

My code is this:

#;encoding=utf-8

import re
from datetime import datetime

import pdf_redactor

options = pdf_redactor.RedactorOptions()

options.content_filters = [
    # First convert all dash-like characters to dashes.
    (
        re.compile(r"aaa"),
        lambda m : "test123"
    ),
]

pdf_redactor.redactor(options)

(If https://github.com/JoshData/pdf-redactor/blob/master/example.py works, why doesn't my script work?)

(Using Python3.7)

Is there any way to use pdf-redactor purely from .py instead of from terminal?

So, instead of setting options in pdf-redactor.py and running it through terminal with

python3 pdf-redactor.py < doc-to-redact.pdf > doc-after-redact.pdf,

we can use:

...
some options
....
dataout = b""
with open("doc-to-redact.pdf", "rb") as f:
    data = f.read()
    dataout = pdf_redactor.redactor(data, options)
with open("doc-after-redact.pdf", "wb") as f:
    f.write(dataout)

or maybe show the workaround on this? Thank you!

key-value pattern regex to identify key-value pairs

Had posted it on stack long time back... not sure if still relevant. Opening issue anyways.

https://stackoverflow.com/questions/62467452/in-a-pdf-want-to-identify-key-value-pairs-and-programatically-redact-values-onl

The question in case the link is not accessible...

I'm using joshdata redactor and key-value pattern regex to identify key-value pairs in a pdf text token stream. However, with some of the pdfs, thought the key-value pair appear adjacent when visually in a pdf document, in its token stream they appear far apart. Hence, doesn't get a match with key-value pattern.

How can this be solved?

joshdata / pdf-redactor Goto Github PK

pdf-redactor's Issues

Recommend Projects

Recommend Topics

Recommend Org