goshin / remove-pdf-watermark Goto Github PK

View Code? Open in Web Editor NEW

146.0 3.0 51.0 471 KB

Remove embedded watermarks and color stains for scanned PDF. 去除扫描版 PDF 中的水印

License: MIT License

Python 100.00%

remove-pdf-watermark's Introduction

Remove PDF Watermark

A simple python script to remove embedded watermarks and color stains for scanned PDF.

简单去除扫描版 PDF 中的水印

Experimental Result

Requirements

Python 3
Pillow
PyPDF2
img2pdf

Usage

usage: pdf-watermark-removal.py [-h] [-o out] [-s SKIP] PATH

positional arguments:
  PATH

optional arguments:
  -h, --help            show this help message and exit
  -o out, --output out  Output PDF file
  -s SKIP, --skip SKIP  Skip over the first n page(s).

example:

$ pdf-watermark-removal.py --skip=1 --output=output.pdf document.pdf

License

The software is released under the MIT license.

remove-pdf-watermark's People

Contributors

Stargazers

Watchers

Forkers

xrogzu liangsi03 668 jiangzzz nightrice jacobi2017 jk50505k hackingwu explosiveyan zenistzw bhupathi-raju lanxingmo c1a1o1 githubssj suhendiandigo sporterman liangshener mrying furoxr lordcasser zijun-he santhoshsthanikam kerasking vyuezhengling aceupoic lxngoddess5321 xxj1991 zhanghucheng gzp1124 zhifengjiang rianley anselhall wurongzong adensheen yuansky dumking lpker1 ekicham wayneg123 wsz94 publicjoker zdway10 huangww-cn jfkjfkflfd smaingi calvenyi kevinhe12345 wsawf zhangmingshuo github9157

remove-pdf-watermark's Issues

sorry, I try to use this method to remove my pdf's watermark, but error be happened. Can you detail describe the method one again? Here's what I did when I executed the error. Thanks ~~

c:\python\Python37-32\Remove-PDF-Watermark-master\src>pdf-watermark-removal.py -s=2 -o=output.pdf abc.pdf
2018-09-13 18:10:05,724 - INFO - Processing page 1/60
Traceback (most recent call last):
File "C:\python\Python37-32\Remove-PDF-Watermark-master\src\pdf-watermark-removal.py", line 129, in
main()
File "C:\python\Python37-32\Remove-PDF-Watermark-master\src\pdf-watermark-removal.py", line 119, in main
process_page(pdf, i, i < args.skip)
File "C:\python\Python37-32\Remove-PDF-Watermark-master\src\pdf-watermark-removal.py", line 80, in process_page
img = Image.frombytes(mode, size, data)
File "C:\Python\Python37-32\lib\site-packages\PIL\Image.py", line 2368, in frombytes
im.frombytes(data, decoder_name, args)
File "C:\Python\Python37-32\lib\site-packages\PIL\Image.py", line 799, in frombytes
raise ValueError("not enough image data")
ValueError: not enough image data

ValueError: not enough image data

run then print ValueError: not enough image data, i am confuse , and can not solve by myself

can't read pdf

I use python 3.6.12 on mac os 10.15.6
I run the script get this error
raise ValueError("not enough image data")
ValueError: not enough image data
testDecember.pdf

ValueError: not enough values to unpack (expected 2, got 0)

I'm using:

python 3.9
pillow 9.1.1
pypdf 2.1
img2pdf 0.4.4

many functions was removed in PyPDF2 3.0.0

I just change some to new version as follows:

#!/usr/bin/env python3

# The MIT License (MIT)
#
# Copyright (c) 2016 John Chong
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

import os
import shutil
from collections import OrderedDict
import argparse
import io
import logging

import img2pdf
from PIL import Image
from PyPDF2 import PdfReader


def is_gray(a, b, c):
    r = 40
    if a + b + c < 350:
        return True
    if abs(a - b) > r:
        return False
    if abs(a - c) > r:
        return False
    if abs(b - c) > r:
        return False
    return True


def remove_watermark(image):
    image = image.convert("RGB")
    color_data = image.getdata()

    new_color = []
    for item in color_data:
        if is_gray(item[0], item[1], item[2]):
            new_color.append(item)
        else:
            new_color.append((255, 255, 255))

    image.putdata(new_color)
    return image


def process_page(pdf, page_index, skipped):
    content = pdf.pages[page_index]['/Resources']['/XObject'].get_object()
    images = {}
    for obj in content:
        if content[obj]['/Subtype'] == '/Image':
            size = (content[obj]['/Width'], content[obj]['/Height'])
            data = content[obj]._data
            if content[obj]['/ColorSpace'] == '/DeviceRGB':
                mode = "RGB"
            else:
                mode = "P"

            if content[obj]['/Filter'] == '/FlateDecode':
                img = Image.frombytes(mode, size, data)
            else:
                img = Image.open(io.BytesIO(data))
            images[int(obj[3:])] = img
    images = OrderedDict(sorted(images.items())).values()
    widths, heights = zip(*(i.size for i in images))
    total_height = sum(heights)
    max_width = max(widths)
    concat_image = Image.new('RGB', (max_width, total_height))
    offset = 0
    for i in images:
        concat_image.paste(i, (0, offset))
        offset += i.size[1]
    if not skipped:
        concat_image = remove_watermark(concat_image)
    concat_image.save("./temp/{}.jpg".format(page_index))


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('input_pdf_path', metavar='PATH')
    parser.add_argument('-o', '--output', metavar='out', type=argparse.FileType('wb'),
                        help='Output PDF file')
    parser.add_argument('-s', '--skip', type=int, default=0,
                        help='Skip over the first n page(s).')
    args = parser.parse_args()

    logger = logging.getLogger(__name__)
    logging.basicConfig(level='INFO', format='%(asctime)s - %(levelname)s - %(message)s')

    directory = './temp/'
    if not os.path.exists(directory):
        os.makedirs(directory)

    images_path = []
    pdf = PdfReader(open(args.input_pdf_path, "rb"))
    for i in range(0, len(pdf.pages)):
        logger.info("Processing page {}/{}".format(i + 1, len(pdf.pages)))
        images_path.append("./temp/{}.jpg".format(i))
        process_page(pdf, i, i < args.skip)

    logger.info('Writing to output PDF file')
    args.output.write(img2pdf.convert(*list(map(img2pdf.input_images, images_path))))
    logger.info('Done')

    shutil.rmtree(directory, True)


if __name__ == '__main__':
    main()

可以加上水印吗

tips

If an error is reported, the package version is rolled back to previous generations

KeyError: '/XObject'

Traceback (most recent call last):
  File "pdf-watermark-removal.py", line 127, in <module>
    main()
  File "pdf-watermark-removal.py", line 117, in main
    process_page(pdf, i, i < args.skip)
  File "pdf-watermark-removal.py", line 66, in process_page
    content = pdf.getPage(page_index)['/Resources']['/XObject'].getObject()
  File "/home/math/.pyenv/versions/3.8.4/lib/python3.8/site-packages/PyPDF2/generic.py", line 516, in __getitem__
    return dict.__getitem__(self, key).getObject()
KeyError: '/XObject'

I recommend to use pdfminer.six instead of PyPDF2, because pdfminser.six is maintained and PyPDF2 isn't.

No module named 'img2pdf'

I run this py at pycharm, but it shown me: ModuleNotFoundError: No module named 'img2pdf'
I tried to install this package, but it searched nothing.

在window10下使用Python 3.8.2运行时报下面的错

D:\script\Remove-PDF-Watermark-master\src>pdf-watermark-removal.py --skip=1 --output=output.pdf document.pdf
2020-03-24 12:50:01,669 - INFO - Processing page 1/1450
2020-03-24 12:50:01,685 - INFO - Processing page 2/1450
Traceback (most recent call last):
File "D:\script\Remove-PDF-Watermark-master\src\pdf-watermark-removal.py", line 127, in
main()
File "D:\script\Remove-PDF-Watermark-master\src\pdf-watermark-removal.py", line 117, in main
process_page(pdf, i, i < args.skip)
File "D:\script\Remove-PDF-Watermark-master\src\pdf-watermark-removal.py", line 66, in process_page
content = pdf.getPage(page_index)['/Resources']['/XObject'].getObject()
File "C:\Users\yudi\AppData\Local\Programs\Python\Python38-32\lib\site-packages\PyPDF2\generic.py", line 516, in getitem
return dict.getitem(self, key).getObject()
KeyError: '/XObject'

problem

Hello! when I open this project why i can not compile correctly?