Giter Site home page Giter Site logo

remove-pdf-watermark's Introduction

Remove PDF Watermark

A simple python script to remove embedded watermarks and color stains for scanned PDF.

简单去除扫描版 PDF 中的水印


Experimental Result

Requirements

  • Python 3
  • Pillow
  • PyPDF2
  • img2pdf

Usage

usage: pdf-watermark-removal.py [-h] [-o out] [-s SKIP] PATH

positional arguments:
  PATH

optional arguments:
  -h, --help            show this help message and exit
  -o out, --output out  Output PDF file
  -s SKIP, --skip SKIP  Skip over the first n page(s).

example:

$ pdf-watermark-removal.py --skip=1 --output=output.pdf document.pdf

License

The software is released under the MIT license.

remove-pdf-watermark's People

Contributors

goshin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

remove-pdf-watermark's Issues

sorry, I try to use this method to remove my pdf's watermark, but error be happened. Can you detail describe the method one again? Here's what I did when I executed the error. Thanks ~~

c:\python\Python37-32\Remove-PDF-Watermark-master\src>pdf-watermark-removal.py -s=2 -o=output.pdf abc.pdf
2018-09-13 18:10:05,724 - INFO - Processing page 1/60
Traceback (most recent call last):
File "C:\python\Python37-32\Remove-PDF-Watermark-master\src\pdf-watermark-removal.py", line 129, in
main()
File "C:\python\Python37-32\Remove-PDF-Watermark-master\src\pdf-watermark-removal.py", line 119, in main
process_page(pdf, i, i < args.skip)
File "C:\python\Python37-32\Remove-PDF-Watermark-master\src\pdf-watermark-removal.py", line 80, in process_page
img = Image.frombytes(mode, size, data)
File "C:\Python\Python37-32\lib\site-packages\PIL\Image.py", line 2368, in frombytes
im.frombytes(data, decoder_name, args)
File "C:\Python\Python37-32\lib\site-packages\PIL\Image.py", line 799, in frombytes
raise ValueError("not enough image data")
ValueError: not enough image data

can't read pdf

I use python 3.6.12 on mac os 10.15.6
I run the script get this error
raise ValueError("not enough image data")
ValueError: not enough image data
testDecember.pdf

many functions was removed in PyPDF2 3.0.0

I just change some to new version as follows:

#!/usr/bin/env python3

# The MIT License (MIT)
#
# Copyright (c) 2016 John Chong
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

import os
import shutil
from collections import OrderedDict
import argparse
import io
import logging

import img2pdf
from PIL import Image
from PyPDF2 import PdfReader


def is_gray(a, b, c):
    r = 40
    if a + b + c < 350:
        return True
    if abs(a - b) > r:
        return False
    if abs(a - c) > r:
        return False
    if abs(b - c) > r:
        return False
    return True


def remove_watermark(image):
    image = image.convert("RGB")
    color_data = image.getdata()

    new_color = []
    for item in color_data:
        if is_gray(item[0], item[1], item[2]):
            new_color.append(item)
        else:
            new_color.append((255, 255, 255))

    image.putdata(new_color)
    return image


def process_page(pdf, page_index, skipped):
    content = pdf.pages[page_index]['/Resources']['/XObject'].get_object()
    images = {}
    for obj in content:
        if content[obj]['/Subtype'] == '/Image':
            size = (content[obj]['/Width'], content[obj]['/Height'])
            data = content[obj]._data
            if content[obj]['/ColorSpace'] == '/DeviceRGB':
                mode = "RGB"
            else:
                mode = "P"

            if content[obj]['/Filter'] == '/FlateDecode':
                img = Image.frombytes(mode, size, data)
            else:
                img = Image.open(io.BytesIO(data))
            images[int(obj[3:])] = img
    images = OrderedDict(sorted(images.items())).values()
    widths, heights = zip(*(i.size for i in images))
    total_height = sum(heights)
    max_width = max(widths)
    concat_image = Image.new('RGB', (max_width, total_height))
    offset = 0
    for i in images:
        concat_image.paste(i, (0, offset))
        offset += i.size[1]
    if not skipped:
        concat_image = remove_watermark(concat_image)
    concat_image.save("./temp/{}.jpg".format(page_index))


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('input_pdf_path', metavar='PATH')
    parser.add_argument('-o', '--output', metavar='out', type=argparse.FileType('wb'),
                        help='Output PDF file')
    parser.add_argument('-s', '--skip', type=int, default=0,
                        help='Skip over the first n page(s).')
    args = parser.parse_args()

    logger = logging.getLogger(__name__)
    logging.basicConfig(level='INFO', format='%(asctime)s - %(levelname)s - %(message)s')

    directory = './temp/'
    if not os.path.exists(directory):
        os.makedirs(directory)

    images_path = []
    pdf = PdfReader(open(args.input_pdf_path, "rb"))
    for i in range(0, len(pdf.pages)):
        logger.info("Processing page {}/{}".format(i + 1, len(pdf.pages)))
        images_path.append("./temp/{}.jpg".format(i))
        process_page(pdf, i, i < args.skip)

    logger.info('Writing to output PDF file')
    args.output.write(img2pdf.convert(*list(map(img2pdf.input_images, images_path))))
    logger.info('Done')

    shutil.rmtree(directory, True)


if __name__ == '__main__':
    main()

tips

If an error is reported, the package version is rolled back to previous generations

KeyError: '/XObject'

Traceback (most recent call last):
  File "pdf-watermark-removal.py", line 127, in <module>
    main()
  File "pdf-watermark-removal.py", line 117, in main
    process_page(pdf, i, i < args.skip)
  File "pdf-watermark-removal.py", line 66, in process_page
    content = pdf.getPage(page_index)['/Resources']['/XObject'].getObject()
  File "/home/math/.pyenv/versions/3.8.4/lib/python3.8/site-packages/PyPDF2/generic.py", line 516, in __getitem__
    return dict.__getitem__(self, key).getObject()
KeyError: '/XObject'

I recommend to use pdfminer.six instead of PyPDF2, because pdfminser.six is maintained and PyPDF2 isn't.

No module named 'img2pdf'

I run this py at pycharm, but it shown me: ModuleNotFoundError: No module named 'img2pdf'
I tried to install this package, but it searched nothing.

在window10下使用Python 3.8.2运行时报下面的错

D:\script\Remove-PDF-Watermark-master\src>pdf-watermark-removal.py --skip=1 --output=output.pdf document.pdf
2020-03-24 12:50:01,669 - INFO - Processing page 1/1450
2020-03-24 12:50:01,685 - INFO - Processing page 2/1450
Traceback (most recent call last):
File "D:\script\Remove-PDF-Watermark-master\src\pdf-watermark-removal.py", line 127, in
main()
File "D:\script\Remove-PDF-Watermark-master\src\pdf-watermark-removal.py", line 117, in main
process_page(pdf, i, i < args.skip)
File "D:\script\Remove-PDF-Watermark-master\src\pdf-watermark-removal.py", line 66, in process_page
content = pdf.getPage(page_index)['/Resources']['/XObject'].getObject()
File "C:\Users\yudi\AppData\Local\Programs\Python\Python38-32\lib\site-packages\PyPDF2\generic.py", line 516, in getitem
return dict.getitem(self, key).getObject()
KeyError: '/XObject'

problem

Hello! when I open this project why i can not compile correctly?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.