Giter Site home page Giter Site logo

pdfprocessor's Introduction

Summary

Python script that sorts pages of an inputted PDF file by extracting text from each page of the original document.

How it works

Using the PyPDF library, the script splits the inputted PDF file by pages and extracts the text on each page. Then, it takes the customer_id value based on string position and renames the file based on the customer_id, customer_name, and customer_route in hash tables provided in the top of the document. page_number is also included to prevent overwriting of invoices from the same customer. The newly renamed PDF files get merged back together to one sorted PDF file. Lastly, inputted PDF and all originally created split pages and are removed.

Using the ReportLab library, watermarks are added behind each PDF page. A hash map was created to reference what the watermarks will print behind the pages.

Time complexity

This script runs at O(n) time complexity. Hash tables are included in the beginning of the document to help shorten time. The watermark runs as a different instance from the PyPDF processes, which doubles the time and space required. This is the reason why it is included only in scripts that need it (e.g. sortbyroute).

Room for improvement

  • Optimize the space complexity by compressing the outputted PDF files. The new PDF created after both the PyPDF merger process and inserting the ReportLab watermark process is almost 10x in size.
  • Connect the script to a more user-friendly front end interface. View full app here: https://github.com/timleungtech/pdf-processor-react-flask

sample sample2

pdfprocessor's People

Contributors

timleungtech avatar timlpq avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

timlpq

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.