Giter Site home page Giter Site logo

whoiskatrin / financial-statement-pdf-extractor Goto Github PK

View Code? Open in Web Editor NEW
81.0 4.0 22.0 18 KB

Python script to extract as much structured information as possible from annual/quarterly reports.

Python 100.00%
pdf quarterly-reports extract financial-analysis financial-statements data-processing balance-sheet cash-flow cash-flow-statement

financial-statement-pdf-extractor's Introduction

PDF Financial Statement Extractor ๐Ÿ“š๐Ÿ”

This Python script extracts tables containing specific keywords, such as "Revenue" and "Income," from a collection of PDF files in the specified input directory and saves the extracted tables as Excel files in the specified output directory.

Features โœจ

  • Extract tables with specific keywords from PDF files
  • Parallel processing for faster extraction
  • Customizable regex pattern for keyword search
  • Error handling and logging for better traceability
  • Supports specifying input and output directories

Installation ๐Ÿ› ๏ธ

Dependencies

  • Python 3.7 or higher
  • pdfgrep (system package)

Steps

  1. Clone the repository or download the script:
git clone financial-statement-pdf-extractor.git

Install the Python dependencies using pip:

pip install -r requirements.txt 

Install the pdfgrep package using your system's package manager: For Ubuntu:

sudo apt-get install pdfgrep

For macOS:

brew install pdfgrep

Usage

Replace input_directory with the path to the directory containing the PDF files you want to process, and output_directory with the path to the directory where you want to save the extracted tables.

Optional Arguments -p, --processes: Number of parallel processes (default: number of CPU cores) -r, --regex: Custom regex pattern for searching specific keywords in PDF files (default: '^(?s:(?=.*Revenue)|(?=.*Income))') For example, to use a custom regex pattern and specify the number of parallel processes, run the script as follows:

python script.py -i input_directory -o output_directory -r 'your_custom_pattern' -p 4

License ๐Ÿ“„

This project is licensed under the MIT License. See the LICENSE file for details.

Contributing ๐Ÿค

Please feel free to open an issue or submit a pull request if you would like to contribute to the project or have any suggestions for improvements.

financial-statement-pdf-extractor's People

Contributors

dependabot[bot] avatar whoiskatrin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

financial-statement-pdf-extractor's Issues

pages not getting

        cmd = "pdfgrep -Pn '^(?s:(?=.*Revenue)|(?=.*Income))' " + pdf + " | awk -F\":\" '$0~\":\"{print $1}' | tr '\n' ','"
        pages = subprocess.check_output(cmd, shell=True).decode("utf-8")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.