Giter Site home page Giter Site logo

callegarimattia / hotpdf Goto Github PK

View Code? Open in Web Editor NEW

This project forked from weareprestatech/hotpdf

0.0 0.0 0.0 16.75 MB

hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

Home Page: https://hotpdf.readthedocs.io/en/latest/

License: MIT License

Python 100.00%

hotpdf's Introduction

hotpdf

Documentation Status latest build Coverage Status Unit tests

This project was started as an internal project @ Prestatech to parse PDF files in a fast and memory-efficient way to overcome the difficulties we were having while parsing big PDF files using libraries such as pdfquery [Comparison].

hotpdf is a wrapper around pdfminer.six focusing on text extraction and text search operations on PDFs.

hotpdf can be used to find and extract text from PDFs. Please read the docs to understand how the library can help you!

Installation

The latest version of hotpdf can be installed directly from PyPI with pip.

pip install hotpdf

Local Setup

First, install the dependencies required by hotpdf

python3 -m pip install -e .

Contributing

You should install the pre-commit hooks with pre-commit install. This will run the linter, mypy, and ruff formatting before each commit.

Remember to run pip install -e '.[dev]' to install the extra dependencies for development.

For more examples of how to run the full test suite please refer to the CI workflow.

We strive to keep the test coverage at 100% (but can't due to certain reasons - e.g., test file not available): if you want your contributions accepted please write tests for them :D

Some examples of running tests locally:

python3 -m pip install -e '.[dev]'               # install extra deps for testing
python3 -m pytest -n=auto tests/                      # run the test suite
# run tests with coverage
python3 -m pytest --cov-fail-under=96 -n=auto --cov=hotpdf --cov-report term-missing

Documentation

We use sphinx for generating our docs and host them on readthedocs

Please update and add documentation if required, with your contributions.

Update the .rst files, rebuild them, and commit them along with your PRs.

cd docs
make clean
make html

This will generate the necessary documentation files. Once merged to main the docs will be updated automatically.

Usage

To view more detailed usage information, please read the docs

Basic usage is as follows:

from hotpdf import HotPdf

pdf_file_path = "test.pdf"

# Load pdf file into memory
hotpdf_document = HotPdf(pdf_file_path)

# Alternatively, you can also pass an opened PDF stream to be loaded
with open(pdf_file_path, "rb") as f:
   hotpdf_document_2 = HotPdf(f)

# You can also merge multiple HotPdf objects to get one single HotPdf object
merged_hotpdf_object = HotPdf.merge_multiple(hotpdfs=[hotpdf1, hotpdf2])

# Get the number of pages
print(len(hotpdf_document.pages))

# Find text
text_occurences = hotpdf_document.find_text("foo")

# Find text and its full span
text_occurences_full_span = hotpdf_document.find_text("foo", take_span=True)

# Extract text in the region
text_in_bbox = hotpdf_document.extract_text(
   x0=0,
   y0=0,
   x1=100,
   y1=10,
   page=0,
)

# Extract spans in the region
spans_in_bbox = hotpdf_document.extract_spans(
   x0=0,
   y0=0,
   x1=100,
   y1=10,
   page=0,
)

# Extract spans text in the region
spans_text_in_bbox = hotpdf_document.extract_spans_text(
   x0=0,
   y0=0,
   x1=100,
   y1=10,
   page=0,
)

# Extract full-page text
full_page_text = hotpdf_document.extract_page_text(page=0)

Known Issues

  1. (cid:x) characters in text - In some pdfs when extracted, some symbols like might not be properly decoded, and instead be extracted as (cid:128).

This is a problem with the pdfminer.six library. We have fixed it from our side on our fork, and you can install it using pip. Until we can merge it to pdfminer.six repo and it gets released, we recommend that you install our fork with the fixes manually.

pip install --no-cache-dir git+https://github.com/weareprestatech/pdfminer.six.git@20240222#egg=pdfminer-six

License

This project is licensed under the terms of the MIT license.


with ❤️ from the team @ Prestatech GmbH

hotpdf's People

Contributors

krishnasism avatar callegarimattia avatar aptakhin avatar iodabasi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.