Giter Site home page Giter Site logo

cat-lemonade / pdfdataextractor Goto Github PK

View Code? Open in Web Editor NEW
57.0 57.0 11.0 5.85 MB

A toolkit for automatically extracting semantic information from PDF files of scientific articles

Home Page: https://pdfdataextractor.readthedocs.io/en/latest/?

License: MIT License

Python 100.00%

pdfdataextractor's People

Contributors

cat-lemonade avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

pdfdataextractor's Issues

Can't install on macOS Sonoma 14.0

python setup.py install results in:

The package setup script has attempted to modify files on your system
that are not within the EasyInstall build area, and has been aborted.

running pyton code from demo notebook errors

ModuleNotFoundError Traceback (most recent call last)
/workspaces/PDFDataExtractor2/demo/PDE Demo.ipynb Cell 10 line 1
----> 1 from pdfdataextractor import Reader

File ~/.local/lib/python3.10/site-packages/pdfdataextractor-1.0-py3.10.egg/pdfdataextractor/init.py:2
1 # # -- coding: utf-8 --
----> 2 from .extraction import *

File ~/.local/lib/python3.10/site-packages/pdfdataextractor-1.0-py3.10.egg/pdfdataextractor/extraction.py:13
11 from pdfminer.pdfpage import PDFPage
12 from collections import Counter, OrderedDict
---> 13 from templates import *
16 class Reader:
17 """Reader that reads in PDF files"""

File ~/.local/lib/python3.10/site-packages/pdfdataextractor-1.0-py3.10.egg/templates/init.py:1
----> 1 from .elsevier import ElsevierTemplate
2 from .royal_society_of_chemistry import RoyalSocietyChemistryTemplate
3 from .american_chemical_society import AmericanChemicalSocietyTemplate

File ~/.local/lib/python3.10/site-packages/pdfdataextractor-1.0-py3.10.egg/templates/elsevier.py:5
3 from pdfminer.pdfparser import PDFParser
4 from pdfminer.pdfdocument import PDFDocument
----> 5 from chemdataextractor.doc import Paragraph
6 import re
9 class ElsevierTemplate(Methods):

Installing chemdataextractor or chemdataextractor2 fails to install using codespaces virtual environment python 3.10
Building wheels for collected packages: DAWG
Building wheel for DAWG (setup.py) ... error
error: subprocess-exited-with-error

Getting None from file.read_file(path)

Hello! I'm running the following code and getting the error message below:

from pdfdataextractor import Reader
path = r'/Users/eightyfour/working/llm-sci/PDFDataExtractor/demo/test.pdf'
file = Reader()
pdf = file.read_file(path)
pdf.test()
Reading:  /Users/eightyfour/working/llm-sci/PDFDataExtractor/demo/test.pdf
*** Elsevier detected ***
Traceback (most recent call last):
  File "/Users/eightyfour/working/llm-sci/PDFDataExtractor/demo/run.py", line 5, in <module>
    pdf.test()
AttributeError: 'NoneType' object has no attribute 'test'

pdf = file.read_file(path) takes ~12 seconds and detects Elsevier as you can see, but test always gets None. print(pdf) returns None as well, so that tracks :) Any tips would be helpful!

Thanks,

Mark

Adding to this repo

Hey!

I'm working on a project with DeSci labs (desci.com) that is emphasizing generating automatic metadata. And we were hoping to use your repo as a basis for getting metadata from the text itself (including combining the functionality with LLMs and some other tooling).

Right now, we've just forked your repository and are keeping any changes we make separate. But we wanted to check in to see if you would be interested in us adding any of our changes to your repo to make it more robust.

Thanks!

Great tool but only able to run with python3.8

Hi!
Seems like a great package for what I need!

I struggled to install it, and only made it possible with a downgrade to python version 3.8.
The package that created the problem was chemdataextractor.

Is it necessary for the overall code? I'm happy only with text so far, so maybe the package is not completely necessary and ill be able to keep using more recent versions of python

r = pdf.abstract(chem=True) only possible for abstract?

Hey,

it seems to me that only the abstract-method can be used right now with chemdataextractor if chem is True. Is there an implemented way to screen the whole pdf text with chemdataextractor?

I can of course just take the plaintext()-method, take the full pdf-text-string and run it manually through chemdataextractor. But as mentioned above, i was wondering if there is an implemented way in PDFDataExtractor to do that.

greetings
viktor

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.