cat-lemonade / pdfdataextractor Goto Github PK

View Code? Open in Web Editor NEW

57.0 57.0 11.0 5.85 MB

A toolkit for automatically extracting semantic information from PDF files of scientific articles

Home Page: https://pdfdataextractor.readthedocs.io/en/latest/?

License: MIT License

Python 100.00%

pdfdataextractor's People

Contributors

Stargazers

Watchers

Forkers

freeenergylab dantemerlino bbastiani dominikusbrian plikt jgmedina95 winfrednyoroka dingyun-huang 5l1v3r1 pkrouth lsvih

pdfdataextractor's Issues

Can't install on macOS Sonoma 14.0

python setup.py install results in:

The package setup script has attempted to modify files on your system
that are not within the EasyInstall build area, and has been aborted.

running pyton code from demo notebook errors

ModuleNotFoundError Traceback (most recent call last)
/workspaces/PDFDataExtractor2/demo/PDE Demo.ipynb Cell 10 line 1
----> 1 from pdfdataextractor import Reader

File ~/.local/lib/python3.10/site-packages/pdfdataextractor-1.0-py3.10.egg/pdfdataextractor/init.py:2
1 # # -- coding: utf-8 --
----> 2 from .extraction import *

File ~/.local/lib/python3.10/site-packages/pdfdataextractor-1.0-py3.10.egg/pdfdataextractor/extraction.py:13
11 from pdfminer.pdfpage import PDFPage
12 from collections import Counter, OrderedDict
---> 13 from templates import *
16 class Reader:
17 """Reader that reads in PDF files"""

File ~/.local/lib/python3.10/site-packages/pdfdataextractor-1.0-py3.10.egg/templates/init.py:1
----> 1 from .elsevier import ElsevierTemplate
2 from .royal_society_of_chemistry import RoyalSocietyChemistryTemplate
3 from .american_chemical_society import AmericanChemicalSocietyTemplate

File ~/.local/lib/python3.10/site-packages/pdfdataextractor-1.0-py3.10.egg/templates/elsevier.py:5
3 from pdfminer.pdfparser import PDFParser
4 from pdfminer.pdfdocument import PDFDocument
----> 5 from chemdataextractor.doc import Paragraph
6 import re
9 class ElsevierTemplate(Methods):

Installing chemdataextractor or chemdataextractor2 fails to install using codespaces virtual environment python 3.10
Building wheels for collected packages: DAWG
Building wheel for DAWG (setup.py) ... error
error: subprocess-exited-with-error

Getting None from file.read_file(path)

Hello! I'm running the following code and getting the error message below:

from pdfdataextractor import Reader
path = r'/Users/eightyfour/working/llm-sci/PDFDataExtractor/demo/test.pdf'
file = Reader()
pdf = file.read_file(path)
pdf.test()

Reading:  /Users/eightyfour/working/llm-sci/PDFDataExtractor/demo/test.pdf
*** Elsevier detected ***
Traceback (most recent call last):
  File "/Users/eightyfour/working/llm-sci/PDFDataExtractor/demo/run.py", line 5, in <module>
    pdf.test()
AttributeError: 'NoneType' object has no attribute 'test'

pdf = file.read_file(path) takes ~12 seconds and detects Elsevier as you can see, but test always gets None. print(pdf) returns None as well, so that tracks :) Any tips would be helpful!

Thanks,

Mark

Adding to this repo

Hey!

I'm working on a project with DeSci labs (desci.com) that is emphasizing generating automatic metadata. And we were hoping to use your repo as a basis for getting metadata from the text itself (including combining the functionality with LLMs and some other tooling).

Right now, we've just forked your repository and are keeping any changes we make separate. But we wanted to check in to see if you would be interested in us adding any of our changes to your repo to make it more robust.

Thanks!

Great tool but only able to run with python3.8

Hi!
Seems like a great package for what I need!

I struggled to install it, and only made it possible with a downgrade to python version 3.8.
The package that created the problem was chemdataextractor.

Is it necessary for the overall code? I'm happy only with text so far, so maybe the package is not completely necessary and ill be able to keep using more recent versions of python

r = pdf.abstract(chem=True) only possible for abstract?

Hey,

it seems to me that only the abstract-method can be used right now with chemdataextractor if chem is True. Is there an implemented way to screen the whole pdf text with chemdataextractor?

I can of course just take the plaintext()-method, take the full pdf-text-string and run it manually through chemdataextractor. But as mentioned above, i was wondering if there is an implemented way in PDFDataExtractor to do that.

greetings
viktor

cat-lemonade / pdfdataextractor Goto Github PK

pdfdataextractor's People

Contributors

Stargazers

Watchers

Forkers

pdfdataextractor's Issues

Can't install on macOS Sonoma 14.0

running pyton code from demo notebook errors

Getting None from file.read_file(path)

Adding to this repo

Great tool but only able to run with python3.8

r = pdf.abstract(chem=True) only possible for abstract?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent