seerlabs / pdfmef Goto Github PK

View Code? Open in Web Editor NEW

29.0 29.0 8.0 55.72 MB

Multi-Entity Extraction Framework for Academic Documents (with default extraction tools)

License: Apache License 2.0

Python 28.30% Perl 46.46% Shell 0.11% XS 4.16% HTML 13.81% Prolog 7.17%

pdfmef's People

Contributors

Stargazers

Watchers

Forkers

anukat2015 tranhungnghiep karthi2016 afcarl bharathgit956 amm-kun raghavkeesara zebakarishma

pdfmef's Issues

ExpatError: unbound prefix

This happens on the first PDF file I tried to extract. The file can be downloaded from here

The error is not printed on the console, but in

000.000.001/000.000.001.header.tei

<?xml version='1.0' encoding='UTF-8'?>
<error>ExpatError: unbound prefix: line 2, column 0</error>

In this case, I put a pdf file named WuJWEBSCI2012-crawling.pdf in the repository, and configures PDFMEF to load from the file system, I got the following error. It looks like that the current code expects file names constructed by numerical characters but this is not the case in general.

Traceback (most recent call last):
  File "main.py", line 140, in <module>
    ids = wrapper.get_document_ids()
  File "/home/jxw394/github/pdfmef/src/extractor/python_wrapper/wrappers.py", line 75, in get_document_ids
    ids.append(utils.file_name_to_id(docPath[docPath.rfind('/') + 1 : docPath.rfind('.pdf') + 4]))
  File "/home/jxw394/github/pdfmef/src/extractor/python_wrapper/utils.py", line 10, in file_name_to_id
    return int(ID)
ValueError: invalid literal for int() with base 10: 'WuJWEBSCI2012-crawling'

No runnable satisfies the requirement for a FullTextTEIExtractor

In python_wrapper/properties.config, I set
fulltext_tei_to_csx: True
and got the following error message. What is exactly a "FullTextTEIExtractor"?

Traceback (most recent call last):
  File "main.py", line 155, in <module>
    runner.run_from_file_batch(files, outputPaths, num_processes=numProcesses, file_prefixes=prefixes)
  File "/usr/lib/python2.6/site-packages/extraction/core.py", line 201, in run_from_file_batch
    e.get()
  File "/usr/lib64/python2.6/multiprocessing/pool.py", line 422, in get
    raise self._value
LookupError: No runnable satisfies the requirement for a FullTextTEIExtractor

seerlabs / pdfmef Goto Github PK

pdfmef's People

Contributors

Stargazers

Watchers

Forkers

pdfmef's Issues

ExpatError: unbound prefix

file_name_to_id error

No runnable satisfies the requirement for a FullTextTEIExtractor

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent