impira / docquery Goto Github PK

View Code? Open in Web Editor NEW

1.7K 1.7K 117.0 98 KB

An easy way to extract information from documents

License: MIT License

Makefile 0.99% Python 95.62% Jupyter Notebook 1.47% JavaScript 1.93%

docquery's People

Contributors

Stargazers

Watchers

Forkers

amazingvince owenanalytics toant13 kysmatter nanderoo mihir0056 personx000 ayo-faks techthiyanes kianmeng manuelhrokr ekapujiw2002 cclauss repletech pablojmoreno karndeepsingh dvaltchanov hady-eslam sreekiranar algonacci jagilley giorgiop frankiert nosahama sawdog praveen5733 sainiudit dhrumil84 pardeepsf shipengtaov subham27-07 fastflair stjordanis hypevisenj advit200 manastkcs lukemcredmond bjmeo8 ankur612 kaljuvee anuragsingh28 luoruijie deyh2020 litanlitudan sibtainrazajamali ydeh22 aucan starkhv mayhemheroes ahrvo-technologies capuanob theobouwman jgautsch rjzauner pkpkpk octag0no gpalrepo dantemerlino agoulah lilonpro poshakjaiswal stonetingxin 0-hero bin01 jiayee raahulrawat therealthor sachin-pcpl zhiwenyou103 ayomi-gh humblef00ls xorapps patrickstamant abouzyakarim igrodfer w1r4 pandinosaurus jfontestad perinm blwarren kehangchenms studioxy will-sylvera willmaclean andmattia ken88ling yugabharathi ego rpainfognana gaoalessandro69 jorgeseifert bruteforce-group droidcraft code4indo prajwaljumde rayfernando1337-ai-forks mvilab jxeeno agitronics pinkthepink

docquery's Issues

Running the Google Colab throws exception

When running the notebook top to bottom we get following error for the second cell:

document-question-answering is already registered. Overwriting pipeline for task document-question-answering...
2023-02-05 19:28:49,801 ERROR: Failed while processing https://arxiv.org/pdf/2101.07597.pdf on question: 'who authored this paper?'
Traceback (most recent call last):
  File "/usr/local/bin/docquery", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/docquery/cmd/__main__.py", line 61, in main
    return args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/docquery/cmd/scan.py", line 95, in main
    response = nlp(question=q, **d.context)
  File "/usr/local/lib/python3.8/dist-packages/docquery/ext/pipeline_document_question_answering.py", line 232, in __call__
    return super().__call__({"question": question, "pages": normalized_images}, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1076, in __call__
    return next(
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py", line 291, in __next__
    is_last = item.pop("is_last")
  File "/usr/local/lib/python3.8/dist-packages/transformers/utils/generic.py", line 281, in pop
    raise Exception(f"You cannot use ``pop`` on a {self.__class__.__name__} instance.")
Exception: You cannot use ``pop`` on a ModelOutput instance.

I get the same exception when running the following with my local install.

docquery scan "who authored this paper?" https://arxiv.org/pdf/2101.07597.pdf

I would really appreciate a fix, since I'm really curious of trying out docquery

Regards!

Support "no classes" in the document classifier

See context here

If we want to extend the model to be able to predict "No value", this can be done pretty easily by adding an additional mask here outputs > 0 (this corresponds to a sigmoid output of > 0.5, the typical threshold for binary classification). In other words, the results returned by postprocess_standard() would only include the top_k predictions in which their respective logit was > 0.

I'm not sure how to extend this to account for when len(model_outputs) > 1, though. Something I'll have to think about for a bit

Transformers 4.26.0 onwards not working

Whenever I try to run a query on a document with transformers>=4.26 an exception is raised:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[1], line 5
      3 doc = document.load_document("https://templates.invoicehome.com/invoice-template-us-neat-750px.png")
      4 for q in ["What is the invoice number?", "What is the invoice total?"]:
----> 5     print(q, p(question=q, **doc.context))

File [~/Documents/repos/doqry/venv/lib/python3.10/site-packages/docquery/ext/pipeline_document_question_answering.py:232](https://file+.vscode-resource.vscode-cdn.net/home/user/Documents/repos/doqry/~/Documents/repos/doqry/venv/lib/python3.10/site-packages/docquery/ext/pipeline_document_question_answering.py:232), in DocumentQuestionAnsweringPipeline.__call__(self, image, question, **kwargs)
    229 else:
    230     normalized_images = [(image, None)]
--> 232 return super().__call__({"question": question, "pages": normalized_images}, **kwargs)

File [~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/pipelines/base.py:1076](https://file+.vscode-resource.vscode-cdn.net/home/user/Documents/repos/doqry/~/Documents/repos/doqry/venv/lib/python
[requirements.txt](https://github.com/impira/docquery/files/11320270/requirements.txt)
3.10/site-packages/transformers/pipelines/base.py:1076), in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
   1074     return self.iterate(inputs, preprocess_params, forward_params, postprocess_params)
   1075 elif self.framework == "pt" and isinstance(self, ChunkPipeline):
-> 1076     return next(
   1077         iter(
   1078             self.get_iterator(
   1079                 [inputs], num_workers, batch_size, preprocess_params, forward_params, postprocess_params
   1080             )
   1081         )
   1082     )
   1083 else:
   1084     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)

File [~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py:124](https://file+.vscode-resource.vscode-cdn.net/home/user/Documents/repos/doqry/~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py:124), in PipelineIterator.__next__(self)
    121     return self.loader_batch_item()
    123 # We're out of items within a batch
--> 124 item = next(self.iterator)
    125 processed = self.infer(item, **self.params)
    126 # We now have a batch of "inferred things".

File [~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py:291](https://file+.vscode-resource.vscode-cdn.net/home/user/Documents/repos/doqry/~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py:291), in PipelinePackIterator.__next__(self)
    289     else:
    290         item = processed
--> 291         is_last = item.pop("is_last")
    292         accumulator.append(item)
    293 return accumulator

File [~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/utils/generic.py:281](https://file+.vscode-resource.vscode-cdn.net/home/user/Documents/repos/doqry/~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/utils/generic.py:281), in ModelOutput.pop(self, *args, **kwargs)
    280 def pop(self, *args, **kwargs):

requirements.txt
requirements.txt

--> 281     raise Exception(f"You cannot use ``pop`` on a {self.__class__.__name__} instance.")

Exception: You cannot use ``pop`` on a ModelOutput instance..

Metric of Docquery's score result

Hello, i have some question regarding docquery score result.
Sometimes i get score around 0.8-0.9 or vice versa. Could you explain what the score is based on? Is it based on WER or CER? is it like confidence score?. Thank you

How to Uninstall this along with all Downloaded Files for pipeline?

Just the title. Multiple files were downloaded upon first use. I wonder where they go and how the package and those files can be deleted in the future.

Quickstart CLI not working with PDFs

Steps to reproduce (Following the QuickStart (CLI) guide):

Run pip install docquery
Run apt-get install tesseract-ocr
Run docsquery scan "What is the invoice number?" https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf

Observe error: pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

Fix: Install `poppler-utils`

My environment:
Mac OS Apple Silicon
Ran via the Python:3 Docker image.

It may be worth adding to the README to install poppler-utils. I'm happy to open a PR for this - also happy to open a PR for a basic Docker configuration if that's something you would like.

This is my first open-source contribution so apologies if I've missed some formalities - and nice project!

Donut dependencies

I installed docquery in a fresh python environment and followed the readme to run it using donut but ran into dependency issues. I had to install the following libraries to get it to work:

sentencepiece==0.1.97
protobuf==3.20.3

Just wanted to share this in case others run into the same problems.

issue while running the code

print(q, p(question=q, **doc.context))

Exception: You cannot use pop on a ModelOutput instance.

already installed tensort not working still

2023-06-17 18:30:01.207852: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
i am working on google collab

i also get an error when i use pipeline.getpipline
says its an attribute error and that the object doesn't have such a method

Import error when following example

Top of my file:
from docquery import pipeline, document

When running this, I see the following error:

ImportError: cannot import name 'pipeline' from partially initialized module 'docquery' (most likely due to a circular import) (/Users/swaraj/sempre-repos/swaraj-sandbox/docquery.py)

module 'os' has no attribute 'geteuid' on Windows

This might be a low-priority issue, but os.geteuid() is only available for unix-like systems, so this line throws an error on Windows:

docquery/src/docquery/web.py

Line 43 in a9c5127

if os.geteuid() == 0:

I don't know how you can check for root user in Windows, but it might be a good idea to surround this with a try-catch, or check os for geteuid() with hasattr().

from docquery cannot import Document

I already installed the docquery and it cannot import the Document from docquery This is the error it is showing Im doing it on colab

in <cell line: 1>()
----> 1 from docquery import Document

ImportError: cannot import name 'Document' from 'docquery' (/usr/local/lib/python3.10/dist-packages/docquery/init.py)

device logic failing in colab

Traceback (most recent call last):
File "/usr/local/bin/docquery", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/docquery/cmd/main.py", line 59, in main
return args.func(args)
File "/usr/local/lib/python3.7/dist-packages/docquery/cmd/scan.py", line 47, in main
nlp = get_pipeline(args.checkpoint)
File "/usr/local/lib/python3.7/dist-packages/docquery/pipeline.py", line 79, in get_pipeline
**pipeline_kwargs,
File "/usr/local/lib/python3.7/dist-packages/transformers/pipelines/init.py", line 767, in pipeline
return pipeline_class(model=model, framework=framework, task=task, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/docquery/ext/pipeline.py", line 87, in init
super().init(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/pipelines/base.py", line 768, in init
self.device = device if framework == "tf" else torch.device("cpu" if device < 0 else f"cuda:{device}")
TypeError: '<' not supported between instances of 'str' and 'int'

support document type

Hi! I want to ask whether the DocQuery supports .json file as document input?

Support text-based question answering pipelines

We've had several requests (e.g. this one) to support international documents in DocQuery. One way to do that is by extending DocQuery to support traditional QA models too (by rendering the document's text as a plain text). This will not be as accurate as document-specific models but may work well for a number of tasks.

UI RefExp task

Congrats on the great work with DocQuery, Impira team and congrats on joining Figma.

Have you looked at transfer learning DocVQA into UI RefExp? Natural language Referring Expression for the location of the bounding box of a UI element.

I have had some success with Donut, but it is converging slower than I anticipated. After a couple of weeks on a single A100GPU, it is still improving slowly with IoU at 47%. About 10% improvement in the last week. IoU (intersection over union) of predicted vs ground truth bounding boxes seems to be a reasonable validation metric for this task.

Curious what your experience with the task might have been and whether I am approaching the problem from the wrong angle.

Here is my Huggingface space with links to fine tuning notebook.
https://huggingface.co/spaces/ivelin/ui-refexp

Regards!

Doesn't work with PDF

docquery scan "What is the invoice number?" https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf

Errors:

FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo'
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

Add Gradio demo to Dockerfile

We should support running https://huggingface.co/spaces/impira/docquery in the Docker container

Support brew install

For users who want to quick start with docquery and not figure out dependencies (e.g. tesseract), it'd be great to have a single brew install command that installs docquery and its dependencies.

Unable to load certain websites in DocQuery on Hugging Face

Examples

https://usdtoreros.com/sports/womens-volleyball/schedule/2022

Message: unknown error: session deleted because of page crashfrom unknown error: cannot determine loading statusfrom tab crashed (Session info: headless chrome=90.0.4430.212)Stacktrace:#0 0x55b0d6014e89 <unknown>

https://www.nytimes.com/

Message: unknown error: session deleted because of page crashfrom tab crashed (Session info: headless chrome=90.0.4430.212)Stacktrace:#0 0x55b74c4eae89 <unknown>

https://www.ikea.com/us/en/cat/sectionals-16238/

`Message: unknown error: cannot activate web view (Session info: headless chrome=90.0.4430.212)Stacktrace:#0 0x55b74c4eae89 `

How to get currency name(USD,IND) with the DocQuery?

@ankrgyl
i have tried multiple ways to extract the currency name.but no luck.
can you suggest me a matching query(question) to extract the currency from the doc?

Fine Tune for document question answering

Hi, I am trying to fine tune impira/layoutlm-document-qa for document question answering tasks. I am unable to find any relevant code or examples for the same. Most content is based on LayoutLMv2/v3 but they differ in architecture and encoding hence unable to use for finetuning. Any help is appreciated

CLI for Word document (.docx )

Need CLI for Word document, to read particular content in tables

Table and content tag will be always same.

Just need to read corresponding tag content and display it

Need to read all tags and display at once

Illegal instruction: 4

$ docquery scan "What is the invoice number?" https://templates.invoicehome.com/invoice-template-us-neat-750px.png
Illegal instruction: 4

Just go this running on Mac M1, is it supported?

Any suggestions to resolve would be helpful 🙏

Warm regards

LayoutLM Squad Training

what approach did you use for fine-tuning layoutlm on squad?

Since there is no visual element to squad, did you input blank boxes or null boxes? Or maybe generate synthetic documents?

failed to import pipeline

from docquery import document, pipeline

gets me

Failed to import docquery.transformers_patch because of the following error (look up to see its traceback):
Failed to import transformers.pipelines because of the following error (look up to see its traceback):
cannot import name 'AutoModelForDepthEstimation' from 'transformers.models.auto.modeling_auto' (/Users/theobouwman/dev/projects/test/venv/lib/python3.10/site-packages/transformers/models/auto/modeling_auto.py)

How can i use my checkpoints in the piepline

When I am using my own trained docvqa model for checkpoints, it is throwing an error of the config.json file, how to use the trained model on transformers docvqa with the library, and return checkpoints

The web tests are flaky

Specifically the question/answering over the README file. We should pick a "simpler" web page that can reliably return a result.

Support document classification

This issue tracks support for document classification, e.g. identifying that a document is an invoice.

Issue with docquery document pydantic

Hi I'm facing the issues with pydantic
from pydantic.fields import ModelField
ImportError: cannot import name 'ModelField' from 'pydantic.fields'

with the python 3.8, 3.9, 3.10 also . Please suggest the solutions for this

Multiline extraction

Problem with this package is we can't able to extraction multiline answers. For example, description will have 2 or more lines docquery cant able to extract those informations. I need solution for this issue.

Question about how to encode the whole doc-context and store it for future retrieval

Hi! I have a question about how to store the doc.context in this command print(q, p(question=q, **doc.context))? Because it took a long time to encode context file for each query, so it would be better to pre-encode the context file and use docquery to retrieve from the encoded context. Thank you!

OSError: [WinError 123]

doc) PS C:\Users\N_B\OneDrive\Desktop\docquery>
document-question-answering is already registered. Overwriting pipeline for task document-question-answering...
Traceback (most recent call last):
File "C:\Users\N_B\Miniconda3\envs\mydoc\lib\runpy.py", line 193, in run_module_as_main
"main", mod_spec)
File "C:\Users\N_B\Miniconda3\envs\mydoc\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\N_B\Miniconda3\envs\mydoc\Scripts\docquery.exe_main.py", line 7, in
File "C:\Users\N_B\Miniconda3\envs\mydoc\lib\site-packages\docquery\cmd_main.py", line 61, in main
return args.func(args)
File "C:\Users\N_B\Miniconda3\envs\mydoc\lib\site-packages\docquery\cmd\scan.py", line 54, in main
if pathlib.Path(args.path).is_dir():
File "C:\Users\N_B\Miniconda3\envs\mydoc\lib\pathlib.py", line 1373, in is_dir
return S_ISDIR(self.stat().st_mode)
File "C:\Users\N_B\Miniconda3\envs\mydoc\lib\pathlib.py", line 1183, in stat
return self._accessor.stat(self)
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'https:\news.ycombinator.com'
(mydoc) PS C:\Users\N_B\OneDrive\Desktop\docquery>

no matches found: docquery[donut]

Running this command to use donut pip install docquery[donut] seems to throw an error no matches found: docquery[donut]
A simple workaround is to install manually these two libraries :

pip install sentencepiece==0.1.97
pip install protobuf==3.20.3

Support scraping webpages in docquery

A few folks have mentioned using DocQuery to scrape web pages (e.g. real-estate websites). This issue is meant to track that project and collect more information about use cases.

Colab error: kwargs unexpected keyword argument: 'ocr_reader_name' (type=type_error

I just switch to GPU and do a "run all" and this is the error I get

Default Experience Should Not Require Poppler for PDFs

PDFs take advantage of Poppler to create image previews; however, these are unnecessary if the file has embedded text for certain models (e.g. LayoutLMv1). We should make sure that the default scenario of poppler not being available still works.

Could you please specify a notebook how we can use the model for training on custom data ?

How to output top 3/5 answers?

Hello! May I ask how to adjust the number of output answers? I would like to get more than ONE answer. Thank you!

Publish docker image to Dockerhub

(CC @amazingvince) Once PR #12 lands, we should publish the dockerfile to Dockerhub for easy consumption.

Cannot use pop on a ModelOutput instance. (transformers)

Hi, I had a problem. When I run the file I get an error:
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 376, in pop
raise Exception(f"You cannot use pop on a {self.class.name} instance.")
Exception: You cannot use pop on a ModelOutput instance.

I saw here on GitHub that some ppl solve this by downgrading the transformers library at version 4.23.0, but when I try to do the same thing I get an error on installation:

ERROR: Failed building wheel for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

Any ideas?

I saw on the internet that some ppl solve this by installing rust, but I get the same error.

Thanks

Exception: You cannot use ``pop`` on a ModelOutput instance.

Hello

I'm getting the following error when trying the google colab demo
https://colab.research.google.com/github/amazingvince/docqa/blob/updating_colab/docquery_example.ipynb
(Same issue on my m1 macbook)

!docquery scan "who authored this paper?" https://arxiv.org/pdf/2101.07597.pdf

023-01-30 19:00:54,639 ERROR: Failed while processing https://arxiv.org/pdf/2101.07597.pdf on question: 'who authored this paper?'
Traceback (most recent call last):
  File "/usr/local/bin/docquery", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/docquery/cmd/__main__.py", line 61, in main
    return args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/docquery/cmd/scan.py", line 95, in main
    response = nlp(question=q, **d.context)
  File "/usr/local/lib/python3.8/dist-packages/docquery/ext/pipeline_document_question_answering.py", line 232, in __call__
    return super().__call__({"question": question, "pages": normalized_images}, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1076, in __call__
    return next(
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py", line 291, in __next__
    is_last = item.pop("is_last")
  File "/usr/local/lib/python3.8/dist-packages/transformers/utils/generic.py", line 281, in pop
    raise Exception(f"You cannot use ``pop`` on a {self.__class__.__name__} instance.")
Exception: You cannot use ``pop`` on a ModelOutput instance.

DocQuery has difficulty pulling concepts from different parts of a document

https://www.animalearn.org/img/pdf/animalFacts.pdf

Question: which animals are mentioned in this document?
Docsign's answer: Tiny animals!
Correct answer: Cat, rat, pig, earthworm, crayfish.

Stuck on Loading pipelines step

Hi,

I'm trying to use docquery via command line to ask a question on a PDF file but the process seems to get stuck at the Loading pipelines. step.

> docquery scan "What can I use to open a PDF file?" pdf-sample1.pdf
document-question-answering is already registered. Overwriting pipeline for task document-question-answering...
2023-04-05 09:47:24,078 INFO: Loading pdf-sample1.pdf
2023-04-05 09:47:24,120 INFO: Done loading 1 file(s).
2023-04-05 09:47:24,121 INFO: Loading pipelines.

Also, when loading as library, the notebook gets stuck at this line

p = pipeline('document-question-answering')

Could you help me resolve this? I'm on Windows 10 btw.

Doc Query Document types ?

Currently how many document types handled by doc query engine ?
like invoice etc.....

pipeline on list based input

pipeline([{"image": image, "question": question}]) input format fails.

The pipeline treats the list as image and fails to extract question from dict on line 196 in pipeline.
The error is "TypeError: list indices must be integers or slices, not str"

Example:
p = pipeline.get_pipeline()
doc = document.load_document("example.pdf")
print(p([{**doc.context, 'question': "Who authored this paper?"}]))