Giter Site home page Giter Site logo

docquery's People

Contributors

amazingvince avatar ankrgyl avatar cclauss avatar giorgiop avatar kianmeng avatar rstebbing avatar toant13 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

docquery's Issues

Running the Google Colab throws exception

When running the notebook top to bottom we get following error for the second cell:

document-question-answering is already registered. Overwriting pipeline for task document-question-answering...
2023-02-05 19:28:49,801 ERROR: Failed while processing https://arxiv.org/pdf/2101.07597.pdf on question: 'who authored this paper?'
Traceback (most recent call last):
  File "/usr/local/bin/docquery", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/docquery/cmd/__main__.py", line 61, in main
    return args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/docquery/cmd/scan.py", line 95, in main
    response = nlp(question=q, **d.context)
  File "/usr/local/lib/python3.8/dist-packages/docquery/ext/pipeline_document_question_answering.py", line 232, in __call__
    return super().__call__({"question": question, "pages": normalized_images}, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1076, in __call__
    return next(
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py", line 291, in __next__
    is_last = item.pop("is_last")
  File "/usr/local/lib/python3.8/dist-packages/transformers/utils/generic.py", line 281, in pop
    raise Exception(f"You cannot use ``pop`` on a {self.__class__.__name__} instance.")
Exception: You cannot use ``pop`` on a ModelOutput instance.

I get the same exception when running the following with my local install.

docquery scan "who authored this paper?" https://arxiv.org/pdf/2101.07597.pdf

I would really appreciate a fix, since I'm really curious of trying out docquery

Regards!

Support "no classes" in the document classifier

See context here

If we want to extend the model to be able to predict "No value", this can be done pretty easily by adding an additional mask here outputs > 0 (this corresponds to a sigmoid output of > 0.5, the typical threshold for binary classification). In other words, the results returned by postprocess_standard() would only include the top_k predictions in which their respective logit was > 0.

I'm not sure how to extend this to account for when len(model_outputs) > 1, though. Something I'll have to think about for a bit

Transformers 4.26.0 onwards not working

Whenever I try to run a query on a document with transformers>=4.26 an exception is raised:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[1], line 5
      3 doc = document.load_document("https://templates.invoicehome.com/invoice-template-us-neat-750px.png")
      4 for q in ["What is the invoice number?", "What is the invoice total?"]:
----> 5     print(q, p(question=q, **doc.context))

File [~/Documents/repos/doqry/venv/lib/python3.10/site-packages/docquery/ext/pipeline_document_question_answering.py:232](https://file+.vscode-resource.vscode-cdn.net/home/user/Documents/repos/doqry/~/Documents/repos/doqry/venv/lib/python3.10/site-packages/docquery/ext/pipeline_document_question_answering.py:232), in DocumentQuestionAnsweringPipeline.__call__(self, image, question, **kwargs)
    229 else:
    230     normalized_images = [(image, None)]
--> 232 return super().__call__({"question": question, "pages": normalized_images}, **kwargs)

File [~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/pipelines/base.py:1076](https://file+.vscode-resource.vscode-cdn.net/home/user/Documents/repos/doqry/~/Documents/repos/doqry/venv/lib/python
[requirements.txt](https://github.com/impira/docquery/files/11320270/requirements.txt)
3.10/site-packages/transformers/pipelines/base.py:1076), in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
   1074     return self.iterate(inputs, preprocess_params, forward_params, postprocess_params)
   1075 elif self.framework == "pt" and isinstance(self, ChunkPipeline):
-> 1076     return next(
   1077         iter(
   1078             self.get_iterator(
   1079                 [inputs], num_workers, batch_size, preprocess_params, forward_params, postprocess_params
   1080             )
   1081         )
   1082     )
   1083 else:
   1084     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)

File [~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py:124](https://file+.vscode-resource.vscode-cdn.net/home/user/Documents/repos/doqry/~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py:124), in PipelineIterator.__next__(self)
    121     return self.loader_batch_item()
    123 # We're out of items within a batch
--> 124 item = next(self.iterator)
    125 processed = self.infer(item, **self.params)
    126 # We now have a batch of "inferred things".

File [~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py:291](https://file+.vscode-resource.vscode-cdn.net/home/user/Documents/repos/doqry/~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py:291), in PipelinePackIterator.__next__(self)
    289     else:
    290         item = processed
--> 291         is_last = item.pop("is_last")
    292         accumulator.append(item)
    293 return accumulator

File [~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/utils/generic.py:281](https://file+.vscode-resource.vscode-cdn.net/home/user/Documents/repos/doqry/~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/utils/generic.py:281), in ModelOutput.pop(self, *args, **kwargs)
    280 def pop(self, *args, **kwargs):

requirements.txt
requirements.txt

--> 281     raise Exception(f"You cannot use ``pop`` on a {self.__class__.__name__} instance.")

Exception: You cannot use ``pop`` on a ModelOutput instance..

Metric of Docquery's score result

Hello, i have some question regarding docquery score result.
Sometimes i get score around 0.8-0.9 or vice versa. Could you explain what the score is based on? Is it based on WER or CER? is it like confidence score?. Thank you

Quickstart CLI not working with PDFs

Steps to reproduce (Following the QuickStart (CLI) guide):

  • Run pip install docquery
  • Run apt-get install tesseract-ocr
  • Run docsquery scan "What is the invoice number?" https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf

Observe error: pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

Fix: Install poppler-utils

My environment:
Mac OS Apple Silicon
Ran via the Python:3 Docker image.

It may be worth adding to the README to install poppler-utils. I'm happy to open a PR for this - also happy to open a PR for a basic Docker configuration if that's something you would like.

This is my first open-source contribution so apologies if I've missed some formalities - and nice project!

Donut dependencies

I installed docquery in a fresh python environment and followed the readme to run it using donut but ran into dependency issues. I had to install the following libraries to get it to work:

sentencepiece==0.1.97
protobuf==3.20.3

Just wanted to share this in case others run into the same problems.

already installed tensort not working still

2023-06-17 18:30:01.207852: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
i am working on google collab

i also get an error when i use pipeline.getpipline
says its an attribute error and that the object doesn't have such a method

Import error when following example

Top of my file:
from docquery import pipeline, document

When running this, I see the following error:

ImportError: cannot import name 'pipeline' from partially initialized module 'docquery' (most likely due to a circular import) (/Users/swaraj/sempre-repos/swaraj-sandbox/docquery.py)

from docquery cannot import Document

I already installed the docquery and it cannot import the Document from docquery This is the error it is showing Im doing it on colab

in <cell line: 1>()
----> 1 from docquery import Document

ImportError: cannot import name 'Document' from 'docquery' (/usr/local/lib/python3.10/dist-packages/docquery/init.py)

device logic failing in colab

Traceback (most recent call last):
File "/usr/local/bin/docquery", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/docquery/cmd/main.py", line 59, in main
return args.func(args)
File "/usr/local/lib/python3.7/dist-packages/docquery/cmd/scan.py", line 47, in main
nlp = get_pipeline(args.checkpoint)
File "/usr/local/lib/python3.7/dist-packages/docquery/pipeline.py", line 79, in get_pipeline
**pipeline_kwargs,
File "/usr/local/lib/python3.7/dist-packages/transformers/pipelines/init.py", line 767, in pipeline
return pipeline_class(model=model, framework=framework, task=task, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/docquery/ext/pipeline.py", line 87, in init
super().init(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/pipelines/base.py", line 768, in init
self.device = device if framework == "tf" else torch.device("cpu" if device < 0 else f"cuda:{device}")
TypeError: '<' not supported between instances of 'str' and 'int'

Support text-based question answering pipelines

We've had several requests (e.g. this one) to support international documents in DocQuery. One way to do that is by extending DocQuery to support traditional QA models too (by rendering the document's text as a plain text). This will not be as accurate as document-specific models but may work well for a number of tasks.

UI RefExp task

Congrats on the great work with DocQuery, Impira team and congrats on joining Figma.

Have you looked at transfer learning DocVQA into UI RefExp? Natural language Referring Expression for the location of the bounding box of a UI element.

I have had some success with Donut, but it is converging slower than I anticipated. After a couple of weeks on a single A100GPU, it is still improving slowly with IoU at 47%. About 10% improvement in the last week. IoU (intersection over union) of predicted vs ground truth bounding boxes seems to be a reasonable validation metric for this task.

Curious what your experience with the task might have been and whether I am approaching the problem from the wrong angle.

Here is my Huggingface space with links to fine tuning notebook.
https://huggingface.co/spaces/ivelin/ui-refexp

Regards!

Doesn't work with PDF

docquery scan "What is the invoice number?" https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf

Errors:

  • FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo'

  • pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

Support brew install

For users who want to quick start with docquery and not figure out dependencies (e.g. tesseract), it'd be great to have a single brew install command that installs docquery and its dependencies.

Unable to load certain websites in DocQuery on Hugging Face

Examples

https://usdtoreros.com/sports/womens-volleyball/schedule/2022

image
Message: unknown error: session deleted because of page crashfrom unknown error: cannot determine loading statusfrom tab crashed (Session info: headless chrome=90.0.4430.212)Stacktrace:#0 0x55b0d6014e89 <unknown>


https://www.nytimes.com/

image
Message: unknown error: session deleted because of page crashfrom tab crashed (Session info: headless chrome=90.0.4430.212)Stacktrace:#0 0x55b74c4eae89 <unknown>


https://www.ikea.com/us/en/cat/sectionals-16238/

image

`Message: unknown error: cannot activate web view (Session info: headless chrome=90.0.4430.212)Stacktrace:#0 0x55b74c4eae89 `

Fine Tune for document question answering

Hi, I am trying to fine tune impira/layoutlm-document-qa for document question answering tasks. I am unable to find any relevant code or examples for the same. Most content is based on LayoutLMv2/v3 but they differ in architecture and encoding hence unable to use for finetuning. Any help is appreciated

CLI for Word document (.docx )

Need CLI for Word document, to read particular content in tables

Table and content tag will be always same.

Just need to read corresponding tag content and display it

Need to read all tags and display at once

Illegal instruction: 4

$ docquery scan "What is the invoice number?" https://templates.invoicehome.com/invoice-template-us-neat-750px.png
Illegal instruction: 4

Just go this running on Mac M1, is it supported?

Any suggestions to resolve would be helpful ๐Ÿ™

Warm regards

LayoutLM Squad Training

what approach did you use for fine-tuning layoutlm on squad?

Since there is no visual element to squad, did you input blank boxes or null boxes? Or maybe generate synthetic documents?

failed to import pipeline

from docquery import document, pipeline

gets me

Failed to import docquery.transformers_patch because of the following error (look up to see its traceback):
Failed to import transformers.pipelines because of the following error (look up to see its traceback):
cannot import name 'AutoModelForDepthEstimation' from 'transformers.models.auto.modeling_auto' (/Users/theobouwman/dev/projects/test/venv/lib/python3.10/site-packages/transformers/models/auto/modeling_auto.py)

How can i use my checkpoints in the piepline

When I am using my own trained docvqa model for checkpoints, it is throwing an error of the config.json file, how to use the trained model on transformers docvqa with the library, and return checkpoints

The web tests are flaky

Specifically the question/answering over the README file. We should pick a "simpler" web page that can reliably return a result.

Issue with docquery document pydantic

Hi I'm facing the issues with pydantic
from pydantic.fields import ModelField
ImportError: cannot import name 'ModelField' from 'pydantic.fields'

with the python 3.8, 3.9, 3.10 also . Please suggest the solutions for this

Multiline extraction

Problem with this package is we can't able to extraction multiline answers. For example, description will have 2 or more lines docquery cant able to extract those informations. I need solution for this issue.

OSError: [WinError 123]

doc) PS C:\Users\N_B\OneDrive\Desktop\docquery>
document-question-answering is already registered. Overwriting pipeline for task document-question-answering...
Traceback (most recent call last):
File "C:\Users\N_B\Miniconda3\envs\mydoc\lib\runpy.py", line 193, in run_module_as_main
"main", mod_spec)
File "C:\Users\N_B\Miniconda3\envs\mydoc\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\N_B\Miniconda3\envs\mydoc\Scripts\docquery.exe_main
.py", line 7, in
File "C:\Users\N_B\Miniconda3\envs\mydoc\lib\site-packages\docquery\cmd_main
.py", line 61, in main
return args.func(args)
File "C:\Users\N_B\Miniconda3\envs\mydoc\lib\site-packages\docquery\cmd\scan.py", line 54, in main
if pathlib.Path(args.path).is_dir():
File "C:\Users\N_B\Miniconda3\envs\mydoc\lib\pathlib.py", line 1373, in is_dir
return S_ISDIR(self.stat().st_mode)
File "C:\Users\N_B\Miniconda3\envs\mydoc\lib\pathlib.py", line 1183, in stat
return self._accessor.stat(self)
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'https:\news.ycombinator.com'
(mydoc) PS C:\Users\N_B\OneDrive\Desktop\docquery>

no matches found: docquery[donut]

Running this command to use donut pip install docquery[donut] seems to throw an error no matches found: docquery[donut]
A simple workaround is to install manually these two libraries :

pip install sentencepiece==0.1.97
pip install protobuf==3.20.3

Support scraping webpages in docquery

A few folks have mentioned using DocQuery to scrape web pages (e.g. real-estate websites). This issue is meant to track that project and collect more information about use cases.

Default Experience Should Not Require Poppler for PDFs

PDFs take advantage of Poppler to create image previews; however, these are unnecessary if the file has embedded text for certain models (e.g. LayoutLMv1). We should make sure that the default scenario of poppler not being available still works.

Cannot use pop on a ModelOutput instance. (transformers)

Hi, I had a problem. When I run the file I get an error:
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 376, in pop
raise Exception(f"You cannot use pop on a {self.class.name} instance.")
Exception: You cannot use pop on a ModelOutput instance.

I saw here on GitHub that some ppl solve this by downgrading the transformers library at version 4.23.0, but when I try to do the same thing I get an error on installation:

ERROR: Failed building wheel for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

Any ideas?

I saw on the internet that some ppl solve this by installing rust, but I get the same error.

Thanks

Exception: You cannot use ``pop`` on a ModelOutput instance.

Hello

I'm getting the following error when trying the google colab demo
https://colab.research.google.com/github/amazingvince/docqa/blob/updating_colab/docquery_example.ipynb
(Same issue on my m1 macbook)

!docquery scan "who authored this paper?" https://arxiv.org/pdf/2101.07597.pdf

023-01-30 19:00:54,639 ERROR: Failed while processing https://arxiv.org/pdf/2101.07597.pdf on question: 'who authored this paper?'
Traceback (most recent call last):
  File "/usr/local/bin/docquery", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/docquery/cmd/__main__.py", line 61, in main
    return args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/docquery/cmd/scan.py", line 95, in main
    response = nlp(question=q, **d.context)
  File "/usr/local/lib/python3.8/dist-packages/docquery/ext/pipeline_document_question_answering.py", line 232, in __call__
    return super().__call__({"question": question, "pages": normalized_images}, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1076, in __call__
    return next(
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py", line 291, in __next__
    is_last = item.pop("is_last")
  File "/usr/local/lib/python3.8/dist-packages/transformers/utils/generic.py", line 281, in pop
    raise Exception(f"You cannot use ``pop`` on a {self.__class__.__name__} instance.")
Exception: You cannot use ``pop`` on a ModelOutput instance.

Stuck on Loading pipelines step

Hi,

I'm trying to use docquery via command line to ask a question on a PDF file but the process seems to get stuck at the Loading pipelines. step.

> docquery scan "What can I use to open a PDF file?" pdf-sample1.pdf
document-question-answering is already registered. Overwriting pipeline for task document-question-answering...
2023-04-05 09:47:24,078 INFO: Loading pdf-sample1.pdf
2023-04-05 09:47:24,120 INFO: Done loading 1 file(s).
2023-04-05 09:47:24,121 INFO: Loading pipelines.

Also, when loading as library, the notebook gets stuck at this line

p = pipeline('document-question-answering')

Could you help me resolve this? I'm on Windows 10 btw.

pipeline on list based input

pipeline([{"image": image, "question": question}]) input format fails.

The pipeline treats the list as image and fails to extract question from dict on line 196 in pipeline.
The error is "TypeError: list indices must be integers or slices, not str"

Example:
p = pipeline.get_pipeline()
doc = document.load_document("example.pdf")
print(p([{**doc.context, 'question': "Who authored this paper?"}]))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.