impira / docquery Goto Github PK
View Code? Open in Web Editor NEWAn easy way to extract information from documents
License: MIT License
An easy way to extract information from documents
License: MIT License
When running the notebook top to bottom we get following error for the second cell:
document-question-answering is already registered. Overwriting pipeline for task document-question-answering...
2023-02-05 19:28:49,801 ERROR: Failed while processing https://arxiv.org/pdf/2101.07597.pdf on question: 'who authored this paper?'
Traceback (most recent call last):
File "/usr/local/bin/docquery", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/docquery/cmd/__main__.py", line 61, in main
return args.func(args)
File "/usr/local/lib/python3.8/dist-packages/docquery/cmd/scan.py", line 95, in main
response = nlp(question=q, **d.context)
File "/usr/local/lib/python3.8/dist-packages/docquery/ext/pipeline_document_question_answering.py", line 232, in __call__
return super().__call__({"question": question, "pages": normalized_images}, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1076, in __call__
return next(
File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
item = next(self.iterator)
File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py", line 291, in __next__
is_last = item.pop("is_last")
File "/usr/local/lib/python3.8/dist-packages/transformers/utils/generic.py", line 281, in pop
raise Exception(f"You cannot use ``pop`` on a {self.__class__.__name__} instance.")
Exception: You cannot use ``pop`` on a ModelOutput instance.
I get the same exception when running the following with my local install.
docquery scan "who authored this paper?" https://arxiv.org/pdf/2101.07597.pdf
I would really appreciate a fix, since I'm really curious of trying out docquery
Regards!
See context here
If we want to extend the model to be able to predict "No value", this can be done pretty easily by adding an additional mask here outputs > 0 (this corresponds to a sigmoid output of > 0.5, the typical threshold for binary classification). In other words, the results returned by postprocess_standard() would only include the top_k predictions in which their respective logit was > 0.
I'm not sure how to extend this to account for when len(model_outputs) > 1, though. Something I'll have to think about for a bit
Whenever I try to run a query on a document with transformers>=4.26 an exception is raised:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Cell In[1], line 5
3 doc = document.load_document("https://templates.invoicehome.com/invoice-template-us-neat-750px.png")
4 for q in ["What is the invoice number?", "What is the invoice total?"]:
----> 5 print(q, p(question=q, **doc.context))
File [~/Documents/repos/doqry/venv/lib/python3.10/site-packages/docquery/ext/pipeline_document_question_answering.py:232](https://file+.vscode-resource.vscode-cdn.net/home/user/Documents/repos/doqry/~/Documents/repos/doqry/venv/lib/python3.10/site-packages/docquery/ext/pipeline_document_question_answering.py:232), in DocumentQuestionAnsweringPipeline.__call__(self, image, question, **kwargs)
229 else:
230 normalized_images = [(image, None)]
--> 232 return super().__call__({"question": question, "pages": normalized_images}, **kwargs)
File [~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/pipelines/base.py:1076](https://file+.vscode-resource.vscode-cdn.net/home/user/Documents/repos/doqry/~/Documents/repos/doqry/venv/lib/python
[requirements.txt](https://github.com/impira/docquery/files/11320270/requirements.txt)
3.10/site-packages/transformers/pipelines/base.py:1076), in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
1074 return self.iterate(inputs, preprocess_params, forward_params, postprocess_params)
1075 elif self.framework == "pt" and isinstance(self, ChunkPipeline):
-> 1076 return next(
1077 iter(
1078 self.get_iterator(
1079 [inputs], num_workers, batch_size, preprocess_params, forward_params, postprocess_params
1080 )
1081 )
1082 )
1083 else:
1084 return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File [~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py:124](https://file+.vscode-resource.vscode-cdn.net/home/user/Documents/repos/doqry/~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py:124), in PipelineIterator.__next__(self)
121 return self.loader_batch_item()
123 # We're out of items within a batch
--> 124 item = next(self.iterator)
125 processed = self.infer(item, **self.params)
126 # We now have a batch of "inferred things".
File [~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py:291](https://file+.vscode-resource.vscode-cdn.net/home/user/Documents/repos/doqry/~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py:291), in PipelinePackIterator.__next__(self)
289 else:
290 item = processed
--> 291 is_last = item.pop("is_last")
292 accumulator.append(item)
293 return accumulator
File [~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/utils/generic.py:281](https://file+.vscode-resource.vscode-cdn.net/home/user/Documents/repos/doqry/~/Documents/repos/doqry/venv/lib/python3.10/site-packages/transformers/utils/generic.py:281), in ModelOutput.pop(self, *args, **kwargs)
280 def pop(self, *args, **kwargs):
requirements.txt
requirements.txt
--> 281 raise Exception(f"You cannot use ``pop`` on a {self.__class__.__name__} instance.")
Exception: You cannot use ``pop`` on a ModelOutput instance..
Hello, i have some question regarding docquery score result.
Sometimes i get score around 0.8-0.9 or vice versa. Could you explain what the score is based on? Is it based on WER or CER? is it like confidence score?. Thank you
pip install docquery
apt-get install tesseract-ocr
docsquery scan "What is the invoice number?" https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
Observe error: pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
poppler-utils
My environment:
Mac OS Apple Silicon
Ran via the Python:3
Docker image.
It may be worth adding to the README to install poppler-utils
. I'm happy to open a PR for this - also happy to open a PR for a basic Docker configuration if that's something you would like.
This is my first open-source contribution so apologies if I've missed some formalities - and nice project!
I installed docquery in a fresh python environment and followed the readme to run it using donut but ran into dependency issues. I had to install the following libraries to get it to work:
sentencepiece==0.1.97
protobuf==3.20.3
Just wanted to share this in case others run into the same problems.
print(q, p(question=q, **doc.context))
Exception: You cannot use pop
on a ModelOutput instance.
2023-06-17 18:30:01.207852: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
i am working on google collab
i also get an error when i use pipeline.getpipline
says its an attribute error and that the object doesn't have such a method
Top of my file:
from docquery import pipeline, document
When running this, I see the following error:
ImportError: cannot import name 'pipeline' from partially initialized module 'docquery' (most likely due to a circular import) (/Users/swaraj/sempre-repos/swaraj-sandbox/docquery.py)
This might be a low-priority issue, but os.geteuid()
is only available for unix-like systems, so this line throws an error on Windows:
Line 43 in a9c5127
I don't know how you can check for root user in Windows, but it might be a good idea to surround this with a try-catch, or check os
for geteuid()
with hasattr()
.
I already installed the docquery and it cannot import the Document from docquery This is the error it is showing Im doing it on colab
in <cell line: 1>()
----> 1 from docquery import Document
ImportError: cannot import name 'Document' from 'docquery' (/usr/local/lib/python3.10/dist-packages/docquery/init.py)
Traceback (most recent call last):
File "/usr/local/bin/docquery", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/docquery/cmd/main.py", line 59, in main
return args.func(args)
File "/usr/local/lib/python3.7/dist-packages/docquery/cmd/scan.py", line 47, in main
nlp = get_pipeline(args.checkpoint)
File "/usr/local/lib/python3.7/dist-packages/docquery/pipeline.py", line 79, in get_pipeline
**pipeline_kwargs,
File "/usr/local/lib/python3.7/dist-packages/transformers/pipelines/init.py", line 767, in pipeline
return pipeline_class(model=model, framework=framework, task=task, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/docquery/ext/pipeline.py", line 87, in init
super().init(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/pipelines/base.py", line 768, in init
self.device = device if framework == "tf" else torch.device("cpu" if device < 0 else f"cuda:{device}")
TypeError: '<' not supported between instances of 'str' and 'int'
Hi! I want to ask whether the DocQuery supports .json file as document input?
We've had several requests (e.g. this one) to support international documents in DocQuery. One way to do that is by extending DocQuery to support traditional QA models too (by rendering the document's text as a plain text). This will not be as accurate as document-specific models but may work well for a number of tasks.
Congrats on the great work with DocQuery, Impira team and congrats on joining Figma.
Have you looked at transfer learning DocVQA into UI RefExp? Natural language Referring Expression for the location of the bounding box of a UI element.
I have had some success with Donut, but it is converging slower than I anticipated. After a couple of weeks on a single A100GPU, it is still improving slowly with IoU at 47%. About 10% improvement in the last week. IoU (intersection over union) of predicted vs ground truth bounding boxes seems to be a reasonable validation metric for this task.
Curious what your experience with the task might have been and whether I am approaching the problem from the wrong angle.
Here is my Huggingface space with links to fine tuning notebook.
https://huggingface.co/spaces/ivelin/ui-refexp
Regards!
docquery scan "What is the invoice number?" https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
Errors:
FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo'
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
We should support running https://huggingface.co/spaces/impira/docquery in the Docker container
For users who want to quick start with docquery
and not figure out dependencies (e.g. tesseract
), it'd be great to have a single brew install
command that installs docquery and its dependencies.
Message: unknown error: session deleted because of page crashfrom unknown error: cannot determine loading statusfrom tab crashed (Session info: headless chrome=90.0.4430.212)Stacktrace:#0 0x55b0d6014e89 <unknown>
Message: unknown error: session deleted because of page crashfrom tab crashed (Session info: headless chrome=90.0.4430.212)Stacktrace:#0 0x55b74c4eae89 <unknown>
@ankrgyl
i have tried multiple ways to extract the currency name.but no luck.
can you suggest me a matching query(question) to extract the currency from the doc?
Hi, I am trying to fine tune impira/layoutlm-document-qa for document question answering tasks. I am unable to find any relevant code or examples for the same. Most content is based on LayoutLMv2/v3 but they differ in architecture and encoding hence unable to use for finetuning. Any help is appreciated
Need CLI for Word document, to read particular content in tables
Table and content tag will be always same.
Just need to read corresponding tag content and display it
Need to read all tags and display at once
$ docquery scan "What is the invoice number?" https://templates.invoicehome.com/invoice-template-us-neat-750px.png
Illegal instruction: 4
Just go this running on Mac M1, is it supported?
Any suggestions to resolve would be helpful ๐
Warm regards
what approach did you use for fine-tuning layoutlm on squad?
Since there is no visual element to squad, did you input blank boxes or null boxes? Or maybe generate synthetic documents?
from docquery import document, pipeline
gets me
Failed to import docquery.transformers_patch because of the following error (look up to see its traceback):
Failed to import transformers.pipelines because of the following error (look up to see its traceback):
cannot import name 'AutoModelForDepthEstimation' from 'transformers.models.auto.modeling_auto' (/Users/theobouwman/dev/projects/test/venv/lib/python3.10/site-packages/transformers/models/auto/modeling_auto.py)
When I am using my own trained docvqa model for checkpoints, it is throwing an error of the config.json file, how to use the trained model on transformers docvqa with the library, and return checkpoints
Specifically the question/answering over the README file. We should pick a "simpler" web page that can reliably return a result.
This issue tracks support for document classification, e.g. identifying that a document is an invoice.
Hi I'm facing the issues with pydantic
from pydantic.fields import ModelField
ImportError: cannot import name 'ModelField' from 'pydantic.fields'
with the python 3.8, 3.9, 3.10 also . Please suggest the solutions for this
Problem with this package is we can't able to extraction multiline answers. For example, description will have 2 or more lines docquery cant able to extract those informations. I need solution for this issue.
Hi! I have a question about how to store the doc.context in this command print(q, p(question=q, **doc.context))
? Because it took a long time to encode context file for each query, so it would be better to pre-encode the context file and use docquery to retrieve from the encoded context. Thank you!
doc) PS C:\Users\N_B\OneDrive\Desktop\docquery>
document-question-answering is already registered. Overwriting pipeline for task document-question-answering...
Traceback (most recent call last):
File "C:\Users\N_B\Miniconda3\envs\mydoc\lib\runpy.py", line 193, in run_module_as_main
"main", mod_spec)
File "C:\Users\N_B\Miniconda3\envs\mydoc\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\N_B\Miniconda3\envs\mydoc\Scripts\docquery.exe_main.py", line 7, in
File "C:\Users\N_B\Miniconda3\envs\mydoc\lib\site-packages\docquery\cmd_main.py", line 61, in main
return args.func(args)
File "C:\Users\N_B\Miniconda3\envs\mydoc\lib\site-packages\docquery\cmd\scan.py", line 54, in main
if pathlib.Path(args.path).is_dir():
File "C:\Users\N_B\Miniconda3\envs\mydoc\lib\pathlib.py", line 1373, in is_dir
return S_ISDIR(self.stat().st_mode)
File "C:\Users\N_B\Miniconda3\envs\mydoc\lib\pathlib.py", line 1183, in stat
return self._accessor.stat(self)
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'https:\news.ycombinator.com'
(mydoc) PS C:\Users\N_B\OneDrive\Desktop\docquery>
Running this command to use donut pip install docquery[donut]
seems to throw an error no matches found: docquery[donut]
A simple workaround is to install manually these two libraries :
pip install sentencepiece==0.1.97
pip install protobuf==3.20.3
A few folks have mentioned using DocQuery to scrape web pages (e.g. real-estate websites). This issue is meant to track that project and collect more information about use cases.
I just switch to GPU and do a "run all" and this is the error I get
PDFs take advantage of Poppler to create image previews; however, these are unnecessary if the file has embedded text for certain models (e.g. LayoutLMv1). We should make sure that the default scenario of poppler not being available still works.
Hello! May I ask how to adjust the number of output answers? I would like to get more than ONE answer. Thank you!
(CC @amazingvince) Once PR #12 lands, we should publish the dockerfile to Dockerhub for easy consumption.
Hi, I had a problem. When I run the file I get an error:
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 376, in pop
raise Exception(f"You cannot use pop
on a {self.class.name} instance.")
Exception: You cannot use pop
on a ModelOutput instance.
I saw here on GitHub that some ppl solve this by downgrading the transformers library at version 4.23.0, but when I try to do the same thing I get an error on installation:
ERROR: Failed building wheel for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects
Any ideas?
I saw on the internet that some ppl solve this by installing rust, but I get the same error.
Thanks
Hello
I'm getting the following error when trying the google colab demo
https://colab.research.google.com/github/amazingvince/docqa/blob/updating_colab/docquery_example.ipynb
(Same issue on my m1 macbook)
!docquery scan "who authored this paper?" https://arxiv.org/pdf/2101.07597.pdf
023-01-30 19:00:54,639 ERROR: Failed while processing https://arxiv.org/pdf/2101.07597.pdf on question: 'who authored this paper?'
Traceback (most recent call last):
File "/usr/local/bin/docquery", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/docquery/cmd/__main__.py", line 61, in main
return args.func(args)
File "/usr/local/lib/python3.8/dist-packages/docquery/cmd/scan.py", line 95, in main
response = nlp(question=q, **d.context)
File "/usr/local/lib/python3.8/dist-packages/docquery/ext/pipeline_document_question_answering.py", line 232, in __call__
return super().__call__({"question": question, "pages": normalized_images}, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1076, in __call__
return next(
File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
item = next(self.iterator)
File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/pt_utils.py", line 291, in __next__
is_last = item.pop("is_last")
File "/usr/local/lib/python3.8/dist-packages/transformers/utils/generic.py", line 281, in pop
raise Exception(f"You cannot use ``pop`` on a {self.__class__.__name__} instance.")
Exception: You cannot use ``pop`` on a ModelOutput instance.
https://www.animalearn.org/img/pdf/animalFacts.pdf
Question: which animals are mentioned in this document?
Docsign's answer: Tiny animals!
Correct answer: Cat, rat, pig, earthworm, crayfish.
Hi,
I'm trying to use docquery via command line to ask a question on a PDF file but the process seems to get stuck at the Loading pipelines.
step.
> docquery scan "What can I use to open a PDF file?" pdf-sample1.pdf
document-question-answering is already registered. Overwriting pipeline for task document-question-answering...
2023-04-05 09:47:24,078 INFO: Loading pdf-sample1.pdf
2023-04-05 09:47:24,120 INFO: Done loading 1 file(s).
2023-04-05 09:47:24,121 INFO: Loading pipelines.
Also, when loading as library, the notebook gets stuck at this line
p = pipeline('document-question-answering')
Could you help me resolve this? I'm on Windows 10 btw.
Currently how many document types handled by doc query engine ?
like invoice etc.....
pipeline([{"image": image, "question": question}]) input format fails.
The pipeline treats the list as image
and fails to extract question from dict on line 196 in pipeline.
The error is "TypeError: list indices must be integers or slices, not str"
Example:
p = pipeline.get_pipeline()
doc = document.load_document("example.pdf")
print(p([{**doc.context, 'question': "Who authored this paper?"}]))
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.