katanaml / sparrow Goto Github PK

View Code? Open in Web Editor NEW

1.8K 1.8K 221.0 797.49 MB

Data processing with ML and LLM

Home Page: https://katanaml.io

License: GNU General Public License v3.0

Python 98.98% Shell 1.02%

computer-vision gpt huggingface-transformers llm machinelearning nlp-machine-learning rag

sparrow's People

Contributors

Stargazers

Watchers

Forkers

ultrapower8 mattiasstahre beichu benjamesbabala karndeepsingh sc-ravbal saifurrr faizan1041 macknilan marziehzzare ltogliaucv ebunt periwinkleftw etc22 glaceage pingyangtiaer brentes seanbrayucl lcalmbach crziter lenapheno eulerian-tuple jarach-209 pablomarcel ssenoris shrijagit tdrei kamzam sthitaprajnas abd-mansour blackwhites jimenezmoracd codeaudit rajneesh-tiwari eureka3214 codingtello2 hansv905 andy-fractal adithya-n11 anuragsingh28 suriya43426 shrey10926 valsicuksofia henb bitpetro akramineda sakets596 qkjin fabregas201307 brandaobrandisborges ocrifydotnet leenamkee heidihello luisgradossalinas wsantunes blueoceandevops willcode2surf nicojuicy denischiciudean nfonjeannoel techthiyanes tangklrb kawdoco jabbiez maxpowerwastaken badreddine1234 allstartix 5l1v3r1 bigdatasciencegroup pai-sr am-infoweb mz0in milenioscience itsbrex 1001011000101101 ss756 sadafshafi titocampos jaypaddy anoop-qasolve yz accountsware waceke-kc inayet josekurian sukritgoel thedotproduct jemsc eivindkjosbakken codelabspro bpm-ui if-ai m-daniyal-123 maxsu tgowthambits tonywhite11 nadeem-cpp denysmiller tuhinmallick isa96

sparrow's Issues

Table Extraction Benchmark Usecase

Hope this finds you well. In the midst of my recent deep dive into OCRs, I found myself in a conundrum. We've got this galaxy of OCR tools—tabula, doctr, textract, paddle, donut, and the list goes on. Each with their own merits, but how do we objectively measure them?

I've been toying with an idea: writing an open source OCR benchmark system.

Here's a breakdown of what the project aims to achieve, in a tentative order of implementation:

Initial milestone:

Curate some public data sets.
Create a pip-friendly CLI.
Automate OCR installation.
Test OCR quality (WER) and OCR speed.

Expansion push:

Allow users to test their own data.
Use Python notebooks for visually appealing reports.
Keep the report notebooks clean, with a thoughtful reporting API.

Stretch goals:

Run tests on a series of OCR releases - Who's improving? Who has frequent regressions?
Meta-analysis; interpret results from multiple experiments.
Evaluate textract's cost-effectiveness in user use cases (WER-Delta-Per-Dollar-Vs-Doctr) 🤡

Now, I'm floating this to you because I respect your acumen and think you could be the catalyst to take this from concept to reality. But here's the catch, and it's a significant one for me: I'm a die-hard advocate for keeping this venture firmly rooted in the open-source ethos, specifically under the AGPL. The repo is currently MIT, and I'd be keen on transitioning it to AGPL.

If this aligns with your principles and you're up for a challenge, then let’s talk! If not, no hard feelings. It's crucial we're on the same wavelength from the get-go.

Eager to hear your thoughts!

How to save the predicted output from LayoutLM or LayoutLMv2 ?

I trained LayoutLM for my dataset and I am getting predictions at the word level like in the image "ALVARO FRANCISCO MONTOYA" is true labeled as "party_name_1" but while prediction "ALVARO " is tagged as "party_name_1", "FRANCISCO" is tagged as "party_name_1", "MONTOYA" is tagged as "party_name_1". In short, i am getting prediction for each word but how to save these prediction as one predicted output like "ALVARO FRANCISCO MONTOYA" as "party_name_1". How to save this as a single output?
Any help would be greatful.
Below image is the predicted output image from LayoutLM.

ModuleNotFoundError: No module named 'tools.utilities'

RuntimeError: Failed to import transformers.models.mpnet.modeling_mpnet

Hi I tried runnig your demo with:

./sparrow.sh ingest

But this resulted in this error:

...
│ /home/tobias/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:698 in │
│ getattribute_from_module                                                                         │
│                                                                                                  │
│   695 │   │   return None                                                                        │
│   696 │   if isinstance(attr, tuple):                                                            │
│   697 │   │   return tuple(getattribute_from_module(module, a) for a in attr)                    │
│ ❱ 698 │   if hasattr(module, attr):                                                              │
│   699 │   │   return getattr(module, attr)                                                       │
│   700 │   # Some of the mappings have entries model_type -> object of another model type. In t   │
│   701 │   # object at the top level.                                                             │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │   attr = 'MPNetModel'                                                                        │ │
│ │ module = <module 'transformers.models.mpnet' from                                            │ │
│ │          '/home/tobias/.local/lib/python3.10/site-packages/transformers/models/mpnet/__init… │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                  │
│ /home/tobias/.local/lib/python3.10/site-packages/transformers/utils/import_utils.py:1354 in      │
│ __getattr__                                                                                      │
│                                                                                                  │
│   1351 │   │   if name in self._modules:                                                         │
│   1352 │   │   │   value = self._get_module(name)                                                │
│   1353 │   │   elif name in self._class_to_module.keys():                                        │
│ ❱ 1354 │   │   │   module = self._get_module(self._class_to_module[name])                        │
│   1355 │   │   │   value = getattr(module, name)                                                 │
│   1356 │   │   else:                                                                             │
│   1357 │   │   │   raise AttributeError(f"module {self.__name__} has no attribute {name}")       │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ name = 'MPNetModel'                                                                          │ │
│ │ self = <module 'transformers.models.mpnet' from                                              │ │
│ │        '/home/tobias/.local/lib/python3.10/site-packages/transformers/models/mpnet/__init__… │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                  │
│ /home/tobias/.local/lib/python3.10/site-packages/transformers/utils/import_utils.py:1366 in      │
│ _get_module                                                                                      │
│                                                                                                  │
│   1363 │   │   try:                                                                              │
│   1364 │   │   │   return importlib.import_module("." + module_name, self.__name__)              │
│   1365 │   │   except Exception as e:                                                            │
│ ❱ 1366 │   │   │   raise RuntimeError(                                                           │
│   1367 │   │   │   │   f"Failed to import {self.__name__}.{module_name} because of the followin  │
│   1368 │   │   │   │   f" traceback):\n{e}"                                                      │
│   1369 │   │   │   ) from e                                                                      │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ module_name = 'modeling_mpnet'                                                               │ │
│ │        self = <module 'transformers.models.mpnet' from                                       │ │
│ │               '/home/tobias/.local/lib/python3.10/site-packages/transformers/models/mpnet/_… │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Failed to import transformers.models.mpnet.modeling_mpnet because of the following error (look up to see its traceback):
Failed to import transformers.generation.utils because of the following error (look up to see its traceback):
'FieldInfo' object has no attribute 'required'
/usr/lib/python3.10/tempfile.py:999: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpkf5oi4te'>
  _warnings.warn(warn_message, ResourceWarning)

Before I started docker, ran the installation and downloaded the model:

docker compose up -d
pip install -r requirements.txt
curl -fsSL https://ollama.com/install.sh | sh
wget https://huggingface.co/TheBloke/Starling-LM-7B-alpha-GGUF/resolve/main/starling-lm-7b-alpha.Q5_K_M.gguf?download=true -O starling-lm-7b-alpha.Q5_K_M.gguf

Error with docker

Getting error {"action":"startup","error":"invalid config: no authentication scheme configured, you must select at least one","level":"error","msg":"could not load config","time":"2024-03-01T08:58:45Z"}
When to start the docker.

How can make sub group

I am trying to get result like the following json:
{
"INVOICE_HEADER": {
"INVOICE_INFO": {
"INVOICE_DATE": "2023-07-28",
"INVOICE_ID": "L05-4254515",
"INVOICE_ISSUER_IDREF": 442985,
"INVOICE_RECIPIENT_IDREF": {
"content": "41420-0000411428-89",
"type": "netcomID"
},
"HEADER_UDX": {
"RE_PK": 1010,
"INV_REFERENCE_NUMBER": "N/A",
"INV_NET_AMOUNT2": "N/A",
"INV_NET_AMOUNT3": "N/A",
"INV_TAX_RATE2": "N/A",
"QR_IBAN": "CH1130778010700502202",
"QR_REFERENCE": 4.017800000004255e+25,
"INV_TAX_RATE3": "N/A",
"DocumentNr": 60124433,
"PON": "N/A",
"INV_TAX_AMOUNT2": "N/A",
"INV_TAX_AMOUNT3": "N/A",
"QR_INFORMATION": "N/A",
"INV_IS_MM": 0,
"RE_ILN": 7610227000016,
"INV_DELIVERY_DATE": "26.07.2023",
"RE_RECIPIENT_NO": 8,
"ESR_ROW": "N/A",
"INV_CREDIT_NOTE": 0
},
"CURRENCY": "CHF",
"PARTIES": {
"PARTY": [
{
"PARTY_ROLE": "invoice_issuer",
"ADDRESS": {
"TAX_NUMBER": "CHE-104.537.601",
"CITY": "Weiningen ZH",
"VAT_ID": "N/A",
"NAME": "Auto AG Truck",
"STREET": "Im Gewerbepark 1",
"NAME2": "N/A",
"COUNTRY": "CH",
"ZIP": 8104
},
"PARTY_ID": [
442985,
{
"content": "41001-0000415300-14",
"type": "netcomID"
}
]
},
{
"PARTY_ROLE": "invoice_recipient",
"ADDRESS": {
"CITY": "Schaan",
"VAT_ID": "LI50552",
"NAME": "Hilcona AG",
"STREET": "Bendererstrasse 21",
"COUNTRY": "LI",
"ZIP": 9494
},
"PARTY_ID": "41420-0000411428-89"
}
]
}
}
},
"INVOICE_ITEM_LIST": {
"INVOICE_ITEM": {
"ITEM_UDX": {
"INVI_ORI_ARTICLE_NO": "206-555-04",
"OR_ORDER_NO": "206-555-0144",
"INVI_ORDER_NO": "206-555-0144",
"OR_DELIVERY_DATE": "2019-11-11",
"OR_DELIVERY_NO": 1,
"OR_TOTAL_NET_PRICE": 100
},
"QUANTITY": 1,
"LINE_ITEM_ID": 1,
"PRICE_LINE_AMOUNT": 110,
"PRODUCT_ID": {
"DESCRIPTION_SHORT": "iom_dummy"
},
"ORDER_UNIT": "C62",
"PRODUCT_PRICE_FIX": {
"PRICE_AMOUNT": 110
}
}
},
"version": 2.1,
"INVOICE_SUMMARY": {
"TOTAL_TAX": {
"TAX_DETAILS_FIX": {
"TAX": 7.7,
"TAX_AMOUNT": 41.75
}
},
"NET_VALUE_GOODS": 542.29,
"TOTAL_ITEM_NUM": 1,
"TOTAL_AMOUNT": 584.05
}
}
To do that, I think sparrow-data needs to be changed. The pdfs are in German language but the invoice is in English.

sparrow-ui release ?

Hi , really love what you guys are doing with sparrow rn , and would love to see ui to test it , there was one for the donut models but it seems its still under active development here, any expected time of release

Model for commercial use

The starling model mentioned in config file is not available for commercial use. Which other model is available for commercial use?

Can you share google colab tutorial for it. Can it run on Korean langugage?

How to increase the accuracy after training a model?

Hi everyone

I was able to create my fine-tuned model based on a dataset of 93 images. Here is an example:

Image:

Json:

{
    "contractor_application_for_payment": {
        "original_contract_sum": "313,500.00",
        "net_change_by_change_orders": "0.00",
        "contract_sum_to_date": "313,500.00",
        "total_completed_and_stored_to_date": "100,000.00",
        "retainage": "10,000.00",
        "total_earned_less_retainage": "90,000.00",
        "less_previous_certificates_for_payment": "0.00",
        "current_payment_due": "90,000.00",
        "balance_to_finish": "223,500.00"
    }
}

Then I test how accurate my model is with 6 images and I get a poor result of 0.53 accuracy so not really sure what to do next.

For some documents, it's very inaccurate, it looks like it sets random values: (I'm just printing the values here)

Expected:
['16,875.00', '82,370.95', '99,245.95', '14,201.89', '1,420.19', '12,781.70', '0.00', '12,781.70', '86,464.25']

Inferred:
['16,875.00', '82,370.95', '99,245.95', '16,803.89', '1,639.18', '15,282.29', '0.00', '15,282.29', '60,442.22']

I'm basically using the same code as in this repo. Only my config is a bit different:

config = {
    "max_epochs": 30,
    "val_check_interval": 0.4,
    "check_val_every_n_epoch":1,
    "gradient_clip_val" :1.0,
    "num_training_samples_per_epoch": 93,
    "lr": 3e-5,
    "train_batch_sizes": [8],
    "val_batch_sizes": [1],
    # "seed":2022,
    "num_nodes": 1,
    "warmup_steps": 5,
    "result_path": "./result",
    "verbose": False,
}

Should I adjust it? or should I just train my model with a bigger dataset?

I'd appreciate some guidance please, thanks in advance!

Query Regarding Tree-Based Accuracy Calculation in Donut Model

Hello,

I am currently working on understanding the code within the donut model repository, specifically focusing on the tree-based accuracy calculation. While examining the codebase, I came across the utilization of the zss.distance function for accuracy calculation.

My inquiry pertains to the distance function, particularly concerning the concept of "keyroots of tree." I am seeking clarification on the definition and significance of these "keyroots of tree" within the context of the accuracy calculation process. Could someone kindly provide an explanation or insight into this matter?

Thank you.

Taking a long response time

The code that you've shared is taking a long time (more than 3 mins) to retrieve the results, how to optimize the response time ?

validation loss does not decrease

Hello,

I have been trying to finetune the donut model on my custom dataset. However, I have encountered an issue where the validation loss does not decrease after a few training epochs.

Here are the details of my dataset:

Total number of images in the training set: 12032
Total number of images in the validation set: 1290

Here are the config details that I have used for training;

config = { "max_epochs":30,
"val_check_interval":1.0,
"check_val_every_n_epoch":1,
"gradient_clip_val":1.0,
"num_training_samples_per_epoch": 12032,
"lr":3e-5,
"train_batch_sizes": [1],
"val_batch_sizes": [1],
#"seed":2022,
"num_nodes": 1,
"warmup_steps": 36096,
"result_path": "./result",
"verbose": False,
}

Here is the training log :

Epoch 21: 99%
13160/13320 [51:42<00:37, 4.24it/s, loss=0.0146, v_num=0]

The validation loss appears to fluctuate without showing a consistent decreasing trend. I would appreciate any insights or suggestions on how to address this issue and potentially improve the validation loss convergence.

Thank you for your assistance.

Is there a way to store the custom trained model locally and use it for inference?

In the provided example you use PushToHubCallback to push the trained model to hugging face.

In my case, I just want to keep the model locally. I tried using trainer.save_checkpoint("mymodel.ckpt"). The file mymodel.ckpt gets saved but not sure how to use it for inference since in sparrow_inference_donut_v1.ipynb it uses the model directly from hugging face with:

processor = DonutProcessor.from_pretrained('katanaml-org/invoices-donut-model-v1')
model = VisionEncoderDecoderModel.from_pretrained('katanaml-org/invoices-donut-model-v1')

I tried providing the path to mymodel.ckpt but it doesn't work, it seems it's expecting to find these files:

which I don't know to generate. Any help would be highly appreciated 🙂 thanks!

When using own images that are not invoices gives an unknown error

FileNotFoundError: [Errno 2] No such file or directory: 'docs/json/claim 001.json' Traceback: File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script exec(code, module.__dict__) File "/Users/louis/Documents/code/Code/sparrow/sparrow/sparrow-ui/donut/main.py", line 187, in <module> view(Model()) File "/Users/louis/Documents/code/Code/sparrow/sparrow/sparrow-ui/donut/main.py", line 88, in view DataAnnotation().view(DataAnnotation.Model(), st.session_state['ui_width'], st.session_state['device_type'], File "/Users/louis/Documents/code/Code/sparrow/sparrow/sparrow-ui/donut/views/data_annotation.py", line 146, in view saved_state = self.fetch_annotations(model.rects_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/louis/Documents/code/Code/sparrow/sparrow/sparrow-ui/donut/views/data_annotation.py", line 348, in fetch_annotations with open(rects_file, "r") as f:

Can you share google Colab code for this Repo

Hello, I hope you are doing well. I want to run this repo on Google colab. How Can I run this

Long respond time

Taking a long time to respond/process the results. Usually more than 3-4 mins.
Need to optimize the response time.
How to optimize?

How to annotate new document image?

I tried to upload an image but I got this error. It's asking for a corresponding file in docs/json.
But I haven't created one because I wish to use the annotation tool to do exactly that.
What are the steps to annotate new images?

Error with training with cpu

Error: AttributeError: 'DataLoader' object has no attribute 'code'

I've updated trainer as below:

trainer = pl.Trainer(
accelerator="cpu",#===========here
devices=1,
max_epochs=config_params.get("max_epochs"),
val_check_interval=config_params.get("val_check_interval"),
check_val_every_n_epoch=config_params.get("check_val_every_n_epoch"),
gradient_clip_val=config_params.get("gradient_clip_val"),
precision='bf16-mixed', #===========here
num_sanity_val_steps=0,
# logger=wandb_logger,
callbacks=[PushToHubCallback()],
)

LlamaIndex make this code really unreadable

LlamaIndex makes this code really hard to read and customization .

Please help me start the sparrow ocr api. It is not clear from the readme.

https://github.com/katanaml/sparrow/tree/main/sparrow-data/ocr Is this the path for sparrow ocr? I dont see a "start" service command here. Also, do I need to install the requirements.txt in the same virtual environment or I need to create a different one to avoid conflicts?

LayoutLMv2 inference without labeling the new images

Hi @abaranovskis-redsamurai,

Could you provide me the screenshot how the test data looks like before loading it to the model for inference?

I see that you are feeding the label information ['nertags'] as well during inference. When we are actually trying to predict the new image, how is it possible to give labels to them? Could you please throw some light on this?

Looking forward to your reply.

Thanks

Installation Tutorial

Is there a step by step installation tutorial?

I tried this installation on an Ubuntu 22.04 following https://github.com/katanaml/sparrow/blob/main/README.MD

Once i reached LLM:
./sparrow.sh ingest

I get :
Missing option '--file-path'.

But if i supplied path with the sample data,
./sparrow.sh ingest --file-path data/invoice_1.pdf

I get:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/llm/sparrow/sparrow-ml/llm/ingest.py:14 in run                                       │
│                                                                                                  │
│   11 def run(file_path: Annotated[str, typer.Option(help="The file to process")],                │
│   12 │   │   agent: Annotated[str, typer.Option(help="Ingest agent")] = "llamaindex"):           │
│   13 │   user_selected_agent = agent  # Modify this as needed                                    │
│ ❱ 14 │   ingest = get_ingest(user_selected_agent)      
...< and alot of codes>...
AttributeError: module 'threadpoolctl' has no attribute 'threadpool_limits'

How are annotations handled when training?

I've been reading the code on run_ocr.py, run_converter.py, run_donut.py and it seems the dataset that gets uploaded to huggingface only contains the image + the json file in this format: (Also checked in huggingface and it only contains image + ground_truth it seems)

{
  "header": {
    "invoice_no": "40378170",
    "invoice_date": "10/15/2012",
    "seller": "Patel, Thompson and Montgomery 356 Kyle Vista New James, MA 46228",
    "client": "Jackson, Odonnell and Jackson 267 John Track Suite 841 Jenniferville, PA 98601",
    "seller_tax_id": "958-74-3511",
    "client_tax_id": "998-87-7723",
    "iban": "GB77WRBQ31965128414006"
  },
  "items": [
    {
      "item_desc": "Leed's Wine Companion Bottle Corkscrew Opener Gift Box Set with Foil Cutter",
      "item_qty": "1,00",
      "item_net_price": "7,50",
      "item_net_worth": "7,50",
      "item_vat": "10%",
      "item_gross_worth": "8,25"
    }
  ],
  "summary": {
    "total_net_worth": "$7,50",
    "total_vat": "$0,75",
    "total_gross_worth": "$8,25"
  }
}

So when the model needs to be trained, it only uses that data? What happens to the annotations and bounding boxes?

I recently started learning ML and I though annotations with bounding boxes needed to be part of the dataset. I would really appreciate some explanation on how it works.

Thanks a lot for your patience 🙂

Training locally

Hi Andrej, I tried to run your Fine Tuning colab locally but I had so much troubles with the Dataloaders and the multiprocessors, basically they weren't recognizing the donutDataset function defined in the notebook, so I had to define it outside and import it, but after this when Im training the model with GPU the training starts but the loss function never decreases. My question is, do you have a notebook or a script where you did a local training, or do you know how to fix this issue?, I would appreciate it a lot.
Thank you for your attention.

Sparrow logo - No technical issue

Hi Commity , First of all I'd like to thank you for this amazing tool that helps me a lot to annotate unstructured data, For now I don't have any technical issue but for my final studies I'm preparing a presentation to show my approach using sparrow to annotate my custom data then refining the Donut model. All I need is the sparrow logo :P I can't find it anywhere.
Thanks in advance

Model assuming NoneType instead of string.

I am using gemma-7b model. I am getting this error for the above pdf:

ws_nm_ncqa_recred_oe_batch_desc
Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
For further information visit https://errors.pydantic.dev/2.6/v/string_type
other_specialities
Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
For further information visit https://errors.pydantic.dev/2.6/v/string_type
upin_number
Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
For further information visit https://errors.pydantic.dev/2.6/v/string_type
taxonomy_code
Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
For further information visit https://errors.pydantic.dev/2.6/v/string_type

Looks like wherever the output value in the document is None or No information provided, the model is assuming NoneType but I cannot give NoneType as type in the command as I won't know in advance which fields would be none. Is there a way the model could ignore such fields. Sme change in prompts? Or any other solution to this problem?

Cannot install requirements for sparrow-ml/llm

Thanks for what seems to be a nice template structure. However, I'm having a barrier to getting started because I'm not able to install the virtual environment in sparrow-ml/llm/requirements.txt - I've concluded it must be Python 3.10 since 3.11 and 3.9 both seem to lead to significant errors. I'm on 3.10.13 and I get
`ERROR: Cannot install -r requirements.txt (line 25) and haystack-pydoc-tools because these package versions have conflicting dependencies.

The conflict is caused by:
instructor 0.6.4 depends on docstring-parser<0.16 and >=0.15
pydoc-markdown 4.8.2 depends on docstring-parser<0.12 and >=0.11

To fix this you could try to:

loosen the range of package versions you've specified
remove package versions to allow pip attempt to solve the dependency conflict
`
Would be great if you could review and also confirm your python version and any dependencies.

api/chat not available

HTTPStatusError: Client error '404 Not Found' for url 'http://127.0.0.1:11434/api/chat'

I see a bug on line #15 of pdf_converter.py file

        # save the jpg file
        for page in pages:
            page.save(jpg_path + '/' + pdf_file.replace('.pdf', '') + '.jpg', 'JPEG')

This will overwrite with last page all the time. You may consider adding a counter initialized before the save loop and increment inside and add the counter to the name of the file to get separate file by page.

License and usage clarification

I just came across this promising project but the license and usage needs clarification.

The LICENSE in the repo is GPL
The License Section in the README says it's under the Apache License
And the Commercial usage section in the README and the mentioning of dual licensing options add even more confusion to it.

To be able to consider evaluating this project for potential uses, this needs to be sorted out.

Also considering the latest licensing change Headlines around Redis it would be good to be transparent about potential future plans around licensing.

Once the confusion about the Licensing is resolved, we can move on to having an actual look at sparrow itself.

LLM is disabled. Using mock LLM

I am getting this :

⠙ Loading documents...
⠦ Loading embedding model...
LLM is explicitly disabled. Using MockLLM.
⠋ Building index...

And I am getting json response as {} i.e. empty
Is it because of the LLM being disabled? What is the resaon for llm to be disabled and mock llm to be used?

grouping annotations disappearing when changing some other annotation

I don't know if this just happen to me but when I do the grouping annotations under the ordering tab, if I come back to the mapping tab and I add another annotation or I simply change one of the existing annotations that I have then the whole group column under the ordering tab gets blank.

This is annoying cause I have to be completely be sure that I will not add any more fields in the mapping tab or change any annotation before start making the groupings or I will have to group over and over for every change I make.

Also, this is a suggestion, it would be nice if you could group a bunch of annotations together in the ordering tab, right now you have to go one by one selecting in which group they belong, but if you have 50 of them that belong to the "summary" category for example, it would be better to select this 50 elements and group them into "summary" all at once, it would make the process much more faster.

thank you so much!

Random prediction and wrong prediction in repeated characters

Hello,

I have trained a donut base model on our custom dataset, which consists of a total of 12,480 images. I then fine-tuned this base model with default parameters.

During the analysis of predictions, I observed certain patterns in the JSON output. Specifically, when similar keys appear almost simultaneously, the model tends to make the following types of errors:

It predicts extra characters (e.g., "Paneer cheese paratha with butter" is predicted as "Paneer Paneer cheese paratha with butter").
It misses some characters (e.g., "199.00" is predicted as "19.00").
It predicts incorrect characters (e.g., "119.00" is predicted as "159.00").

Additionally, I noticed that the model often predicts characters such as "5," "7," and "1," even though these characters are not present in the images.

Ground Truth:

{
"table": [
{
"key": "Paneer paratha with butter",
"value": "199.00"
},
{
"key": "Paneer cheese paratha with butter",
"value": "119.00"
}
]
}

Prediction:

{
"table": [
{
"key": "Paneer paratha with butter",
"value": "19.00"
},
{
"key": "Paneer Paneer cheese paratha with butter",
"value": "159.00"
}
]
}

In the below json, model misses in between characters, predicts something else other than ground truth or gives extra characters in prediction which are not there in image/json. The image is clean enough for a model to get proper predictions still it gets wrong predictions as mentioned above.

As per analysis, the model makes more mistakes in values(Numeric) than keys(Alphabetic), maybe the reason is data imbalancing.

Ground Truth:

{
"table": [
{
"key": "Accessible Amount",
"value": "9123.23"
},
{
"key": "Car parts due :",
"value": "2,09,233.19"
},
{
"key": "Paint brushes :",
"value": "200.00"
}
]
}

Predicted:

{
"table": [
{
"key": "Accesible Amount",
"value": "9123.33"
},
{
"key": "Car parts due :",
"value": "9,1,233.19"
},
{
"key": "Paint brushes :",
"value": "200.000"
}
]
}

In the JSON provided below, despite the clarity of the image, the model consistently exhibits several issues:

Missing Characters: The model frequently fails to recognize certain characters.
Duplicate Keys: It tends to predict the same type of key multiple times, resulting in an extra key, such as "Oil fluid," which is a combination of two adjacent keys.
Missing Colon (:) at the End of Keys: The model omits the colon character at the end of keys.
Missing Plus Sign (+) in Values: It also overlooks the plus sign in values.

Ground Truth :

{
"table": [
{
"key": "Delivery charges :",
"value": "(+)470.00"
},
{
"key": "Oil charge:",
"value": "3,120.00"
},
{
"key": "Washer fluid :",
"value": "3,120.00"
}
]
}

Predicted:

{
"table": [
{
"key": "Delivery charges",
"value": "( )470.00"
},
{
"key": "Oil charge:",
"value": "3,120.00"
},
{
"key": "Oil fluid :",
"value": "157.00"
},
{
"key": "Washer fluid :",
"value": "3,120.00"
}
]
}

In the below json, I have found the same pattern that sometimes model predict a character only one time even after that character there two times in the image. like; (‘@ @’, ‘: :’) then the model will predict it only once. Also predicts the same keys twice.

Ground Truth:

{
"table": [
{
"key": "Transport charges::",
"value": "144.00"
},
{
"key": "Freight charges",
"value": ""
},
{
"key": "Washer fluid @ @ 18 %",
"value": "3,120.00"
}
]
}

Prediction:

{
"table": [
{
"key": "Transport charges:",
"value": "144.00"
},
{
"key": "Freight charges:",
"value": ""
},
{
"key": "Freight charges:",
"value": ""
},
{
"key": "Freight charges:",
"value": ""
},
{
"key": "Washer fluid @ 18 %",
"value": "3,120.00"
}
]
}

Pandas v2 currently installed by default but throws error (for sparrow-ui)

Hello Andrej,

I'm a big fan of this repo, thanks for creating it!

One issue I've encountered - for Sparrow-UI, after I install dependencies with pip install -r requirements.txt and then launch the data annotation UI with streamlit run main.py, I get the following error:

...
 File "/Users/max.epstein/opt/anaconda3/envs/spuienv/lib/python3.10/site-packages/st_aggrid/__init__.py", line 42, in __cast_date_columns_to_iso8601
    for c, d in dataframe.dtypes.iteritems():
  File "/Users/max.epstein/opt/anaconda3/envs/spuienv/lib/python3.10/site-packages/pandas/core/generic.py", line 5989, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'iteritems'

This SO answer notes that Series.iteritems() was removed as of Pandas v2. And while Pandas is not mentioned in your requirements.txt file, one of your dependencies seems to currently install Pandas v2 (as I can see from running pip freeze | grep pandas in my venv.

I fixed this issue by running pip install "pandas<2.0" after running pip install -r requirements.txt, but I'd suggest adding this line to requirements.txt as a better solution. Happy to provide a PR with just that change if you'd like as well.

Configurable dataset location?

Hi Andrej,

Would you accept a PR to change all the paths with /invoices/ in the sparrow-data code files to use a configurable /dataset/ variable instead (to be set at the CLI and/or a config file)? I would keep the default value of dataset to be invoice so that default behavior of e.g. run_ocr.py would not change.

That way, I could run sparrow-data functions on my own data/path with just a CLI change. Currently, in my local sparrow code I have updated those paths to something with mydatasetname in the code files, but then I can't keep pulling in your main updates because I will always have a conflict in the code with your paths which reference /invoices/.

go back option while grouping word

I made a mistake while grouping the words of a label, how can I go back to correct this error?

retrain a model that is already pushed to hugging face

Im doing the fine-tuning of the donut model using your sparrow_finetuning_donut_v1.ipynb file, but I came across one problem: I have to specify the number of epochs I want to train the model for the dataset ( user/dataset) and once it finishes the training if I want to keep continuing the training for this model ( in this case user/dataset) I don't know how to do it without overwriting the existing training I already did, can you help me in this task?
thank you very much

Haystack giving timeout error

At the ingest step haystack agent is throwing timeout error. I just ran the command:
./sparrow.sh ingest --file-path /data/invoice_1.pdf --agent haystack --index-name Sparrow_haystack_doc1

Error:
ReadTimeout: HTTPConnectionPool(host='127.0.0.1', port=11434): Read timed out. (read timeout=120)

I have the following Image I want to apply this Repo on that Image

I have the following Image I want to apply this Repo on that Image Which is basically In Korean Language,

Can you please share a google Colab tutorial for it?

http connection error while using vprocessor

I am getting following error when I use vprocessor:

ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=8001): Max retries
exceeded with url: /api/v1/sparrow-ocr/inference (Caused by
NewConnectionError('<urllib3.connection.HTTPConnection object at
0x71e175f92d10>: Failed to establish a new connection: [Errno 111] Connection
refused'))

extraction Invalid sparrow key error

I already installed the application through the docker method and the normal streamlit run method and I don't know what I'm doing wrong but when I try to execute the extraction for a given document it gives me this error:

{
"error":"Invalid Sparrow key."
}

Also I can't delete or create any labels or groups in the setup section.

Thanks for all your work.

What to do after annotation?

I am performing the following steps:

1.) Deleted all the pdf, images and json from all the folders. I only kept 5 pdf files in the "sparrow-data/docs/input/invoices/Dataset with valid information" folder in order to test the training process.
1.) Installed the all the relevent libraries in the sparrow-data and sparrow-ui folder.
2.) Ran run_ocr.py and run_converter.py in order to get processed json in "sparrow-data/docs/input/invoices/processed/output" folder.
3.) Copied the json files from "sparrow-data/docs/input/invoices/processed/output" to "sparrow-ui/docs/json" folder.
4.) Annotated the documents by running the streamlit app and exported the labels.

After this step what should I do next? I watched the 3 tutorials on how to prepare the dataset using sparrow annotation tool but I couldn't find instruction on what to do after annotations. If I run run_donut.py I get an error that metadata.jsonl file is empty. The training, validation and testing folders present in the sparrow-data folder are also empty except for a jsonl file which is again empty. Kindly guide me in the right direction!

Thanks in advance!

Invalid key when use ui

When I tried to extract data of an image by using sparrow-ui, I got error: "Invalid Sparrow key.".
It means that we have to use a key to post url?

Performance is slow on GPU as well

I got : Time to retrieve answer: 11.326609142999587
This model requires around 11-12 secs on GPU. I am using 'adrienbrault/nous-hermes2pro:Q5_K_M-json'
What are the ways to improve the speed of this model apart from using a higher config GPU? Can it work faster when data is processed in batches? Any other optimizations?
Also does the time to retrieve answer include ocr time? Else I have to adds around 5 more secs to the total time.

Have you seen this issue while installing requirements.txt of sparrow-ocr?

Collecting python-poppler==0.4.1
Using cached python_poppler-0.4.1.tar.gz (138 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Installing backend dependencies ... done
Preparing metadata (pyproject.toml) ... error
error: subprocess-exited-with-error

× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [33 lines of output]
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 156, in prepare_metadata_for_build_wheel
hook = backend.prepare_metadata_for_build_wheel
AttributeError: module 'mesonpy' has no attribute 'prepare_metadata_for_build_wheel'

  During handling of the above exception, another exception occurred:
  
  Traceback (most recent call last):
    File "/usr/lib/python3/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
      main()
    File "/usr/lib/python3/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/usr/lib/python3/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 160, in prepare_metadata_for_build_wheel
      whl_basename = backend.build_wheel(metadata_directory, config_settings)
    File "/tmp/pip-build-env-qqwjak5s/overlay/local/lib/python3.10/dist-packages/mesonpy/__init__.py", line 985, in wrapper
      return func(*args, **kwargs)
    File "/tmp/pip-build-env-qqwjak5s/overlay/local/lib/python3.10/dist-packages/mesonpy/__init__.py", line 1038, in build_wheel
      with _project(config_settings) as project:
    File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
      return next(self.gen)
    File "/tmp/pip-build-env-qqwjak5s/overlay/local/lib/python3.10/dist-packages/mesonpy/__init__.py", line 912, in _project
      yield Project(source_dir, build_dir, meson_args, editable_verbose)
    File "/tmp/pip-build-env-qqwjak5s/overlay/local/lib/python3.10/dist-packages/mesonpy/__init__.py", line 635, in __init__
      self._meson = _get_meson_command(pyproject_config.get('meson'))
    File "/tmp/pip-build-env-qqwjak5s/overlay/local/lib/python3.10/dist-packages/mesonpy/__init__.py", line 947, in _get_meson_command
      meson_version = subprocess.run(cmd + ['--version'], check=False, text=True, capture_output=True).stdout
    File "/usr/lib/python3.10/subprocess.py", line 503, in run
      with Popen(*popenargs, **kwargs) as process:
    File "/usr/lib/python3.10/subprocess.py", line 971, in __init__
      self._execute_child(args, executable, preexec_fn, close_fds,
    File "/usr/lib/python3.10/subprocess.py", line 1863, in _execute_child
      raise child_exception_type(errno_num, err_msg, err_filename)
  FileNotFoundError: [Errno 2] No such file or directory: 'meson'
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

ModuleNotFoundError: No module named 'rapidfuzz.string_metric'

Hi. I am getting the error ModuleNotFoundError: No module named 'rapidfuzz.string_metric' when executing the following command python run_ocr.py.

I have installed the requirements before running the command. Also I have separately install rapidfuzz using pip install rapidfuzz but still getting the error. My python version is 3.8.0.
Any help is appreciated!

Thanks in advavce!

Issue while training the model.

Hi Folks, Greetings of the day to y'all. It is very commendable of the work that you have done and I highly appreciate the wok.
However, I found an error while running the training code on colab. I request the team to have a look into and get me the resolving steps. I am getting this error after running one epoch.
Below is the error I am finding.

HfHubHTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/api/models/katanaml-org/invoices-donut-model-v1/commit/main (Request ID: Root=1-64c20bc3-1d8ccb5577b6df8a7e365802;055e011f-9ecd-4145-8055-394d9c223361)

Forbidden: pass create_pr=1 as a query parameter to create a Pull Request

katanaml / sparrow Goto Github PK

sparrow's People

Contributors

Stargazers

Watchers

Forkers

sparrow's Issues

Recommend Projects

Recommend Topics

Recommend Org