calamari-ocr / calamari_models Goto Github PK

View Code? Open in Web Editor NEW

55.0 11.0 17.0 2.29 GB

Pretrained mixed models to be used with Calamari.

License: MIT License

calamari_models's Introduction

calamari_models

Pretrained mixed models to be used with Calamari.

Short description of the available models

gt4histocr was trained on the entire GT4HistOCR corpus
antiqua/fraktur_historical used the corresponding data from the GT4HistOCR corpus
The ligs versions used the same data as the standard antiqua/fraktur_historical models but were refined on an manually enriched subset (afair 100 lines per book) containing ligatures etc.
fraktur19 is based on the DTA19 sub set of the GT4HistOCR corpus; and other Fraktur19 data; see here for further details
uw3-modern-english (formerly antiqua_modern) was trained on the UW3 dataset (link seems to be down)
idiotikon originated from a cooperation with Schweizerisches Idiotikon in an effort to OCR the entire dictionary. The model is based on the UW3 model and its codec contains a wide variety of mostly Latin-based characters including many diacritics. Consequently, it may be a good choice to recognize many modern languages.
18th_century_french was trained in cooperation with the MiMoText project of the University of Trier. It is based on double-keyed transcriptions of french novels published between 1750 and 1800.
historical_french used data from 17/18/19. century French printings collected by Simon Gabay

General comments

All models are provided as a voting ensemble trained via calamari-cross-fold-train and containing five individual voters.
Most models have been trained (exclusively) on the output of the OCRopus nlbin script (gt4histocr used grayscale images, the rest binary ones) and will consequently work best on similar data.
All models have been trained using NFC normalization except the idiotikon one (NFD).

Future work

We are planning to considerably increase the number of available models in the somewhat near future, including more time-specific models and spezialized or at least more robust models regarding different binarization techniques. Furthermore, a student work (hopefully) will deal with the task of training various models for modern scripts and languages using (mainly) synthetic data.

If you want to contribute your own models or have any questions / model requests please don't hesitate to contact us.

calamari_models's People

Contributors

Stargazers

Watchers

Forkers

hejin vikas-kumar-infrrd zzmcdc synthetik-technologies wjn0918 ganwang wjson usteiner9 midnight93 ocr-collection sunxingxingtf admariner ltdeivis wocnb gsoykan folichonne openfnord

calamari_models's Issues

How to train our own model？

I want to train a set of my own handwriting model and want to know how to train。

The result given by Calamari is not completed

Hello
@ChWick
When i test Calamari ocr on my printed database it show me just statistic results and there is no lines of characters shown:

Thank you for your response

U+EADA with using antiqua_historical_ligs 2020-06-05

When using https://github.com/Calamari-OCR/calamari_models/raw/d61781a9a17e20ca38faf71478185585ea227fd9/antiqua_historical_ligs/0.ckpt.h5 +*.ckpt.*
with current ocrd_all docker image and this scan:

https://digi.ub.uni-heidelberg.de/diglitData/v/montfaucon1719bd2_1.210.tif

I'll get this XML:

...
ſe mit devant les rangs; &amp; approchant de Xanthe, il uſa dune tromperie
qui lui reuit: E ce agir en honnete homme, dit. il, damener un ſecond,

Missing json file in fraktur_19th_century?

It seems that 0.ckpt.json is missing from https://github.com/Calamari-OCR/calamari_models/tree/master/fraktur_19th_century, which gives the following error when loading:

  File "/opt/miniconda3/envs/nteract/lib/python3.7/site-packages/calamari_ocr/ocr/predictor.py", line 228, in __init__
    data_preproc=data_preproc, processes=processes) for cp in checkpoints]
  File "/opt/miniconda3/envs/nteract/lib/python3.7/site-packages/calamari_ocr/ocr/predictor.py", line 228, in <listcomp>
    data_preproc=data_preproc, processes=processes) for cp in checkpoints]
  File "/opt/miniconda3/envs/nteract/lib/python3.7/site-packages/calamari_ocr/ocr/predictor.py", line 102, in __init__
    ckpt = Checkpoint(checkpoint, auto_update=self.auto_update_checkpoints)
  File "/opt/miniconda3/envs/nteract/lib/python3.7/site-packages/calamari_ocr/ocr/checkpoint.py", line 20, in __init__
    with open(self.json_path, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/arbeit/Documents/calamari-models/calamari_official_fraktur_19th_century/0.ckpt.json'

uw3-modern-english = antiqua_modern (1.0.zip)

Thanks for renaming antiqua_modern to uw3-modern-english, because model seems to miss some accented characters. Example:

https://digi.hadw-bw.de/view/di016/0032

»Grabstein eines Hildebertus. Urspriinglich in der Peterskirche ...«

Is there a calamari model for modern antiqua fonts & german?

fraktur_19th_century vs github.com/qurator-spk/train-calamari-gt4histocr

Dear reader,
do you have any details about the model in fraktur_19th_century?

Is it based on gt4histocr ground truth?

Kind regards.

python version of models: "bad marshal data"

When loading keras models, the python version needs to be equal between the system the model was trained on and the system loading the file (cf. keras-team/keras#7440). I stumbled upon this when transferring models for inference to another machine running 3.8 instead of 3.7. Would't it be helpful to include this version in the json and provide some more useful error message based on that information? Is there a way to load and save the models in a way that updates them to another python version?

forever "Upgrading from version ..." with fraktur_19th_century (via OCR-D)

Thanks!
Just trying it... but it seems to take forever:

12:38:15.117 INFO ocrd.task_sequence.run_tasks - Start processing task 'calamari
-recognize -I OCR-D-N11 -O OCR-D-OCR -p {"checkpoint":"/usr/local/ocrd_models/ca
lamari/calamari_models/fraktur_19th_century/*.ckpt.json"}'
12:38:16.414 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-N11'] output
_file_grp=['OCR-D-OCR']
Upgrading from version 2
Upgrading from version 3
Upgrading from version 4
Upgrading from version 5
Upgrading from version 6
Upgrading from version 7
Upgrading from version 8
...
Upgrading from version 1738637
Upgrading from version 1738638
Upgrading from version 1738639
...

now aborting... what I've done wrong?

PS: The same command works with https://qurator-data.de/calamari-models/GT4HistOCR/model.tar.xz

Models from UW3 training?

Is there a place where we can find the models that were trained in the published paper? Or must we perform training ourselves if we wish to use calamari on modern text?

recommended way to download models

Short of cloning the entire repo or clicking on each individual checkpoint HDF5 and JSON file, is there a simple way to download models individually from this site? (Ideally one that can also be scripted...)

Or could you try to make GH release archives from them?

Which model is better for modern French as a pre-training model

What version to run the model?

Am trying to run inference on

calamari_models-2.0/uw3-modern-english/0.ckpt

What version of everything should we be using? I am getting all kinds of complaints about mismatches and can't resolve them so far.

what python version
what tensorflow version
what calamari version

Exception: Downgrading of models is not supported (5 to 2). Please upgrade your Calamari instance (currently installed: 1.0.5)

And you also get this trying to install latest calamari:

error: tensorflow 2.9.1 is installed but tensorflow<2.7.0,>=2.4.0 is required by {'tfaip'}

And then there is no version of that in pip.

What are these models trained on?

Hello there--

I was wondering if you could disclose any details on what these different models were trained on? Much appriciated!

Best,
James

GT4HistOCR models

"The current model gt4histocr was trained on the entire GT4HistOCR corpus."

A combined model spanning all subcorpora of GT4HistOCR is not a good idea for several reasons:

The subcorpora used different transcription guidelines (labeling similar-looking glyphs with different Unicode characters). The resulting model will therefore get confused about these glyphs.
The size of the subcorpora is vastly different. Currently the bulk is made up 19th c. Fraktur and this will dominate the model and may even crowd out the learning of comparably rare glyphs in the smaller subcorpora.
The quality of the subcorpora (i.e., the remaining errors in ground truth) are also widely different. The incunabula and early modern Latin corpus have been checked much more carefully than others.

For these reasons it is a much better idea to train separate models for the subcorpora.