Giter Site home page Giter Site logo

calamari_models's Introduction

calamari_models

Pretrained mixed models to be used with Calamari.

Short description of the available models

  • gt4histocr was trained on the entire GT4HistOCR corpus
  • antiqua/fraktur_historical used the corresponding data from the GT4HistOCR corpus
  • The ligs versions used the same data as the standard antiqua/fraktur_historical models but were refined on an manually enriched subset (afair 100 lines per book) containing ligatures etc.
  • fraktur19 is based on the DTA19 sub set of the GT4HistOCR corpus; and other Fraktur19 data; see here for further details
  • uw3-modern-english (formerly antiqua_modern) was trained on the UW3 dataset (link seems to be down)
  • idiotikon originated from a cooperation with Schweizerisches Idiotikon in an effort to OCR the entire dictionary. The model is based on the UW3 model and its codec contains a wide variety of mostly Latin-based characters including many diacritics. Consequently, it may be a good choice to recognize many modern languages.
  • 18th_century_french was trained in cooperation with the MiMoText project of the University of Trier. It is based on double-keyed transcriptions of french novels published between 1750 and 1800.
  • historical_french used data from 17/18/19. century French printings collected by Simon Gabay

General comments

  • All models are provided as a voting ensemble trained via calamari-cross-fold-train and containing five individual voters.
  • Most models have been trained (exclusively) on the output of the OCRopus nlbin script (gt4histocr used grayscale images, the rest binary ones) and will consequently work best on similar data.
  • All models have been trained using NFC normalization except the idiotikon one (NFD).

Future work

We are planning to considerably increase the number of available models in the somewhat near future, including more time-specific models and spezialized or at least more robust models regarding different binarization techniques. Furthermore, a student work (hopefully) will deal with the task of training various models for modern scripts and languages using (mainly) synthetic data.

If you want to contribute your own models or have any questions / model requests please don't hesitate to contact us.

calamari_models's People

Contributors

chreul avatar chwick avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

calamari_models's Issues

U+EADA with using antiqua_historical_ligs 2020-06-05

When using https://github.com/Calamari-OCR/calamari_models/raw/d61781a9a17e20ca38faf71478185585ea227fd9/antiqua_historical_ligs/0.ckpt.h5 +*.ckpt.*
with current ocrd_all docker image and this scan:

https://digi.ub.uni-heidelberg.de/diglitData/v/montfaucon1719bd2_1.210.tif

I'll get this XML:

...
ſe mit devant les rangs; & approchant de Xanthe, il uſa dune tromperie
qui lui reuit: E ce agir en honnete homme, dit. il, damener un ſecond,

Missing json file in fraktur_19th_century?

It seems that 0.ckpt.json is missing from https://github.com/Calamari-OCR/calamari_models/tree/master/fraktur_19th_century, which gives the following error when loading:

  File "/opt/miniconda3/envs/nteract/lib/python3.7/site-packages/calamari_ocr/ocr/predictor.py", line 228, in __init__
    data_preproc=data_preproc, processes=processes) for cp in checkpoints]
  File "/opt/miniconda3/envs/nteract/lib/python3.7/site-packages/calamari_ocr/ocr/predictor.py", line 228, in <listcomp>
    data_preproc=data_preproc, processes=processes) for cp in checkpoints]
  File "/opt/miniconda3/envs/nteract/lib/python3.7/site-packages/calamari_ocr/ocr/predictor.py", line 102, in __init__
    ckpt = Checkpoint(checkpoint, auto_update=self.auto_update_checkpoints)
  File "/opt/miniconda3/envs/nteract/lib/python3.7/site-packages/calamari_ocr/ocr/checkpoint.py", line 20, in __init__
    with open(self.json_path, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/arbeit/Documents/calamari-models/calamari_official_fraktur_19th_century/0.ckpt.json'

python version of models: "bad marshal data"

When loading keras models, the python version needs to be equal between the system the model was trained on and the system loading the file (cf. keras-team/keras#7440). I stumbled upon this when transferring models for inference to another machine running 3.8 instead of 3.7. Would't it be helpful to include this version in the json and provide some more useful error message based on that information? Is there a way to load and save the models in a way that updates them to another python version?

forever "Upgrading from version ..." with fraktur_19th_century (via OCR-D)

Thanks!
Just trying it... but it seems to take forever:

12:38:15.117 INFO ocrd.task_sequence.run_tasks - Start processing task 'calamari
-recognize -I OCR-D-N11 -O OCR-D-OCR -p {"checkpoint":"/usr/local/ocrd_models/ca
lamari/calamari_models/fraktur_19th_century/*.ckpt.json"}'
12:38:16.414 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-N11'] output
_file_grp=['OCR-D-OCR']
Upgrading from version 2
Upgrading from version 3
Upgrading from version 4
Upgrading from version 5
Upgrading from version 6
Upgrading from version 7
Upgrading from version 8
...
Upgrading from version 1738637
Upgrading from version 1738638
Upgrading from version 1738639
...

now aborting... what I've done wrong?

PS: The same command works with https://qurator-data.de/calamari-models/GT4HistOCR/model.tar.xz

Models from UW3 training?

Is there a place where we can find the models that were trained in the published paper? Or must we perform training ourselves if we wish to use calamari on modern text?

recommended way to download models

Short of cloning the entire repo or clicking on each individual checkpoint HDF5 and JSON file, is there a simple way to download models individually from this site? (Ideally one that can also be scripted...)

Or could you try to make GH release archives from them?

What version to run the model?

Am trying to run inference on

calamari_models-2.0/uw3-modern-english/0.ckpt

What version of everything should we be using? I am getting all kinds of complaints about mismatches and can't resolve them so far.

  • what python version
  • what tensorflow version
  • what calamari version

Exception: Downgrading of models is not supported (5 to 2). Please upgrade your Calamari instance (currently installed: 1.0.5)

And you also get this trying to install latest calamari:

error: tensorflow 2.9.1 is installed but tensorflow<2.7.0,>=2.4.0 is required by {'tfaip'}

And then there is no version of that in pip.

What are these models trained on?

Hello there--

I was wondering if you could disclose any details on what these different models were trained on? Much appriciated!

Best,
James

GT4HistOCR models

"The current model gt4histocr was trained on the entire GT4HistOCR corpus."

A combined model spanning all subcorpora of GT4HistOCR is not a good idea for several reasons:

  1. The subcorpora used different transcription guidelines (labeling similar-looking glyphs with different Unicode characters). The resulting model will therefore get confused about these glyphs.
  2. The size of the subcorpora is vastly different. Currently the bulk is made up 19th c. Fraktur and this will dominate the model and may even crowd out the learning of comparably rare glyphs in the smaller subcorpora.
  3. The quality of the subcorpora (i.e., the remaining errors in ground truth) are also widely different. The incunabula and early modern Latin corpus have been checked much more carefully than others.

For these reasons it is a much better idea to train separate models for the subcorpora.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.