shrutirij / ocr-post-correction Goto Github PK

License: Other

Python 89.35% Shell 10.65%

ocr-post-correction's Introduction

OCR Post Correction for Endangered Language Texts

This repository contains code for models and experiments from the paper "OCR Post Correction for Endangered Language Texts".

Textual data in endangered languages is often found in formats that are not machine-readable, including scanned images of paper books. Extracting the text is challenging because there is typically no annotated data to train an OCR system for each endangered language. Instead, we focus on post-correcting the OCR output from a general-purpose OCR system.

📌 In the paper, we present a dataset containing annotations for documents in three critically endangered languages: Ainu, Griko, Yakkha.

📌 Our model reduces the recognition error rate by 34% on average, over a state-of-the-art OCR system.

Learn more about the paper here!

OCR Post-Correction

The goal of OCR post-correction is to automatically correct errors in the text output from an existing OCR system.

The existing OCR system is used to obtain a first pass transcription of the input image (example below in the endangered language Griko):

The incorrectly recognized characters in the first pass are then corrected by the post-correction model.

Model

As seen in the example above, OCR post-correction is a text-based sequence-to-sequence task.

📌 We use a character-level encoder-decoder architecture with attention and add several adaptations for the low-resource setting. The paper has all the details!

📌 The model is trained in a supervised manner. The training data consists of first pass OCR outputs as the source with corresponding manually corrected transcriptions as the target.

📌 Some books that contain texts in endangered languages also contain translations of the text in another (usually high-resource) language. We incorporate an additional encoder in the model, with a multisource framework, to use the information from these translations if they are available.

We provide instructions for both single-source and multisource models:

The single-source model can be used for almost any document and is significantly easier to set up.
The multisource model can only be used if translations are available.

Dataset

This repository contains a sample from our dataset in sample_dataset, which you can use to train the post-correction model. Get the full dataset here!

However, this repository can be used to train OCR post-correction models for documents in any language!

🚀 If you want to use our model with a new set of documents, construct a dataset by following the steps here.

🚀 We'd love to hear about the new datasets and models you build: send us an email at [email protected]!

Running Experiments

Once you have a suitable dataset (e.g., sample_dataset or your own dataset), you can train a model and run experiments on OCR post-correction.

If you have your own dataset, you can use the utils/prepare_data.py script to create train, development, and test splits (see the last step here).

The steps are described below, illustrated with sample_dataset/postcorrection. If using another dataset, simply change the experiment settings to point to your dataset and run the same scripts.

Requirements

Python 3+ is required. Pip can be used to install the packages:

pip install -r postcorr_requirements.txt

Training

The process of training the post-correction model has two main steps:

Pretraining with first pass OCR outputs.
Training with manually corrected transcriptions in a supervised manner.

For a single-source model, modify the experimental settings in train_single-source.sh to point to the appropriate dataset and desired output folder. It is currently set up to use sample_dataset.

Then run

bash train_single-source.sh

For multisource, use train_multi-source.sh.

Log files and saved models are written to the user-specified experiment folder for both the pretraining and training steps. For a list of all available hyperparameters and options, look at postcorrection/constants.py and postcorrection/opts.py.

Testing

For testing with a single-source model, modify the experimental settings in test_single-source.sh. It is currently set up to use sample_dataset.

Then run

bash test_single-source.sh

For multisource, use test_multi-source.sh.

Citation

Please cite our paper if this repository was useful.

@inproceedings{rijhwani-etal-2020-ocr,
    title = "{OCR} {P}ost {C}orrection for {E}ndangered {L}anguage {T}exts",
    author = "Rijhwani, Shruti  and
      Anastasopoulos, Antonios  and
      Neubig, Graham",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.478",
    doi = "10.18653/v1/2020.emnlp-main.478",
    pages = "5931--5942",
}

License

ocr-post-correction's People

Contributors

Stargazers

Watchers

Forkers

jwijffels linkonbsmrstu abhilasharavichander rosnerm skarlett992 vanlar-cyber ducbluee fundou aucan meghanathmacha jivnesh sudharsan2020 vahini01 bumbutudor omarhatemsalem zaidsheikh kc82 shamanthnyk sfedia

ocr-post-correction's Issues

Dynet dynamic memory allocation

I have installed dynet with gpu compatibility as mentioned in the docs. Also the --dynet-mem is set in the train_single-source.sh file. Even then I got this error. Following is the Traceback of the entire error.

[dynet] Device Number: 2
[dynet] Device name: GeForce GTX 1080 Ti
[dynet] Memory Clock Rate (KHz): 5505000
[dynet] Memory Bus Width (bits): 352
[dynet] Peak Memory Bandwidth (GB/s): 484.44
[dynet] Memory Free (GB): 11.5464/11.7215
[dynet] Device(s) selected: 2
[dynet] random seed: 2652333402
[dynet] using autobatching
[dynet] allocating memory: 6000MB
[dynet] memory allocation done.
Param, load_model: None
Traceback (most recent call last):
File "/mnt/data/souvik/sanskrit/ocr-post-correction/postcorrection/multisource_wrapper.py", line 65, in
pretrainer = PretrainHandler(
File "/mnt/data/souvik/sanskrit/ocr-post-correction/postcorrection/pretrain_handler.py", line 81, in init
self.pretrain_model(pretrain_src1, pretrain_src2, pretrain_tgt, epochs)
File "/mnt/data/souvik/sanskrit/ocr-post-correction/postcorrection/pretrain_handler.py", line 88, in pretrain_model
self.seq2seq_trainer.train(
File "/mnt/data/souvik/sanskrit/ocr-post-correction/postcorrection/seq2seq_trainer.py", line 55, in train
batch_loss.backward()
File "_dynet.pyx", line 823, in _dynet.Expression.backward
File "_dynet.pyx", line 842, in _dynet.Expression.backward
ValueError: Dynet does not support both dynamic increasing of memory pool size, and automatic batching or memory checkpointing. If you want to use automatic batching or checkpointing, please pre-allocate enough memory using the --dynet-mem command line option (details http://dynet.readthedocs.io/en/latest/commandline.html).

TypeError: unsupported operand type(s) for -: 'int' and '_dynet.Expression'

Hi! I've encountered an issue related to scalar-matrix arithmetic operations.
Whenever a multiplication, addition, subtraction, or division operation occurs between a float/int and an instance of dynet.Expression, this exception is raised if the float/int precedes a dynet.Expression instance.
If, on the contrary, an instance of dynet.Expression is placed before float/int, no exception occurs. I suppose that this happens due to the strict order of arguments in methods for scalar-matrix operators (matrix goes first, then scalar goes): https://dynet.readthedocs.io/en/latest/operations.html

Traceback (most recent call last):
  File "postcorrection/multisource_wrapper.py", line 63, in <module>
    pretrainer = PretrainHandler(
  File "/Users/fyodorsizov/Documents/git/ocr-post-correction/postcorrection/pretrain_handler.py", line 81, in __init__
    self.pretrain_model(pretrain_src1, pretrain_src2, pretrain_tgt, epochs)
  File "/Users/fyodorsizov/Documents/git/ocr-post-correction/postcorrection/pretrain_handler.py", line 88, in pretrain_model
    self.seq2seq_trainer.train(
  File "/Users/fyodorsizov/Documents/git/ocr-post-correction/postcorrection/seq2seq_trainer.py", line 53, in train
    losses.append(self.model.get_loss(src1, src2, tgt))
  File "/Users/fyodorsizov/Documents/git/ocr-post-correction/postcorrection/multisource_model.py", line 295, in get_loss
    return self.decode_loss(src1, src2, tgt)
  File "/Users/fyodorsizov/Documents/git/ocr-post-correction/postcorrection/multisource_model.py", line 278, in decode_loss
    probs, _ = self.get_pointergen_probs(
  File "/Users/fyodorsizov/Documents/git/ocr-post-correction/postcorrection/multisource_model.py", line 204, in get_pointergen_probs
    copy_probs = a_t * (1 - p_gen)
TypeError: unsupported operand type(s) for -: 'int' and '_dynet.Expression'

Have you seen something like this before? Any idea how to fix this other than changing the order of terms inside the code or overriding __mul__, __add__, etc methods?

Help with installing Dynet - GPU

Hi,
I am facing some issues while installing Dynet-GPU for CUDA 11.1. Can you please inform, whether you have used the CPU version of dynet or GPU ? If GPU, then can you inform the version of dynet and eigen that you have used?

System Specifications:
Cuda Version - 11.1

I have tried the following versions.

Build Command:
To avoid Unsupported GPU architecture for compute_30 during build time, the below command is used.

cmake .. -DEIGEN3_INCLUDE_DIR=../eigen -DPYTHON='which python' -DBACKEND=cuda -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-11.1

Dynet	Eigen	Error
Latest(master branch)	Eigen 3.2	CX11 folder is not present
Latest(master branch)	Eigen 3.3	identifier std::round is undefined in device code
Latest(master branch)	Eigen 3.3.7	identifier std::round is undefined in device code
Latest(master branch)	Eigen 3.4	Error while running make
Dynet 2.0.3	Eigen-2355b22	Unsupported GPU architecture for compute_30 (while running make)
Dynet 2.1	Eigen-b2e267dc99d4.zip	Unsupported GPU architecture for compute_30 (while running make)

a vs "a" without a hat

in the example diagram with the training model for corrections, I notices that ka is not corrected. how do you make corrections for the two variants of the IPA a?

Error installing dependencies

Hi,

I'm trying ti run and test this repo, but i'm getting an error when i try to install the dependencies.

Ubuntu 22.04 LTS

`pip install -r postcorr_requirements.txt

...

  INFO:root:/usr/bin/g++ -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -g -fwrapv -O2 build/temp.linux-x86_64-cpython-310/tmp/pip-install-0pfrr3uk/dynet_233344f25cf84b9098f2fbac1c2cd338/build/py3.10-64bit/python/_dynet.o -L/tmp/pip-install-0pfrr3uk/dynet_233344f25cf84b9098f2fbac1c2cd338/build/py3.10-64bit/dynet/ -L. -L/usr/lib/x86_64-linux-gnu -Wl,--enable-new-dtags,-R/tmp/pip-install-0pfrr3uk/dynet_233344f25cf84b9098f2fbac1c2cd338/build/py3.10-64bit/dynet/ -Wl,--enable-new-dtags,-R/usr/lib/ -ldynet -o /tmp/pip-install-0pfrr3uk/dynet_233344f25cf84b9098f2fbac1c2cd338/build/py3.10-64bit/python/_dynet.cpython-310-x86_64-linux-gnu.so -Wl,-rpath='/usr/lib/',--no-as-needed
  INFO:root:Copying built extensions...
  [100%] Built target pydynet
  INFO:root:Installing...
  Consolidate compiler generated dependencies of target dynet
  [ 98%] Built target dynet
  [ 98%] Built target pydynet_precopy
  [100%] Built target pydynet
  Install the project...
  -- Install configuration: "Release"
  CMake Error at dynet/cmake_install.cmake:46 (file):
    file cannot create directory: /usr/include/dynet.  Maybe need
    administrative privileges.
  Call Stack (most recent call first):
    cmake_install.cmake:47 (include)
  
  
  make: *** [Makefile:100: install] Erro 1
  error: /usr/bin/make install
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for dynet
Failed to build dynet
ERROR: Could not build wheels for dynet, which is required to install pyproject.toml-based projects

I realize this output:

file cannot create directory: /usr/include/dynet. Maybe need
administrative privileges.

I try sudo pip install -r postcorr_requirements.txt

i got this output:

Collecting dynet Using cached dyNET-2.1.2.tar.gz (509 kB) Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Collecting editdistance Using cached editdistance-0.6.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (282 kB) Collecting Levenshtein Using cached Levenshtein-0.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (169 kB) Collecting cython Using cached Cython-3.0.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB) Requirement already satisfied: numpy in /usr/lib/python3/dist-packages (from dynet->-r postcorr_requirements.txt (line 1)) (1.21.5) Collecting rapidfuzz<4.0.0,>=3.1.0 Using cached rapidfuzz-3.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB) Building wheels for collected packages: dynet Building wheel for dynet (pyproject.toml) ... done Created wheel for dynet: filename=dyNET-2.1.2-cp310-cp310-linux_x86_64.whl size=3542372 sha256=af2a6936a12d7d17d77059ff66ba6353ab8d7ac543664134e29349e04c741b28 Stored in directory: /root/.cache/pip/wheels/2d/39/d7/01b76ca1370da9de9825b7051a8fd9aff320b254e2bba7ccce Successfully built dynet Installing collected packages: rapidfuzz, editdistance, cython, Levenshtein, dynet Successfully installed Levenshtein-0.23.0 cython-3.0.6 dynet-2.1.2 editdistance-0.6.2 rapidfuzz-3.5.2 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

there is a better way to install?

Distributed training and auto-batching with this codebase

Hi @shrutirij thanks for the great work and very well-documented and clean codebase, I greatly appreciate it!

I adapted and tried running this codebase for some conceptually similar experiments and I've observed a few quirks that I wanted to run by you to get your thoughts since I haven't really used Dynet before.

I ran the codebase with --dynet-gpus 8 (after also modifying opts.py to support this arg) and found that although 8 processes are spawned and attached to 8 GPUs, only the first process has > 0% GPU utilization. It appears that this codebase doesn't support distributed training in its current form. Is that accurate? Is there an equivalent to PyTorch's DistributedDataParallel and DistributedSampler that I can use to perform data-parallel training and inference with Dynet? It would greatly speed up my experiments.
It appears that the training time with a CPU is the same as the training time with GPU when using 1 GPU via provision of the --dynet-gpu flag. Is this what you noticed too during your runs? If not, could you suggest how I can get this to run faster with a GPU?
It appears that the Dynet auto-batching feature isn't working, because I tried running the code with and without the --dynet-autobatch 1 flag and the run-time doesn't seem to change. I see the main training loop looks like the following (where minibatch_size is always set to 1 here):

            for i in range(0, len(train_data), minibatch_size):
                cur_size = min(minibatch_size, len(train_data) - i)
                losses = []
                dy.renew_cg()
                for (src1, src2, tgt) in train_data[i : i + cur_size]:
                    losses.append(self.model.get_loss(src1, src2, tgt))
                batch_loss = dy.esum(losses)
                batch_loss.backward()
                trainer.update()
                epoch_loss += batch_loss.scalar_value()
            logging.info("Epoch loss: %0.4f" % (epoch_loss / len(train_data)))

Doesn't this mean that cur_size is always 1, causing the inner for loop to just iterate over a list of size 1 by default? If I were to override minibatch_size to, say, 32, how does Dynet ensure that 1 forward operation occurs per batch of 32 examples instead of 32 separate forward passes?

Thanks a lot for your time and thanks again for the great work toward protecting endangered languages!

Great paper

Need help with anything ? Code re-factoring etc ?

Segmentation Fault: Pretraining Epoch 0

The training process is interrupted by a segmentation fault during the very first epoch as part of the pretraining process. The error encountered is as follows:

[dynet] random seed: 1678755796
[dynet] allocating memory: 512MB
[dynet] memory allocation done.
[dynet] random seed: 2822143777
[dynet] using autobatching
[dynet] allocating memory: 10000MB
[dynet] memory allocation done.
train_single-source.sh: line 65: 1310309 Segmentation fault (core dumped) python3 postcorrection/multisource_wrapper.py --dynet-mem $dynet_mem --dynet-autobatch 1 --pretrain_src1 $pretrain_src --pretrain_tgt $pretrain_tgt $params --single --vocab_folder $expt_folder/vocab --output_folder $expt_folder --model_name $pretrained_model_name --pretrain_only

I was able to narrow down the problem to the following piece of code:
batch_loss.backward() in line 80: lm_trainer.py

The following issue may hint towards the potential problem:
[(https://github.com/clab/dynet/issues/308)]

Has anyone encountered this problem before?