Giter Site home page Giter Site logo

amazon-science / semimtr-text-recognition Goto Github PK

View Code? Open in Web Editor NEW
81.0 4.0 14.0 1.26 MB

Multimodal Semi-Supervised Learning for Text Recognition (SemiMTR)

License: Apache License 2.0

Python 92.57% Jupyter Notebook 7.43%
computer-vision consistency-regularization contrastive-learning ocr scene-text-recognition semi-supervised-learning text-recognition deep-learning pytorch self-supervised-learning

semimtr-text-recognition's Introduction

Multimodal Semi-Supervised Learning for Text Recognition

The official code implementation of SemiMTR Paper | Pretrained Models | SeqCLR Paper | Citation | Demo.

Aviad Aberdam, Roy Ganz, Shai Mazor, Ron Litman

We introduce a multimodal semi-supervised learning algorithm for text recognition, which is customized for modern vision-language multimodal architectures. To this end, we present a unified one-stage pretraining method for the vision model, which suits scene text recognition. In addition, we offer a sequential, character-level, consistency regularization in which each modality teaches itself. Extensive experiments demonstrate state-of-the-art performance on multiple scene text recognition benchmarks.

Figures

semimtr vision model pretraining

Figure 1: SemiMTR vision model pretraining: Contrastive learning



semimtr fine-tuning

Figure 2: SemiMTR model fine-tuning: Consistency regularization

Getting Started

Run Demo with Pretrained Model Open In Colab

Dependencies

  • Inference and demo requires PyTorch >= 1.7.1
  • For training and evaluation, install the dependencies
pip install -r requirements.txt

Pretrained Models

Download pretrained models:

Pretrained vision models:

Pretrained language model:

For fine-tuning SemiMTR without vision and language pretraining, locate the above models in a workdir directory, as follows:

workdir
├── semimtr_vision_model_real_l_and_u.pth
├── abinet_language_model.pth
└── semimtr_real_l_and_u.pth

SemiMTR Models Accuracy

Training Data IIIT SVT IC13 IC15 SVTP CUTE Avg. COCO RCTW Uber ArT LSVT MLT19 ReCTS Avg.
Synth (ABINet) 96.4 93.2 95.1 82.1 89.0 89.2 91.2 63.1 59.7 39.6 68.3 59.5 85.0 86.7 52.0
Real-L+U 97.0 95.8 96.1 84.7 90.7 94.1 92.8 72.2 76.1 58.5 71.6 77.1 90.4 92.4 65.4
Real-L+U+Synth 97.4 96.8 96.5 84.7 92.9 95.1 93.3 73.0 75.7 58.6 72.4 77.5 90.4 93.1 65.8
Real-L+U+TextOCR 97.3 97.7 96.9 86.0 92.2 94.4 93.7 73.8 77.7 58.6 73.5 78.3 91.3 93.3 66.1

Datasets

  • Download preprocessed lmdb dataset for training and evaluation. Link
  • For training the language model, download WikiText103. Link
  • The final structure of data directory can be found in DATA.md.

Training

  1. Pretrain vision model
    CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/semimtr_pretrain_vision_model.yaml
    
  2. Pretrain language model
    CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/pretrain_language_model.yaml
    
  3. Train SemiMTR
    CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/semimtr_finetune.yaml
    

Note:

  • You can set the checkpoint path for vision and language models separately for specific pretrained model, or set to None to train from scratch

Training ABINet

  1. Pre-train vision model
    CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/abinet_pretrain_vision_model.yaml
    
  2. Pre-train language model
    CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/pretrain_language_model.yaml
    
  3. Train ABINet
    CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/abinet_finetune.yaml
    

Evaluation

CUDA_VISIBLE_DEVICES=0 python main.py --config configs/semimtr_finetune.yaml --run_only_test

Arguments:

  • --checkpoint /path/to/checkpoint set the path of evaluation model
  • --test_root /path/to/dataset set the path of evaluation dataset
  • --model_eval [alignment|vision] which sub-model to evaluate

Citation

If you find our method useful for your research, please cite

@article{aberdam2022multimodal,
  title={Multimodal Semi-Supervised Learning for Text Recognition},
  author={Aberdam, Aviad and Ganz, Roy and Mazor, Shai and Litman, Ron},
  journal={arXiv preprint arXiv:2205.03873},
  year={2022}
}

@inproceedings{aberdam2021sequence,
  title={Sequence-to-sequence contrastive learning for text recognition},
  author={Aberdam, Aviad and Litman, Ron and Tsiper, Shahar and Anschel, Oron and Slossberg, Ron and Mazor, Shai and Manmatha, R and Perona, Pietro},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={15302--15312},
  year={2021}
}

Acknowledgements

This implementation is based on the repository ABINet.

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Contact

Feel free to contact us if there is any question: Aviad Aberdam

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.