lotemn102 / hebhtr Goto Github PK

Hebrew Handwritten OCR

Python 100.00%

ocr machine-learning tensorflow hebrew-nlp image-segmentation

hebhtr's Introduction

Hebrew Handwritten Text Recognizer (OCR)

Hebrew Handwritten Text Recognizer, based on machine learning. Implemented with TensorFlow and OpenCV.
Model is based on Harald Scheidl's SimpleHTR model [1], and CTC-WordBeam algorithm [2].

Getting Started

Prerequisites

Currently HebHTR is only supported on Linux. I've tested it on Ubuntu 18.04.

In order to run HebHTR you need to compile Harald Scheidl's CTC-WordBeam. In order to do that you need to clone the CTC-WordBeam, go to cpp/proj/ directory and run the script ./buildTF.sh.

Quick Start

from HebHTR import *

# Create new HebHTR object.
img = HebHTR('example.png')

# Infer words from image.
text = img.imgToWord(iterations=5, decoder_type='word_beam')

Result:

About the Model

As mentioned, this model was written by Harald Scheidl. This model was trained to decode text from images with a single word. I've trained the model on a Hebrew words dataset.

The model receives input image of the shape 128×32, binary colored. It has 5 CNN layers, 2 RNN layers, and eventually words are being decoded with a CTC-WordBeam algoritm.

Explanation in much more details can be found in Harald's article [1].

All words prediced by this model should fit it's input data, i.e binary colored images of size 128*32. Therefore, HebHTR normalizes each image to binary color. Then, HebHTR resizes it (without distortion) until it either has a width of 128 or a height of 32. Finally, image is copied into a (white) target image of size 128×32.

The following figure demonstrates this process:

About the Dataset

I've created a dataset of around 100,000 Hebrew words. Around 50,000 of them are real words, taken from students scanned exams. Segementation of those words was done using one of my previous works which can be found here.
This data was cleaned and labeled manually by me. The other 50,000 words were made artificially also by me. The word list for creating the artificial words is taken from MILA's Hebrew stopwords lexicon [3]. Overall, the whole dataset contains 25 different handwrites. The dataset also contains digits and punctuation characters.

All words in the dataset were encoded into black and white (binary).
For example:

About the Corpus

The corpus which is being used in the Word Beam contains around 500,000 unique Hebrew words. The corpus was created by me using the MILA's Arutz 7 corpus [4], TheMarker corpus [5] and HaKnesset corpus [6].

Avaliable Functions

imgToWords

imgToWords(iterations=5, decoder_type='word_beam')

Converts a text-based image to text.

Parameters:

iterations (int): Number of dilation iterations that will be done on the image. Image is dilated to find the contours of it's words. Default value is set to 5.
decoder_type (string): Which decoder to use when infering a word. There are two decoding options:
- 'word_beam' - CTC word beam algorithm.
- 'best_path' - Determinded by taking the model's most likely character at each position.
The word beam decoding has significant better results.

Returns

Text decoded by the model from the image (string).

Example of usage in this function:

from HebHTR import *

# Create new HebHTR object.
img = HebHTR('example.png')

# Infer words from image.
text = img.imgToWord(iterations=5, decoder_type='word_beam')

Result:

Requirements

TensorFlow 1.12.0
Numpy 16.4
OpenCV

References

[1] Harald Scheid's SimpleHTR model
[2] Harald Scheid's CTC-WordBeam algorithm
[3] The MILA Hebrew Lexicon
[4] MILA's Arutz 7 corpus
[5] MILA's TheMarker corpus
[6] MILA's HaKnesset corpus

hebhtr's People

Contributors

Stargazers

Watchers

Forkers

visionvast shalevy1 gedkott dviros nivb52 sapirkro oriziv5 moshebogo hilaman1 joshsucher aienginedev kuzja111 itaimondshine zahor55

hebhtr's Issues

Access to data

Hi Lotem,
I know it's been a while but I'm trying my luck..
I'm building a GAN which creates Hebrew handwriting, can you please give a link or something to the full data?

Thank you

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 42: character maps to <undefined>

Example application to OCR yiddish handwritten letter is not able to successfully execute

Hi Lotem,

I have been trying to use this project to OCR a handwritten letter in yiddish - we are hoping that once we can read the text better, we might be able to translate it.
I cloned this project and https://github.com/githubharald/CTCWordBeamSearch. I built CTCWordBeamSearch according to their docs and that worked perfectly.
My entire set of changes is available by viewing: https://github.com/Lotemn102/HebHTR/compare/master...gedkott:yiddish-ocr?expand=1
I didn't open a PR to avoid creating noise for you since this is more of an application (with some portions, e.g. requirements.txt, .python-version potentially desired upstream if you'd like).
Additionally, I am running Ubuntu 18.04.6 LTS, 64 bit OS, with 15.3 GiB RAM, Intel® UHD Graphics (CML GT2), and Intel® Core™ i7-10710U CPU @ 1.10GHz × 12.

When I attempted to run a program using your project:

from HebHTR import *

# Create new HebHTR object.
img = HebHTR('./yiddish.png')

# Infer words from image.
text = img.imgToWord(iterations=5, decoder_type='word_beam')

I ran into several issues.

the version of tensorflow required according to the README.md appears to be only available when running python 2.7.18 (or lower presumably). I added a requirements.txt file with the following:

opencv-python == 4.2.0.32
tensorflow == 1.12.0
numpy == 1.16.4

After some experimentation, I found that these were the best versions of each dependency needed. I am happy to supply the requirements.txt back to this project if you'd like.

python would crash when reading the data files because of invalid encoding:

Traceback (most recent call last):
  File "main.py", line 7, in <module>
    text = img.imgToWord(iterations=5, decoder_type='word_beam')
  File "/home/gedalia-kott/hebrew-ocr-projects/HebHTR/HebHTR.py", line 14, in imgToWord
    model = getModel(decoder_type=decoder_type)
  File "/home/gedalia-kott/hebrew-ocr-projects/HebHTR/predictWord.py", line 27, in getModel
    mustRestore=True)
  File "/home/gedalia-kott/hebrew-ocr-projects/HebHTR/Model.py", line 38, in __init__
    self.setupCTC()
  File "/home/gedalia-kott/hebrew-ocr-projects/HebHTR/Model.py", line 155, in setupCTC
    corpus.encode('utf8'), chars.encode('utf8'),
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 0: ordinal not in range(128)

This was resolved by defaulting all encodings to utf-8 during the main.py script execution:

# coding: utf-8
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Next, the script would fail due to check-pointing with tensorflow. An excerpt of the stacktrace:

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Assign requires shapes of both tensors to match. lhs shape= [1,1,512,96] rhs shape= [1,1,512,69]
	 [[node save/Assign_16 (defined at /home/gedalia-kott/hebrew-ocr-projects/HebHTR/Model.py:161)  = Assign[T=DT_FLOAT, _class=["loc:@Variable_5"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Variable_5/RMSProp, save/RestoreV2:16)]]

I can provide more detail if it helps, but this was resolved by deleting the checkpoint, snapshot, and accuracy files from the model directory.

Next, the script would fail when executing:

if self.mustRestore and not latestSnapshot:
            raise Exception('No saved model found in: ' + modelDir)

This was resolved by modifying the invocation in predictWord.py from:

model = Model(open(FilePaths.fnCharList).read(), decoderType,
                  mustRestore=False)

model = Model(open(FilePaths.fnCharList).read(), decoderType,
                  mustRestore=True)

I believe this will force the model files to be recreated if the previous ones are not compatible with my environment.

Last, I am not able to run the script because in processFunctions.py in the method preprocessImageForPrediction, I am seeing that the dimensions of my PNG image is a tuple with three elements (dimension of 3) and the function is assuming the image will be a tuple of dimension two resulting in the following:

Traceback (most recent call last):
  File "main.py", line 12, in <module>
    text = img.imgToWord(iterations=5, decoder_type='word_beam')
  File "/home/gedalia-kott/hebrew-ocr-projects/HebHTR/HebHTR.py", line 15, in imgToWord
    transcribed_words.extend(predictWord(self.original_img, model))
  File "/home/gedalia-kott/hebrew-ocr-projects/HebHTR/predictWord.py", line 32, in predictWord
    return infer(model, image)
  File "/home/gedalia-kott/hebrew-ocr-projects/HebHTR/predictWord.py", line 15, in infer
    img = preprocessImageForPrediction(image, Model.imgSize)
  File "/home/gedalia-kott/hebrew-ocr-projects/HebHTR/processFunctions.py", line 20, in preprocessImageForPrediction
    target[0:newSize[1], 0:newSize[0]] = img
ValueError: could not broadcast input array from shape (32,26,3) into shape (32,26)

I am not sure how to proceed from here.

Would love to get your input on how I can update the project to work for me. I am happy to discuss compensation for your time and effort helping me as well.

Gedalia Kott

Dataset Access

Hi, I'm building a conditional GAN which creates Hebrew handwrite, can I have a link to the full dataset?

cant use it with error below

Traceback (most recent call last):
File "/home/yaniv/HebHTR/pic.py", line 7, in
text = img.imgToWord(iterations=5, decoder_type='word_beam')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yaniv/HebHTR/HebHTR.py", line 14, in imgToWord
model = getModel(decoder_type=decoder_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yaniv/HebHTR/predictWord.py", line 26, in getModel
model = Model(open(FilePaths.fnCharList).read(), decoderType,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yaniv/HebHTR/Model.py", line 29, in init
self.is_train = tf.placeholder(tf.bool, name='is_train')
^^^^^^^^^^^^^^
AttributeError: module 'tensorflow' has no attribute 'placeholder'

Hello can i use this inside android app i'm building?

Hello,

I want to use this module to extract hand written hebrew text from images taken within an android phone,
for an android app i'm builiding.

would love some help and\or directions on how to use, if possible at all.

thanks,