interscript / rababa Goto Github PK

Rababa, the diacritization library for Arabic and Hebrew (Abjad scripts in general)

Python 84.42% Ruby 15.53% Shell 0.05%

deep-learning machine-learning pytorch neural-network neural-networks arabic arabic-nlp transformer attention-model hebrew

rababa's Introduction

Interscript: Interoperable Script Conversion Systems, with Ruby and JavaScript runtimes

Introduction

This repository contains interoperable transliteration schemes from:

ALA-LC
BGN/PCGN
ICAO
ISO
UN (by UNGEGN)
Many, many other script conversion system authorities.

The goal is to achieve interoperable transliteration schemes allowing quality comparisons.

Demonstration

Installing Interscript

gem install interscript

Interscript stats

interscript stats | less

Transliteration

cat ara-Arab.txt
interscript -s odni-ara-Arab-Latn-2015 ara-Arab.txt -o ara-Arab-out.txt
cat ara-Arab-out.txt

Diacritization

# First, we need to install rababa
gem install rababa
# Now we can transliterate
interscript -s "var-ara-Arab-Arab-rababa|odni-ara-Arab-Latn-2015" ara-Arab.txt -o ara-Arab-out-tl.txt
cat ara-Arab-out-tl.txt
# Compare this to transliteration without diacritization
cat ara-Arab-out.txt

Reversing

cat rus-rev.txt
interscript -s odni-rus-Latn-Cyrl-2015 rus-rev.txt -o rus.txt
# Note that Latn and Cyrl are reversed
cat rus.txt

Installation

Prerequisites

Interscript depends on Ruby. Once you manage to install Ruby, it’s easy. This part won’t work until we release Interscript v2, please use the one below.

gem install interscript -v "~>2.0"

You can also download a local copy of this Git repository, eg. for development purposes:

git clone https://github.com/interscript/lcs
cd lcs/ruby
bundle install

Additional prerequisites for Thai systems

If you want to transliterate Thai systems, you will need to install some additional requirements. Please consult: Usage with Secryst.

Usage

Assume you have a file ready in the source script like this:

cat <<EOT > rus-Cyrl.txt
Эх, тройка! птица тройка, кто тебя выдумал? знать, у бойкого народа ты
могла только родиться, в той земле, что не любит шутить, а
ровнем-гладнем разметнулась на полсвета, да и ступай считать версты,
пока не зарябит тебе в очи. И не хитрый, кажись, дорожный снаряд, не
железным схвачен винтом, а наскоро живьём с одним топором да долотом
снарядил и собрал тебя ярославский расторопный мужик. Не в немецких
ботфортах ямщик: борода да рукавицы, и сидит чёрт знает на чём; а
привстал, да замахнулся, да затянул песню — кони вихрем, спицы в
колесах смешались в один гладкий круг, только дрогнула дорога, да
вскрикнул в испуге остановившийся пешеход — и вон она понеслась,
понеслась, понеслась!

Н.В. Гоголь
EOT

You can run interscript on this text using different transliteration systems.

interscript rus-Cyrl.txt \
  --system=bgnpcgn-rus-Cyrl-Latn-1947 \
  --output=bgnpcgn-rus-Latn.txt

interscript rus-Cyrl.txt \
  --system=iso-rus-Cyrl-Latn-9-1995 \
  --output=iso-rus-Latn.txt

interscript rus-Cyrl.txt \
  --system=icao-rus-Cyrl-Latn-9303 \
  --output=icao-rus-Latn.txt

interscript rus-Cyrl.txt \
  --system=bas-rus-Cyrl-Latn-2017-bss \
  --output=bas-rus-Latn.txt

It is then easy to see the exact differences in rendering between the systems.

diff bgnpcgn-rus-Latn.txt bas-rus-Latn.txt

If you use Interscript from the Git repository, you would call the following command instead of interscript:

# Ensure you are in your Git repository root path
ruby/bin/interscript rus-Cyrl.txt \
  --system=bas-rus-Cyrl-Latn-2017-bss \
  --output=bas-rus-Latn.txt

Adding transliteration system

Please consult the Map Editing Guide

Integration with Ruby applications

Please consult the guide for integration with Ruby applications

ISCS system codes

In accordance with ISO/CC 24229, the system code identifying a script conversion system has the following components:

e.g. bgnpcgn-rus-Cyrl-Latn-1947:

bgnpcgn: the authority identifier
rus: an ISO 639-{1,2,3,5} language code that this system applies to (For 639-2, use (T) code)
Cyrl: an ISO 15924 script code, identifying the source script
Latn: an ISO 15924 script code, identifying the target script
1947: an identifier unit within the authority to identify this system

Covered languages

Currently the schemes cover Cyrillic, Armenian, Greek, Arabic and Hebrew.

Samples to play with

rus-Cyrl-1.txt: Copied from the XLS output from http://www.primorsk.vybory.izbirkom.ru/region/primorsk?action=show&global=true&root=254017025&tvd=4254017212287&vrn=100100067795849&prver=0&pronetvd=0&region=25&sub_region=25&type=242&vibid=4254017212287
rus-Cyrl-2.txt: Copied from the XLS output from http://www.yaroslavl.vybory.izbirkom.ru/region/yaroslavl?action=show&root=764013001&tvd=4764013188704&vrn=4764013188693&prver=0&pronetvd=0&region=76&sub_region=76&type=426&vibid=4764013188704

References

Reference documents are located at the interscript-references repository. Some specifications that have distribution limitations may not be reproduced there.

Links to system definitions

Copyright and license

This is a Ribose project. Copyright Ribose.

rababa's People

Contributors

Stargazers

Watchers

Forkers

syedpeer

rababa's Issues

Failing diacriticization tests

There were two diacriticization tests that are currently failing (and I commented them out).

@AhMohsen46 can you help check?

  1) Rababa::Diacritizer diacriticizes # گيله پسمير الجديد 34
     Failure/Error: expect(diacritizer.diacritize_text(source)).to eq target
     
       expected: "# گيَلِهُ پسُمِيْرٌ الجَدِيدُ 34"
            got: "# گيَلْهُ پسَمِيرٌ الجَدِيدَ 34"
     
       (compared using ==)
     # ./spec/rababa/diacritizer_spec.rb:49:in `block (3 levels) in <top (required)>'

  2) Rababa::Diacritizer diacriticizes 26 سبتمبر العقبة
     Failure/Error: expect(diacritizer.diacritize_text(source)).to eq target
     
       expected: "26 سَبْتَمْبَرِ العَقَبَة"
            got: "26 سَبْتَمْبَرَ العَقْبَة"
     
       (compared using ==)
     # ./spec/rababa/diacritizer_spec.rb:49:in `block (3 levels) in <top (required)>'

Finished in 12.58 seconds (files took 0.36641 seconds to load)
6 examples, 2 failures

Literature and Codes

As in 06/2021... We review only the more advanced technologies.
Older solutions used rules based approaches.
Deep Learning was applied relatively to the problem of diacritization, gradually getting better results than rules based approaches.

Mishkal, Arabic text vocalization software
Zerrouki, T.
rules based library, 2014

code

Automatic minimal diacritization of Arabic texts
Rehab Alnefaiea, Aqil M.Azmib
11.2017

MADAMIRA software
paper

An Approach for Arabic Diacritization
Ismail Hadjir, Mohamed Abbache, Fatma Zohra Belkredim
06.2019

keywords: Hidden Markov Models, Viterbi algorithm
article

Diacritization of Moroccan and Tunisian Arabic Dialects: A CRF Approach
Kareem Darwish∗, Ahmed Abdelali∗, Hamdy Mubarak∗, Younes Samih†, Mohammed Attia⋆
2018

keywords: Conditional Random Fields, arabic dialects...
paper

Arabic Text Diacritization Using Deep Neural Networks
Ali Fadel, Ibraheem Tuffaha, Bara' Al-Jawarneh, Mahmoud Al-Ayyoub
Shakkala library, tensorflow, 04.2019

keywords: Embedding, LSTM
paper
code, tensorflow
benchmarks&scripts

Highly Effective Arabic Diacritization using Sequence to Sequence Modeling

Hamdy Mubarak, Ahmed Abdelali, Hassan Sajjad, Younes Samih, Kareem Darwish
06.2019
keywords: seq2seq(LSTM), NMT, interesting representation units, context window, voting
paper

Multi-components System for Automatic Arabic Diacritization
Hamza Abbad, Shengwu Xiong
04.2020

keywords: LSTM's, parallel layers for Shadda and Harakat (⇒ pipeline)
paper
code, tensorflow

Deep Diacritization: Efficient Hierarchical Recurrence for Improved Arabic Diacritization
Badr AlKhamissi, Muhammad N. ElNokrashy, and Mohamed Gabr
12.2020

keywords: Cross-level attention, Encoder-Decoder (LSTM), Teacher forcing,
paper
slides
code, pytorch

Effective Deep Learning Models for Automatic Diacritization of Arabic Text
Mokthar Ali Hasan Madhfar; Ali Mustafa Qamar
12.2020

keywords: embedding, encoder-decoder (LSTM), Highway Nets, Attention, CBHG Module
paper
code, pytorch

A Deep Belief Network Classification Approach for Automatic Diacritization of Arabic Text
Mohammad Aref Alshraideh, Mohammad Alshraideh and Omar Alkadi
4.2021

keywords: DBN built with Boltzmann restricted machines (restricted RBM's) superior to LSTMs, unicode encoding, Borderline-SMOTE
paper

Research ideas

Here we just mention some 2021-ish ideas mentioned in the recent papers above:

Transformer-based Encoders
Byte-pair-encodings
Improve Injected Hints Method (train with semi diacritised data)
More Interpretable Attention Weights
Deep belief networks
More data and data processing

Ruby gem: make Rababa useable from the CLI directly

Ruby version needs to automatically download the necessary models and config.

The exe/rababa executable should be accessible.

Maybe this should be integrated into the Secryst framework?

Compare quality with Farasa

As per #36 , AlJazeera Learning also offers an Arabic diacriticizer

https://learning.aljazeera.net/en/pages/تشكيل-vocalization

Which actually goes to QCRI's Farasa:

curl 'https://farasa-api.qcri.org/msa/webapi/diacritizeV2' \
-X 'POST' \
-H 'Accept: */*' \
-H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' \
-H 'Origin: https://quiz.aljazeera.net' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Host: farasa-api.qcri.org' \
-H 'Content-Length: 75' \
-H 'Accept-Language: en-us' \
-H 'Connection: keep-alive' \
--data 'text=%D8%B5%D9%81%D8%AD%D8%A9+%D8%A7%D9%84%D8%AA%D8%B4%D9%83%D9%8A%D9%84%0A'

Apparently they have two diacritization modules that can be downloaded (Java, JAR) or used via the web:

We should do some comparisons with them.

Diacriticization failure after latest ONNX model change

@gilgameshjw the tests started failing at commit e869dff.

  1) Rababa::Diacritizer diacriticizes قطر
     Failure/Error: expect(diacritizer.diacritize_text(source)).to eq target
     
       expected: "قِطْرَ"
            got: "قَطُر"
     
       (compared using ==)
     # ./spec/rababa/diacritizer_spec.rb:43:in `block (3 levels) in <top (required)>'

  2) Rababa::Diacritizer diacriticizes abc
     Failure/Error: predicts = @onnx_session.run(nil, batch_data)
     
     OnnxRuntime::Error:
       Non-zero status code returned while running FusedConv node. Name:'fused Conv_92' Status Message: Invalid input shape: {0}
     # ./lib/rababa/diacritizer.rb:113:in `predict_batch'
     # ./lib/rababa/diacritizer.rb:68:in `diacritize_text'
     # ./spec/rababa/diacritizer_spec.rb:43:in `block (3 levels) in <top (required)>'

  3) Rababa::Diacritizer diacriticizes ‘Iz. Ibrāhīm as-Sa‘danī
     Failure/Error: predicts = @onnx_session.run(nil, batch_data)
     
     OnnxRuntime::Error:
       Non-zero status code returned while running FusedConv node. Name:'fused Conv_92' Status Message: Invalid input shape: {0}
     # ./lib/rababa/diacritizer.rb:113:in `predict_batch'
     # ./lib/rababa/diacritizer.rb:68:in `diacritize_text'
     # ./spec/rababa/diacritizer_spec.rb:43:in `block (3 levels) in <top (required)>'

Finished in 6.1 seconds (files took 2.67 seconds to load)
6 examples, 3 failures, 2 pending

Failed examples:

rspec ./spec/rababa/diacritizer_spec.rb[1:1] # Rababa::Diacritizer diacriticizes قطر
rspec ./spec/rababa/diacritizer_spec.rb[1:2] # Rababa::Diacritizer diacriticizes abc
rspec ./spec/rababa/diacritizer_spec.rb[1:3] # Rababa::Diacritizer diacriticizes ‘Iz. Ibrāhīm as-Sa‘danī

Auto-release rababa gem

Use the secret INTERSCRIPT_RUBYGEMS_API_KEY.

Investigate "case ending" training

https://arxiv.org/pdf/2002.01207.pdf

From the QCRI people.

Arabic Diacritic Recovery Using a Feature-Rich biLSTM Model
Kareem Darwish, Ahmed Abdelali, Hamdy Mubarak, Mohamed Eldesouki
Qatar Computing Research Institute. Hamad Bin Khalifa University, Doha. Qatar

Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them to correctly pronounce words. There are two types of Arabic diacritics: the first are core-word diacritics (CW), which specify the lexical selection, and the second are case endings (CE), which typically appear at the end of the word stem and generally specify their syntactic roles. Recovering CEs is relatively harder than recovering core-word diacritics due to inter-word dependencies, which are often distant. In this paper, we use a feature-rich recurrent neural network model that uses a variety of linguistic and surface- level features to recover both core word diacritics and case endings. Our model surpasses all previous state-of-the-art systems with a CW error rate (CWER) of 2.86% and a CE error rate (CEER) of 3.7% for Modern Standard Arabic (MSA) and CWER of 2.2% and CEER of 2.5% for Classical Arabic (CA). When combining diacritized word cores with case endings, the resultant word error rate is 6.0% and 4.3% for MSA and CA respectively. This highlights the effectiveness of feature engineering for such deep neural models.

For Case Ending:

MSA Results: As the results show, our baseline DNN system outperforms all state-of-the-art systems. Further, adding more features yielded better results overall. Surface-level features resulted in the most gain, followed by POS tags, and lastly stem templates. Further, adding head and tail characters along with a list of sukun words and named entities led to further improvement. Our proposed feature-rich system has a CEER that is approximately 61% lower than any of the state-of-the-art systems.

CA Results: The results show that the POS tagging features led to the most improvements followed by the surface features. Combining all features led to the best results with WER of 2.5%. As we saw for CW diacritics, using our best MSA system to diacritize CA led to significantly lower results with CEER of 8.9%.

Python 3.9: training with CPU fails with double invocation, and no progress

Works with Python 3.7 but not 3.9. It launches with double invocation, and no progress:

me:~/src/interscript/rababa/python (main *): python3.9 train.py --model "cbhg" --config config/cbhg.yml

CONFIGURATION CA_MSA.base.cbhg
- session_name : base
- data_directory : data
- data_type : CA_MSA
- log_directory : log_dir
- load_training_data : True
- load_test_data : False
- load_validation_data : True
- n_training_examples : None
- n_test_examples : None
- n_validation_examples : None
- test_file_name : test.csv
- is_data_preprocessed : False
- data_separator : |
- diacritics_separator : *
- text_encoder : ArabicEncoderWithStartSymbol
- text_cleaner : valid_arabic_cleaners
- max_len : 600
- reconcile : True
- max_steps : 2000000
- learning_rate : 0.001
- batch_size : 32
- adam_beta1 : 0.9
- adam_beta2 : 0.999
- use_decay : True
- weight_decay : 0.0
- embedding_dim : 256
- use_prenet : False
- prenet_sizes : [512, 256]
- cbhg_projections : [128, 256]
- cbhg_filters : 16
- cbhg_gru_units : 256
- post_cbhg_layers_units : [256, 256]
- post_cbhg_use_batch_norm : True
- use_mixed_precision : False
- optimizer_type : Adam
- device : cuda
- evaluate_frequency : 5000
- evaluate_with_error_rates_frequency : 5000
- n_predicted_text_tensorboard : 10
- model_save_frequency : 5000
- train_plotting_frequency : 50000000
- n_steps_avg_losses : [100, 500, 1000, 5000]
- error_rates_n_batches : 10000
- test_model_path : None
- train_resume_model_path : None
- len_input_symbols : 44
- len_target_symbols : 17
- optimizer : OptimizerType.Adam
- git_hash : v0.1.0-34-g4e8ffa9
The model has 15413521 trainable parameters parameters
/usr/local/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py:115: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.")
Length of training iterator = 86962
Length of valid iterator = 4676
data loaded
----------------------------------------------------------
Eval:   0%|                                                                                                                                 | 0/4676 [00:00<?, ?it/s]
CONFIGURATION CA_MSA.base.cbhg                                                                                                              | 0/4676 [00:00<?, ?it/s]
- session_name : base                                                                                                                    | 0/2000000 [00:00<?, ?it/s]
- data_directory : data
- data_type : CA_MSA
- log_directory : log_dir
- load_training_data : True
- load_test_data : False
- load_validation_data : True
- n_training_examples : None
- n_test_examples : None
- n_validation_examples : None
- test_file_name : test.csv
- is_data_preprocessed : False
- data_separator : |
- diacritics_separator : *
- text_encoder : ArabicEncoderWithStartSymbol
- text_cleaner : valid_arabic_cleaners
- max_len : 600
- reconcile : True
- max_steps : 2000000
- learning_rate : 0.001
- batch_size : 32
- adam_beta1 : 0.9
- adam_beta2 : 0.999
- use_decay : True
- weight_decay : 0.0
- embedding_dim : 256
- use_prenet : False
- prenet_sizes : [512, 256]
- cbhg_projections : [128, 256]
- cbhg_filters : 16
- cbhg_gru_units : 256
- post_cbhg_layers_units : [256, 256]
- post_cbhg_use_batch_norm : True
- use_mixed_precision : False
- optimizer_type : Adam
- device : cuda
- evaluate_frequency : 5000
- evaluate_with_error_rates_frequency : 5000
- n_predicted_text_tensorboard : 10
- model_save_frequency : 5000
- train_plotting_frequency : 50000000
- n_steps_avg_losses : [100, 500, 1000, 5000]
- error_rates_n_batches : 10000
- test_model_path : None
- train_resume_model_path : None
- len_input_symbols : 44
- len_target_symbols : 17
- optimizer : OptimizerType.Adam
- git_hash : v0.1.0-34-g4e8ffa9
The model has 15413521 trainable parameters parameters
/usr/local/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py:115: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.")
Length of training iterator = 86962
Length of valid iterator = 4676
data loaded
----------------------------------------------------------
Eval:   0%|                                                                                                                                 | 0/4676 [00:00<?, ?it/sTraceback (most recent call last):                                                                                                           | 0/4676 [00:00<?, ?it/s]
  File "<string>", line 1, in <module>                                                                                                   | 0/2000000 [00:00<?, ?it/s]
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 268, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/me/src/interscript/rababa/python/train.py", line 43, in <module>
    trainer.run()
  File "/Users/me/src/interscript/rababa/python/trainer.py", line 210, in run
    for batch_inputs in repeater(train_iterator):
  File "/Users/me/src/interscript/rababa/python/util/utils.py", line 64, in repeater
    for data in loader:
  File "/usr/local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 359, in __iter__
    return self._get_iterator()
  File "/usr/local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 305, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/usr/local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 918, in __init__
    w.start()
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
Eval:   0%|                                                                                                                                 | 0/4676 [00:00<?, ?it/s]
WER/DER : :   0%|                                                                                                                           | 0/4676 [00:00<?, ?it/s]
  0%|                                                                                                                                    | 0/2000000 [00:00<?, ?it/s]

Originally posted by @ronaldtse in #22 (comment)

Python 3.7 warnings with CPU training

These warnings don't seem to affect training but probably should be fixed:

[W ParallelNative.cpp:212] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native 

[W ParallelNative.cpp:212] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
                                                                                                                    /Users/me/.pyenv/versions/3.7.11/lib/python3.7/site-packages/torch/nn/functional.py:652: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  ../c10/core/TensorImpl.h:1156.)

  return torch.max_pool1d(input, kernel_size, stride, padding, dilation, ceil_mode)

Originally posted by @ronaldtse in #22 (comment)

Upload code

For @gilgameshjw

Farsi

Transliteration in Farsi

With mahdi, we have identified a number of challenges peculiar to Farsi:

Persians can use various characters for a particular one, requiring "normalisation" work, probably with maps.
Persians are in practice not strict with the usage of spaces, i.e. the same Farsi word can appear with or without spaces between the characters or they may use a ZWNJ character (zero-width non-joiner).
Transliteration of single words:
- Mahdi has found Large dictionaries with farsi words and with transliteration in their various part of speech (N,V,...)
- The above table is quite extensive and could be used.
- Research shows that transliteration can be better learned with NNets than with rules.
- The resulting transliteration seems NOT aligned with interscript one (requiring maps probably)
Transliteration of several words
- In Farsi, words get pre/suffixes depending on their position and role in a sentence.
- As a consequence, we think of using a PoS tagging technology
- PoS Tagging: there are Algos doing that in Farsi, we need to research software and possibly compare or even train.

Ideas (bad and goods)

speech to text data?
learn Farsi $\Rightarrow$ interscript-like transliteration

Plan

Look for mappings: farsi $\Rightarrow$ +- latine
Done
Stats of collisions and concept validation
952 collisions for 50k dictionary, 0.5% at word level.
Done, Validated
Create git branch so that Mahdi+Jair can collaborate
Done
Run simplest possible transliteration:
- Mahdi provides dataset
- Jair build naive map and transliterate (model 0)
- Ronald, Mahdi, Jair: feedbacks
Review NLP libraries, codebases and research in Farsi.
Improve (char normalisation, preprocessing and PoS)

Restructure Rababa Ruby gem

The Ruby gem right now is not clean with code best practices (and rubocop) and needs refactoring.

@webdev778 can you please help clean up the gem structure? Thanks!

Explain the name Kithara in the README

From @gilgameshjw

nabla is one of the oldest music instrument

(a harp)

From @AhMohsen46

this instrument in arabic is called Kythara

(a lute)

From @gilgameshjw

I think the idea to see diacritics as some melody over some structure or strings is nice.

Create RSpecs

Productionize ruby code for interscript https://github.com/interscript/rababa/tree/hebrew

@webdev778 @ronaldtse

Task on Hebrew branch

Review and possibly improve the code and improve ruby style...
Review and possibly the lib/Readme
Create tests and gem
Current code can be used in rababa/lib using

ruby script.rb -m ../models-data/diacritization_model.onnx --t 'מה שלומך'
ruby script.rb -m ../models-data/diacritization_model.onnx --f '../python/data/test/test.txt'

There is also a small sample data under rababa/example.

The hebrew diacritization model can be downloaded there https://github.com/secryst/rababa-models

Move test from `reconcile.rb` to RSpec

Original code:

d_tests = [{'original' => '# گيله پسمير الجديد 34',
            'diacritized' => 'يَلِهُ سُمِيْرٌ الجَدِيدُ',
            'reconciled' => '# گيَلِهُ پسُمِيْرٌ الجَدِيدُ 34' },

           {'original' => 'abc',
            'diacritized' => '',
            'reconciled' => 'abc'},

           {'original' => '‘Iz. Ibrāhīm as-Sa‘danī',
            'diacritized' => '',
            'reconciled' => '‘Iz. Ibrāhīm as-Sa‘danī'},

           {'original' => '26 سبتمبر العقبة',
            'diacritized' => 'سَبْتَمْبَرِ العَقَبَة',
            'reconciled' => '26 سَبْتَمْبَرِ العَقَبَة'}]

d_tests.each {|d| \
    if not d['reconciled']==reconcile_strings(d['original'], d['diacritized'])
        raise Exception.new('reconcile string not matched')
    end
}

or:
for s in '# گيله پسمير الجديد 34' 'abc'  '‘Iz. Ibrāhīm as-Sa‘danī' '26 سبتمبر العقبة'
do;
    ruby rababa.rb -t $s -m '../models-data/diacritization_model.onnx'
done

Publish rababa to PyPI

Test Rababa on GNDB Arabic data

This is easier than #4 :

Fetch ara_Arab2Latn_BGN_1956.csv (27.9MB) from https://github.com/interscript/geonames-transliteration-data/releases/download/v20210705/pairs.zip
Run the NNets on Arabic (SRC_FULL_NAME_RO column)

Run the output of step 2 using Interscript system (ara_Arab2Latn_BGN_1956 is this one: https://www.interscript.org/systems/bgnpcgn-ara-Arab-Latn-1956)
Compare the output of ara_Arab2Latn_BGN_1956.csv column DEST_FULL_NAME_RO and output of step 2.

Enable CPU training

Currently fails with no GPU:

  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 852, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 530, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 552, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 850, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/usr/local/lib/python3.9/site-packages/torch/cuda/__init__.py", line 166, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Fix Ruby 3.0 ONNX model download

See https://ruby-doc.org/core-3.0.0/IO.html#method-c-open

Ruby 2.6/2.7 passes but not 3.0.

Write blog post about Rababa

Please send it to interscript/interscript.org.

Implement reverse diacriticization based on GNDB dataset

GNDB dataset: https://github.com/interscript/geonames-transliteration-data/releases

We want to perform the following steps:

For every "unpointed Arabic" and "transliterated Arabic" pair
reverse transliterate the "transliterated Arabic" into "pointed Arabic"

Then, use this data of "unpointed Arabic" to "pointed Arabic" to verify that the Arabic diacriticization models work.

Interscript now supports reverse script conversion (need to read it up).

Ping @AhMohsen46 @gilgameshjw .

Rename repository from arabic-diacriticization to kithara

Rename links and artifacts within repo.

Write up simple "try out Rababa" steps for people to try it

Move reconcile tests into proper Rspecs

Add note about dependencies in release

From @gilgameshjw :

the dependencies in releases are:
python model (fixed are string length & batch size) and onnx model for ruby (fixed are another string length & same batch size)

Add testing framework for Python

Port tests from Ruby to Python.

d_tests = [{'original' => '# گيله پسمير الجديد 34',
            'diacritized' => 'يَلِهُ سُمِيْرٌ الجَدِيدُ',
            'reconciled' => '# گيَلِهُ پسُمِيْرٌ الجَدِيدُ 34' },

           {'original' => 'abc',
            'diacritized' => '',
            'reconciled' => 'abc'},

           {'original' => '‘Iz. Ibrāhīm as-Sa‘danī',
            'diacritized' => '',
            'reconciled' => '‘Iz. Ibrāhīm as-Sa‘danī'},

           {'original' => '26 سبتمبر العقبة',
            'diacritized' => 'سَبْتَمْبَرِ العَقَبَة',
            'reconciled' => '26 سَبْتَمْبَرِ العَقَبَة'}]

d_tests.each {|d| \
    if not d['reconciled']==reconcile_strings(d['original'], d['diacritized'])
        raise Exception.new('reconcile string not matched')
    end
}

or:
for s in '# گيله پسمير الجديد 34' 'abc'  '‘Iz. Ibrāhīm as-Sa‘danī' '26 سبتمبر العقبة'
do;
    ruby rababa.rb -t $s -m '../models-data/diacritization_model.onnx'
done
``

Unable to train model in Python

I've been trying to run Rababa to train models but have been unable to due to this error:

Traceback (most recent call last):
  File "/home/runner/work/rababa-models/rababa-models/rababa/python/train.py", line 8, in <module>
    from trainer import (
  File "/home/runner/work/rababa-models/rababa-models/rababa/python/trainer.py", line 15, in <module>
    from diacritizer import CBHGDiacritizer
ImportError: cannot import name 'CBHGDiacritizer' from 'diacritizer' (/home/runner/work/rababa-models/rababa-models/rababa/python/diacritizer.py)

The CBHGDiacritizer code no longer exists in Rababa.

I've uploaded the workflow attempted to train here:
https://github.com/secryst/rababa-models/blob/main/.github/workflows/train.yml

The full logs can be seen here:
https://github.com/secryst/rababa-models/runs/3218393663?check_suite_focus=true

Improve training dataset

The Rababa models today are trained on the Tashkeela corpus.

In Tashkeela, 98% of its content come from Shamela.

There are some other additional datasets that are either pointed or can be made into pointed datasets.

Pointed datasets:

Shamela offers a full download of 6.8 GB of its books, which most if not all are pointed Arabic
- https://shamela.ws/page/download
The University of Leeds uses this: https://corpus.quran.com/java/uthmaniscript.jsp
- It uses the Tanzil distribution of Quran that includes pointed text: https://tanzil.net/download/
- This is a pointed dataset, can be immediately useable by supplementing it to the old Tashkeela
K. Aissa, Maqola, a collection of best arabic citations, 2016.(Online)(. Available) http://maqola.org
- This is one of the sources of Tashkeela, pointed dataset.
AlJazeera Learning https://learning.aljazeera.net/ar
- This is one of the sources of Tashkeela, pointed, but needs crawling to obtain text.

AlJazeera Learning also offers an Arabic diacriticizer, which we can test against:

https://learning.aljazeera.net/en/pages/تشكيل-vocalization

The endpoint goes to:

curl 'https://farasa-api.qcri.org/msa/webapi/diacritizeV2' \
-X 'POST' \
-H 'Accept: */*' \
-H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' \
-H 'Origin: https://quiz.aljazeera.net' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Host: farasa-api.qcri.org' \
-H 'Content-Length: 75' \
-H 'Accept-Language: en-us' \
-H 'Connection: keep-alive' \
--data 'text=%D8%B5%D9%81%D8%AD%D8%A9+%D8%A7%D9%84%D8%AA%D8%B4%D9%83%D9%8A%D9%84%0A'