Giter Site home page Giter Site logo

interscript / rababa Goto Github PK

View Code? Open in Web Editor NEW
12.0 11.0 1.0 30.58 MB

Rababa, the diacritization library for Arabic and Hebrew (Abjad scripts in general)

Python 84.42% Ruby 15.53% Shell 0.05%
deep-learning machine-learning pytorch neural-network neural-networks arabic arabic-nlp transformer attention-model hebrew

rababa's Introduction

Interscript: Interoperable Script Conversion Systems, with Ruby and JavaScript runtimes

Ruby build status JavaScript build status

Introduction

This repository contains interoperable transliteration schemes from:

  • ALA-LC

  • BGN/PCGN

  • ICAO

  • ISO

  • UN (by UNGEGN)

  • Many, many other script conversion system authorities.

The goal is to achieve interoperable transliteration schemes allowing quality comparisons.

Demonstration

Installing Interscript

gem install interscript

Interscript stats

interscript stats | less

Transliteration

cat ara-Arab.txt
interscript -s odni-ara-Arab-Latn-2015 ara-Arab.txt -o ara-Arab-out.txt
cat ara-Arab-out.txt

Diacritization

# First, we need to install rababa
gem install rababa
# Now we can transliterate
interscript -s "var-ara-Arab-Arab-rababa|odni-ara-Arab-Latn-2015" ara-Arab.txt -o ara-Arab-out-tl.txt
cat ara-Arab-out-tl.txt
# Compare this to transliteration without diacritization
cat ara-Arab-out.txt

Reversing

cat rus-rev.txt
interscript -s odni-rus-Latn-Cyrl-2015 rus-rev.txt -o rus.txt
# Note that Latn and Cyrl are reversed
cat rus.txt

interscript screencast

Installation

Prerequisites

Interscript depends on Ruby. Once you manage to install Ruby, it’s easy. This part won’t work until we release Interscript v2, please use the one below.

gem install interscript -v "~>2.0"

You can also download a local copy of this Git repository, eg. for development purposes:

git clone https://github.com/interscript/lcs
cd lcs/ruby
bundle install

Additional prerequisites for Thai systems

If you want to transliterate Thai systems, you will need to install some additional requirements. Please consult: Usage with Secryst.

Usage

Assume you have a file ready in the source script like this:

cat <<EOT > rus-Cyrl.txt
Эх, тройка! птица тройка, кто тебя выдумал? знать, у бойкого народа ты
могла только родиться, в той земле, что не любит шутить, а
ровнем-гладнем разметнулась на полсвета, да и ступай считать версты,
пока не зарябит тебе в очи. И не хитрый, кажись, дорожный снаряд, не
железным схвачен винтом, а наскоро живьём с одним топором да долотом
снарядил и собрал тебя ярославский расторопный мужик. Не в немецких
ботфортах ямщик: борода да рукавицы, и сидит чёрт знает на чём; а
привстал, да замахнулся, да затянул песню — кони вихрем, спицы в
колесах смешались в один гладкий круг, только дрогнула дорога, да
вскрикнул в испуге остановившийся пешеход — и вон она понеслась,
понеслась, понеслась!

Н.В. Гоголь
EOT

You can run interscript on this text using different transliteration systems.

interscript rus-Cyrl.txt \
  --system=bgnpcgn-rus-Cyrl-Latn-1947 \
  --output=bgnpcgn-rus-Latn.txt

interscript rus-Cyrl.txt \
  --system=iso-rus-Cyrl-Latn-9-1995 \
  --output=iso-rus-Latn.txt

interscript rus-Cyrl.txt \
  --system=icao-rus-Cyrl-Latn-9303 \
  --output=icao-rus-Latn.txt

interscript rus-Cyrl.txt \
  --system=bas-rus-Cyrl-Latn-2017-bss \
  --output=bas-rus-Latn.txt

It is then easy to see the exact differences in rendering between the systems.

diff bgnpcgn-rus-Latn.txt bas-rus-Latn.txt

If you use Interscript from the Git repository, you would call the following command instead of interscript:

# Ensure you are in your Git repository root path
ruby/bin/interscript rus-Cyrl.txt \
  --system=bas-rus-Cyrl-Latn-2017-bss \
  --output=bas-rus-Latn.txt

Adding transliteration system

Please consult the Map Editing Guide

Integration with Ruby applications

ISCS system codes

In accordance with ISO/CC 24229, the system code identifying a script conversion system has the following components:

e.g. bgnpcgn-rus-Cyrl-Latn-1947:

bgnpcgn

the authority identifier

rus

an ISO 639-{1,2,3,5} language code that this system applies to (For 639-2, use (T) code)

Cyrl

an ISO 15924 script code, identifying the source script

Latn

an ISO 15924 script code, identifying the target script

1947

an identifier unit within the authority to identify this system

Covered languages

Currently the schemes cover Cyrillic, Armenian, Greek, Arabic and Hebrew.

References

Reference documents are located at the interscript-references repository. Some specifications that have distribution limitations may not be reproduced there.

This is a Ribose project. Copyright Ribose.

rababa's People

Contributors

ahmohsen46 avatar gilgameshjw avatar ronaldtse avatar webdev778 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

syedpeer

rababa's Issues

Failing diacriticization tests

There were two diacriticization tests that are currently failing (and I commented them out).

@AhMohsen46 can you help check?

  1) Rababa::Diacritizer diacriticizes # گيله پسمير الجديد 34
     Failure/Error: expect(diacritizer.diacritize_text(source)).to eq target
     
       expected: "# گيَلِهُ پسُمِيْرٌ الجَدِيدُ 34"
            got: "# گيَلْهُ پسَمِيرٌ الجَدِيدَ 34"
     
       (compared using ==)
     # ./spec/rababa/diacritizer_spec.rb:49:in `block (3 levels) in <top (required)>'

  2) Rababa::Diacritizer diacriticizes 26 سبتمبر العقبة
     Failure/Error: expect(diacritizer.diacritize_text(source)).to eq target
     
       expected: "26 سَبْتَمْبَرِ العَقَبَة"
            got: "26 سَبْتَمْبَرَ العَقْبَة"
     
       (compared using ==)
     # ./spec/rababa/diacritizer_spec.rb:49:in `block (3 levels) in <top (required)>'

Finished in 12.58 seconds (files took 0.36641 seconds to load)
6 examples, 2 failures

Literature review

Literature and Codes

As in 06/2021... We review only the more advanced technologies.
Older solutions used rules based approaches.
Deep Learning was applied relatively to the problem of diacritization, gradually getting better results than rules based approaches.

Mishkal, Arabic text vocalization software
Zerrouki, T.
rules based library, 2014

Automatic minimal diacritization of Arabic texts
Rehab Alnefaiea, Aqil M.Azmib
11.2017

  • MADAMIRA software
  • paper

An Approach for Arabic Diacritization
Ismail Hadjir, Mohamed Abbache, Fatma Zohra Belkredim
06.2019

  • keywords: Hidden Markov Models, Viterbi algorithm
  • article

Diacritization of Moroccan and Tunisian Arabic Dialects: A CRF Approach
Kareem Darwish∗, Ahmed Abdelali∗, Hamdy Mubarak∗, Younes Samih†, Mohammed Attia⋆
2018

  • keywords: Conditional Random Fields, arabic dialects...
  • paper

Arabic Text Diacritization Using Deep Neural Networks
Ali Fadel, Ibraheem Tuffaha, Bara' Al-Jawarneh, Mahmoud Al-Ayyoub
Shakkala library, tensorflow, 04.2019

Highly Effective Arabic Diacritization using Sequence to Sequence Modeling

  • Hamdy Mubarak, Ahmed Abdelali, Hassan Sajjad, Younes Samih, Kareem Darwish
    06.2019
  • keywords: seq2seq(LSTM), NMT, interesting representation units, context window, voting
  • paper

Multi-components System for Automatic Arabic Diacritization
Hamza Abbad, Shengwu Xiong
04.2020

  • keywords: LSTM's, parallel layers for Shadda and Harakat (⇒ pipeline)
  • paper
  • code, tensorflow

Deep Diacritization: Efficient Hierarchical Recurrence for Improved Arabic Diacritization
Badr AlKhamissi, Muhammad N. ElNokrashy, and Mohamed Gabr
12.2020

  • keywords: Cross-level attention, Encoder-Decoder (LSTM), Teacher forcing,
  • paper
  • slides
  • code, pytorch

Effective Deep Learning Models for Automatic Diacritization of Arabic Text
Mokthar Ali Hasan Madhfar; Ali Mustafa Qamar
12.2020

  • keywords: embedding, encoder-decoder (LSTM), Highway Nets, Attention, CBHG Module
  • paper
  • code, pytorch

A Deep Belief Network Classification Approach for Automatic Diacritization of Arabic Text
Mohammad Aref Alshraideh, Mohammad Alshraideh and Omar Alkadi
4.2021

  • keywords: DBN built with Boltzmann restricted machines (restricted RBM's) superior to LSTMs, unicode encoding, Borderline-SMOTE
  • paper

Research ideas

Here we just mention some 2021-ish ideas mentioned in the recent papers above:

  • Transformer-based Encoders
  • Byte-pair-encodings
  • Improve Injected Hints Method (train with semi diacritised data)
  • More Interpretable Attention Weights
  • Deep belief networks
  • More data and data processing

Compare quality with Farasa

As per #36 , AlJazeera Learning also offers an Arabic diacriticizer

Which actually goes to QCRI's Farasa:

curl 'https://farasa-api.qcri.org/msa/webapi/diacritizeV2' \
-X 'POST' \
-H 'Accept: */*' \
-H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' \
-H 'Origin: https://quiz.aljazeera.net' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Host: farasa-api.qcri.org' \
-H 'Content-Length: 75' \
-H 'Accept-Language: en-us' \
-H 'Connection: keep-alive' \
--data 'text=%D8%B5%D9%81%D8%AD%D8%A9+%D8%A7%D9%84%D8%AA%D8%B4%D9%83%D9%8A%D9%84%0A'

Apparently they have two diacritization modules that can be downloaded (Java, JAR) or used via the web:

We should do some comparisons with them.

Diacriticization failure after latest ONNX model change

@gilgameshjw the tests started failing at commit e869dff.

  1) Rababa::Diacritizer diacriticizes قطر
     Failure/Error: expect(diacritizer.diacritize_text(source)).to eq target
     
       expected: "قِطْرَ"
            got: "قَطُر"
     
       (compared using ==)
     # ./spec/rababa/diacritizer_spec.rb:43:in `block (3 levels) in <top (required)>'

  2) Rababa::Diacritizer diacriticizes abc
     Failure/Error: predicts = @onnx_session.run(nil, batch_data)
     
     OnnxRuntime::Error:
       Non-zero status code returned while running FusedConv node. Name:'fused Conv_92' Status Message: Invalid input shape: {0}
     # ./lib/rababa/diacritizer.rb:113:in `predict_batch'
     # ./lib/rababa/diacritizer.rb:68:in `diacritize_text'
     # ./spec/rababa/diacritizer_spec.rb:43:in `block (3 levels) in <top (required)>'

  3) Rababa::Diacritizer diacriticizes ‘Iz. Ibrāhīm as-Sa‘danī
     Failure/Error: predicts = @onnx_session.run(nil, batch_data)
     
     OnnxRuntime::Error:
       Non-zero status code returned while running FusedConv node. Name:'fused Conv_92' Status Message: Invalid input shape: {0}
     # ./lib/rababa/diacritizer.rb:113:in `predict_batch'
     # ./lib/rababa/diacritizer.rb:68:in `diacritize_text'
     # ./spec/rababa/diacritizer_spec.rb:43:in `block (3 levels) in <top (required)>'

Finished in 6.1 seconds (files took 2.67 seconds to load)
6 examples, 3 failures, 2 pending

Failed examples:

rspec ./spec/rababa/diacritizer_spec.rb[1:1] # Rababa::Diacritizer diacriticizes قطر
rspec ./spec/rababa/diacritizer_spec.rb[1:2] # Rababa::Diacritizer diacriticizes abc
rspec ./spec/rababa/diacritizer_spec.rb[1:3] # Rababa::Diacritizer diacriticizes ‘Iz. Ibrāhīm as-Sa‘danī

Investigate "case ending" training

https://arxiv.org/pdf/2002.01207.pdf

From the QCRI people.

Arabic Diacritic Recovery Using a Feature-Rich biLSTM Model
Kareem Darwish, Ahmed Abdelali, Hamdy Mubarak, Mohamed Eldesouki
Qatar Computing Research Institute. Hamad Bin Khalifa University, Doha. Qatar

Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them to correctly pronounce words. There are two types of Arabic diacritics: the first are core-word diacritics (CW), which specify the lexical selection, and the second are case endings (CE), which typically appear at the end of the word stem and generally specify their syntactic roles. Recovering CEs is relatively harder than recovering core-word diacritics due to inter-word dependencies, which are often distant. In this paper, we use a feature-rich recurrent neural network model that uses a variety of linguistic and surface- level features to recover both core word diacritics and case endings. Our model surpasses all previous state-of-the-art systems with a CW error rate (CWER) of 2.86% and a CE error rate (CEER) of 3.7% for Modern Standard Arabic (MSA) and CWER of 2.2% and CEER of 2.5% for Classical Arabic (CA). When combining diacritized word cores with case endings, the resultant word error rate is 6.0% and 4.3% for MSA and CA respectively. This highlights the effectiveness of feature engineering for such deep neural models.

For Case Ending:

MSA Results: As the results show, our baseline DNN system outperforms all state-of-the-art systems. Further, adding more features yielded better results overall. Surface-level features resulted in the most gain, followed by POS tags, and lastly stem templates. Further, adding head and tail characters along with a list of sukun words and named entities led to further improvement. Our proposed feature-rich system has a CEER that is approximately 61% lower than any of the state-of-the-art systems.

CA Results: The results show that the POS tagging features led to the most improvements followed by the surface features. Combining all features led to the best results with WER of 2.5%. As we saw for CW diacritics, using our best MSA system to diacritize CA led to significantly lower results with CEER of 8.9%.

Python 3.9: training with CPU fails with double invocation, and no progress

Works with Python 3.7 but not 3.9. It launches with double invocation, and no progress:

me:~/src/interscript/rababa/python (main *): python3.9 train.py --model "cbhg" --config config/cbhg.yml

CONFIGURATION CA_MSA.base.cbhg
- session_name : base
- data_directory : data
- data_type : CA_MSA
- log_directory : log_dir
- load_training_data : True
- load_test_data : False
- load_validation_data : True
- n_training_examples : None
- n_test_examples : None
- n_validation_examples : None
- test_file_name : test.csv
- is_data_preprocessed : False
- data_separator : |
- diacritics_separator : *
- text_encoder : ArabicEncoderWithStartSymbol
- text_cleaner : valid_arabic_cleaners
- max_len : 600
- reconcile : True
- max_steps : 2000000
- learning_rate : 0.001
- batch_size : 32
- adam_beta1 : 0.9
- adam_beta2 : 0.999
- use_decay : True
- weight_decay : 0.0
- embedding_dim : 256
- use_prenet : False
- prenet_sizes : [512, 256]
- cbhg_projections : [128, 256]
- cbhg_filters : 16
- cbhg_gru_units : 256
- post_cbhg_layers_units : [256, 256]
- post_cbhg_use_batch_norm : True
- use_mixed_precision : False
- optimizer_type : Adam
- device : cuda
- evaluate_frequency : 5000
- evaluate_with_error_rates_frequency : 5000
- n_predicted_text_tensorboard : 10
- model_save_frequency : 5000
- train_plotting_frequency : 50000000
- n_steps_avg_losses : [100, 500, 1000, 5000]
- error_rates_n_batches : 10000
- test_model_path : None
- train_resume_model_path : None
- len_input_symbols : 44
- len_target_symbols : 17
- optimizer : OptimizerType.Adam
- git_hash : v0.1.0-34-g4e8ffa9
The model has 15413521 trainable parameters parameters
/usr/local/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py:115: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.")
Length of training iterator = 86962
Length of valid iterator = 4676
data loaded
----------------------------------------------------------
Eval:   0%|                                                                                                                                 | 0/4676 [00:00<?, ?it/s]
CONFIGURATION CA_MSA.base.cbhg                                                                                                              | 0/4676 [00:00<?, ?it/s]
- session_name : base                                                                                                                    | 0/2000000 [00:00<?, ?it/s]
- data_directory : data
- data_type : CA_MSA
- log_directory : log_dir
- load_training_data : True
- load_test_data : False
- load_validation_data : True
- n_training_examples : None
- n_test_examples : None
- n_validation_examples : None
- test_file_name : test.csv
- is_data_preprocessed : False
- data_separator : |
- diacritics_separator : *
- text_encoder : ArabicEncoderWithStartSymbol
- text_cleaner : valid_arabic_cleaners
- max_len : 600
- reconcile : True
- max_steps : 2000000
- learning_rate : 0.001
- batch_size : 32
- adam_beta1 : 0.9
- adam_beta2 : 0.999
- use_decay : True
- weight_decay : 0.0
- embedding_dim : 256
- use_prenet : False
- prenet_sizes : [512, 256]
- cbhg_projections : [128, 256]
- cbhg_filters : 16
- cbhg_gru_units : 256
- post_cbhg_layers_units : [256, 256]
- post_cbhg_use_batch_norm : True
- use_mixed_precision : False
- optimizer_type : Adam
- device : cuda
- evaluate_frequency : 5000
- evaluate_with_error_rates_frequency : 5000
- n_predicted_text_tensorboard : 10
- model_save_frequency : 5000
- train_plotting_frequency : 50000000
- n_steps_avg_losses : [100, 500, 1000, 5000]
- error_rates_n_batches : 10000
- test_model_path : None
- train_resume_model_path : None
- len_input_symbols : 44
- len_target_symbols : 17
- optimizer : OptimizerType.Adam
- git_hash : v0.1.0-34-g4e8ffa9
The model has 15413521 trainable parameters parameters
/usr/local/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py:115: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.")
Length of training iterator = 86962
Length of valid iterator = 4676
data loaded
----------------------------------------------------------
Eval:   0%|                                                                                                                                 | 0/4676 [00:00<?, ?it/sTraceback (most recent call last):                                                                                                           | 0/4676 [00:00<?, ?it/s]
  File "<string>", line 1, in <module>                                                                                                   | 0/2000000 [00:00<?, ?it/s]
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 268, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/me/src/interscript/rababa/python/train.py", line 43, in <module>
    trainer.run()
  File "/Users/me/src/interscript/rababa/python/trainer.py", line 210, in run
    for batch_inputs in repeater(train_iterator):
  File "/Users/me/src/interscript/rababa/python/util/utils.py", line 64, in repeater
    for data in loader:
  File "/usr/local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 359, in __iter__
    return self._get_iterator()
  File "/usr/local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 305, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/usr/local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 918, in __init__
    w.start()
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
Eval:   0%|                                                                                                                                 | 0/4676 [00:00<?, ?it/s]
WER/DER : :   0%|                                                                                                                           | 0/4676 [00:00<?, ?it/s]
  0%|                                                                                                                                    | 0/2000000 [00:00<?, ?it/s]

Originally posted by @ronaldtse in #22 (comment)

Python 3.7 warnings with CPU training

These warnings don't seem to affect training but probably should be fixed:

[W ParallelNative.cpp:212] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native 

[W ParallelNative.cpp:212] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
                                                                                                                    /Users/me/.pyenv/versions/3.7.11/lib/python3.7/site-packages/torch/nn/functional.py:652: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  ../c10/core/TensorImpl.h:1156.)

  return torch.max_pool1d(input, kernel_size, stride, padding, dilation, ceil_mode)

Originally posted by @ronaldtse in #22 (comment)

Farsi

Farsi

Transliteration in Farsi

With mahdi, we have identified a number of challenges peculiar to Farsi:

  1. Persians can use various characters for a particular one, requiring "normalisation" work, probably with maps.
  2. Persians are in practice not strict with the usage of spaces, i.e. the same Farsi word can appear with or without spaces between the characters or they may use a ZWNJ character (zero-width non-joiner).
  3. Transliteration of single words:
    • Mahdi has found Large dictionaries with farsi words and with transliteration in their various part of speech (N,V,...)
    • The above table is quite extensive and could be used.
    • Research shows that transliteration can be better learned with NNets than with rules.
    • The resulting transliteration seems NOT aligned with interscript one (requiring maps probably)
  4. Transliteration of several words
    • In Farsi, words get pre/suffixes depending on their position and role in a sentence.
    • As a consequence, we think of using a PoS tagging technology
    • PoS Tagging: there are Algos doing that in Farsi, we need to research software and possibly compare or even train.

Ideas (bad and goods)

  • speech to text data?
  • learn Farsi $\Rightarrow$ interscript-like transliteration

Plan

  1. Look for mappings: farsi $\Rightarrow$ +- latine
    Done
  2. Stats of collisions and concept validation
    952 collisions for 50k dictionary, 0.5% at word level.
    Done, Validated
  3. Create git branch so that Mahdi+Jair can collaborate
    Done
  4. Run simplest possible transliteration:
    • Mahdi provides dataset
    • Jair build naive map and transliterate (model 0)
    • Ronald, Mahdi, Jair: feedbacks
  5. Review NLP libraries, codebases and research in Farsi.
  6. Improve (char normalisation, preprocessing and PoS)

Restructure Rababa Ruby gem

The Ruby gem right now is not clean with code best practices (and rubocop) and needs refactoring.

@webdev778 can you please help clean up the gem structure? Thanks!

Productionize ruby code for interscript https://github.com/interscript/rababa/tree/hebrew

@webdev778 @ronaldtse

Task on Hebrew branch

  • Review and possibly improve the code and improve ruby style...
  • Review and possibly the lib/Readme
  • Create tests and gem
    Current code can be used in rababa/lib using

ruby script.rb -m ../models-data/diacritization_model.onnx --t 'מה שלומך'
ruby script.rb -m ../models-data/diacritization_model.onnx --f '../python/data/test/test.txt'

There is also a small sample data under rababa/example.

The hebrew diacritization model can be downloaded there https://github.com/secryst/rababa-models

Move test from `reconcile.rb` to RSpec

Original code:

d_tests = [{'original' => '# گيله پسمير الجديد 34',
            'diacritized' => 'يَلِهُ سُمِيْرٌ الجَدِيدُ',
            'reconciled' => '# گيَلِهُ پسُمِيْرٌ الجَدِيدُ 34' },

           {'original' => 'abc',
            'diacritized' => '',
            'reconciled' => 'abc'},

           {'original' => '‘Iz. Ibrāhīm as-Sa‘danī',
            'diacritized' => '',
            'reconciled' => '‘Iz. Ibrāhīm as-Sa‘danī'},

           {'original' => '26 سبتمبر العقبة',
            'diacritized' => 'سَبْتَمْبَرِ العَقَبَة',
            'reconciled' => '26 سَبْتَمْبَرِ العَقَبَة'}]

d_tests.each {|d| \
    if not d['reconciled']==reconcile_strings(d['original'], d['diacritized'])
        raise Exception.new('reconcile string not matched')
    end
}

or:
for s in '# گيله پسمير الجديد 34' 'abc'  '‘Iz. Ibrāhīm as-Sa‘danī' '26 سبتمبر العقبة'
do;
    ruby rababa.rb -t $s -m '../models-data/diacritization_model.onnx'
done

Test Rababa on GNDB Arabic data

This is easier than #4 :

  1. Fetch ara_Arab2Latn_BGN_1956.csv (27.9MB) from https://github.com/interscript/geonames-transliteration-data/releases/download/v20210705/pairs.zip

  2. Run the NNets on Arabic (SRC_FULL_NAME_RO column)

image

  1. Run the output of step 2 using Interscript system (ara_Arab2Latn_BGN_1956 is this one: https://www.interscript.org/systems/bgnpcgn-ara-Arab-Latn-1956)

  2. Compare the output of ara_Arab2Latn_BGN_1956.csv column DEST_FULL_NAME_RO and output of step 2.

Enable CPU training

Currently fails with no GPU:

  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 852, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 530, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 552, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 850, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/usr/local/lib/python3.9/site-packages/torch/cuda/__init__.py", line 166, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Implement reverse diacriticization based on GNDB dataset

GNDB dataset: https://github.com/interscript/geonames-transliteration-data/releases

We want to perform the following steps:

  1. For every "unpointed Arabic" and "transliterated Arabic" pair
  2. reverse transliterate the "transliterated Arabic" into "pointed Arabic"

Then, use this data of "unpointed Arabic" to "pointed Arabic" to verify that the Arabic diacriticization models work.

Interscript now supports reverse script conversion (need to read it up).

Ping @AhMohsen46 @gilgameshjw .

Add testing framework for Python

Port tests from Ruby to Python.

d_tests = [{'original' => '# گيله پسمير الجديد 34',
            'diacritized' => 'يَلِهُ سُمِيْرٌ الجَدِيدُ',
            'reconciled' => '# گيَلِهُ پسُمِيْرٌ الجَدِيدُ 34' },

           {'original' => 'abc',
            'diacritized' => '',
            'reconciled' => 'abc'},

           {'original' => '‘Iz. Ibrāhīm as-Sa‘danī',
            'diacritized' => '',
            'reconciled' => '‘Iz. Ibrāhīm as-Sa‘danī'},

           {'original' => '26 سبتمبر العقبة',
            'diacritized' => 'سَبْتَمْبَرِ العَقَبَة',
            'reconciled' => '26 سَبْتَمْبَرِ العَقَبَة'}]

d_tests.each {|d| \
    if not d['reconciled']==reconcile_strings(d['original'], d['diacritized'])
        raise Exception.new('reconcile string not matched')
    end
}

or:
for s in '# گيله پسمير الجديد 34' 'abc'  '‘Iz. Ibrāhīm as-Sa‘danī' '26 سبتمبر العقبة'
do;
    ruby rababa.rb -t $s -m '../models-data/diacritization_model.onnx'
done
``

Unable to train model in Python

I've been trying to run Rababa to train models but have been unable to due to this error:

Traceback (most recent call last):
  File "/home/runner/work/rababa-models/rababa-models/rababa/python/train.py", line 8, in <module>
    from trainer import (
  File "/home/runner/work/rababa-models/rababa-models/rababa/python/trainer.py", line 15, in <module>
    from diacritizer import CBHGDiacritizer
ImportError: cannot import name 'CBHGDiacritizer' from 'diacritizer' (/home/runner/work/rababa-models/rababa-models/rababa/python/diacritizer.py)

The CBHGDiacritizer code no longer exists in Rababa.

I've uploaded the workflow attempted to train here:
https://github.com/secryst/rababa-models/blob/main/.github/workflows/train.yml

The full logs can be seen here:
https://github.com/secryst/rababa-models/runs/3218393663?check_suite_focus=true

Improve training dataset

The Rababa models today are trained on the Tashkeela corpus.

In Tashkeela, 98% of its content come from Shamela.

There are some other additional datasets that are either pointed or can be made into pointed datasets.

Pointed datasets:

AlJazeera Learning also offers an Arabic diacriticizer, which we can test against:

The endpoint goes to:

curl 'https://farasa-api.qcri.org/msa/webapi/diacritizeV2' \
-X 'POST' \
-H 'Accept: */*' \
-H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' \
-H 'Origin: https://quiz.aljazeera.net' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Host: farasa-api.qcri.org' \
-H 'Content-Length: 75' \
-H 'Accept-Language: en-us' \
-H 'Connection: keep-alive' \
--data 'text=%D8%B5%D9%81%D8%AD%D8%A9+%D8%A7%D9%84%D8%AA%D8%B4%D9%83%D9%8A%D9%84%0A'

Apparently they have two diacritization modules that can be downloaded (Java, JAR) or used via the web:

Datasets that could potentially be pointed...:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.