Welcome to ู ูุดูููููุงุชู.ai ... An innovative Arabic text diacritization (Tashkeel) engine developed using advanced neural and statistical techniques. This project aims to accurately predict and add diacritics to Arabic text, enhancing readability and understanding. The ู ูุดูููููุงุชู.ai model achieved first-place on Kaggle, showcasing its exceptional performance ๐ฅ
The ู ูุดูููููุงุชู.ai diacritization system employs a dual-model architecture that consists of:
- A Neural Bidirectional Stacked Long Short-Term Memory (BiLSTM) model - that captures sequential dependencies and context information within the Arabic text - inspired by this research paper, but on steroids!
- A Statistical Post-Processing model that operates on the output generated by the neural model to further refine the diacritization results, inspired by this research paper
Meshakkelaty-Promo.mp4
To use ู ูุดูููููุงุชู.ai, follow these steps:
- Clone the repository
git clone https://github.com/Omar-Al-Sharif/Meshakkelaty.ai.git
- Install the necessary dependencies:
pip install -r Meshakkelaty.ai/requirements.txt
- Acquire your data and place them in
data
directory under the namestrain.txt
andval.txt
- Change the directory to scripts directory:
cd Meshakkelaty.ai/scripts
- Prepare your data by running the following command
python tokenize_dataset.py
- Train the neural model on your data
python train_neural_model.py
- Train the statistical model on your data
python train_statistical_model.py
- Put your input text inside:
../data/test_input.txt
- Diacritize the input text by running:
python predict.py