This project enables to distill multilingual transformers into language-specific students. Features contain:
- Adjust any distillation loss
- Any number of students and languages per student
- Change the teacher and student architecture
- Choose between monolingual, bilingual, or multilingual distillation setup
- Component Sharing across students
- Initialization of students from teacher layers
Configure your environment first.
# clone project
git clone https://github.com/MinhDucBui/clkd.git
cd clkd
# [OPTIONAL] create conda environment
conda create -n myenv python=3.9
conda activate myenv
# install requirements
pip install -r requirements.txt
# Please make sure to have the right Pytorch Version: Install pytorch according to instructions
# https://pytorch.org/get-started/
Download English & Turkish dataset from cc100 here. Alternatively, to speed up the downloading process, download Urdu-Swahili.
# change to data folder
cd data/cc100
# Download English Data (82GB) and Turkish Data (5.4GB)
wget http://data.statmt.org/cc-100/en.txt.xz
wget http://data.statmt.org/cc-100/tr.txt.xz
# Alternative: Urdu (884MB) and Swahili (332MB)
wget http://data.statmt.org/cc-100/ur.txt.xz
wget http://data.statmt.org/cc-100/sw.txt.xz
# Change back to original folder
cd ..
cd ..
Execute the main script. The default setting uses the same strategy as MonoShot.
# Choose GPU Device
export CUDA_VISIBLE_DEVICES=0
# execute main script
python run.py
# Alternativ: Urdu-Swahili pair
python run.py experiment=monolingual_urdu_swahili
Change Distillation Loss
Hydra allows you to easily overwrite any parameter defined in your config. See students/individuals/loss for all loss functions.
python run.py students/individual/loss=monoalignment
To contruct your own distillation loss, we provide bass losses, that can be used to construct the final loss. Furthermore, we provide all distillation losses used in this thesis here.
Example of constucting the distillation loss from the MLM loss and logit distillation with CE loss with equal weighting.
_target_: src.loss.loss.GeneralLoss
defaults:
- base_loss@base_loss.mlm: mlm.yaml
- base_loss@base_loss.softtargets_ce: softtargets_ce.yaml
base_loss:
softtargets_ce:
temperature: 4.0
loss_weighting:
mlm: 0.5
softtargets_ce: 0.5
Change Student Number and Languages
We constructed some default configs for different scenarios:
# monolingual setting with english-turkish language pair
python train.py experiment=monolingual
# monolingual setting with english-basque language pair
python train.py experiment=monolingual_eu
# monolingual setting with english-turkish language pair
python train.py experiment=monolingual_sw
# monolingual setting with english-turkish language pair
python train.py experiment=monolingual_ur
# bilingual setting with english-turkish language pair
python train.py experiment=monolingual_bilingual
To construct a custom setting, please see the documentation here.
Embedding Sharing across Students
# Share language embeddings only in each student, not across students.
python run.py students.embed_sharing="in_each_model"
To construct a custom setting, please see the documentation here.
Layer Sharing across Students
Please see the documentation here.
Change Student Architecture
# Use the same architecture as the teacher
python run.py students/individual/model=from_teacher
More architectures can be found here.
Student Initialization
Default uses weights from the teacher.
# Randomly Initialize Embedding Weights
python run.py students.individual.model.weights_from_teacher.embeddings=False
# Randomly Initialize Layer Weights
python run.py students.individual.model.weights_from_teacher.transformer_blocks=False
The directory structure of new project looks like this:
βββ configs <- Hydra configuration files
β βββ callbacks <- Callbacks configs
β βββ collate_fn <- Collate functions configs
β βββ datamodule <- Datamodule configs
β βββ distillation_setup <- Distillation configs
β βββ evaluation <- Evaluation configs
β βββ experiment <- Experiment configs
β βββ hydra <- Hydra related configs
β βββ logger <- Logger configs
β βββ students <- Student configs
β βββ teacher <- Teacher configs
β βββ trainer <- Trainer configs
β β
β βββ config.yaml <- Main project configuration file
β
βββ data <- Project data
β
βββ logs <- Logs generated by Hydra and PyTorch Lightning loggers
β
βββ src
β βββ callbacks <- Lightning callbacks
β βββ datamodules <- Lightning datamodules
β βββ distillation <- Distillation Setup Files
β βββ evaluation <- Evaluation Files
β βββ los <- Loss Files
β βββ models <- Lightning models
β βββ utils <- Utility scripts
β β
β βββ train.py <- Training pipeline
β
βββ run.py <- Run pipeline with chosen experiment configuration
β
βββ .env.example <- Template of the file for storing private environment variables
βββ .gitignore <- List of files/folders ignored by git
βββ .pre-commit-config.yaml <- Configuration of automatic code formatting
βββ setup.cfg <- Configurations of linters and pytest
βββ Dockerfile <- File for building docker container
βββ requirements.txt <- File for installing python dependencies
βββ LICENSE
βββ README.md