Cross-Lingual Knowledge Distillation

📌 Introduction

This project enables to distill multilingual transformers into language-specific students. Features contain:

Adjust any distillation loss
Any number of students and languages per student
Change the teacher and student architecture
Choose between monolingual, bilingual, or multilingual distillation setup
Component Sharing across students
Initialization of students from teacher layers

🚀 Quickstart

Configure your environment first.

# clone project
git clone https://github.com/MinhDucBui/clkd.git
cd clkd

# [OPTIONAL] create conda environment
conda create -n myenv python=3.9
conda activate myenv

# install requirements
pip install -r requirements.txt

# Please make sure to have the right Pytorch Version: Install pytorch according to instructions
# https://pytorch.org/get-started/

Download English & Turkish dataset from cc100 here. Alternatively, to speed up the downloading process, download Urdu-Swahili.

# change to data folder
cd data/cc100

# Download English Data (82GB) and Turkish Data (5.4GB)
wget http://data.statmt.org/cc-100/en.txt.xz
wget http://data.statmt.org/cc-100/tr.txt.xz

# Alternative: Urdu (884MB) and Swahili (332MB)
wget http://data.statmt.org/cc-100/ur.txt.xz
wget http://data.statmt.org/cc-100/sw.txt.xz

# Change back to original folder
cd ..
cd ..

Execute the main script. The default setting uses the same strategy as MonoShot.

# Choose GPU Device
export CUDA_VISIBLE_DEVICES=0

# execute main script
python run.py

# Alternativ: Urdu-Swahili pair
python run.py experiment=monolingual_urdu_swahili

⚡ Your Superpowers

Change Distillation Loss

Hydra allows you to easily overwrite any parameter defined in your config. See students/individuals/loss for all loss functions.

python run.py students/individual/loss=monoalignment

To contruct your own distillation loss, we provide bass losses, that can be used to construct the final loss. Furthermore, we provide all distillation losses used in this thesis here.

Example of constucting the distillation loss from the MLM loss and logit distillation with CE loss with equal weighting.

_target_: src.loss.loss.GeneralLoss
defaults:
  - base_loss@base_loss.mlm: mlm.yaml
  - base_loss@base_loss.softtargets_ce: softtargets_ce.yaml

base_loss:
  softtargets_ce:
    temperature: 4.0

loss_weighting:
  mlm: 0.5
  softtargets_ce: 0.5

Change Student Number and Languages

We constructed some default configs for different scenarios:

# monolingual setting with english-turkish language pair
python train.py experiment=monolingual

# monolingual setting with english-basque language pair
python train.py experiment=monolingual_eu

# monolingual setting with english-turkish language pair
python train.py experiment=monolingual_sw

# monolingual setting with english-turkish language pair
python train.py experiment=monolingual_ur

# bilingual setting with english-turkish language pair
python train.py experiment=monolingual_bilingual

To construct a custom setting, please see the documentation here.

Embedding Sharing across Students

# Share language embeddings only in each student, not across students.
python run.py students.embed_sharing="in_each_model"

To construct a custom setting, please see the documentation here.

Layer Sharing across Students

Please see the documentation here.

Change Student Architecture

# Use the same architecture as the teacher
python run.py students/individual/model=from_teacher

More architectures can be found here.

Student Initialization

Default uses weights from the teacher.

# Randomly Initialize Embedding Weights
python run.py students.individual.model.weights_from_teacher.embeddings=False
  
# Randomly Initialize Layer Weights
python run.py students.individual.model.weights_from_teacher.transformer_blocks=False

ℹ️ Project Structure

The directory structure of new project looks like this:


├── configs                 <- Hydra configuration files
│   ├── callbacks               <- Callbacks configs
│   ├── collate_fn              <- Collate functions configs
│   ├── datamodule              <- Datamodule configs
│   ├── distillation_setup      <- Distillation configs
│   ├── evaluation              <- Evaluation configs
│   ├── experiment              <- Experiment configs
│   ├── hydra                   <- Hydra related configs
│   ├── logger                  <- Logger configs
│   ├── students                <- Student configs
│   ├── teacher                 <- Teacher configs
│   ├── trainer                 <- Trainer configs
│   │
│   └── config.yaml             <- Main project configuration file
│
├── data                    <- Project data
│
├── logs                    <- Logs generated by Hydra and PyTorch Lightning loggers
│
├── src
│   ├── callbacks               <- Lightning callbacks
│   ├── datamodules             <- Lightning datamodules
│   ├── distillation            <- Distillation Setup Files
│   ├── evaluation              <- Evaluation Files
│   ├── los                     <- Loss Files
│   ├── models                  <- Lightning models
│   ├── utils                   <- Utility scripts
│   │
│   └── train.py                <- Training pipeline
│
├── run.py                  <- Run pipeline with chosen experiment configuration
│
├── .env.example            <- Template of the file for storing private environment variables
├── .gitignore              <- List of files/folders ignored by git
├── .pre-commit-config.yaml <- Configuration of automatic code formatting
├── setup.cfg               <- Configurations of linters and pytest
├── Dockerfile              <- File for building docker container
├── requirements.txt        <- File for installing python dependencies
├── LICENSE
└── README.md

minhducbui / clkd Goto Github PK

clkd's Introduction

Cross-Lingual Knowledge Distillation

📌 Introduction

🚀 Quickstart

⚡ Your Superpowers

ℹ️ Project Structure

clkd's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent