View Code? Open in Web Editor NEW

This project forked from timinar/babyllama

Training code for Baby-Llama, our submission to the strict-small track of the BabyLM challenge.

License: MIT License

Python 68.42% Jupyter Notebook 31.58%

babyllama's Introduction

BabyLlama

Very basic training code for Baby-Llama, our submission to the strict-small track of the BabyLM challenge. See our paper for more details.

We perform some basic regex-based cleaning of the dataset and then train a tokenizer on the cleaned dataset. This is performed in cleaning_and_tokenization.ipynb. The notebook assumes that the babylm dataset (/babylm_10M and /babylm_dev) is placed or symlinked in the /data folder. The tokenizer is saved in '/models' folder. We use the same tokenizer for both teacher and student models.

To train the teacher models:

python train.py --config ./config/gpt-705M.yaml

And analogously for llama-360M.yaml. One can also rewrite the learning rate and the model name defined in the config by adding arguments --lr and --model_name respectively. The trained model is saved in the /models folder. Once the two teacher models are trained, run distill-ensemble-pretraining-baby-llama.py to train the student model using the distillation loss. We modified the Trainer from this repository. Notice that it is not optimized to run on multiple GPUs (teachers are placed on a single GPU). With the current settings (model sizes and batch sizes) everything fits on a single 20GB GPU.

Recommend Projects

flyinggh / babyllama Goto Github PK

babyllama's Introduction

BabyLlama

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent