Giter Site home page Giter Site logo

babyllama's Introduction

BabyLlama

arXiv

Baby-Llama LLM with its Teachers

BabyLlama and its teachers, as depicted by DALL·E 3

Very basic training code for BabyLlama, our submission to the strict-small track of the BabyLM challenge. See our paper for more details.

We perform some basic regex-based cleaning of the dataset and then train a tokenizer on the cleaned dataset. This is performed in cleaning_and_tokenization.ipynb. The notebook assumes that the babylm dataset (/babylm_10M and /babylm_dev) is placed or symlinked in the /data folder. The tokenizer is saved in '/models' folder. We use the same tokenizer for both teacher and student models.

To train the teacher models:

python train.py --config ./config/gpt-705M.yaml

And analogously for llama-360M.yaml. One can also rewrite the learning rate and the model name defined in the config by adding arguments --lr and --model_name respectively. The trained model is saved in the /models folder. Once the two teacher models are trained, run distill-ensemble-pretraining-baby-llama.py to train the student model using the distillation loss. We modified the Trainer from this repository. Notice that it is not optimized to run on multiple GPUs (teachers are placed on a single GPU). With the current settings (model sizes and batch sizes) everything fits on a single 20GB GPU.

Llama training speed

During our tests, we found that Llama trains significantly faster than GPT-2. It reaches the minimum eval loss in nearly half the number of epochs needed for GPT-2. There are two main differences between the models: GPT uses trainable positional embeddings, while Llama employs Rotary Positional Embedding (RoPE); additionally, Llama utilizes SwiGLU instead of simple MLP layers.

To try to isolate these two effects, we also trained GPT-J, which uses RoPE (although we used the default settings and didn't attempt to make the RoPE implementations match precisely) but not SwiGLU. To make the comparison with GPT-2 more accurate, we enabled weight tying in both Llama and GPT-J (this feature is disabled by default). We performed a grid search for the optimal learning rate (happened to be the same for all three models) using the 10M BabyLM dataset (strict-small task). Then trained all the models using the 100M dataset (strict task; see the configs *-strict.yaml). The result is shown below.

eval-loss

Llama achieves a lower loss than GPT-J and does so more quickly than GPT-2. It seems that SwiGLU -— a gated unit that is quadratic in its inputs -— performs better.

babyllama's People

Contributors

timinar avatar jltastet avatar

Stargazers

Xiaoyang Liu avatar  avatar  avatar Jiajun Liu avatar skyfucker avatar Pierre MORVAN avatar  avatar  avatar  avatar wangzs avatar  avatar Zhang Xiangdong avatar something avatar  avatar Junhui He avatar Zhenpeng Lin avatar  avatar  avatar  avatar Xinpeng Qin avatar Sandalots avatar yuiseki avatar  avatar Jeff Carpenter avatar Nina Groot avatar xinxyou avatar  avatar  avatar Johanes Setiawan avatar Sang Park avatar Dave Houston avatar Rémi avatar codinglover0111 avatar Mohammed OE Abdallah avatar yws avatar Brandon Lockaby avatar  avatar  avatar Boyang Yang avatar smellslikeml avatar  avatar Amine Dahire avatar sparverius avatar Haoxiang Wang avatar 賴祺清 avatar  avatar Daxiong avatar

Watchers

 avatar  avatar

babyllama's Issues

Unable to reproduce???

Hello, I follow the configuration of your paper and used GPT-2 (705M, eval_loss: 3.75) and LLaMA (360M, eval_loss: 3.746) to distill babyLLamA (58M), but I get eval_loss: 27.985. The gap is so big. What is the reason?

In addition, I trained babyLLamA (58M) from scratch alone and got eval_loss: 4.023. The following is my eval loss process, for the above four models (GPT-2(705M, eval_loss:3.75), LLaMA(360M, eval_loss:3.746), babyLLamA(58M), distill-babyLLamA(58M)):
image
image
image
image

Others results:
image

evaluation code

Hi, did you use the code contained in this repo ?

If not, would you mind providing one? especially for (Super)GLUE and MSGS (MCC)

Evaluate the Llama2-7B

@JLTastet @timinar Excuse me,how should i distill the Llama-2-7B model to obtain a 3.5B Llama-2 model by BabyLlama?At the same time, I want to use the local Llama-2-7B model whose path is /home/Llama-2-7B

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.