mixture-of-depths's Introduction

Mixture of Depths

An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"

Setup

First follow instructions for setting up your environment for Llama 2 here.
Then:

pip install einops

Details

Implementing MoD in Llama 2
Follow paper's configuration with some assumptions.
- Route every other layer
- Training configurations for both causal inference methods proposed
Notes on auxiliary router for causal inference:
- Currently, we train it separately after MoD Llama is trained.
- Simple task as we achieve high token prediction accuracy quickly, which is further simplified by using a simple dataset.
MoD_training.ipynb demonstrates training and was used for the results below.
MoD_sampling.ipynb demonstrates generation with each method.

Results

50 million parameter model
- C4
  - Baseline after 1 epoch:
    - Loss: 3.73
    - Samples/sec: 6.79
  - MoD w/ Auxiliary Loss after 1 epoch:
    - Loss: 3.81
    - Samples/sec: 8.15
  - MoD w/ Auxiliary Router after 1 epoch:
    - Loss: 4.19
    - Samples/sec: 7.64
- Tiny Stories
  - Baseline after 5 epochs:
    - Loss: 2.46
    - Samples/sec: 11.22
  - MoD w/ Auxiliary Loss after 5 epochs:
    - Loss: 2.55
    - Samples/sec: 11.33
  - MoD w/ Auxiliary Router after 5 epochs:
    - Loss: 2.48
    - Auxilairy Router Causal Loss: 0.15
    - Samples/sec: 11.54

TODO

Validate
Sampling methods
- Auxiliary loss
- "Second" router

Citations

@misc{raposo2024mixtureofdepths,
    title={Mixture-of-Depths: Dynamically allocating compute in transformer-based language models}, 
    author={David Raposo and Sam Ritter and Blake Richards and Timothy Lillicrap and Peter Conway Humphreys and Adam Santoro},
    year={2024},
    eprint={2404.02258},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

mixture-of-depths's People

Contributors

Stargazers

Watchers

mixture-of-depths's Issues

Is the implementation wrong?

sorted_indices = torch.argsort(topk_indices)

should be

sorted_indices = torch.sort(topk_indices)

Right?

compute about attention

It seems that in your code all the token in a sequence are put into the transformer (attention and ffn) block?

Normalize topk_weight

Hi
First of all thanks for your implementation!
For the selected tokens you multiply the topk_weight by the output of the transformer block (here)

I think without any normalization, this multiplication can cause the model to give too much high value to the hidden_state and put nan after a rms_norm layer.

In this implementation they use softmax to normalize the topk_weight but they say also that this softmax break causality they mention also the auxiliary router.

I'm a bit confused about this normalization and this auxiliary router.

Thank you for your time!

sramshetty / mixture-of-depths Goto Github PK

mixture-of-depths's Introduction

Mixture of Depths

Setup

Details

Results

TODO

Citations

mixture-of-depths's People

Contributors

Stargazers

Watchers

Forkers

mixture-of-depths's Issues

Is the implementation wrong?

compute about attention

Normalize topk_weight

Qs on inference

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent