Giter Site home page Giter Site logo

mega's Introduction

Mega: Moving Average Equipped Gated Attention

This is the PyTorch implementation of the Mega paper. This folder is based on the fairseq package v0.9.0.

Mega: Moving Average Equipped Gated Attention

Xuezhe Ma*, Chunting Zhou*, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, Luke Zettlemoyer

Updates

  1. [Oct 17th 2022] --fp16 training has been supported.
  2. [Jan 10th 2023] Release Mega Image.
  3. [Jan 28th 2023] Release checkpoints of WMT'16 English to German.

Setup

This repository requires Python 3.8+ and Pytorch 1.11+.

# Install from this repo
pip install -e .

For faster training, install NVIDIA's apex library following fairseq.

Examples

Models Checkpoints

Task Description # params Download
LRA Mega on LRA tasks -- mega.lra.zip
WMT'16 (En-De) Mega-base on WMT'16 En-De 67M mega.wmt16ende.base.zip
WMT'16 (De-En) Mega-base on WMT'16 De-En 67M mega.wmt16deen.base.zip
SC-Raw Mega-base/big on raw Speech Commands 300k mega.sc.zip
WikiText-103 Language modeling on WikiText-103 252M mega.wiki103.zip
Enwiki8 Language modeling on Enwiki8 39M mega.enwiki8.zip

Experiments

Code Overview

  1. Mega layer is implemented in fairseq/modules/mega_layer.py.
  2. Mega encoder (LRA) is implemented in fairseq/models/lra/mega_lra_encoder.py.
  3. Mega decoder (LM) is implemented in fairseq/models/mega_lm.py.
  4. Mega encoder-decoder (NMT) is implemented in fairseq/models/mega.py.

Tips

  1. The released model checkpoints were trained with float32.
  2. When training Mega, please note the following hyperparamters that decide the size of the model. --encoder/decoder-embed-dim is the input embedding dimension. --encoder/decoder-hidden-dim is the expanding dimension of value vectors of attention. --encoder/decoder-ffn-embed-dim is the FFN intermediate dimension. --encoder/decoder-hidden-dim and --encoder/decoder-ffn-embed-dim is usually the same. To obtain similar number of parameters as Transformers, they are usually set to be 2 times --encoder/decoder-embed-dim. --encoder/decoder-z-dim is the shrinking dimension of the query/key vectors of attention, which is usually set to be 128 or 1/4 of --encoder/decoder-embed-dim.
  3. If you'd like to apply Mega to your task/data, besides architecture size, hyperparameters that are worth considering and tuning include learning rate (lr) and weight decay (wd). We find that tuning wd is a more effective regularization to Mega (in contrast to tuning dropout rates for Transformers). Suggested wd values include 0.01, 0.05, 0.1 (larger models typically need larger wd, please refer to appendix of our paper for hyperparameters we have used). For lr scheduler, linear lr decay and cosine lr decay schedules are more effective than the inverse square root decay scheduler for Mega.

License

mega is under MIT license. The license applies to model checkpoints as well.

Citation

@article{ma2022mega,
  title={Mega: Moving Average Equipped Gated Attention},
  author={Ma, Xuezhe and Zhou, Chunting and Kong, Xiang and He, Junxian and Gui, Liangke and Neubig, Graham and May, Jonathan and Zettlemoyer Luke},
  journal={arXiv preprint arXiv:2209.10655},
  year={2022}
}

mega's People

Contributors

jacqueline-he avatar kytimmylai avatar violet-zct avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mega's Issues

Regarding the damping factor δ

Hello,

I am not sure if this is a bug, but the following is a question about the damping factor δ after I read the paper and the code.

In the paper, it says that "MEGA allows the damping of the influence of the previous time step". I assume it means that δ is for damping of the influence of the previous time step.

The formula (3) in the paper is: $y_t = α \odot x_t + (1 − α \odot δ) \odot y_{t−1}$
But if δ is for damping the previous time steps, shouldn't it be $y_t = α \odot x_t + ((1 − α) \odot δ) \odot y_{t−1}$ ?

With the formula (3), δ is between 0 and 1 so $(1 − α \odot δ) >= (1 - α)$, which will enhance the influence of the previous time step instead of damping it. I thought it is a misplacement of parentheses in the paper. However, I checked the code in this repo and found that formula (3) is also used in the code (https://github.com/facebookresearch/mega/blob/main/fairseq/modules/exponential_moving_average.py#L79), so formula (3) seems to be the intended formula.

Could you clarify if formula (3) is the correct formula? If so, why does it have damping effect for the previous time steps?
Thanks!

Non-commercial license

Hello,

First of all, nice work on the MEGA paper - I enjoyed reading it, and the results are quite impressive. I have done some initial testing with MEGA, and it seems extremely promising: throughput on my long-document tasks is much faster with MEGA-chunk than I've seen with other linear self-attention mechanisms.

However, I noticed that the license in this repository does not permit commercial use. Is there a possibility of relaxing this license to something more permissive, such as fairseq's MIT License? Thanks!

How to get test accuracy

I have trained several models on tasks from the LRA, but I'm not 100% sure how to obtain the accuracy on the test set. Do you have a command that you could provide me with for this purpose?

Lack of src-bin/dict.txt

When I try to train mega on long range arena, there is an error reporting that there is no "src-bin/dict.txt" in my data path, but there is no 'src-bin' directory in the lra.zip. How can I solve this? Thank you.

Traceback (most recent call last):
File "../train.py", line 14, in
cli_main()
File "/tmp/code/fairseq_cli/train.py", line 446, in cli_main
distributed_utils.call_main(args, main)
File "/tmp/code/fairseq/distributed_utils.py", line 193, in call_main
main(args, **kwargs)
File "/tmp/code/fairseq_cli/train.py", line 78, in main
task = tasks.setup_task(args)
File "/tmp/code/fairseq/tasks/init.py", line 17, in setup_task
return TASK_REGISTRY[args.task].setup_task(args, **kwargs)
File "/tmp/code/fairseq/tasks/long_range_arena.py", line 95, in setup_task
data_dict = cls.load_dictionary(os.path.join(args.data, 'src-bin', 'dict.txt'))
File "/tmp/code/fairseq/tasks/long_range_arena.py", line 88, in load_dictionary
dictionary = Dictionary.load(filename)
File "/tmp/code/fairseq/data/dictionary.py", line 214, in load
d.add_from_file(f)
File "/tmp/code/fairseq/data/dictionary.py", line 227, in add_from_file
raise fnfe
File "/tmp/code/fairseq/data/dictionary.py", line 224, in add_from_file
with PathManager.open(f, "r", encoding="utf-8") as fd:
File "/tmp/code/fairseq/file_io.py", line 46, in open
return open(
FileNotFoundError: [Errno 2] No such file or directory: '<my_data_path>/src-bin/dict.txt'

Tokenization for downstream tasks

First of all thank you very much for your work.

I am working on the long text classification task, and given the spectacular results of MEGA for long sequence modelling I wanted to use it for this task. The only thing that I haven't figured out how to do is the tokenization of my text samples, so I was wondering if someone could help me out on how to tokenize my text with the dict that is obtained from a checkpoint like the LRA one from the text task.

Thank you very much for your time

Maximum context length?

Hello, in the paper a context length of 49k tokens was used, which is great, but can this model go any further? What is the maximum context length possible?

The lack of warmup-updates

Hello, it seems that the command to train a model on Path-X lacks the value of warmup-updates. I don't find it in the paper either. Thank you for your help.

# Path-X
model=mega_lra_pf128
python -u train.py ${DATA} \
    --seed $seed --ddp-backend c10d --find-unused-parameters \
    -a ${model} --task lra-image --input-type image --pixel-normalization 0.5 0.5 \
    --encoder-layers 4 --n-dim 16 --chunk-size ${CHUNK} \
    --activation-fn 'silu' --attention-activation-fn 'laplace' \
    --norm-type 'syncbatchnorm' --sen-rep-type 'mp' --encoder-normalize-before \
    --criterion lra_cross_entropy --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
    --optimizer adam --lr 0.01 --adam-betas '(0.9, 0.98)' --adam-eps 1e-8 --clip-norm 1.0 \
    --dropout 0.0 --attention-dropout 0.0 --act-dropout 0.0 --weight-decay 0.01 \
    --batch-size 2 --sentence-avg --update-freq 8 --max-update 125000 \
    --lr-scheduler linear_decay --total-num-update 125000 --end-learning-rate 0.0 \
    --warmup-updates ${WARMUP} --warmup-init-lr '1e-07' --warmup-power 2 --keep-last-epochs 1 --max-sentences-valid 12 \
    --save-dir ${SAVE} --log-format simple --log-interval 100 --num-workers 0

Any plans to merge with `fairseq`?

Thank you for your work.

I see this is based on much older version of fairseq.
Do you plan to merge the codebase into fairseq so we can use changes / new features?

ONNX support

hey, could you provide a script to convert models to ONNX?

Fail for transformer_lra_pf32

Hello. I noticed that you replicated XFM results on lra, but I failed when training transformer on pathfinder dataset using your default setting: loss was always 1 and acc was always 50%. I've tuned learning rate, weight decay, batch size, but nothing worked. The experiment for mega on Pathfinder works well, and transformer on other dataset also work well. Could you share your command to train transformer on Pathfinder?

Below is the command I used to train on 8 GPUS:

model=transformer_lra_pf32
python -u train.py ${DATA} \
    --seed $seed --ddp-backend c10d --find-unused-parameters \
    -a ${model} --task lra-image --input-type image --pixel-normalization 0.5 0.5 \
    --encoder-layers 1 --encoder-embed-dim 128 --encoder-ffn-embed-dim 128 --chunk-size ${CHUNK} \
    --activation-fn 'silu' --attention-activation-fn 'softmax' \
    --norm-type 'batchnorm' --sen-rep-type 'mp' --encoder-normalize-before \
    --criterion lra_cross_entropy --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
    --optimizer adam --lr 1e-4 --adam-betas '(0.9, 0.98)' --adam-eps 1e-8 --clip-norm 1.0 \
    --dropout 0.0 --attention-dropout 0.0 --act-dropout 0.0 --weight-decay 0.01 \
    --batch-size 8 --sentence-avg --update-freq 2 --max-update 250000 \
    --lr-scheduler linear_decay --total-num-update 250000 --end-learning-rate 0.0 \
    --warmup-updates 50000 --warmup-init-lr '1e-07' --keep-last-epochs 1 --max-sentences-valid 8 \
    --save-dir ${SAVE} --log-format simple --log-interval 100 --num-workers 0

Attention masking in MovingAverageGatedAttention

In moving_average_gated_attention.py masking in the softmax method is done like

if attn_mask is not None:
            qk = qk + attn_mask

Is this correct? Surely this does not zero out relevant entries. Do we not want to multiply by the attention mask after application of the softmax?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.