microsoft / torchscale Goto Github PK

Foundation Architecture for (M)LLMs

License: MIT License

Python 100.00%

computer-vision machine-learning multimodal natural-language-processing pretrained-language-model speech-processing transformer translation

torchscale's Introduction

TorchScale - A Library of Foundation Architectures

TorchScale is a PyTorch library that allows researchers and developers to scale up Transformers efficiently and effectively.

Fundamental research to develop new architectures for foundation models and A(G)I, focusing on modeling generality and capability, as well as training stability and efficiency.

Stability - DeepNet: scaling Transformers to 1,000 Layers and beyond
Generality - Foundation Transformers (Magneto): towards true general-purpose modeling across tasks and modalities (including language, vision, speech, and multimodal)
Capability - A Length-Extrapolatable Transformer
Efficiency - X-MoE: scalable & finetunable sparse Mixture-of-Experts (MoE)

The Revolution of Model Architecture

BitNet: 1-bit Transformers for Large Language Models
RetNet: Retentive Network: A Successor to Transformer for Large Language Models
LongNet: Scaling Transformers to 1,000,000,000 Tokens

News

December, 2023: LongNet and LongViT released
October, 2023: Update RMSNorm and SwiGLU as the default module in RetNet
November, 2022: TorchScale 0.1.1 released [Paper] [PyPI]

Installation

To install:

pip install torchscale

Alternatively, you can develop it locally:

git clone https://github.com/microsoft/torchscale.git
cd torchscale
pip install -e .

For faster training install Flash Attention for Turing, Ampere, Ada, or Hopper GPUs:

pip install flash-attn

or xFormers for Volta, Turing, Ampere, Ada, or Hopper GPUs:

# cuda 11.8 version
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118
# cuda 12.1 version
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121

Getting Started

It takes only several lines of code to create a model with the above fundamental research features enabled. Here is how to quickly obtain a BERT-like encoder:

>>> from torchscale.architecture.config import EncoderConfig
>>> from torchscale.architecture.encoder import Encoder

>>> config = EncoderConfig(vocab_size=64000)
>>> model = Encoder(config)

>>> print(model)

We also support the Decoder architecture and the EncoderDecoder architecture:

# Creating a decoder model
>>> from torchscale.architecture.config import DecoderConfig
>>> from torchscale.architecture.decoder import Decoder

>>> config = DecoderConfig(vocab_size=64000)
>>> decoder = Decoder(config)
>>> print(decoder)

# Creating a encoder-decoder model
>>> from torchscale.architecture.config import EncoderDecoderConfig
>>> from torchscale.architecture.encoder_decoder import EncoderDecoder

>>> config = EncoderDecoderConfig(vocab_size=64000)
>>> encdec = EncoderDecoder(config)
>>> print(encdec)

It takes only several lines of code to create a RetNet model:

# Creating a RetNet model
>>> import torch
>>> from torchscale.architecture.config import RetNetConfig
>>> from torchscale.architecture.retnet import RetNetDecoder

>>> config = RetNetConfig(vocab_size=64000)
>>> retnet = RetNetDecoder(config)

>>> print(retnet)

For LongNet models (Flash Attention required):

>>> import torch
>>> from torchscale.architecture.config import EncoderConfig, DecoderConfig
>>> from torchscale.model.longnet import LongNetEncoder, LongNetDecoder

# Creating a LongNet encoder with the dilated pattern of segment_length=[2048,4096] and dilated_ratio=[1,2]
>>> config = EncoderConfig(vocab_size=64000, segment_length='[2048,4096]', dilated_ratio='[1,2]', flash_attention=True)
>>> longnet = LongNetEncoder(config)

# Creating a LongNet decoder with the dilated pattern of segment_length=[2048,4096] and dilated_ratio=[1,2]
>>> config = DecoderConfig(vocab_size=64000, segment_length='[2048,4096]', dilated_ratio='[1,2]', flash_attention=True)
>>> longnet = LongNetDecoder(config)

Key Features

DeepNorm to improve the training stability of Post-LayerNorm Transformers
- enabled by setting deepnorm=True in the Config class.
- It adjusts both the residual connection and the initialization method according to the model architecture (i.e., encoder, decoder, or encoder-decoder).
SubLN for the model generality and the training stability
- enabled by subln=True. This is enabled by default.
- It introduces another LayerNorm to each sublayer and adjusts the initialization according to the model architecture.
- Note that SubLN and DeepNorm cannot be used in one single model.
X-MoE: efficient and finetunable sparse MoE modeling
- enabled by use_xmoe=True.
- It replaces every 'moe_freq' FeedForwardNetwork layers with the X-MoE layers.
Multiway architecture for multimodality
- enabled by multiway=True.
- It provides a pool of Transformer's parameters used for different modalities.
Extrapolatable position embedding (Xpos)
- enabled by xpos_rel_pos=True.
Relative position bias
- enabled by adjusting rel_pos_buckets and max_rel_pos.
SparseClip: improving the gradient clipping for sparse MoE models
- we provide a sample code that can be easily adapted to the FairSeq (or other) repo.
Retentive Network: A Successor to Transformer for Large Language Models
- created by config = RetNetConfig(vocab_size=64000) and retnet = RetNetDecoder(config).
LongNet: Scaling Transformers to 1,000,000,000 Tokens

Most of the features above can be used by simply passing the corresponding parameters to the config. For example:

>>> from torchscale.architecture.config import EncoderConfig
>>> from torchscale.architecture.encoder import Encoder

>>> config = EncoderConfig(vocab_size=64000, deepnorm=True, multiway=True)
>>> model = Encoder(config)

>>> print(model)

Examples

We have examples of how to use TorchScale in the following scenarios/tasks:

Language
Vision
- LongViT
- ViT/BEiT [In progress]
Speech
Multimodal
- Multiway Transformers/BEiT-3

We plan to provide more examples regarding different tasks (e.g. vision pretraining and speech recognition) and various deep learning toolkits (e.g. DeepSpeed and Megatron-LM). Any comments or PRs are welcome!

Acknowledgments

Some implementations in TorchScale are either adapted from or inspired by the FairSeq repository and the UniLM repository.

Citations

If you find this repository useful, please consider citing our work:

@article{torchscale,
  author    = {Shuming Ma and Hongyu Wang and Shaohan Huang and Wenhui Wang and Zewen Chi and Li Dong and Alon Benhaim and Barun Patra and Vishrav Chaudhary and Xia Song and Furu Wei},
  title     = {{TorchScale}: {Transformers} at Scale},
  journal   = {CoRR},
  volume    = {abs/2211.13184},
  year      = {2022}
}

@article{deepnet,
  author    = {Hongyu Wang and Shuming Ma and Li Dong and Shaohan Huang and Dongdong Zhang and Furu Wei},
  title     = {{DeepNet}: Scaling {Transformers} to 1,000 Layers},
  journal   = {CoRR},
  volume    = {abs/2203.00555},
  year      = {2022},
}

@article{magneto,
  author    = {Hongyu Wang and Shuming Ma and Shaohan Huang and Li Dong and Wenhui Wang and Zhiliang Peng and Yu Wu and Payal Bajaj and Saksham Singhal and Alon Benhaim and Barun Patra and Zhun Liu and Vishrav Chaudhary and Xia Song and Furu Wei},
  title     = {Foundation {Transformers}},
  journal   = {CoRR},
  volume    = {abs/2210.06423},
  year      = {2022}
}

@inproceedings{xmoe,
  title={On the Representation Collapse of Sparse Mixture of Experts},
  author={Zewen Chi and Li Dong and Shaohan Huang and Damai Dai and Shuming Ma and Barun Patra and Saksham Singhal and Payal Bajaj and Xia Song and Xian-Ling Mao and Heyan Huang and Furu Wei},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022},
  url={https://openreview.net/forum?id=mWaYC6CZf5}
}

@article{retnet,
  author={Yutao Sun and Li Dong and Shaohan Huang and Shuming Ma and Yuqing Xia and Jilong Xue and Jianyong Wang and Furu Wei},
  title     = {Retentive Network: A Successor to {Transformer} for Large Language Models},
  journal   = {ArXiv},
  volume    = {abs/2307.08621},
  year      = {2023}
}

@article{longnet,
  author={Jiayu Ding and Shuming Ma and Li Dong and Xingxing Zhang and Shaohan Huang and Wenhui Wang and Nanning Zheng and Furu Wei},
  title     = {{LongNet}: Scaling Transformers to 1,000,000,000 Tokens},
  journal   = {ArXiv},
  volume    = {abs/2307.02486},
  year      = {2023}
}

@article{longvit,
  title     = {When an Image is Worth 1,024 x 1,024 Words: A Case Study in Computational Pathology},
  author    = {Wenhui Wang and Shuming Ma and Hanwen Xu and Naoto Usuyama and Jiayu Ding and Hoifung Poon and Furu Wei},
  journal   = {ArXiv},
  volume    = {abs/2312.03558},
  year      = {2023}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information, see the Code of Conduct FAQ or contact Furu Wei and Shuming Ma with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos is subject to those third-party's policies.

torchscale's People

Contributors

Stargazers

Watchers

Forkers

ustcwhy jeonsworld ml-lab techthiyanes lipiji jingmouren dongcf lipovsek zmzlois yashguptatech anoop-qasolve nhsjgczryf jaydass jaedukseo iammojahid xiaolongguo jiwonseo1212 bekyilma talentsprint2011 veasnachroeun mbrukman mohamedkz sscavengerr hadryan kashif okhoa2-61 jonychoi artofnext nanderoo jrcribb richardsonjf chadkowski luanqx oceanssssssssssss rudymartin luisclark ssahgal lulu-cloud jrahme31 logicsense1 buaahsh chongli9230 sunyt32 dhockaday daphne97 matthewchang vishalsingh17 wenhui0924 kimsoohwan alirezabayatmk zhenyangiacas yz-liu yuezhou-oh yanyang1024 cc6dll whu-dft enockipp wdykas note-liu 00mjk ai-jie01 tingyingwu2010 hertera1 jimmieliu standardgalactic codeaudit hbcbh1999 winnerineast conceptofmind clearme777 fengzhilei kyegomez xiaochunlu usryokousha iff-0303 jinlmsft jonathanrayner njb-ms ethicalsecurity-agency xlggzzz hugo279 xingxingzhang smart-boot klae01 mcx maddyonline apollohuang1 ricklentz gridechelon ard-skelling clairdelunejs azure-arc-0 daniel-007 zyeric agoryuno darcstar-solutions-tech mitzen peytontolbert deeprnd aiworkspace

torchscale's Issues

Installer bug - wrong `apex` package installed

If you pip install torchscale then via requirements it also installs apex from pypi. However, apex on pypi is not Nvidia's apex, but is an unrelated project with many deps. As a result, many other additional deps are also pulled into the installation.

Nvidia's apex is currently not pip-installable, and therefore it should not be listed as a requirement.
cc @shumingma

RetNet : Check consistency of each forward mode

Hello authors,

I'm really happy to see this great work!
I have one question or request about the consistency of output from each forward mode.
I have been comparing three outputs by using below simple code.

import torch
from ret_net import MultiScaleRetention, RetNetRelPos

def test_msr():
    seq_len = 16
    dim = 16
    B = 3
    n_heads = 4
    chunk_size = 4
    x = torch.rand(B, seq_len, dim)
    x = torch.arange(B*seq_len*dim).view(B, seq_len, dim)
    x = x/x.max()
    
    xpos = RetNetRelPos(dim, n_heads, chunk_size)
    layer = MultiScaleRetention(dim, n_heads)
    
    # parallel
    output_p, _ = layer(x, xpos(seq_len, False, False), False, None)
    
    # recurrent
    output_r = []
    incremental_state = {}
    for idx in range(seq_len):
        rpos = xpos(idx+1, True, False)
        xi = x[:, idx, :].unsqueeze(1)
        out_r, incremental_state = layer(xi, rpos, False, incremental_state)
        output_r.append(out_r)
    output_r = torch.concat(output_r, dim=1)
    
    # chunkwise
    output_c, _ = layer(x, xpos(seq_len, False, True), True, None)
    
    check_diff('parallel  - recurrent', output_p, output_r)
    check_diff('parallel  - chunkwise', output_p, output_c)
    check_diff('recurrent - chunkwise', output_r, output_c)
    
def check_diff(name, A, B, eps=1e-6):
    D = A - B
    C = A/B
    print(name, torch.sum(torch.abs(D)))
    idx = torch.abs(D) < eps
    print(idx[0, :, 0])
    print()

And I got below result.

parallel  - recurrent tensor(6.4814e-07, grad_fn=<SumBackward0>)
tensor([True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True])

parallel  - chunkwise tensor(5.6081, grad_fn=<SumBackward0>)
tensor([ True,  True,  True,  True, False, False, False, False, False, False,
        False, False, False, False, False, False])

recurrent - chunkwise tensor(5.6081, grad_fn=<SumBackward0>)
tensor([ True,  True,  True,  True, False, False, False, False, False, False,
        False, False, False, False, False, False])

As you might see good agreement between parallel and recurrent results.
But the chunkwise output doesn't agree with both parallel and recurrent, after the 2nd chunk.
Could you give me a hint to understand this?

(I have already pulled the latest main branch)

Thanks a lot,
Masahiro Morinaga

`get_moe_group` 's return is None, when building `class MOELayer(Base)` , using one gpu

Hi,

I want to replace Transformer Encoder with X-MoE Encoder. Below is my configuration:

config = EncoderConfig(
      encoder_embed_dim = 500,
      encoder_layers = 4,
      use_xmoe = True,
      moe_freq = 1,
      moe_top1_expert = True,
      moe_expert_count = 10
  )
  module = Encoder(config)

I faced the below Error:

It is because that dist.is_initialized() = None

Thanks for your help~

AttributeError: 'EncoderConfig' object has no attribute 'decoder_layers'

Hi, I plan to reproduce the results of the WMT-17 translation task as presented in the deepnet paper. Could you please let me know what the command for running the script should be? For example, what should --arch be set to? According to the examples provided in the readme, should I run the following command?

cd examples/fairseq/
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 train.py \
    ${PATH_TO_DATA} \
    --arch mt_base --share-decoder-input-output-embed \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --max-tokens 4096 --fp16  --deepnorm

However, when I add --deepnorm to the command from example, it throws an error: AttributeError: 'EncoderConfig' object has no attribute 'decoder_layers'. Could you please advise on the correct command and settings to obtain results similar to Table 1 in the paper? Thank you!

Question on decay factor for attention with xPos

Hello!
I was truly impressed by the paper "A Length-Extrapolatable Transformer". I am most interested in training LLMs designed for large sequence length numbers. You have pointed out an issue of the RoPE that it tends to be unstable as relative distance between the two tokens gets bigger, which leads up to degradation of precision. For the purpose of regularization you have introduced a factor for a function of positional encoding:
$$g_ζ [n] = \sum_{i=0}^{d/2}\cos{nθ_i}ζ_i^n$$, where $ζ_i\in[0,1]$. Here, I have an unresolved question, for which I could not find the answer in the paper or in the code. The thing is that n can be negative, because n is actually some $\hat{m}-\hat{n}$, where $\hat{m}$ is position in Q, and $\hat{n}$ is one in K, so n runs thourgh the interval (-s, s). Because $ζ_i\in[0,1]$, for negative n and bigger relative distances the function $g_ζ [n]$ must grow unlimitedly. Surely, we cannot take the absolute value of n, because in that case we are discarding the important property that the positional embedding does not depend on the offset. Can you please explain how you go about this issue?
Thank you!

About the param `scale_base`

Any advice on the value setting of scale_base(code)? e.g. I want to train a gpt model with window size 2048, and expect it can be extended to 4096 or longer. Would the default value(512) bring too strong long-distance penalty?

Question about is_first_step and Retnet

In the code when is_first_step is True then activate_recurrent is set to False here:
https://github.com/microsoft/torchscale/blob/main/torchscale/architecture/retnet.py#L362

I was wondering what the reason for this is?
Should one us is_first_step=False when using the recurrent mode of Retnet?

Longnet Code Release

Hello all, just wondering whether there is an ETA on the official release of the long net code, it was mentioned here microsoft/unilm#1182 (comment) that the long net code would be released as part of torchscale. Looking forward to the seeing the official implementation!

the meaning of "incremental_state" in RetNet

Hi there~,
Thanks for your great work RetNet. i have encountered a problem when I try to define "incremental_state".
Could you provide me some usage about it or explain more?
Thanks,
Best regards.

pip package does not contain RetNet

The latest release 0.2.0 is from March (see https://pypi.org/project/torchscale/#history), predating the introduction of RetNet in this repo. As such the README is misleading since it is not possible to create a RetNet model with a pip install.
When is the next release planned?

can not download dict.txt

when link https://publicmodel.blob.core.windows.net/torchscale/vocab/dict.txt

This XML file does not appear to have any style information associated with it. The document tree is shown below.

PublicAccessNotPermitted
Public access is not permitted on this storage account. RequestId:6a16dad1-301e-000f-6855-c0dbf7000000 Time:2023-07-27T06:39:12.1140517Z

question about the number of output_projection

Hello!

I found that there is only one output_projection (nn.Linear(768, 64000)) for masked language modeling. However, as Beit-3 is a multimodal model, should there also be a output head for masked image modeling?

LEX inference support and checkpoint

Hello, thanks for your great work!

I was trying to benchmark your work on LEX, which I found in this fork.
However, that fork repo doesn't have issue feature so I'm posting my questions here.

I tried to test your BCA technique with the LLaMa models. So I implemented the BCA according to the commit I pasted above. However, my model failed to extrapolate when going beyond the block size. So I am wondering if you could provide one checkpoint of your LEX model such that I could test and compare with my codes to see where is the bug?

Thanks!

Retnet parameter dimension

I wonder why we need twice dimensions for $\mathbf{W}_V$

Does Torchscale support vision transformers in vision tasks?

Does the post layernorm and scaling in residual branch and initialization in DeepNet also support vision tasks, like ImageNet classification and mask image modeling?

Question about the recurrent forward of MultiScaleRetention

In the multiscale-retention`s recurrent forward, it looks like the incremental state is not being updated(not returned)[1].

[1]https://github.com/microsoft/torchscale/blob/258eda33083f6361e7305f2a5afd241e381826e1/torchscale/component/multiscale_retention.py#L117C18-L117C18

requires dict.txt and sentencepiece.bpe.model

Could you provide "dict.txt" and "sentencepiece.bpe.model" (i.e., files in doc ''Example: BERT Pretraining'') for further exploring pretraining BERT with deepnorm.

Fairseq version compatible with torchscale

Hi,

Thank you for your great work to develop Torchscale.

I have been trying to use your codebase but it seems like some modules such as "fairseq.models.squad" that were available during the development of Torchscale are not available anymore in the current version of Fairseq.

Can you let me know the version (or the commit) of fairseq you are using to make Torchscale work?

Thanks for your help,
Samy

embed_tokens

In the RetNet model, embed _ tokens is not given, I can 't run the code. When I use this model, what should the parameter token _ embeddings pass ? Or how do I define embed _ tokens ?

Typo in Paper

Missing space in paper. :)

recurrent_forward in MultiScaleRetention

For MultiScaleRetention, the forward function use input X: [B, tgt_len, embed_dim] to construct q: [B, tgt_len, embed_dim], k: [B, tgt_len, embed_dim], v: [B, tgt_len, factor*embed_dim] and g: [B, tgt_len, factor*embed_dim]. For v, if choose to use "recurrent_forward", in line 102 v = v.view(bsz, self.num_heads, self.head_dim, 1), since self.head_dim = self.embed_dim * self.factor // num_heads, it splits factor*embed_dim in to num_heads and head_dim, but where's tgt_len, the output shape [B, num_heads, head_dim, 1] is invalid for the input size [B, tgt_len, factor*embed_dim]

AttributeError: 'EncoderDecoderConfig' object has no attribute 'normalize_output'

I have get two errors:
（1）
from torchscale.architecture.config import EncoderDecoderConfig
from torchscale.architecture.encoder_decoder import EncoderDecoder

config = EncoderDecoderConfig(vocab_size=64000)
encdec = EncoderDecoder(config)
print(encdec)

Traceback (most recent call last):
AttributeError: 'EncoderDecoderConfig' object has no attribute 'normalize_output'

（2）
import torch
from torchscale.architecture.config import RetNetConfig
from torchscale.architecture.retnet import RetNetDecoder

config = RetNetConfig(vocab_size=64000)
retnet = RetNetDecoder(config)
print(retnet)

it shows that "Cannot find reference 'RetNetConfig' in 'config.py' " and "Cannot find reference 'retnet' in 'init.py' " and "Unresolved reference 'RetNetDecoder' "

so how I fix them?

[Question] what are the usages of multiway_network.py?

Dear torchscale developers & researchers,

Thank you for sharing the implementation of torchcale public.

I have a question regarding the multiway_network usage in torchscale. In BeitV3.py line 32, i found its the only place a multiway_wrapper is used and it return multiway_networks that split things into 2 and apply 2 modules (in the code example, it's position emb) to it. Does it mean multiway_network only supports splitting a feature to 2 and apply operations?

According to Feedforward_network.py line 55, we could potentially have many FFNs, then it is very likely that moe_counts > 2.

Then, how does a multiway_networks helpful to train a multiway transformer? I guess it should be able to provide number of copies that are equivalent to # of moe_counts, but not only 2.

I think I probably misunderstand some part of code. Could you provide some guidance or reference for me?

Thank you very much!

Is there some example of the paper? e.g., compare of the inference latency

Hi,
Thank you for your great work!
We are intersted in the ability of RetNet. However, when we look through this repository, we can't find the code correspond to the paper's experiments. For example, an example of generate a same length text via RetNet and Transformer-based LLM (such as Llama-7b) of similar size, to compare the Inference Latency, an example of long sequence inference, and so on.

So, can you provide some basic example code of train/inference, which compare the RetNet and the Transformer-based LLM, without the Fairseq?

Any response will be great helpful for us!

does torchscale functionalities can impove modeling generality and capability in case of Session-Based Recommendation system

i am currently working Session-Based Recommendation system problem, indeed this topic in NLP so quite under progress of improvement, and most of technics used to tackle that problem are Matrix Factorization or GNN recently invidia realized a library TransfomerSRC4,
i would like to ask if there's a way to use TorchScale functionalities to address the problem Session-Based Recommendation or not yet

Could not install fairseq

When I follow your instruction to install fairseq as follow
pip install git+https://github.com/shumingma/fairseq.git@moe
I meet the following errors. Do you have any clue about it?

Collecting git+https://github.com/shumingma/fairseq.git@moe
  Cloning https://github.com/shumingma/fairseq.git (to revision moe) to /tmp/pip-req-build-o2u6y6uv
  Running command git clone --quiet https://github.com/shumingma/fairseq.git /tmp/pip-req-build-o2u6y6uv
  Running command git checkout -b moe --track origin/moe
  Switched to a new branch 'moe'
  Branch moe set up to track remote branch moe from origin.
  Resolved https://github.com/shumingma/fairseq.git to commit 5605b44f973b573fa09ff9acad5b6826c9bb75e3
  Running command git submodule update --init --recursive -q
  fatal: reference is not a tree: adb23324c222aad0aad89308e70302d996a5eaeb
  Unable to checkout 'adb23324c222aad0aad89308e70302d996a5eaeb' in submodule path 'fairseq/model_parallel/megatron'
  error: subprocess-exited-with-error

  × git submodule update --init --recursive -q did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× git submodule update --init --recursive -q did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Library issues

cannot import name 'UnencryptedCookieSessionFactoryConfig' from 'pyramid.session' (unknown location)
I can't install the package either in windows or in Linux systems.

Compatibility with torchsummary

Here are the versions of torchscale and torchsummary in my environment:

torchscale                    0.2.0
torchsummary                  1.5.1

I am using my custom embedding to achieve an Auto-Regression task so I wrap the torchscale.architecture.decoder.Decoder with the following code:

from torchscale.architecture.decoder import Decoder
from torchscale.architecture.config import DecoderConfig
from torchsummary import summary

class LatentDecoder(nn.Module):
    def __init__(self, **kwargs):
        super().__init__()
        self.decoder = Decoder(**kwargs)
    
    def forward(self, x):
        y = self.decoder(prev_output_tokens=torch.zeros((1,1)),  # Not used when self_attn_relative_position is None
                         token_embeddings=x,
                         eatures_only=True)
        return y

dec_config = DecoderConfig(
    subln=True, # use sublayer normalization
    dropout=0.1,
    drop_path_rate=0.1,
    decoder_layers=6,
    decoder_embed_dim=1024,
    decoder_ffn_embed_dim=2048,
    decoder_attention_heads=8
)

decoder = LatentDecoder(args=dec_config, is_encoder_decoder=True)

When I tried to use torchsummary to get a summary of the model with these codes:

input_size = (16, 4, 1024)  # (batch_size, token_index, embedding_size)
summary(decoder, input_size=input_size)

I got error:

ValueError                                Traceback (most recent call last)
Cell In[18], line 27
     24 input_size = (16, 4, 1024)  # (batch_size, token_index, embedding_size)
     25 decoder = LatentDecoder(args=dec_config, is_encoder_decoder=True)
---> 27 summary(decoder, input_size=input_size)

File ~/.local/lib/python3.11/site-packages/torchsummary/torchsummary.py:72, in summary(model, input_size, batch_size, device)
     68 model.apply(register_hook)
     70 # make a forward pass
     71 # print(x.shape)
---> 72 model(*x)
     74 # remove these hooks
     75 for h in hooks:

File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

Cell In[18], line 11, in LatentDecoder.forward(self, x)
     10 def forward(self, x):
---> 11     y = self.decoder(prev_output_tokens=torch.zeros((1,1)), token_embeddings=x, features_only=True)
     12     return y

File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1538, in Module._call_impl(self, *args, **kwargs)
   1535     bw_hook = hooks.BackwardHook(self, full_backward_hooks, backward_pre_hooks)
   1536     args = bw_hook.setup_input_hook(args)
-> 1538 result = forward_call(*args, **kwargs)
   1539 if _global_forward_hooks or self._forward_hooks:
   1540     for hook_id, hook in (
   1541         *_global_forward_hooks.items(),
   1542         *self._forward_hooks.items(),
   1543     ):

File ~/.local/lib/python3.11/site-packages/torchscale/architecture/decoder.py:437, in Decoder.forward(self, prev_output_tokens, self_attn_padding_mask, encoder_out, incremental_state, features_only, return_all_hiddens, token_embeddings, **kwargs)
    434     if idx not in incremental_state:
    435         incremental_state[idx] = {}
--> 437 x, layer_attn, _, l_aux_i = layer(
    438     x,
    439     encoder_out["encoder_out"] if encoder_out is not None else None,
    440     encoder_out["encoder_padding_mask"]
    441     if encoder_out is not None
    442     else None,
    443     incremental_state[idx] if incremental_state is not None else None,
    444     self_attn_mask=self_attn_mask,
    445     self_attn_padding_mask=self_attn_padding_mask,
    446     self_attn_rel_pos=self_attn_rel_pos_bias,
    447     cross_attn_rel_pos=cross_attn_rel_pos_bias,
    448 )
    449 l_aux.append(l_aux_i)
    450 inner_states.append(x)

File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1538, in Module._call_impl(self, *args, **kwargs)
   1535     bw_hook = hooks.BackwardHook(self, full_backward_hooks, backward_pre_hooks)
   1536     args = bw_hook.setup_input_hook(args)
-> 1538 result = forward_call(*args, **kwargs)
   1539 if _global_forward_hooks or self._forward_hooks:
   1540     for hook_id, hook in (
   1541         *_global_forward_hooks.items(),
   1542         *self._forward_hooks.items(),
   1543     ):

File ~/.local/lib/python3.11/site-packages/torchscale/architecture/decoder.py:148, in DecoderLayer.forward(self, x, encoder_out, encoder_padding_mask, incremental_state, self_attn_mask, self_attn_padding_mask, self_attn_rel_pos, cross_attn_rel_pos)
    145 if self.normalize_before:
    146     x = self.self_attn_layer_norm(x)
--> 148 x, attn = self.self_attn(
    149     query=x,
    150     key=x,
    151     value=x,
    152     key_padding_mask=self_attn_padding_mask,
    153     incremental_state=incremental_state,
    154     attn_mask=self_attn_mask,
    155     rel_pos=self_attn_rel_pos,
    156 )
    157 x = self.dropout_module(x)
    159 if self.drop_path is not None:

File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1538, in Module._call_impl(self, *args, **kwargs)
   1535     bw_hook = hooks.BackwardHook(self, full_backward_hooks, backward_pre_hooks)
   1536     args = bw_hook.setup_input_hook(args)
-> 1538 result = forward_call(*args, **kwargs)
   1539 if _global_forward_hooks or self._forward_hooks:
   1540     for hook_id, hook in (
   1541         *_global_forward_hooks.items(),
   1542         *self._forward_hooks.items(),
   1543     ):

File ~/.local/lib/python3.11/site-packages/torchscale/component/multihead_attention.py:75, in MultiheadAttention.forward(self, query, key, value, incremental_state, key_padding_mask, attn_mask, rel_pos)
     65 def forward(
     66     self,
     67     query,
   (...)
     73     rel_pos=None,
     74 ):
---> 75     bsz, tgt_len, embed_dim = query.size()
     76     src_len = tgt_len
     77     assert embed_dim == self.embed_dim, f"query dim {embed_dim} != {self.embed_dim}"

ValueError: too many values to unpack (expected 3)

Then I tried to omit the batch_size dim:

input_size = (4, 1024)  # (token_index, embedding_size)

I then had another error:

IndexError                                Traceback (most recent call last)
Cell In[19], line 27
     24 input_size = (4, 1024)  # (token_index, embedding_size)
     25 decoder = LatentDecoder(args=dec_config, is_encoder_decoder=True)
---> 27 summary(decoder, input_size=input_size)

File ~/.local/lib/python3.11/site-packages/torchsummary/torchsummary.py:72, in summary(model, input_size, batch_size, device)
     68 model.apply(register_hook)
     70 # make a forward pass
     71 # print(x.shape)
---> 72 model(*x)
     74 # remove these hooks
     75 for h in hooks:

File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

Cell In[19], line 11, in LatentDecoder.forward(self, x)
     10 def forward(self, x):
---> 11     y = self.decoder(prev_output_tokens=torch.zeros((1,1)), token_embeddings=x, features_only=True)
     12     return y

File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1538, in Module._call_impl(self, *args, **kwargs)
   1535     bw_hook = hooks.BackwardHook(self, full_backward_hooks, backward_pre_hooks)
   1536     args = bw_hook.setup_input_hook(args)
-> 1538 result = forward_call(*args, **kwargs)
   1539 if _global_forward_hooks or self._forward_hooks:
   1540     for hook_id, hook in (
   1541         *_global_forward_hooks.items(),
   1542         *self._forward_hooks.items(),
   1543     ):

File ~/.local/lib/python3.11/site-packages/torchscale/architecture/decoder.py:437, in Decoder.forward(self, prev_output_tokens, self_attn_padding_mask, encoder_out, incremental_state, features_only, return_all_hiddens, token_embeddings, **kwargs)
    434     if idx not in incremental_state:
    435         incremental_state[idx] = {}
--> 437 x, layer_attn, _, l_aux_i = layer(
    438     x,
    439     encoder_out["encoder_out"] if encoder_out is not None else None,
    440     encoder_out["encoder_padding_mask"]
    441     if encoder_out is not None
    442     else None,
    443     incremental_state[idx] if incremental_state is not None else None,
    444     self_attn_mask=self_attn_mask,
    445     self_attn_padding_mask=self_attn_padding_mask,
    446     self_attn_rel_pos=self_attn_rel_pos_bias,
    447     cross_attn_rel_pos=cross_attn_rel_pos_bias,
    448 )
    449 l_aux.append(l_aux_i)
    450 inner_states.append(x)

File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1538, in Module._call_impl(self, *args, **kwargs)
   1535     bw_hook = hooks.BackwardHook(self, full_backward_hooks, backward_pre_hooks)
   1536     args = bw_hook.setup_input_hook(args)
-> 1538 result = forward_call(*args, **kwargs)
   1539 if _global_forward_hooks or self._forward_hooks:
   1540     for hook_id, hook in (
   1541         *_global_forward_hooks.items(),
   1542         *self._forward_hooks.items(),
   1543     ):

File ~/.local/lib/python3.11/site-packages/torchscale/architecture/decoder.py:148, in DecoderLayer.forward(self, x, encoder_out, encoder_padding_mask, incremental_state, self_attn_mask, self_attn_padding_mask, self_attn_rel_pos, cross_attn_rel_pos)
    145 if self.normalize_before:
    146     x = self.self_attn_layer_norm(x)
--> 148 x, attn = self.self_attn(
    149     query=x,
    150     key=x,
    151     value=x,
    152     key_padding_mask=self_attn_padding_mask,
    153     incremental_state=incremental_state,
    154     attn_mask=self_attn_mask,
    155     rel_pos=self_attn_rel_pos,
    156 )
    157 x = self.dropout_module(x)
    159 if self.drop_path is not None:

File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1547, in Module._call_impl(self, *args, **kwargs)
   1545     hook_result = hook(self, args, kwargs, result)
   1546 else:
-> 1547     hook_result = hook(self, args, result)
   1549 if hook_result is not None:
   1550     result = hook_result

File ~/.local/lib/python3.11/site-packages/torchsummary/torchsummary.py:19, in summary.<locals>.register_hook.<locals>.hook(module, input, output)
     17 m_key = "%s-%i" % (class_name, module_idx + 1)
     18 summary[m_key] = OrderedDict()
---> 19 summary[m_key]["input_shape"] = list(input[0].size())
     20 summary[m_key]["input_shape"][0] = batch_size
     21 if isinstance(output, (list, tuple)):

IndexError: tuple index out of range

Why it happens and how should I do to fix it?

Thanks!

testing very large attention windows

This is kind of a simple-minded question, but what do I do if I want to see for myself that I can process a huge attention window using torchscale? Ideally, I'd simply like to be able to run a single script or function that shows that, yes, it works, say with summarization of a large corpus of books.

RetNet: relative position

I believe there is a difference in relative position implemented here, and what is described in the paper. The issue I see is in theta_shift and rotate_every_two

def rotate_every_two(x):
    x1 = x[:, :, :, ::2]
    x2 = x[:, :, :, 1::2]
    x = torch.stack((-x2, x1), dim=-1)
    return x.flatten(-2)  # in einsum notation: rearrange(x, '... d j -> ... (d j)')\

# ...

def theta_shift(x, sin, cos):
    return (x * cos) + (rotate_every_two(x) * sin)

You can see here that theta_shift is applied to q and k, which have input shape (bsz, self.num_heads, tgt_len, self.key_dim) (after transpose).

class MultiScaleRetention(nn.Module):
    # ...
    def forward(
        self,
        x,
        rel_pos,
        chunkwise_recurrent=False,
        incremental_state=None
    ):
        bsz, tgt_len, _ = x.size()
        (sin, cos), inner_mask = rel_pos

        q = self.q_proj(x)
        k = self.k_proj(x)
        v = self.v_proj(x)
        g = self.g_proj(x)

        k *= self.scaling
        q = q.view(bsz, tgt_len, self.num_heads, self.key_dim).transpose(1, 2)
        k = k.view(bsz, tgt_len, self.num_heads, self.key_dim).transpose(1, 2)

        qr = theta_shift(q, sin, cos)
        kr = theta_shift(k, sin, cos)

Why does rotate_every_two shuffle elements along the key_dim axis? This is not what was described in the paper (Equations 3, 4)

Relative position embedding should depend only on the sequence position (m, n) and theta parameters. For that reason, I wonder if rotate_every_two is a bug?

EncoderDecoder Configuration Issue

I've identified a problem in the EncoderDecoderConfig class within the architecture module of the torchscale package.

The EncoderDecoderConfig class currently does not contain the normalize_output element. This missing element is causing some functionality of the package to not work as expected. Specifically, when the EncoderDecoder class is used with a EncoderDecoderConfig object, the lack of the normalize_output element can lead to incorrect behavior.

I recommend adding the normalize_output element to the EncoderDecoderConfig class. I believe that this change will resolve the issue and make the EncoderDecoder class function as expected.

Furthermore, I have added a new GitHub Action in pull request #31 to help prevent issues like this in the future.

Retnet training is slow

Hi, when I use retnet's parallel mode to train, it's very slow, I observe the gou memory usage, it's very small, what's going on?
Thank you!

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

some result plots are not show

The plots in two sections, Stability Evaluation / Scaling-up Experiments, are not showing up. Is there any server broken? How about uploading the result to the repository?

Could you please explain the reason behind defining TEMPERATURE_FOR_L_UAX in the code without actually using it?

As the title says, there was only one result.

retnet traning config

Hello,

I have followed the training configuration introduced here (#52) with retnet_medium architecture. I have some questions that I would appreciate if anyone could answer them.

The first is about the initialization. From the RETNET paper https://arxiv.org/abs/2307.08621, I saw that parameters were initialized following deepnet. So I am wondering why in the RetNetConfig it is set to False, and where should I set it True? (https://github.com/microsoft/torchscale/blob/main/torchscale/architecture/config.py#L239)

If I simply add "--deepnorm" in command line, this will be activated together with subln (https://github.com/microsoft/torchscale/blob/main/torchscale/architecture/config.py#L240), then I found the output of each layers getting larger and larger with the layer id increasing.

The second is about the vocabulary. I am newer to fairseq so I am not sure how to deal with a large dataset via fairseq_preprocess. I am trying to use MINIPILE while the dict.txt has 32309612 lines. It seems too large so I am wondering if there is some official recommendation for this part.

The third is about --share-decoder-input-output-embed, Is it recommended? I am sorry if I missed in paper.

Thank you guys in advance:)

Inconsist recurrent and parallel results for RetNet

It seems the recurrent and parallel forward results are quite inconsistent for multiscale retention in the RetNet code. By debugging around for a while, it seems these three lines are quite weird.

A: mask = mask / mask.sum(dim=-1, keepdim=True).sqrt() line 64 in the retnet.py
B: kv = prev_kv * (1 - 1 / scale).view(self.num_heads, 1, 1) + kv / scale.view(self.num_heads, 1, 1) line 108 in the multiscale_retention.py
C: # kv = prev_kv * decay.view(self.num_heads, 1, 1) + kv line 109 in the multiscale_retention.py

If I remove A and B and uncomment C, the recurrent and parallel results become the same. Can you give me some explanation why these are used? Thanks!

There're a confusion in torchscale

When using from torchscale.architecture.config import RetNetConfig , I got a error.

      from torchscale.architecture.config import RetNetConfig
ImportError: cannot import name 'RetNetConfig' from 'torchscale.architecture.config'

But I can import Decoder and EncoderDecoder. help!

scale.sqrt() in the recurrent_forward function of the multiscale_retention module

kv = prev_kv * (1 - 1 / scale).view(self.num_heads, 1, 1) + kv / scale.view(self.num_heads, 1, 1) line 108 in the multiscale_retention.py
should be
kv = prev_kv * (prev_scale.sqrt() * decay / scale.sqrt()).view(self.num_heads, 1, 1) + kv / scale.sqrt().view(self.num_heads, 1, 1)
because line 65 of retnet.py has the sqrt function
mask = mask / mask.sum(dim=-1, keepdim=True).sqrt()

Can't pickle

Hello! I'm running into the following pickling errors and thus Pytorch lightning is unable to checkpoint the model I believe:

import pickle

from torchscale.architecture.encoder import Encoder
from torchscale.architecture.config import EncoderConfig

enc = EncoderConfig()
pickle.dumps(Encoder(enc))

AttributeError: Can't pickle local object 'get_activation_fn.<locals>.<lambda>'

Which is coming from https://github.com/microsoft/torchscale/blob/main/torchscale/component/feedforward_network.py#L88

SMOE or XMOE Network how to "evaluate" and "save and resume"

Hello there,

i'm interested in using the XMOE network and I have some questions regarding how to evaluate its performance on a validation set, how to save checkpoints, and how to resume training from a saved checkpoint.

Evaluation on Validation Set:
Could you please provide some guidance on how to evaluate the XMOE network on a validation set? Also, I'm using the Distributed Data Parallel (DDP) mode, and I'm wondering whether I need to evaluate the XMOE network on all devices or only one device?

Saving Checkpoints:
How can I save the XMOE model's checkpoints during training? What's the recommended way of doing this? Given that each GPU has its own experts and shared parameters, should I save all the parameters on each device or is there an API that can centralize the parameters and save them to avoid redundancy?

Resuming Training from a Saved Checkpoint:
How can I resume training the XMOE model from a saved checkpoint? What's the recommended way of doing this? Is there any specific API or command I should use?

Thank you in advance for your help. I'm looking forward to using the XMOE network in my projects.

xPos cross-attention change

Hey, I noticed compared to the old implementation at https://github.com/sunyt32/torchscale, xPos is no longer used for cross-attention between decoder inputs and encoder outputs. In the old implementation, scaling was simply inverted for that case.

Could you help me out on understanding why the change toward not using xPos (or any positional encoding, as a matter of fact) for cross-attention happened? Does this produce better results than noted in the LeX/xPos paper?

@shumingma @sunyt32

Training & Inference examples for RetNet

Could you provide some Training & Inference examples for RetNet?

About running speed

Thanks for your excellent work!
I have mentioned that torchscale serially executes the operation of mapping x to q, k, and v, in line 84~86 in file torchscale/component/multihead_attention.py. Will this be slower in your approach compared to doing it in parallel? For example, self.qkv_proj=nn.Linear(embed_dim, 3 * embed_dim)

Multi-Scale Retention: Why include position embeddings explicitly?

My question is about the RetNet paper, which leads to the implementation here...

Why include the positional embedding updates directly in the multi-scale retention layer, rather than just applying them to the RetNet inputs?

IMO, this seems overly specific to the language modeling use case. Other applications of retention/attention should be free to use whatever positional embeddings they need/want.

The retention formulation is still self-consistent (i.e. equivalent for parallel, recurrent, chunkwise) without explicitly including positional embeddings in the retention layer. See Equations (1) and (2):

Instead of forcing positional embeddings into the retention formulation, we can just set A equal to the decay matrix D. The parallel/recurrent/chunkwise formulations are still equivalent, and we remove the hard-coded dependence on xPos embeddings in the retention layer.

Conceptually, I'm thinking of how to apply RetNet to other data domains (images, heterogeneous graphs, etc). In those cases, the xPos embeddings are not reflective of the actual position in the sequence (2D position in image, generic position within a graph, etc). Does it make sense to remove the explicit position embedding from the retention layer, or am I missing something?

"sentencepiece.bpe.model" and "dict.txt" in page below seem not available

https://github.com/microsoft/torchscale/blob/main/examples/fairseq/README.md#example-bert-pretraining

RuntimeError: The size of tensor a (5) must match the size of tensor b (2) at non-singleton dimension 0

python train.py \
/home/sc0111/ai/torchscale/wikitext-103/wikitextdone \
--num-workers 4 \
--arch retnet_base \
--task language_modeling \
--optimizer adam --adam-betas "(0.9, 0.98)" \
--max-update 5000 \
--max-tokens 1024

python interactive.py \
/home/sc0111/ai/torchscale/wikitext-103/wikitextdone \
--num-workers 2 \
--path /home/sc0111/ai/torchscale/examples/fairseq/checkpoints/checkpoint_best.pt \
--task language_modeling \
--buffer-size 1024 \
--max-tokens 1024 \
--device-id 0

2023-10-12 19:52:52 | INFO | fairseq_cli.interactive | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma', 'log_nvidia_smi': False}, 'common_eval': {'_name': None, 'path': '/home/sc0111/ai/torchscale/examples/fairseq/checkpoints/checkpoint_best.pt', 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None, 'is_moe': False}, 'distributed_training': {'_name': None, 'distributed_world_size': 1, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': None, 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'pytorch_ddp', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 1, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': False, 'memory_efficient_fp16': False, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False, 'distributed_num_procs': 1}, 'dataset': {'_name': None, 'num_workers': 1, 'num_workers_valid': 0, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': 1024, 'batch_size': None, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': 1024, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 0, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.25], 'stop_min_lr': -1.0, 'use_bmuf': False}, 'checkpoint': {'_name': None, 'save_dir': 'checkpoints', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_best_checkpoints': False, 'no_save_optimizer_state': False, 'no_save_optimizer_state_on_training_finished': False, 'symlink_best_and_last_checkpoints': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 's3_upload_path': None, 'model_parallel_size': 1}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 1}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807, 'stats_path': None, 'max_valid_steps': None}, 'interactive': {'_name': None, 'buffer_size': 1, 'input': '-'}, 'model': None, 'task': {'_name': 'language_modeling', 'data': '/home/sc0111/ai/torchscale/wikitext-103/wikitextdone', 'sample_break_mode': 'none', 'tokens_per_sample': 1024, 'output_dictionary_size': -1, 'self_target': False, 'future_target': False, 'past_target': False, 'add_bos_token': False, 'max_source_positions': None, 'max_target_positions': None, 'shorten_method': 'none', 'shorten_data_split_list': '', 'pad_to_fixed_length': False, 'pad_to_fixed_bsz': False, 'seed': 1, 'batch_size': None, 'batch_size_valid': None, 'dataset_impl': None, 'data_buffer_size': 10, 'tpu': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'criterion': {'_name': 'cross_entropy', 'sentence_avg': True}, 'optimizer': None, 'lr_scheduler': {'_name': 'fixed', 'force_anneal': None, 'lr_shrink': 0.1, 'warmup_updates': 0, 'lr': [0.25]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None}
2023-10-12 19:52:52 | INFO | fairseq.tasks.language_modeling | dictionary: 267744 types
2023-10-12 19:52:52 | INFO | fairseq_cli.interactive | loading model(s) from /home/sc0111/ai/torchscale/examples/fairseq/checkpoints/checkpoint_best.pt
2023-10-12 19:52:52 | INFO | fairseq.checkpoint_utils | load_model_ensemble_and_task is_moe=False
2023-10-12 19:53:02 | INFO | fairseq_cli.interactive | NOTE: hypothesis and token scores are output in base 2
2023-10-12 19:53:02 | INFO | fairseq_cli.interactive | Type the input sentence and press return:
hello?

Traceback (most recent call last):
  File "/home/sc0111/ai/torchscale/examples/fairseq/interactive.py", line 11, in <module>
    cli_main()
  File "/home/sc0111/.pyenv/versions/ai/lib/python3.10/site-packages/fairseq_cli/interactive.py", line 312, in cli_main
    distributed_utils.call_main(convert_namespace_to_omegaconf(args), main)
  File "/home/sc0111/.pyenv/versions/ai/lib/python3.10/site-packages/fairseq/distributed/utils.py", line 376, in call_main
    main(cfg, **kwargs)
  File "/home/sc0111/.pyenv/versions/ai/lib/python3.10/site-packages/fairseq_cli/interactive.py", line 227, in main
    translations = task.inference_step(
  File "/home/sc0111/.pyenv/versions/ai/lib/python3.10/site-packages/fairseq/tasks/language_modeling.py", line 335, in inference_step
    return generator.generate(
  File "/home/sc0111/.pyenv/versions/ai/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/sc0111/.pyenv/versions/ai/lib/python3.10/site-packages/fairseq/sequence_generator.py", line 182, in generate
    return self._generate(sample, **kwargs)
  File "/home/sc0111/.pyenv/versions/ai/lib/python3.10/site-packages/fairseq/sequence_generator.py", line 321, in _generate
    lprobs, avg_attn_scores = self.model.forward_decoder(
  File "/home/sc0111/.pyenv/versions/ai/lib/python3.10/site-packages/fairseq/sequence_generator.py", line 775, in forward_decoder
    decoder_out = model.decoder.forward(
  File "/home/sc0111/ai/torchscale/examples/fairseq/models/retnet.py", line 251, in forward
    return super().forward(src_tokens, **kwargs)
  File "/home/sc0111/ai/torchscale/torchscale/architecture/retnet.py", line 366, in forward
    x, l_aux_i = layer(
  File "/home/sc0111/.pyenv/versions/ai/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sc0111/ai/torchscale/torchscale/architecture/retnet.py", line 165, in forward
    x = self.retention(
  File "/home/sc0111/.pyenv/versions/ai/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sc0111/ai/torchscale/torchscale/component/multiscale_retention.py", line 190, in forward
    output = self.recurrent_forward(qr, kr, v, inner_mask, incremental_state)
  File "/home/sc0111/ai/torchscale/torchscale/component/multiscale_retention.py", line 102, in recurrent_forward
    scale = prev_scale * decay + 1
RuntimeError: The size of tensor a (5) must match the size of tensor b (2) at non-singleton dimension 0

Q) Tensor parallel for magneto

When magneto is applied, it is hard to apply tensor parallel(TP). Gathering tensors in prev of subln and scattering after subln cause so much communication cost. Do you have any code or idea how to solve it?

Questions about the implementation of deepnorm

I have a doubt about deepnorm. In the paper, deepnorm_init function use xavier_normal_(x, gain=beta) for "ffn" "v_proj" "out_proj".

However, in the source code of torhscale use xavier_normal_(x, gain=1)/ beta:

        for name, p in self.named_parameters():
            if (
                "fc1" in name
                or "fc2" in name
                or "out_proj" in name
                or "v_proj" in name
            ):
                p.data.mul_(init_scale)

`
Although i know that X ~ N(0,std^2), aX ~ N(0,(a*std)^2), I plot the distribution of both methods using a histogram，the results show some differences between the two methods:

import torch
import matplotlib.pyplot as plt
from torch.nn.init import xavier_normal_
torch.manual_seed(1)

init_scale = 0.343
linear1 = torch.nn.Linear(4096, 512)  # 1  xavier_norm_(x, gain=beta)
linear2 = torch.nn.Linear(4096, 512) # 2 xavier_norm_(x, gain=1) / beta
xavier_normal_(linear1.weight,gain=init_scale)
xavier_normal_(linear2.weight,gain=1)

linear1_weight = linear1.weight.detach().numpy().reshape((-1, ))
linear2_weight = linear2.weight.detach().numpy().reshape((-1, )) / init_scale
plt.figure(figsize=(10, 6))
temp = plt.hist([linear1_weight, linear2_weight], bins=100, rwidth=0.8, histtype="step")
plt.xlabel("value")
plt.ylabel("count")
plt.legend({"1 xavier_norm_(x, gain=beta)", "2 xavier_norm_(x, gain=1)/beta"})

plt.show()

Is my implementation wrong? Which method should I use? I hope someone can enlighten me, thank you！！！

Can Torchscale be applied in point cloud tasks?

Thank you very much for the work, can Torchscale be applied in point cloud tasks？

initialization of qkv

In the paper, the authors mentioned that the initialization followed DeepNet but from the code, it's kind of different. Why is there a mismatch?

def reset_parameters(self):
    nn.init.xavier_uniform_(self.q_proj.weight, gain=2 ** -2.5)
    nn.init.xavier_uniform_(self.k_proj.weight, gain=2 ** -2.5)
    nn.init.xavier_uniform_(self.v_proj.weight, gain=2 ** -2.5)
    nn.init.xavier_uniform_(self.g_proj.weight, gain=2 ** -2.5)
    nn.init.xavier_uniform_(self.out_proj.weight)
    nn.init.constant_(self.out_proj.bias, 0.0)