Giter Site home page Giter Site logo

alibaba / megatron-llama Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nvidia/megatron-lm

553.0 6.0 50.0 4.28 MB

Best practice for training LLaMA models in Megatron-LM

License: Other

Shell 1.32% C++ 4.58% Python 93.34% C 0.15% Makefile 0.01% HTML 0.13% Cuda 0.45%
deepspeed distributed-training llama llm megatron-lm pretraining pytorch

megatron-llama's Introduction

Megatron-LLaMA: Easy, Fast and Affordable Training of Your Own LLaMA

As is known to all, LLaMA has become one of the greatest work in the open-source community of large language models (LLMs). LLaMA incorporates optimization techniques such as BPE-based tokenization, Pre-normalization, Rotary Embeddings, SwiGLU activation function, RMSNorm, and Untied Embedding. We have witnessed the outstanding results of LLaMA in both objective and subjective evaluations. LLaMA develops versions of 7B, 13B, 30B, and 65B/70B in model sizes. In the open-source community, there have been many successful variants based on LLaMA via continuous-training / supervised fine-tuning (such as Alpaca, Vicuna, WizardLM, Platypus, Minotaur, Orca, OpenBuddy, Linly, Ziya) and training from scratch (Baichuan, QWen, InternLM, OpenLLaMA). These works further demonstrate LLaMA's prominent capabilities in tasks such as long-context comprehension, long-context generation, code writing, mathematical problem solving, tool usage, etc.

However, it is often very expensive for developers to try out their own designs on LLaMA, as training or fine-tuning one's own LLM requires powerful computational resources. Typically, GPUs with large memory or distributed clusters composed of multi-GPU devices are essential for training LLMs. Megatron-LM is a distributed training solution that integrates tensor parallelism (TP), pipeline parallelism (PP), and sequence parallelism (SP). When training models with tens-of- or hundreds-of-billion parameters, it tends to take full advantage of the hardware resources. The resource utilization can reach far beyond the publicly available versions of LLaMA (implemented based on Huggingface and DeepSpeed). Nevertheless, native Megatron-LM would suffer from the communication bottleneck of the distributed optimizer during training at an extremely large scale.

Therefore, to facilitate the training of LLaMA-based models and reduce the cost on occupying hardware resources, Alibaba decides to release the internal optimized Megatron-LLaMA training framework to the community. Megatron-LLaMA makes the following contributions:

(i) A standard implementation of LLaMA in Megatron-LLaMA: It is easy to obtain the LLaMA code from Huggingface, which does not involve the various parallel methods provided by Megatron-LM. Megatron-LLaMA offers a standard implementation of LLaMA in Megatron-LM, allowing developers to configure the optimization techniques on demand. We will continue to release features such as Alibi and FlashAttention2 in the future.

(ii) Efficient communication-computation parallelism: Similar to DeepSpeed ZeRO Stage 2, Megatron-LM implements DistributedOptimizer that partitions the gradient and optimizer state, significantly reducing GPU memory usage. However, the solution provided by Megatron-LM does not fully overlap GPU computation with communication, resulting in underutilization of hardware resources. Building upon the original DistributedOptimizer and ZeRO-Stage-2, Megatron-LLaMA proposes a novel approach for gradient and optimizer state sharding, achieving the following benefits without compromising precision: a) extremely high parallelism between communication and computation; b) highly efficient utilization of communication bandwidth; c) lower GPU memory usage. Consequently, Megatron-LLaMA enables higher training throughput on the same hardware configuration than the vanilla Megatron-LM.

(iii) Utilities: Megatron-LLaMA supplements several utilities and improves the checkpoint mechanism in Megatron-LM, including: a) Distributed checkpoint saving/restoring to speedup. This also provides abstract filesystem interfaces for easily integrating distributed file systems such as HDFS; b) Convenient interface for weight conversion from/to the HuggingFace format, facilitating the delivery to the downstream tasks after pretraining; c) Support for Tokenizers in HuggingFace transformers library.

Megatron-LLaMA makes large-scale training of LLaMA models fast, affordable and scalable.

Efficiency and Affordability: The Megatron-LM techniques make LLaMA training fast and affordable. Suppose that we train our own LLaMA-13b model on four 8xA100-80GB devices. The following table depicts the training cost and TFLOPS of DeepSpeed implentation and Megatron-LM implentation of LLama. According to the Azure pricing, Megatron-LLaMA saves $1037 compared to DeepSpeed when consuming 10 billion tokens.

DeepSpeed (HF) Megatron-LLaMA
Training cost 49.7 hours ($5482) 40.3 hours ($4445)
Training Model TFLOPS 146 180

*The global batch size is set to 2048 via gradient accumulation (GA).

*We enable FlashAttention in the HF/DeepSpeed implementation.

Excellent Scalability: The OverlappedDistributedOptimizer in Megatron-LLaMA introduces the high parallelism between computation and communication, regardless the number of gradient accumulation. We demonstrate the average tokens per second on each GPU in the following table when we try to reproduce the LLaMA training (with 8xA100-80GB devices and 4x200Gbps RDMA inter-bandwidth). Based on this metric, the scaling ratio of Megatron-LLaMA with OverlappedDistributedOptimizer can reach 0.85 when scaling from 32 GPUs to 512 GPUs, while Megatron-LLaMA with DistributedOptimizer can only achieve around 0.7.

256xA100 80GB 512xA100 80GB
Megatron-LLaMA with OverlappedDistributedOptimizer 1890 (23.9 days) 1845 (12.2 days)
Megatron-LLaMA with DistributedOptimizer 1630 (27.8 days) 1430 (15.8 days)

OverlappedDistributedOptimizer

In the vanilla Megatron-LM, users can leverage DistributedOptimizer to partition gradients and optimizer states to reduce GPU memory occupation. After accumulated all gradients in GA, DistributedOptimizer employs a ReduceScatter operation to scatter the gradients to the corresponding ranks. Each rank then updates the local parameters, and then collect the remaining parameters through an AllGather operation from all the other ranks. However, we observe a significant overhead on communication under small GA settings (over 50% time consumption without GA).

To mitigate the overhead, we try to overlap the collective communication with computation, according to the partition strategy in DeepSpeed ZeRO Stage-2. This strategy fails to scale. It takes too many small Reduce operations at large scale, which makes it under-utilize the inter-connection bandwidth.

We abstract the above problem into two aspects:

  1. Finding the room for overlapping communication with computation logically.
  2. Implementing a partition strategy that fully utilizes the room for overlapping and inter-connection bandwidth, without introducing overhead in term of communication volume.

In this case, we propose OverlappedDistributedOptimizer , with a novel partition strategy of gradients and optimizer states. The design principles are summarized as follows:

  • Common optimizers such as Adam and SGD update each value in the parameters independently. Therefore, it is not necessary to keep each parameter as a whole.
  • Any single collective communication operation should commit a sufficient amount of data, making full use of the communication bandwidth.
  • No extra communication volume or GPU memory-copy should be involved.

Brief introduction to OverlappedDistributedOptimizer

Figure 1. The partition strategy in Megatron-LLaMA

As shown in Figure 1, all parameters are assigned to their respective Buckets during the initialization of OverlappedDistributedOptimizer. All the model parameters within a Bucket are complete, with each parameter belonging to only one Bucket. Conceptually, each Bucket is divided equally into P (the number of ranks of the data parallel group) shards. Each rank would be responsible for one shard. The Buckets would be placed in a local queue (Local grad bucket queue) to ensure the communication order. During the training process, the data parallel groups exchange the required gradients at the Bucket level through collective communication.

Figure 2. The communication mechanism in Megatron-LLaMA

OverlappedDistributedOptimizer incorporates an efficient communication mechanism over the Buckets. OverlappedDistributedOptimizer initializes a local buffer called PartitionedParameter with a size equal to the sum of sizes of all parameters that the current rank is responsible for. The respective parameters are taken from the pre-sharded model parameters and assigned to the PartitionedParameter. Besides, a buffer called PartitionedGradient, with the same size as PartitionedParameter, is created to store the gradients corresponding to the PartitionedParameter. Then, Megatron-LLaMA's communication mechanism mainly consists of the following three procedures:

a) As shown in Figure 2-(i), once a parameter's gradient is obtained, the gradient would be copied to the corresponding position in the Bucket. Once all gradients for the parameters in a Bucket are collected, a single ReduceScatter operation is performed to exchange the gradients, with the corresponding position in the PartitionedGradient as destination.

b) As shown in Figure 2-(ii), each rank updates PartitionedParameter by the PartitionedGradient once all ReduceScatter operations are finished.

c) As shown in Figure 2-(iii), each rank re-constructs the full parameters from all the other ranks through AllGather with the logical Bucket.

Specifically, we reduce the memory copy and GPU memory occupation through the following approaches:

a. During the initialization of OverlappedDistributedOptimizer, a buffer called ParameterBuffer is allocated with the same size as the sum of all parameter sizes, and all model parameters are actually placed in ParameterBuffer. The destination addresses for re-constructing the full parameters via AllGather can directly reference to the corresponding positions in ParameterBuffer. It avoids the temporary memory allocation and reduces GPU memory copy. (This optimization is inspired by DeepSpeed).

b. Once copying gradients to the Bucket has been complete, the original space for gradients can be released, reducing GPU memory usage. Additionally, the memory for Bucket can also be released after the ReduceScatter operation. On top of this, we introduce a Buffer Alternation mechanism to avoid the issue of memory fragmentation caused by frequent memory allocation and deallocation.

Using Megatron-LLaMA

Launch a train task

You can use the same launching method as in Megatron-LM Usage. Beyond that, we produce:

A. Weight conversion tool

This tool helps convert the format of paramters between Megatron-LLaMA/Megatron-LM and Huggingface format.

HuggingFace to Megatron-LLaMA

sh tools/checkpoint_conversion/hf_to_megatron.sh

Megatron-LLaMA to HuggingFace

sh tools/checkpoint_conversion/megatron_to_hf.sh

B. Launching scripts

Single-node launching

sh examples/LLaMA/LLaMA_13_standalone.sh

Distributed launching

sh examples/LLaMA/LLaMA_13_slurm.sh

In particular, we recommend to increase the micro-batch size to fully occupy the GPU memory so that the hardware utilization will be maximized.

Customized arguments in Megatron-LLaMA

Argument Specification
--overlapped-distributed-optimizer Enable the OverlappedDistributedOptimizer. Do not set --use-distributed-optimizer simultaneously.
--reduce-bucket-size Set the size of the Bucket in OverlappedDistributedOptimizer. Default to 5e8. Larger Bucket indicates higher utilization of inter-DP group bandwidth; Smaller Bucket bring opportunity for better parallelism between communication and computation.
--tokenizer-type=PretrainedFromHF Use a Tokenizer from Huggingface (would be loaded via transformers.AutoTokenizer)
--distributed-checkpointing Distributed saving of checkpoint files.

Megatron-LLaMA supports the canonical data prepocessing and evaluation as mentioned in the Megatron-LM library.

Future work

At present, we are actively working on the following items:

  • Release the configurations and optimization scheme for the 30B and 65B/70B LLaMA model training
  • Supplement the modifications in models such as Alibi and FlashAttention2
  • Support for LLMs with other model structures
  • We encourage the community to engage in discussions aimed at making LLaMA training even more accessible, efficient, and cost-effective

License

Megatron-LLaMA is developed by Aicheng Technology, Alibaba Group and is based on the Megatron-LM project(https://github.com/NVIDIA/Megatron-LM) from Nvidia. Code is distributed under the Apache License (Version 2.0). This product contains various third-party components under other open source licenses. See the NOTICE file for more information.

Credits

The following repositories are used in Megatron-LLaMA, either in close to original form or as an inspiration:

Megatron-LM

LLaMA

DeepSpeed

megatron-llama's People

Contributors

aklife97 avatar borisfom avatar boxin-wbx avatar deepakn94 avatar dstosic avatar ekmb avatar eqy avatar erhoo82 avatar ericharper avatar hwijeen avatar hyunwoongko avatar jaredcasper avatar jasperzhong avatar jiemingz avatar kantneel avatar ksivaman avatar kvareddy avatar li-yi-dong avatar lmcafee-nvidia avatar maanug-nv avatar maximumentropy avatar mpatwary avatar nakosung avatar pxuab avatar raulpuric avatar roclark avatar satpalsr avatar stas00 avatar tridao avatar zliucr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

megatron-llama's Issues

fp16的支持问题

因为现在手头只有v100的机器,所以训练的时候尝试用了fp16(bf16有点慢)。

但是发现用fp16实质上似乎没有训练,

这一行判断一直为True,也就是找到了inf/nan,导致训练不下去。

同样的数据集bf16的情况我跑过,没有这个问题。我也修改--initial-loss-scale到一个比较小的值也不行。

每次GA的backward都需要做通信

是否可以最后一个GA(micro batch),做backward和reduce-scatter并行,前面的GA-microbatch不做通信?现在发现每次GA都通信会导致性能比DistributedOptimizer更差

2节点训练13B LLaMA模型效率只能达到840 token/sec/GPU

基于2台A800x80G训练13B LLaMA模型发现效率只能达到840 token/sec/GPU,不知道是什么原因,详细配置如下:
--tensor-model-parallel-size 4
--pipeline-model-parallel-size 1 \
--sequence-parallel
--distributed-timeout-minutes 60
--no-position-embedding
--use-rotary-position-embeddings
--RMSNorm
--layernorm-epsilon 1e-6
--causal-lm
--disable-bias-linear
--swiglu
--micro-batch-size 4
--global-batch-size 1024
--train-iters 50862
--lr 6.0e-5
--lr-decay-iters 32000
--lr-decay-style cosine
--min-lr 6e-6
--weight-decay 0.1
--lr-warmup-fraction .01
--clip-grad 1.0
--override-opt_param-scheduler
--overlapped-distributed-optimizer
--reduce-bucket-size=4e8
--no-gradient-accumulation-fusion
--dataloader-type cyclic \

大家好,请教一个关于GLOBAL_BATCH_SIZE值计算的问题,希望大家不吝赐教。

1、训练脚本中有个公式GLOBAL_BATCH_SIZE=$((($WORLD_SIZE * $MICRO_BATCH_SIZE) / ($TP_SIZE * $PP_SIZE) * 8))
在这个公式中,最后乘以8是代表什么意思?

2、还有另外如果不使用脚本中的GLOBAL_BATCH_SIZE计算公式,将GLOBAL_BATCH_SIZE赋固定的值,这个值设置大或设置小有什么影响吗?在设置值这方面有什么讲究吗?

在A800*8卡的机器上,开启 overlapped-distributed-optimizer 的速度比开启 use-distributed-optimizer 的慢约8%

使用 LLaMA_13_standalone.sh 脚本运行,速度为 1962.9 token/sec/GPU。将脚本中的 “--overlapped-distributed-optimizer \ --reduce-bucket-size=2e8 \ --no-gradient-accumulation-fusion \” 删除,替换成 “--use-distributed-optimizer \”,运行速度为 2124.7 token/sec/GPU。

请问这是否符合预期,OverlappedDistributedOptimizer 比 DistributedOptimizer 快的结论仅适用于 256xA100 80GB吗?在 256xA800 80GB 上是否拥有同样的加速比?

hf权重转换代码小bug

在代码的这个地方,if config.num_hidden_layers % args.target_tensor_model_parallel_size != 0:写的不对,不应该是args.target_tensor_model_parallel_size , 应该是args.target_pipeline_model_parallel_size

if config.num_hidden_layers % args.target_tensor_model_parallel_size != 0:
        raise ValueError(
            f"Number of layers ({config.num_hidden_layers}) must be divisible by number of tensor parallelism"
            f" ({args.target_tensor_model_parallel_size})"
        )
    num_layers = config.num_hidden_layers // args.target_pipeline_model_parallel_size

    layer_re = re.compile(r"model.layers\.(\d+)\.([a-z0-9_.]+)\.([a-z]+)")

https://github.com/alibaba/Megatron-LLaMA/blob/main/tools/checkpoint_conversion/llama_checkpoint_conversion.py#L675C47-L675C47

请教一下,怎么感觉LLaMA2-7B模型单机A800*8*80G 用8张卡预训练TP4-PP1-DP2时间和TP1-PP1-DP8时间不合理

用的是相同的数据进行测试,为什么TP4-PP1-DP2(张量并行度4,数据并行度2)的average token/sec/GPU : 6247.2 比
TP1-PP1-DP8(张量并行度1,数据并行度8)的average token/sec/GPU : 8707.8 值小,按理说TP4-PP1-DP2(张量并行度4,数据并行度2)训练速度应该比TP1-PP1-DP8(张量并行度1,数据并行度8)慢,因为单位时间内处理的token数少,为什么反而在该例子中:
TP4-PP1-DP2(张量并行度4,数据并行度2)的elapsed time per iteration (ms): 5245.2 (时间短)?
TP1-PP1-DP8(张量并行度1,数据并行度8)的elapsed time per iteration (ms): 15088.2 (时间长)?

下面是TP4-PP1-DP2(张量并行度4,数据并行度2)的运行日志:
iteration 20/ 1000 | consumed samples: 1280 | elapsed time per iteration (ms): 5245.2 | average overall token/sec : 49977.6 | average token/sec/GPU : 6247.2 | learning rate: 6.000E-06 | global batch size: 64 | lm loss: 3.193359E+00 | loss scale: 1.0 | grad norm: 7.081 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 21/ 1000 | consumed samples: 1344 | elapsed time per iteration (ms): 5240.6 | average overall token/sec : 50021.7 | average token/sec/GPU : 6252.7 | learning rate: 6.000E-06 | global batch size: 64 | lm loss: 3.044922E+00 | loss scale: 1.0 | grad norm: 23.787 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 22/ 1000 | consumed samples: 1408 | elapsed time per iteration (ms): 5223.7 | average overall token/sec : 50184.0 | average token/sec/GPU : 6273.0 | learning rate: 6.000E-06 | global batch size: 64 | lm loss: 3.158203E+00 | loss scale: 1.0 | grad norm: 91.750 | number of skipped iterations: 0 | number of nan iterations: 0 |

====================================================================================

以下是TP1-PP1-DP8(张量并行度1,数据并行度8)的运行日志:
iteration 20/ 1000 | consumed samples: 5120 | elapsed time per iteration (ms): 15088.2 | average overall token/sec : 69496.2 | average token/sec/GPU : 8687.0 | learning rate: 6.000E-06 | global batch size: 256 | lm loss: 2.751221E+00 | loss scale: 1.0 | grad norm: 4.944 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 21/ 1000 | consumed samples: 5376 | elapsed time per iteration (ms): 15052.3 | average overall token/sec : 69662.3 | average token/sec/GPU : 8707.8 | learning rate: 6.000E-06 | global batch size: 256 | lm loss: 2.908936E+00 | loss scale: 1.0 | grad norm: 9.224 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 22/ 1000 | consumed samples: 5632 | elapsed time per iteration (ms): 15057.8 | average overall token/sec : 69636.9 | average token/sec/GPU : 8704.6 | learning rate: 6.000E-06 | global batch size: 256 | lm loss: 2.847168E+00 | loss scale: 1.0 | grad norm: 8.277 | number of skipped iterations: 0 | number of nan iterations: 0 |

Megatron-LM权重转hf格式

请问用原始的Megatron-LM训练的llama2可以使用megatron_to_hf.sh脚本转换吗?目前遇到了一点问题,有些参数的格式不一样导致脚本运行报错,自己按理解改了一下,改完可以进行转换了,但转换后的hf模型输出有问题。

下面是我改动的地方,感觉前两处改动应该没有问题,参数名都比较接近,最后一个改动不知道对不对,参数的shape能对上,改完后输出全重复一个token

image
image
image

请问上面的改动有没有问题?能否提供一个支持Megatron-LM的转换脚本?

出现 forward() missing 1 required positional argument: 'memory_efficient'

您好,运行训练的script 出来以下问题,请问是哪里有问题呢?

Traceback (most recent call last):
  File "/home/litong/workspace/MyDev/Megatron-LLaMA/pretrain_llama.py", line 151, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider,
  File "/home/litong/workspace/MyDev/Megatron-LLaMA/megatron/training.py", line 153, in pretrain
    iteration = train(forward_step_func,
  File "/home/litong/workspace/MyDev/Megatron-LLaMA/megatron/training.py", line 711, in train
    train_step(forward_step_func,
  File "/home/litong/workspace/MyDev/Megatron-LLaMA/megatron/training.py", line 426, in train_step
    losses_reduced = forward_backward_func(
  File "/home/litong/workspace/MyDev/Megatron-LLaMA/megatron/core/pipeline_parallel/schedules.py", line 363, in forward_backward_no_pipelining
    output_tensor = forward_step(forward_step_func, data_iterator,
  File "/home/litong/workspace/MyDev/Megatron-LLaMA/megatron/core/pipeline_parallel/schedules.py", line 218, in forward_step
    output_tensor, loss_func = forward_step_func(data_iterator, model)
  File "/home/litong/workspace/MyDev/Megatron-LLaMA/pretrain_llama.py", line 93, in forward_step
    output_tensor = model(tokens, position_ids, attention_mask,
  File "/home/litong/.conda/envs/megatron-llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/litong/.conda/envs/megatron-llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/litong/workspace/MyDev/Megatron-LLaMA/megatron/model/distributed.py", line 58, in forward
    return self.module(*inputs, **kwargs)
  File "/home/litong/.conda/envs/megatron-llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/litong/.conda/envs/megatron-llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/litong/workspace/MyDev/Megatron-LLaMA/megatron/model/module.py", line 183, in forward
    outputs = self.module(*inputs, **kwargs)
  File "/home/litong/.conda/envs/megatron-llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/litong/.conda/envs/megatron-llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/litong/workspace/MyDev/Megatron-LLaMA/megatron/model/llama_model.py", line 114, in forward
    lm_output = self.language_model(
  File "/home/litong/.conda/envs/megatron-llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/litong/.conda/envs/megatron-llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/litong/workspace/MyDev/Megatron-LLaMA/megatron/model/language_model.py", line 520, in forward
    encoder_output = self.encoder(
  File "/home/litong/.conda/envs/megatron-llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/litong/.conda/envs/megatron-llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/litong/workspace/MyDev/Megatron-LLaMA/megatron/model/transformer.py", line 1303, in forward
    hidden_states = layer(
  File "/home/litong/.conda/envs/megatron-llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/litong/.conda/envs/megatron-llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/litong/workspace/MyDev/Megatron-LLaMA/megatron/model/transformer.py", line 789, in forward
    layernorm_output = self.input_layernorm(hidden_states)
  File "/home/litong/.conda/envs/megatron-llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/litong/.conda/envs/megatron-llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/litong/workspace/MyDev/Megatron-LLaMA/megatron/model/fused_layer_norm.py", line 90, in forward
    return FusedRMSNormAffineFunction.apply(input, weight, self.normalized_shape, self.eps)
  File "/home/litong/.conda/envs/megatron-llama/lib/python3.9/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
TypeError: forward() missing 1 required positional argument: 'memory_efficient'


训练llama-30b模型报错是不支持llama-30b模型么?

building PretrainedFromHF tokenizer ...
Traceback (most recent call last):
File "/home/llm-deploy/Megatron-LLaMA/pretrain_llama.py", line 119, in
pretrain(train_valid_test_datasets_provider, model_provider,
File "/home/llm-deploy/Megatron-LLaMA/megatron/training.py", line 90, in pretrain
initialize_megatron(extra_args_provider=extra_args_provider,
File "/home/llm-deploy/Megatron-LLaMA/megatron/initialize.py", line 50, in initialize_megatron
set_global_variables(args)
File "/home/llm-deploy/Megatron-LLaMA/megatron/global_vars.py", line 92, in set_global_variables
_ = _build_tokenizer(args)
File "/home/llm-deploy/Megatron-LLaMA/megatron/global_vars.py", line 125, in _build_tokenizer
_GLOBAL_TOKENIZER = build_tokenizer(args)
File "/home/llm-deploy/Megatron-LLaMA/megatron/tokenizer/tokenizer.py", line 46, in build_tokenizer
tokenizer = _AutoTokenizer(args.tokenizer_name_or_path, vocab_extra_ids=args.vocab_extra_ids)
File "/home/llm-deploy/Megatron-LLaMA/megatron/tokenizer/tokenizer.py", line 554, in init
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path, **hf_tokenizer_kwargs)
File "/home/llm-deploy/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 652, in from_pretrained
tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
File "/home/llm-deploy/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 496, in get_tokenizer_config
resolved_config_file = cached_file(
File "/home/llm-deploy/anaconda3/lib/python3.10/site-packages/transformers/utils/hub.py", line 417, in cached_file
resolved_file = hf_hub_download(
File "/home/llm-deploy/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
validate_repo_id(arg_value)
File "/home/llm-deploy/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/xxx/megatron-llama-30b-checkpoint'. Use repo_type argument if needed.

llama中decoder layer层里面的MLP问题

开发者您好,对代码有一个疑问,想要请教一下。

情况描述

part 1

  1. 在debug代码的时候,发现你们的llama模型里面的decoder layer层的mlp部分,使用的是Megatron-lmParallelMLP
  2. 但是这个算子,只有两个线性层。他们分别为
    1. self.dense_h_to_4h = tensor_parallel.ColumnParallelLinear
    1. self.dense_4h_to_h = tensor_parallel.RowParallelLinear
  1. 这个ParallelMLP 层的链接为
    class ParallelMLP(MegatronModule):

part2

但是在hugginfgace的transformers代码中的llama的MLP部分、在llama官方的GitHub仓库提供的代码,llama的MLP部分(也叫feedward层),使用的是结构是有三个线性层。他们分别为:

    1. self.gate_proj = tensor_parallel.ColumnParallelLinear
    1. self.up_proj = tensor_parallel.ColumnParallelLinear
    1. self.down_proj = tensor_parallel.RowParallelLinear

hf的transformers的实现llama的MLP参考链接:https://github.com/huggingface/transformers/blob/4557a0dede92ce985576fac478b754d76bba3c18/src/transformers/models/llama/modeling_llama.py#L229

part3

  1. 上面也只是简单列举了贵团队和标准的llama的MLP部分的差异,计算方式也有比较大的差异,这里就不列举了。

我的问题:

想请教一下,你们的MLP计算方式官方的计算方式差异是数学上等价的?还是bug?

感谢贵团队提供了这么优秀的开源作品,希望有同学可以解释一下这个问题, 谢谢,辛苦~

RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead

[before the start of training step] datetime: 2023-09-11 11:26:56
Traceback (most recent call last):
File "/home/llm-deploy/Megatron-LLaMA/pretrain_llama.py", line 119, in
pretrain(train_valid_test_datasets_provider, model_provider,
File "/home/llm-deploy/Megatron-LLaMA/megatron/training.py", line 153, in pretrain
iteration = train(forward_step_func,
File "/home/llm-deploy/Megatron-LLaMA/megatron/training.py", line 711, in train
train_step(forward_step_func,
File "/home/llm-deploy/Megatron-LLaMA/megatron/training.py", line 426, in train_step
losses_reduced = forward_backward_func(
File "/home/llm-deploy/Megatron-LLaMA/megatron/core/pipeline_parallel/schedules.py", line 360, in forward_backward_no_pipelining
output_tensor = forward_step(forward_step_func, data_iterator,
File "/home/llm-deploy/Megatron-LLaMA/megatron/core/pipeline_parallel/schedules.py", line 218, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
File "/home/llm-deploy/Megatron-LLaMA/pretrain_llama.py", line 85, in forward_step
output_tensor = model(tokens, position_ids, attention_mask,
File "/home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/llm-deploy/Megatron-LLaMA/megatron/model/distributed.py", line 58, in forward
return self.module(*inputs, **kwargs)
File "/home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/llm-deploy/Megatron-LLaMA/megatron/model/module.py", line 183, in forward
outputs = self.module(*inputs, **kwargs)
File "/home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/llm-deploy/Megatron-LLaMA/megatron/model/llama_model.py", line 113, in forward
lm_output = self.language_model(
File "/home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/llm-deploy/Megatron-LLaMA/megatron/model/language_model.py", line 520, in forward
encoder_output = self.encoder(
File "/home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/llm-deploy/Megatron-LLaMA/megatron/model/transformer.py", line 1303, in forward
hidden_states = layer(
File "/home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/llm-deploy/Megatron-LLaMA/megatron/model/transformer.py", line 792, in forward
self.self_attention(
File "/home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/llm-deploy/Megatron-LLaMA/megatron/model/transformer.py", line 651, in forward
context_layer = self._checkpointed_attention_forward(
File "/home/llm-deploy/Megatron-LLaMA/megatron/model/transformer.py", line 506, in checkpointed_attention_forward
hidden_states = tensor_parallel.checkpoint(
File "/home/llm-deploy/Megatron-LLaMA/megatron/core/tensor_parallel/random.py", line 252, in checkpoint
return CheckpointFunction.apply(function,
File "/home/llm-deploy/Megatron-LLaMA/megatron/core/tensor_parallel/random.py", line 195, in forward
outputs = run_function(*args)
File "/home/llm-deploy/Megatron-LLaMA/megatron/model/transformer.py", line 499, in custom_forward
output
= self.core_attention(query_layer, key_layer,
File "/home/llm-deploy/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/llm-deploy/Megatron-LLaMA/megatron/model/transformer.py", line 315, in forward
value_layer = value_layer.view(value_layer.size(0),
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead

llama2-34b shape不匹配

我训练llama2-34b(GQA结构)时出现以下问题:

RuntimeError: Error(s) in loading state_dict for ParallelTransformer:
        size mismatch for layers.0.self_attention.query_key_value.weight: copying a param with shape torch.Size([1280, 8192]) from checkpoint, the shape in current model is torch.Size([3072, 8192]).
        size mismatch for layers.1.self_attention.query_key_value.weight: copying a param with shape torch.Size([1280, 8192]) from checkpoint, the shape in current model is torch.Size([3072, 8192]).
        size mismatch for layers.2.self_attention.query_key_value.weight: copying a param with shape torch.Size([1280, 8192]) from checkpoint, the shape in current model is torch.Size([3072, 8192]).
        size mismatch for layers.3.self_attention.query_key_value.weight: copying a param with shape torch.Size([1280, 8192]) from checkpoint, the shape in current model is torch.Size([3072, 8192]).
        size mismatch for layers.4.self_attention.query_key_value.weight: copying a param with shape torch.Size([1280, 8192]) from checkpoint, the shape in current model is torch.Size([3072, 8192]).
        size mismatch for layers.5.self_attention.q

脚本如下:

--finetune \
    --sequence-parallel \
    --tensor-model-parallel-size 8 \
    --pipeline-model-parallel-size 1 \
    --num-layers 48 \
    --hidden-size 8192 \
    --num-attention-heads 64 \
    --seq-length 1024 \
    --max-position-embeddings 16384 \
    --no-position-embedding \
    --use-rotary-position-embeddings \
    --swiglu \
    --ffn-hidden-size 22016 \
    --disable-bias-linear \
    --RMSNorm \
    --layernorm-epsilon 1e-6 \
    --causal-lm \
    --tokenizer-type PretrainedFromHF \
    --tokenizer-name-or-path $TOKENIZER_PATH \
    --make-vocab-size-divisible-by 1 \
    --init-method-std 0.01 \
    --micro-batch-size ${MICRO_BATCH_SIZE} \
    --global-batch-size ${GLOBAL_BATCH_SIZE} \
    --train-iters ${TRAIN_ITERS} \
    --lr 1e-4 \
    --lr-decay-iters 10 \
    --lr-warmup-iters 5 \
    --min-lr 1e-5 \
    --override-opt_param-scheduler \
    --lr-decay-style cosine \
    --adam-beta1 0.9 \
    --adam-beta2 0.95 \
    --clip-grad 1.0 \
    --weight-decay 0.1 \
    --overlapped-distributed-optimizer \
    --reduce-bucket-size=2e8 \
    --no-gradient-accumulation-fusion \
    --dataloader-type cyclic \
    --data-impl mmap \
    --data-path ${DATASET} \
    --split 1,0,0 \
    --eval-interval ${EVAL_INTERVAL} \
    --eval-iters ${EVAL_ITERS} \
    --save-interval ${SAVE_INTERVAL} \
    --save ${SAVE_CHECKPOINT_PATH} \
    --load ${LOAD_CHECKPOINT_PATH} \
    --no-load-optim \
    --log-interval ${LOG_INTERVAL} \
    --tensorboard-dir ${TENSORBOARD_DIR} \
    --tensorboard-queue-size 1000 \
    --log-timers-to-tensorboard \
    --log-batch-size-to-tensorboard \
    --log-validation-ppl-to-tensorboard \
    --job-name ${JOB_NAME} \
    --bf16 \
    --recompute-activations \
    --recompute-granularity selective \
    --use-flash-attn \
    --no-save-optim \
    --no-save-rng \

请问你们是否训练过GQA结构的模型?该如何配置脚本?
谢谢!

单机训练跑不了,CUDA报错

因为手头只有v100,所以去掉了flash attn那个选项。显存也不够,只能跑7B的模型。
模型是先把原始模型转成hf的模型,

python /usr/local/lib/python3.10/dist-packages/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir <path to llama-2-7b> --model_size 7B --output_dir <save path of 7b-hf>

然后再把hf的模型用你的脚本转成megatron的模型

sh tools/checkpoint_conversion/hf_to_megatron.sh

其中配置中tp修改成了1

--target_tensor_model_parallel_size 1 \

也修改了对应的LLaMA_13_standalone.sh中的参数,包括

TP_SIZE=1
options=" \
    --finetune \
    --sequence-parallel \
        --tensor-model-parallel-size ${TP_SIZE} \
        --pipeline-model-parallel-size ${PP_SIZE} \
    --num-layers 32 \
        --hidden-size 4096 \
        --num-attention-heads 32 \
        --seq-length 2048 \
        --max-position-embeddings 2048 \
        --no-position-embedding \
        --use-rotary-position-embeddings \
        --swiglu \
        --ffn-hidden-size 11008\
        --disable-bias-linear \
        --RMSNorm \
        --layernorm-epsilon 1e-6 \
        --causal-lm \
    --tokenizer-type PretrainedFromHF \
        --tokenizer-name-or-path $TOKENIZER_PATH \
        --make-vocab-size-divisible-by 1 \
    --init-method-std 0.01 \
    --micro-batch-size ${MICRO_BATCH_SIZE} \
        --global-batch-size ${GLOBAL_BATCH_SIZE} \
    --train-iters ${TRAIN_ITERS} \
    --lr 6.0e-5 \
        --lr-decay-iters 10 \
        --lr-warmup-iters 5 \
        --min-lr 6.0e-6 \
        --override-opt_param-scheduler \
        --lr-decay-style cosine \
    --adam-beta1 0.9 \
        --adam-beta2 0.95 \
        --clip-grad 1.0 \
        --weight-decay 0.1 \
        --overlapped-distributed-optimizer \
        --reduce-bucket-size=2e8 \
        --no-gradient-accumulation-fusion \
    --dataloader-type cyclic \
        --data-impl mmap \
        --data-path ${DATASET} \
        --split 98,2,0 \
    --eval-interval ${EVAL_INTERVAL} \
        --eval-iters ${EVAL_ITERS} \
    --save-interval ${SAVE_INTERVAL} \
        --save ${SAVE_CHECKPOINT_PATH} \
    --load ${LOAD_CHECKPOINT_PATH} \
        --no-load-optim \
    --log-interval ${LOG_INTERVAL} \
    --tensorboard-dir ${TENSORBOARD_DIR} \
        --tensorboard-queue-size 1000 \
        --log-timers-to-tensorboard \
        --log-batch-size-to-tensorboard \
        --log-validation-ppl-to-tensorboard \
    --job-name ${JOB_NAME} \
    --bf16 \
    --recompute-activations \
        --recompute-granularity selective \
    "

现在的问题是训练的时候会报异常:

# 这个indexSelectLargeIndex错误重复了非常多次

/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [555,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "/root/Megatron-LLaMA/pretrain_llama.py", line 119, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider,
  File "/root/Megatron-LLaMA/megatron/training.py", line 153, in pretrain
    iteration = train(forward_step_func,
  File "/root/Megatron-LLaMA/megatron/training.py", line 711, in train
    train_step(forward_step_func,
  File "/root/Megatron-LLaMA/megatron/training.py", line 426, in train_step
    losses_reduced = forward_backward_func(
  File "/root/Megatron-LLaMA/megatron/core/pipeline_parallel/schedules.py", line 360, in forward_backward_no_pipelining
    output_tensor = forward_step(forward_step_func, data_iterator,
  File "/root/Megatron-LLaMA/megatron/core/pipeline_parallel/schedules.py", line 218, in forward_step
    output_tensor, loss_func = forward_step_func(data_iterator, model)
  File "/root/Megatron-LLaMA/pretrain_llama.py", line 85, in forward_step
    output_tensor = model(tokens, position_ids, attention_mask,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/Megatron-LLaMA/megatron/model/distributed.py", line 58, in forward
    return self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/Megatron-LLaMA/megatron/model/module.py", line 183, in forward
    outputs = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/Megatron-LLaMA/megatron/model/llama_model.py", line 113, in forward
    lm_output = self.language_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/Megatron-LLaMA/megatron/model/language_model.py", line 520, in forward
    encoder_output = self.encoder(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/Megatron-LLaMA/megatron/model/transformer.py", line 1303, in forward
    hidden_states = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/Megatron-LLaMA/megatron/model/transformer.py", line 792, in forward
    self.self_attention(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/Megatron-LLaMA/megatron/model/transformer.py", line 552, in forward
    mixed_x_layer, _ = self.query_key_value(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/Megatron-LLaMA/megatron/core/tensor_parallel/layers.py", line 565, in forward
    output_parallel = linear_with_grad_accumulation_and_async_allreduce(
  File "/root/Megatron-LLaMA/megatron/core/tensor_parallel/layers.py", line 420, in linear_with_grad_accumulation_and_async_allreduce
    return LinearWithGradAccumulationAndAsyncCommunication.apply(*args)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/autocast_mode.py", line 98, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/root/Megatron-LLaMA/megatron/core/tensor_parallel/layers.py", line 244, in forward
    output = torch.matmul(total_input, weight.t())
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

可能是什么原因呢?谢谢!

RuntimeError: CUDA error: device-side assert triggered

在训练Llama2-7b模型时出现错误:

Traceback (most recent call last):

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/pretrain_llama.py", line 119, in < module >

    pretrain(train_valid_test_datasets_provider, model_provider,

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 153, in pretrain

    iteration = train(forward_step_func,

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 711, in train

    train_step(forward_step_func,

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 426, in train_step

    losses_reduced = forward_backward_func(

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/core/pipeline_parallel/schedules.py", line 360, in forward_backward_no_pipelining

    output_tensor = forward_step(forward_step_func, data_iterator,

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/core/pipeline_parallel/schedules.py", line 222, in forward_step

    output_tensor = loss_func(output_tensor)

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/pretrain_llama.py", line 69, in loss_func

    averaged_loss = average_losses_across_data_parallel_group([loss])

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/utils.py", line 73, in average_losses_across_data_parallel_group

    torch.distributed.all_reduce(averaged_losses,

  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1536, in all_reduce

    work = group.allreduce([tensor], opts)

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3

ncclInternalError: Internal check failed.

Last error:

Cuda failure 'device-side assert triggered'

Traceback (most recent call last):

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/pretrain_llama.py", line 119, in < module >

    pretrain(train_valid_test_datasets_provider, model_provider,

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 153, in pretrain

    iteration = train(forward_step_func,

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 711, in train

    train_step(forward_step_func,

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 426, in train_step

    losses_reduced = forward_backward_func(

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/core/pipeline_parallel/schedules.py", line 364, in forward_backward_no_pipelining

    backward_step(grad_scaler, input_tensor, output_tensor,

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/core/pipeline_parallel/schedules.py", line 284, in backward_step

    torch.autograd.backward(output_tensor[0], grad_tensors=output_tensor_grad[0])

  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward

    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

RuntimeError: CUDA error: device-side assert triggered

CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t  >= 0 && t <  n_classes` failed.

/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion `t  >= 0 && t <  n_classes` failed.

/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [4,0,0] Assertion `t  >= 0 && t <  n_classes` failed.

/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [10,0,0] Assertion `t  >= 0 && t <  n_classes` failed.

/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [11,0,0] Assertion `t  >= 0 && t <  n_classes` failed.

/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [13,0,0] Assertion `t  >= 0 && t <  n_classes` failed.

/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [15,0,0] Assertion `t  >= 0 && t <  n_classes` failed.

/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [16,0,0] Assertion `t  >= 0 && t <  n_classes` failed.

/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [19,0,0] Assertion `t  >= 0 && t <  n_classes` failed.

/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [21,0,0] Assertion `t  >= 0 && t <  n_classes` failed.

/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [22,0,0] Assertion `t  >= 0 && t <  n_classes` failed.

/opt/conda/conda-bld/pytorch_1670525552843/work/aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [23,0,0] Assertion `t  >= 0 && t <  n_classes` failed.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 158) of binary: /opt/conda/bin/python

Traceback (most recent call last):

  File "/opt/conda/bin/torchrun", line 33, in < module >

    sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())

  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper

    return f(*args, **kwargs)

  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main

    run(args)

  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run

    elastic_launch(

  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__

    return launch_agent(self._config, self._entrypoint, list(args))

  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent

    raise ChildFailedError(

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

请教一下这个是什么原因导致的?
提交代码:

#!/bin/bash

DATASET_1="/code/xx/LLM_mine/reference/Megatron-LLaMA/data/wiki_zh/wiki_zh_content_document"
DATASET_2="<PATH TO THE SECOND DATASET>"
DATASET_3="<PATH TO THE THIRD DATASET>"
DATASET="1.0 ${DATASET_1}"
# 0.3 ${DATASET_2} 0.5 ${DATASET_3}"

TP_SIZE=2
PP_SIZE=1
WORLD_SIZE=4
MICRO_BATCH_SIZE=2
# The int is the number of micro steps of gradient accumulation
GLOBAL_BATCH_SIZE=$((($WORLD_SIZE * $MICRO_BATCH_SIZE) / ($TP_SIZE * $PP_SIZE) * 8))
# GLOBAL_BATCH_SIZE=128

JOB_NAME="LLaMA_tp${TP_SIZE}_pp${PP_SIZE}_mbs${MICRO_BATCH_SIZE}_gpus${WORLD_SIZE}"

LOAD_CHECKPOINT_PATH="/code/xx/LLM_mine/reference/Megatron-LLaMA/model/llama2-7b"
SAVE_CHECKPOINT_PATH="/code/xx/LLM_mine/reference/Megatron-LLaMA/output/llama2_7b"
TOKENIZER_PATH="/code/xx/LLM_mine/reference/Megatron-LLaMA/model/llama2-7b"
TENSORBOARD_DIR="/code/xx/LLM_mine/reference/Megatron-LLaMA/output"

TRAIN_ITERS=1000
EVAL_ITERS=10
EVAL_INTERVAL=1000
SAVE_INTERVAL=100
LOG_INTERVAL=1

# Setting --tensorboard-queue-size to 1 significantly slows down the training
options=" \
    --finetune \
    --sequence-parallel \
        --tensor-model-parallel-size ${TP_SIZE} \
        --pipeline-model-parallel-size ${PP_SIZE} \
    --num-layers 32 \
        --hidden-size 4096 \
        --num-attention-heads 32 \
        --seq-length 2048 \
        --max-position-embeddings 2048 \
        --no-position-embedding \
        --use-rotary-position-embeddings \
        --swiglu \
        --ffn-hidden-size 11008 \
        --disable-bias-linear \
        --RMSNorm \
        --layernorm-epsilon 1e-6 \
        --causal-lm \
    --tokenizer-type PretrainedFromHF \
        --tokenizer-name-or-path $TOKENIZER_PATH \
        --make-vocab-size-divisible-by 1 \
    --init-method-std 0.01 \
    --micro-batch-size ${MICRO_BATCH_SIZE} \
        --global-batch-size ${GLOBAL_BATCH_SIZE} \
    --train-iters ${TRAIN_ITERS} \
    --lr 6.0e-5 \
        --lr-decay-iters 10 \
        --lr-warmup-iters 5 \
        --min-lr 6.0e-6 \
        --override-opt_param-scheduler \
        --lr-decay-style cosine \
    --adam-beta1 0.9 \
        --adam-beta2 0.95 \
        --clip-grad 1.0 \
        --weight-decay 0.1 \
        --overlapped-distributed-optimizer \
        --reduce-bucket-size=2e8 \
        --no-gradient-accumulation-fusion \
    --dataloader-type cyclic \
        --data-impl mmap \
        --data-path ${DATASET} \
        --split 98,2,0 \
    --eval-interval ${EVAL_INTERVAL} \
        --eval-iters ${EVAL_ITERS} \
    --save-interval ${SAVE_INTERVAL} \
        --save ${SAVE_CHECKPOINT_PATH} \
    --load ${LOAD_CHECKPOINT_PATH} \
        --no-load-optim \
    --log-interval ${LOG_INTERVAL} \
    --tensorboard-dir ${TENSORBOARD_DIR} \
        --tensorboard-queue-size 1000 \
        --log-timers-to-tensorboard \
        --log-batch-size-to-tensorboard \
        --log-validation-ppl-to-tensorboard \
    --job-name ${JOB_NAME} \
    --bf16 \
    --recompute-activations \
        --recompute-granularity selective"

torchrun --nnodes 2 --nproc_per_node 2 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:29400 /code/xx/LLM_mine/reference/Megatron-LLaMA/pretrain_llama.py ${options}

DistributedOptimizer 梯度聚合,疑问

我们观察到DistributedOptimizer的集合通信在梯度聚合较小的情况下,将引入极大的额外开销。极端情况下,不使用梯度聚合,将引入超过整体耗时50%的额外开销。

请教下,『梯度聚合较小』可以详细解释么?

使用distributed optimzer时grad_norm计算准确度的疑问

# Scale grad buffers by '1 / data_parallel_world_size'.
for model in self.models:
for dtype, gbuf in model._grad_buffers.items():
gbuf.data /= data_parallel_world_size
# Reduce-scatter all grads.
gbuf_view_items = self.get_model_grad_buffer_dp_views()
for index, (model_index, dtype, gbuf, gbuf_views) \
in enumerate(gbuf_view_items):
torch.distributed._reduce_scatter_base(
gbuf_views[data_parallel_rank],
gbuf,
group=data_parallel_group,
)

这里执行的应该是使得dp组内每个成员只获得自己维护的那一部分参数梯度的求和吧?

但这样做的话,在后面optimizer.step()中计算的grad_norm是不是就不是很准确了?

因为我看grad_norm计算的时候是dp组内每个成员把自己那部分模型的所有param的所有梯度都平方加和了,但是每个成员的grad只有一部分完成了dp组内求和,这样求出来的grad_norm感觉是错的。

请问是否确实存在这样的问题呢?

多节点训练时使用nccl后端,在训练完后,保存检查点时报错

Traceback (most recent call last):
File "/xxx/Megatron-LLaMA/pretrain_llama.py", line 119, in
pretrain(train_valid_test_datasets_provider, model_provider,
File "/xxx/Megatron-LLaMA/megatron/training.py", line 153, in pretrain
iteration = train(forward_step_func,
File "/xxx/Megatron-LLaMA/megatron/training.py", line 759, in train
save_checkpoint_and_time(iteration, model, optimizer,
File "/xxx/Megatron-LLaMA/megatron/training.py", line 679, in save_checkpoint_and_time
save_checkpoint(iteration, model, optimizer, opt_param_scheduler)
File "/xxx/Megatron-LLaMA/megatron/checkpointing.py", line 373, in save_checkpoint
optimizer.save_parameter_state(
File "/xxx/Megatron-LLaMA/megatron/optimizer/overlapped_dist_optimizer.py", line 1000, in save_parameter_state
torch.distributed.gather(
File "/xxx/anaconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2540, in gather
work = group.gather(output_tensors, input_tensors, opts)
RuntimeError: Tensors must be CUDA and dense

训练完后,将保存的Megatron格式转成HF格式报错

执行sh tools/checkpoint_conversion/megatron_to_hf.sh进行转换;megatron_to_hf.sh如下:
python tools/checkpoint_conversion/llama_checkpoint_conversion.py
--convert_checkpoint_from_megatron_to_transformers
--load_path "/xxx/megatron-llama-2-7b-checkpoint_TP2"
--save_path "/xxx/megatron-llama-2-7b-checkpoint_TP2/megatron-to-hf"
--target_tensor_model_parallel_size 2
--target_pipeline_model_parallel_size 1
--target_params_dtype "fp16"
--make_vocab_size_divisible_by 1
--print-checkpoint-structure
--megatron-path "/xxx/Megatron-LLaMA"

报错信息:
Loading Megatron-LM checkpoint arguments from: /xxx/megatron-llama-2-7b-checkpoint_TP2/iter_0001000
Loading Megatron-LM checkpoint arguments from: /xxx/megatron-llama-2-7b-checkpoint_TP2/iter_0001000/mp_rank_00/model_optim_rng.pt
Traceback (most recent call last):
File "/xxx/Megatron-LLaMA/tools/checkpoint_conversion/llama_checkpoint_conversion.py", line 847, in
main()
File "/xxx/Megatron-LLaMA/tools/checkpoint_conversion/llama_checkpoint_conversion.py", line 841, in main
convert_checkpoint_from_megatron_to_transformers(args)
File "/xxx/Megatron-LLaMA/tools/checkpoint_conversion/llama_checkpoint_conversion.py", line 324, in convert_checkpoint_from_megatron_to_transformers
state_dict = torch.load(rank0_checkpoint_path, map_location="cpu")
File "/xxx/anaconda3/lib/python3.10/site-packages/torch/serialization.py", line 789, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/xxx/anaconda3/lib/python3.10/site-packages/torch/serialization.py", line 1131, in _load
result = unpickler.load()
File "/xxx/anaconda3/lib/python3.10/site-packages/torch/serialization.py", line 1124, in find_class
return super().find_class(mod_name, name)
ModuleNotFoundError: No module named 'megatron'

nccl通信边界问题?

想请教下在 target_buffer 进行通信时,需不需要边界对齐呢?下面对比DeepSpeed列举下实际通信的tensor创建过程。
Megatron-LLaMA:

        self._partitioned_param = torch.empty(total_size,
                                              device=self._flatted_buffer.device,
                                              dtype=self._flatted_buffer.dtype)

DeepSpeed:

        #align nccl all-gather send buffers to 4-bye boundary
        self.nccl_start_alignment_factor = 2  # 4-byte alignment/sizeof(fp16) = 2
...
          self.bf16_groups_flat.append(
              self._flatten_dense_tensors_aligned(self.bf16_groups[i],
                                                  self.nccl_start_alignment_factor * dp_world_size))

想问下,如果通信时边界不对齐,nccl通信时候会不会错呀?

Unable to import Megatron

在使用权重转换工具的时候,我指定了正确的 Megatron-LLaMA 的位置,但是依旧报下面的错误:
Unable to import Megatron, please specify the path to Megatron using --megatron-path. Exiting.

运行时无法找到fused_kernels/build/scaled_upper_triang_masked_softmax_cuda.so

您好,我在运行您的代码时遇到了如题目所示的错误,在对应的文件夹中只能找到build.ninja文件,请问这是因为什么?有尝试过删除整个build文件夹再运行,但是仍无法解决这个问题
具体日志如下所示:

WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


using world size: 2, data-parallel-size: 1, tensor-model-parallel size: 2, pipeline-model-parallel size: 1
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:PretrainedFromHF
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
Disable contiguous grad buffer in DDP, since the optimizer would handle it.
Disable gradient accumulation fusion, since the optimizer would handle it.
------------------------ arguments ------------------------
accumulate_allreduce_grads_in_fp32 .............. True
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.95
adam_eps ........................................ 1e-08
add_bias_linear ................................. False
add_position_embedding .......................... False
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
apply_layernorm_1p .............................. False
apply_query_key_layer_scaling ................... True
apply_residual_connection_post_layernorm ........ False
async_tensor_model_parallel_allreduce ........... False
attention_dropout ............................... 0.0
attention_softmax_in_fp32 ....................... False
barrier_with_L1_time ............................ True
bert_binary_head ................................ True
bert_embedder_type .............................. megatron
bert_load ....................................... None
bf16 ............................................ True
bias_dropout_fusion ............................. True
bias_gelu_fusion ................................ False
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
causal_lm ....................................... True
checkpoint_dir_name ............................. None
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
data_impl ....................................... mmap
data_parallel_random_init ....................... False
data_parallel_size .............................. 1
data_path ....................................... ['1', '/data1/projects/Megatron-DeepSpeed-main/my-train-data/my-gpt2_text_document']
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
dataloader_type ................................. cyclic
DDP_impl ........................................ local
decoder_num_layers .............................. None
decoder_seq_length .............................. None
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
distribute_saved_activations .................... False
distributed_backend ............................. nccl
distributed_checkpointing ....................... False
distributed_timeout_minutes ..................... 10
embedding_path .................................. None
empty_unused_memory_level ....................... 0
encoder_num_layers .............................. 40
encoder_seq_length .............................. 2048
end_weight_decay ................................ 0.1
eod_mask_loss ................................... False
eval_interval ................................... 1000
eval_iters ...................................... 10
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
exit_on_missing_checkpoint ...................... False
exit_signal_handler ............................. False
ffn_hidden_size ................................. 13824
finetune ........................................ True
fp16 ............................................ False
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
fp8_amax_compute_algo ........................... most_recent
fp8_amax_history_len ............................ 1
fp8_e4m3 ........................................ False
fp8_hybrid ...................................... False
fp8_interval .................................... 1
fp8_margin ...................................... 0
fp8_wgrad ....................................... True
global_batch_size ............................... 16
gradient_accumulation_fusion .................... False
head_lr_mult .................................... 1.0
hidden_dropout .................................. 0.0
hidden_size ..................................... 5120
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... 512
init_method_std ................................. 0.01
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 4294967296
iter_per_epoch .................................. 1250
job_name ........................................ LLaMA_tp2_pp1_mbs2_gpus2
kv_channels ..................................... 128
last_bucket_split_count ......................... None
layernorm_epsilon ............................... 1e-06
lazy_mpu_init ................................... None
load ............................................ /data1/projects/Megatron-DeepSpeed-main/llama_33b
local_rank ...................................... None
log_batch_size_to_tensorboard ................... True
log_interval .................................... 1
log_learning_rate_to_tensorboard ................ True
log_loss_scale_to_tensorboard ................... True
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_timers_to_tensorboard ....................... True
log_validation_ppl_to_tensorboard ............... True
log_world_size_to_tensorboard ................... False
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. 6e-05
lr_decay_iters .................................. 10
lr_decay_samples ................................ None
lr_decay_style .................................. cosine
lr_warmup_fraction .............................. None
lr_warmup_iters ................................. 5
lr_warmup_samples ............................... 0
make_vocab_size_divisible_by .................... 1
mask_factor ..................................... 1.0
mask_prob ....................................... 0.15
mask_type ....................................... random
masked_softmax_fusion ........................... True
max_position_embeddings ......................... 2048
max_tokens_to_oom ............................... 12000
merge_file ...................................... None
micro_batch_size ................................ 2
min_loss_scale .................................. 1.0
min_lr .......................................... 6e-06
mmap_warmup ..................................... False
no_load_optim ................................... True
no_load_rng ..................................... None
no_persist_layer_norm ........................... False
no_save_optim ................................... None
no_save_rng ..................................... None
num_attention_heads ............................. 40
num_channels .................................... 3
num_classes ..................................... 1000
num_experts ..................................... None
num_layers ...................................... 40
num_layers_per_virtual_pipeline_stage ........... None
num_workers ..................................... 2
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
output_bert_embeddings .......................... False
overlapped_distributed_optimizer ................ True
override_opt_param_scheduler .................... True
params_dtype .................................... torch.bfloat16
patch_dim ....................................... 16
perform_initialization .......................... True
pipeline_model_parallel_size .................... 1
pipeline_model_parallel_split_rank .............. None
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... None
rank ............................................ 0
recompute_granularity ........................... selective
recompute_method ................................ None
recompute_num_layers ............................ 1
reduce_bucket_size .............................. 200000000.0
reset_attention_mask ............................ False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
retro_add_retriever ............................. False
retro_cyclic_train_iters ........................ None
retro_encoder_attention_dropout ................. 0.1
retro_encoder_hidden_dropout .................... 0.1
retro_encoder_layers ............................ 2
retro_num_neighbors ............................. 2
retro_num_retrieved_chunks ...................... 2
retro_return_doc_ids ............................ False
retro_workdir ................................... None
RMSNorm ......................................... True
rotary_percent .................................. 1.0
sample_rate ..................................... 1.0
save ............................................ /data1/projects/Megatron-DeepSpeed-main/save_models
save_interval ................................... 100
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_length ...................................... 2048
sequence_parallel ............................... True
sgd_momentum .................................... 0.9
short_seq_prob .................................. 0.1
split ........................................... 98,2,0
squared_relu .................................... False
standalone_embedding_stage ...................... False
start_weight_decay .............................. 0.1
swiglu .......................................... True
swin_backbone_type .............................. tiny
tensor_model_parallel_size ...................... 2
tensorboard_dir ................................. /data1/projects/Megatron-DeepSpeed-main/tensorboard_data
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_data_path .................................. None
timing_log_level ................................ 0
timing_log_option ............................... minmax
titles_data_path ................................ None
tokenizer_model ................................. None
tokenizer_name_or_path .......................... /data1/projects/Megatron-DeepSpeed-main/33B_tokenizer
tokenizer_type .................................. PretrainedFromHF
train_data_path ................................. None
train_iters ..................................... 1000
train_samples ................................... None
transformer_impl ................................ local
transformer_pipeline_model_parallel_size ........ 1
untie_embeddings_and_output_weights ............. False
use_checkpoint_args ............................. False
use_checkpoint_opt_param_scheduler .............. False
use_contiguous_buffers_in_local_ddp ............. False
use_cpu_initialization .......................... None
use_distributed_optimizer ....................... False
use_flash_attn .................................. True
use_one_sent_docs ............................... False
use_ring_exchange_p2p ........................... False
use_rotary_position_embeddings .................. True
valid_data_path ................................. None
variable_seq_lengths ............................ False
verify_grad_order ............................... False
virtual_pipeline_model_parallel_size ............ None
vision_backbone_type ............................ vit
vision_pretraining .............................. False
vision_pretraining_type ......................... classify
vocab_extra_ids ................................. 0
vocab_file ...................................... None
vocab_size ...................................... None
weight_decay .................................... 0.1
weight_decay_incr_style ......................... constant
world_size ...................................... 2
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 8

building PretrainedFromHF tokenizer ...
padded vocab (size: 32000) with 0 dummy tokens (new size: 32000)
WARNING: TensorBoard writing requested but is not available (are you using PyTorch 1.1.0 or later?), no TensorBoard logs will be written.
initializing torch distributed ...
[W ProcessGroupGloo.cpp:694] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W ProcessGroupGloo.cpp:694] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
initialized tensor model parallel with size 2
initialized pipeline model parallel with size 1
setting random seeds to 1234 ...
compiling dataset index builder ...
make: Entering directory '/data1/projects/Megatron-DeepSpeed-main/Megatron-LLaMA-main/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/data1/projects/Megatron-DeepSpeed-main/Megatron-LLaMA-main/megatron/data'

done with dataset index builder. Compilation time: 0.031 seconds
compiling and loading fused kernels ...
Detected CUDA files, patching ldflags
Emitting ninja build file /data1/projects/Megatron-DeepSpeed-main/Megatron-LLaMA-main/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
1.11.1.git.kitware.jobserver-1
Loading extension module scaled_upper_triang_masked_softmax_cuda...
Traceback (most recent call last):
File "/data1/projects/Megatron-DeepSpeed-main/Megatron-LLaMA-main/pretrain_llama.py", line 119, in
pretrain(train_valid_test_datasets_provider, model_provider,
File "/data1/projects/Megatron-DeepSpeed-main/Megatron-LLaMA-main/megatron/training.py", line 90, in pretrain
initialize_megatron(extra_args_provider=extra_args_provider,
File "/data1/projects/Megatron-DeepSpeed-main/Megatron-LLaMA-main/megatron/initialize.py", line 82, in initialize_megatron
_compile_dependencies()
File "/data1/projects/Megatron-DeepSpeed-main/Megatron-LLaMA-main/megatron/initialize.py", line 130, in _compile_dependencies
fused_kernels.load(args)
File "/data1/projects/Megatron-DeepSpeed-main/Megatron-LLaMA-main/megatron/fused_kernels/init.py", line 61, in load
scaled_upper_triang_masked_softmax_cuda = _cpp_extention_load_helper(
File "/data1/projects/Megatron-DeepSpeed-main/Megatron-LLaMA-main/megatron/fused_kernels/init.py", line 37, in _cpp_extention_load_helper
return cpp_extension.load(
File "/data1/conda/envs/env39/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/data1/conda/envs/env39/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/data1/conda/envs/env39/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 565, in module_from_spec
File "", line 1173, in create_module
File "", line 228, in _call_with_frames_removed
ImportError: /data1/projects/Megatron-DeepSpeed-main/Megatron-LLaMA-main/megatron/fused_kernels/build/scaled_upper_triang_masked_softmax_cuda.so: cannot open shared object file: No such file or directory

CUDA_DEVICE_MAX_CONNECTIONS 设置问题

关于CUDA_DEVICE_MAX_CONNECTIONS设置,在使用overlap dist opt时,出现了不兼容的问题:
tp之间反向通信为了能与反向计算overlap 需要设置CUDA_DEVICE_MAX_CONNECTIONS=1
但是 dp reduce_scatter 与 反向计算overlap 需要设置CUDA_DEVICE_MAX_CONNECTIONS>1,否则设置成1,观察trace 不会overlap
有什么方法可以解决这个问题吗?

是否兼容sequence parallel

开启overlappedDistOpt的话,不设置CUDA_DEVICE_MAX_CONNECTIONS,才能并行。
那么是不是不能使用seq parallel了?seq parallel要求这个环境变量为1:
Using sequence parallelism requires setting the environment variable "
"CUDA_DEVICE_MAX_CONNECTIONS to 1

4台A100*8测试,开启 overlapped-distributed-optimizer 的速度比开启 use-distributed-optimizer 慢很多

4台A100*8测试,开启 overlapped-distributed-optimizer 的速度比开启 use-distributed-optimizer 慢很多,测试只将脚本中--overlapped-distributed-optimizer 替换成--use-distributed-optimizer,这测试结果合理吗?

TP=2, PP=1 测试结果如下:
-- overlapped-distributed-optimizer
global-bs 2048 micro-bs 4
elapsed time per iteration (ms): 82897.1 | average overall token/sec : 50596.5 | average token/sec/GPU : 1581.1 GPU显存使用89%

use-distributed-optimizer
global-bs 2048 micro-bs 2
elapsed time per iteration (ms): 64638.7 | average overall token/sec : 64888.4 | average token/sec/GPU : 2027.8 GPU显存使用 83.5%

Loss对齐

您好,

最近我尝试使用DistributedOptimizer和OverlappedDistributedOptimizer跑LLaMA-7B,训练参数为tp1-pp1-dp8-zero1,但是发现两者的loss在第三个iter后开始出现gap,如下所示:

OverlappedDistributedOptimizer

iter 1: 1.764603
iter 2: 1.693627
iter 3: 1.751950
iter 4: 1.745651
iter 5: 1.781831


DistributedOptimizer

iter 1: 1.764603
iter 2: 1.693627
iter 3: 1.751948
iter 4: 1.745644
iter 5: 1.781826

两次训练唯一的区别是OverlappedDistributedOptimizer使用了--no-contiguous-buffers-in-local-ddp。同时两者都使用了--gradient-accumulation-fusion

请问这可能是什么原因导致的?

hf转megatron shape错误

我先用模型转换的脚本将llama2-7b从huggingface转到megatron,
训练时出现shape问题:

Traceback (most recent call last):

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/pretrain_llama.py", line 119, in < module >

    pretrain(train_valid_test_datasets_provider, model_provider,

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 112, in pretrain

    model, optimizer, opt_param_scheduler = setup_model_and_optimizer(

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 384, in setup_model_and_optimizer

    args.iteration = load_checkpoint(model, optimizer, opt_param_scheduler, load_dir=args.checkpoint_dir_name)

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/checkpointing.py", line 721, in load_checkpoint

    _load_common_from_state_dict(args, release, state_dict, model, strict=strict)

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/checkpointing.py", line 856, in _load_common_from_state_dict

    model[0].load_state_dict(state_dict["model"], strict=strict)

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/model/distributed.py", line 71, in load_state_dict

    self.module.load_state_dict(state_dict, strict=strict)

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/model/module.py", line 199, in load_state_dict

    self.module.load_state_dict(state_dict, strict=strict)

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/model/llama_model.py", line 154, in load_state_dict

    self.lm_head.load_state_dict(state_dict['lm_head'], strict=strict)

  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict

    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(

RuntimeError: Error(s) in loading state_dict for Linear:

	size mismatch for weight: copying a param with shape torch.Size([32000, 4096]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).

Traceback (most recent call last):

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/pretrain_llama.py", line 119, in < module >

    pretrain(train_valid_test_datasets_provider, model_provider,

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 112, in pretrain

    model, optimizer, opt_param_scheduler = setup_model_and_optimizer(

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 384, in setup_model_and_optimizer

    args.iteration = load_checkpoint(model, optimizer, opt_param_scheduler, load_dir=args.checkpoint_dir_name)

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/checkpointing.py", line 721, in load_checkpoint

    _load_common_from_state_dict(args, release, state_dict, model, strict=strict)

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/checkpointing.py", line 856, in _load_common_from_state_dict

    model[0].load_state_dict(state_dict["model"], strict=strict)

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/model/distributed.py", line 71, in load_state_dict

    self.module.load_state_dict(state_dict, strict=strict)

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/model/module.py", line 199, in load_state_dict

    self.module.load_state_dict(state_dict, strict=strict)

  File "/code/xx/LLM_mine/reference/Megatron-LLaMA/megatron/model/llama_model.py", line 154, in load_state_dict

    self.lm_head.load_state_dict(state_dict['lm_head'], strict=strict)

  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict

    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(

RuntimeError: Error(s) in loading state_dict for Linear:

	size mismatch for weight: copying a param with shape torch.Size([32000, 4096]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).

NGC22.08 环境报错。

采用NGC22.08 镜像,module 'torch.distributed' has no attribute 'reduce_scatter_tensor'

image 自带torch是1.13.0a0+d321be6,重新安装torch会导致apex报错。

请问下:
1)有推荐的NGC镜像么
2)有没有Dockerfile

执行LLaMA_13_standalone.sh脚本,没有训练过程很快就结束

采用的是llama2-7b模型,修改了--ffn-hidden-size的值=11008,
[after dataloaders are built] datetime: 2023-09-11 07:02:50
done with setup ...
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (10401.35, 10402.58)
train/valid/test-data-iterators-setup ..........: (1306.47, 1410.68)
training ...
[after training is done] datetime: 2023-09-11 07:02:50

没有训练过程,一下子就done完成了!

请教下为什么使用overlapped_distributed_optimizer后,CUDA_DEVICE_MAX_CONNECTIONS就可以不为1了?

    if os.environ.get('CUDA_DEVICE_MAX_CONNECTIONS') != "1" \
        and not args.overlapped_distributed_optimizer:
        if args.sequence_parallel:
            raise RuntimeError(
                "Using sequence parallelism requires setting the environment variable "
                "CUDA_DEVICE_MAX_CONNECTIONS to 1")
        if args.async_tensor_model_parallel_allreduce:
            raise RuntimeError(
                "Using async gradient all reduce requires setting the environment "
                "variable CUDA_DEVICE_MAX_CONNECTIONS to 1")

请教下大佬,为什么使用overlapped_distributed_optimizer后,CUDA_DEVICE_MAX_CONNECTIONS就可以不为1了?

deepspeed+megatron+llama,请问作者有试过吗

hello,我最近也在研究megatron,看到了Megatron-deepspeed,项目里并没有实现llama模型,但是提供了pretrain llama架构的sh脚本,请问下作者您的这个项目和那个有什么区别呢?
另外请教下为什么要把模型转换成megatron的格式呢?我直接用的huggingface的bin模型,好像运行成功了
感谢代码😊😊😊

INT4 量化的模型可以被Megatron-LLaMA支持吗?

我只有两块3090 N卡(装在同一网络的两台机器上)。 拿到了LLAMA2 70b GPTQ int4量化的模型文件(约35G)了。 想先转换成megatron format, 不知道可不可以? 我自己试了试

python ./tools/checkpoint_conversion/llama_checkpoint_conversion.py --load_path "path1" --save_path “output_path2" --target_tensor_model_parallel_size 2 --target_pipeline_model_parallel_size 1 --target_data_parallel_size 1 --make_vocab_size_divisible_by 1 --print-checkpoint-structure --megatron-path "./Megatron_LLaMA"

转换后,在
image

进去看, 每个model_optim_rng.pt只有2G, 两个目录下pt文件加起来就4G, 远远小于35G. 但如果用原始的LLAMA2 7B hf (pytorch_model.bin format) , 未经量化的大约是13G, 转换成megatron format后两个目录下pt文件加起来也是13G左右, 看起来很正常。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.