我先用模型转换的脚本将llama2-7b从huggingface转到megatron，训练时出现shape问题： <div class="snippet-

May you post your launching ? hi <a

May you post your launching ? The s

May you post your launching ? <p dir

hf转megatron shape错误 about megatron-llama HOT 10 OPEN

alibaba commented on June 26, 2024

hf转megatron shape错误

from megatron-llama.

Comments (10)

li-yi-dong commented on June 26, 2024

May you post your launching script?

from megatron-llama.

Double-bear commented on June 26, 2024

May you post your launching script?

hi @li-yi-dong .this is my script:

python /code/xx/LLM_mine/reference/Megatron-LLaMA/tools/checkpoint_conversion/llama_checkpoint_conversion.py \
--load_path "/code/xx/LLM_mine/model/LLama2/Llama2_7b/llama2_7b" \
--save_path "/code/xx/LLM_mine/reference/Megatron-LLaMA/model/llama2_7b" \
--target_tensor_model_parallel_size 2 \
--target_pipeline_model_parallel_size 2 \
--target_data_parallel_size 1 \
--target_params_dtype "fp16" \
--make_vocab_size_divisible_by 1 \
--print-checkpoint-structure \
--megatron-path "/code/xx/LLM_mine/reference/Megatron-LLaMA"

and i also met another problem :

LocalFileSystem: /code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/model/llama2_7b 

LocalFileSystem: /code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/model/llama2_7b 

loading checkpoint from root /code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/model/llama2_7b, release: True, iteration: 0

Traceback (most recent call last):

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/pretrain_llama.py", line 119, in < module >

    pretrain(train_valid_test_datasets_provider, model_provider,

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 112, in pretrain

    model, optimizer, opt_param_scheduler = setup_model_and_optimizer(

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 384, in setup_model_and_optimizer

    args.iteration = load_checkpoint(model, optimizer, opt_param_scheduler, load_dir=args.checkpoint_dir_name)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/checkpointing.py", line 721, in load_checkpoint

    _load_common_from_state_dict(args, release, state_dict, model, strict=strict)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/checkpointing.py", line 856, in _load_common_from_state_dict

    model[0].load_state_dict(state_dict["model"], strict=strict)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/model/distributed.py", line 71, in load_state_dict

    self.module.load_state_dict(state_dict, strict=strict)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/model/module.py", line 199, in load_state_dict

    self.module.load_state_dict(state_dict, strict=strict)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/model/llama_model.py", line 154, in load_state_dict

    self.lm_head.load_state_dict(state_dict['lm_head'], strict=strict)

KeyError: 'lm_head'

load_common_from_state_dict: Could not find arguments in the checkpoint ...

Traceback (most recent call last):

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/pretrain_llama.py", line 119, in < module >

    pretrain(train_valid_test_datasets_provider, model_provider,

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 112, in pretrain

    model, optimizer, opt_param_scheduler = setup_model_and_optimizer(

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 384, in setup_model_and_optimizer

    args.iteration = load_checkpoint(model, optimizer, opt_param_scheduler, load_dir=args.checkpoint_dir_name)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/checkpointing.py", line 721, in load_checkpoint

    _load_common_from_state_dict(args, release, state_dict, model, strict=strict)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/checkpointing.py", line 856, in _load_common_from_state_dict

    model[0].load_state_dict(state_dict["model"], strict=strict)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/model/distributed.py", line 71, in load_state_dict

    self.module.load_state_dict(state_dict, strict=strict)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/model/module.py", line 199, in load_state_dict

    self.module.load_state_dict(state_dict, strict=strict)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/model/llama_model.py", line 154, in load_state_dict

    self.lm_head.load_state_dict(state_dict['lm_head'], strict=strict)

KeyError: 'lm_head'

I restarted my task, and both of these issues have occurred.
and I used torch.load to load the state_dict in megatron, there is no key named 'lm_head' indeed.
I'm not quite sure where the problem lies.

from megatron-llama.

li-yi-dong commented on June 26, 2024

May you post your launching script?

The script that launching the training.

from megatron-llama.

Double-bear commented on June 26, 2024

May you post your launching script?

The script that launching the training.

sry @li-yi-dong this is my training script:

#!/bin/bash

DATASET_1="/code/konglingkai/02LLM/data/megatron/llama2_wudao/pretrain_6_text_document"
# DATASET_2="<PATH TO THE SECOND DATASET>"
# DATASET_3="<PATH TO THE THIRD DATASET>"
DATASET="1.0 ${DATASET_1}"
# 0.3 ${DATASET_2} 0.5 ${DATASET_3}

TP_SIZE=2
PP_SIZE=2
WORLD_SIZE=4
MICRO_BATCH_SIZE=2
# The int is the number of micro steps of gradient accumulation
GLOBAL_BATCH_SIZE=$((($WORLD_SIZE * $MICRO_BATCH_SIZE) / ($TP_SIZE * $PP_SIZE) * 8))
# GLOBAL_BATCH_SIZE=128

JOB_NAME="LLaMA_tp${TP_SIZE}_pp${PP_SIZE}_mbs${MICRO_BATCH_SIZE}_gpus${WORLD_SIZE}"

LOAD_CHECKPOINT_PATH="/code/xx/LLM_mine/reference/Megatron-LLaMA/model/llama2_7b"
SAVE_CHECKPOINT_PATH="/code/xx/LLM_mine/reference/Megatron-LLaMA/output/llama2_7b"
TOKENIZER_PATH="/code/xx/LLM_mine/reference/Megatron-LLaMA/model/llama2_7b"
TENSORBOARD_DIR="/code/xx/LLM_mine/reference/Megatron-LLaMA/output/llama2/tensorboard"

TRAIN_ITERS=1000
EVAL_ITERS=10
EVAL_INTERVAL=1000
SAVE_INTERVAL=100
LOG_INTERVAL=1

# Setting --tensorboard-queue-size to 1 significantly slows down the training
options=" \
    --finetune \
    --sequence-parallel \
        --tensor-model-parallel-size ${TP_SIZE} \
        --pipeline-model-parallel-size ${PP_SIZE} \
    --num-layers 32 \
        --hidden-size 4096 \
        --num-attention-heads 32 \
        --seq-length 2048 \
        --max-position-embeddings 2048 \
        --no-position-embedding \
        --use-rotary-position-embeddings \
        --swiglu \
        --ffn-hidden-size 11008\
        --disable-bias-linear \
        --RMSNorm \
        --layernorm-epsilon 1e-6 \
        --causal-lm \
    --tokenizer-type PretrainedFromHF \
        --tokenizer-name-or-path $TOKENIZER_PATH \
        --make-vocab-size-divisible-by 1 \
    --init-method-std 0.01 \
    --micro-batch-size ${MICRO_BATCH_SIZE} \
        --global-batch-size ${GLOBAL_BATCH_SIZE} \
    --train-iters ${TRAIN_ITERS} \
    --lr 6.0e-5 \
        --lr-decay-iters 10 \
        --lr-warmup-iters 5 \
        --min-lr 6.0e-6 \
        --override-opt_param-scheduler \
        --lr-decay-style cosine \
    --adam-beta1 0.9 \
        --adam-beta2 0.95 \
        --clip-grad 1.0 \
        --weight-decay 0.1 \
        --overlapped-distributed-optimizer \
        --reduce-bucket-size=2e8 \
        --no-gradient-accumulation-fusion \
    --dataloader-type cyclic \
        --data-impl mmap \
        --data-path ${DATASET} \
        --split 98,2,0 \
    --eval-interval ${EVAL_INTERVAL} \
        --eval-iters ${EVAL_ITERS} \
    --save-interval ${SAVE_INTERVAL} \
        --save ${SAVE_CHECKPOINT_PATH} \
    --load ${LOAD_CHECKPOINT_PATH} \
        --no-load-optim \
    --log-interval ${LOG_INTERVAL} \
    --tensorboard-dir ${TENSORBOARD_DIR} \
        --tensorboard-queue-size 1000 \
        --log-timers-to-tensorboard \
        --log-batch-size-to-tensorboard \
        --log-validation-ppl-to-tensorboard \
    --job-name ${JOB_NAME} \
    --bf16 \
    --recompute-activations \
        --recompute-granularity selective "

torchrun --nnodes 2 --nproc_per_node 2 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:29400 /code/xx/LLM_mine/reference/Megatron-LLaMA/pretrain_llama.py ${options}

from megatron-llama.

13416157913 commented on June 26, 2024

2个节点，每个节点2张卡，能否跑得动lama2-7b吗？这边2个节点，每个节点4张卡跑不动，不知道是不是这边脚本有问题，还是其他原因。

from megatron-llama.

Double-bear commented on June 26, 2024

2个节点，每个节点2张卡，能否跑得动lama2-7b吗？这边2个节点，每个节点4张卡跑不动，不知道是不是这边脚本有问题，还是其他原因。

我还没有跑起来，还卡在读模型这里的报错

from megatron-llama.

li-yi-dong commented on June 26, 2024

看起来从hf 转megatron 的时候vocab_size 是32000，但是训练初始化模型的时候，vocab_size 是32002。
你修改过tokenizer 吗？

from megatron-llama.

Double-bear commented on June 26, 2024

看起来从hf 转megatron 的时候vocab_size 是32000，但是训练初始化模型的时候，vocab_size 是32002。你修改过tokenizer 吗？

@li-yi-dong 没有修改过，不过我找到问题是因为我7b的huggingface里多了一个add_token.json的文件，加了一个pad token，导致读取config读的是32001大小，然后在tokenizer里面被padding到了32002，我把add_token的json去了就正常了，不过我发现另一个问题，当我pp设置成2的时候，转完模型load_state_dict的时候load lm_head时会找不到key，然后发现读的那个文pt里确实没有，而是在不同的pt文件中，当我tp=2，pp=2的时候，四个文件只有两个pt有lm_head，不知道是不是我哪里操作有问题？

from megatron-llama.

li-yi-dong commented on June 26, 2024

抱歉，PP暂时没有完整测试过，还有些问题，我们近期会修复。
建议优先使用TP，节省显存更明显，额外开销更小。

from megatron-llama.

Double-bear commented on June 26, 2024

抱歉，PP暂时没有完整测试过，还有些问题，我们近期会修复。建议优先使用TP，节省显存更明显，额外开销更小。

好的好的，谢谢，麻烦你们啦

from megatron-llama.

hf转megatron shape错误 about megatron-llama HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent