Giter Site home page Giter Site logo

Comments (10)

li-yi-dong avatar li-yi-dong commented on June 26, 2024

May you post your launching script?

from megatron-llama.

Double-bear avatar Double-bear commented on June 26, 2024

May you post your launching script?

hi @li-yi-dong .this is my script:

python /code/xx/LLM_mine/reference/Megatron-LLaMA/tools/checkpoint_conversion/llama_checkpoint_conversion.py \
--load_path "/code/xx/LLM_mine/model/LLama2/Llama2_7b/llama2_7b" \
--save_path "/code/xx/LLM_mine/reference/Megatron-LLaMA/model/llama2_7b" \
--target_tensor_model_parallel_size 2 \
--target_pipeline_model_parallel_size 2 \
--target_data_parallel_size 1 \
--target_params_dtype "fp16" \
--make_vocab_size_divisible_by 1 \
--print-checkpoint-structure \
--megatron-path "/code/xx/LLM_mine/reference/Megatron-LLaMA"

and i also met another problem :

LocalFileSystem: /code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/model/llama2_7b 

LocalFileSystem: /code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/model/llama2_7b 

loading checkpoint from root /code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/model/llama2_7b, release: True, iteration: 0

Traceback (most recent call last):

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/pretrain_llama.py", line 119, in < module >

    pretrain(train_valid_test_datasets_provider, model_provider,

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 112, in pretrain

    model, optimizer, opt_param_scheduler = setup_model_and_optimizer(

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 384, in setup_model_and_optimizer

    args.iteration = load_checkpoint(model, optimizer, opt_param_scheduler, load_dir=args.checkpoint_dir_name)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/checkpointing.py", line 721, in load_checkpoint

    _load_common_from_state_dict(args, release, state_dict, model, strict=strict)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/checkpointing.py", line 856, in _load_common_from_state_dict

    model[0].load_state_dict(state_dict["model"], strict=strict)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/model/distributed.py", line 71, in load_state_dict

    self.module.load_state_dict(state_dict, strict=strict)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/model/module.py", line 199, in load_state_dict

    self.module.load_state_dict(state_dict, strict=strict)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/model/llama_model.py", line 154, in load_state_dict

    self.lm_head.load_state_dict(state_dict['lm_head'], strict=strict)

KeyError: 'lm_head'

load_common_from_state_dict: Could not find arguments in the checkpoint ...

Traceback (most recent call last):

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/pretrain_llama.py", line 119, in < module >

    pretrain(train_valid_test_datasets_provider, model_provider,

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 112, in pretrain

    model, optimizer, opt_param_scheduler = setup_model_and_optimizer(

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 384, in setup_model_and_optimizer

    args.iteration = load_checkpoint(model, optimizer, opt_param_scheduler, load_dir=args.checkpoint_dir_name)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/checkpointing.py", line 721, in load_checkpoint

    _load_common_from_state_dict(args, release, state_dict, model, strict=strict)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/checkpointing.py", line 856, in _load_common_from_state_dict

    model[0].load_state_dict(state_dict["model"], strict=strict)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/model/distributed.py", line 71, in load_state_dict

    self.module.load_state_dict(state_dict, strict=strict)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/model/module.py", line 199, in load_state_dict

    self.module.load_state_dict(state_dict, strict=strict)

  File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/model/llama_model.py", line 154, in load_state_dict

    self.lm_head.load_state_dict(state_dict['lm_head'], strict=strict)

KeyError: 'lm_head'

I restarted my task, and both of these issues have occurred.
and I used torch.load to load the state_dict in megatron, there is no key named 'lm_head' indeed.
I'm not quite sure where the problem lies.

from megatron-llama.

li-yi-dong avatar li-yi-dong commented on June 26, 2024

May you post your launching script?

The script that launching the training.

from megatron-llama.

Double-bear avatar Double-bear commented on June 26, 2024

May you post your launching script?

The script that launching the training.

sry @li-yi-dong this is my training script:

#!/bin/bash

DATASET_1="/code/konglingkai/02LLM/data/megatron/llama2_wudao/pretrain_6_text_document"
# DATASET_2="<PATH TO THE SECOND DATASET>"
# DATASET_3="<PATH TO THE THIRD DATASET>"
DATASET="1.0 ${DATASET_1}"
# 0.3 ${DATASET_2} 0.5 ${DATASET_3}

TP_SIZE=2
PP_SIZE=2
WORLD_SIZE=4
MICRO_BATCH_SIZE=2
# The int is the number of micro steps of gradient accumulation
GLOBAL_BATCH_SIZE=$((($WORLD_SIZE * $MICRO_BATCH_SIZE) / ($TP_SIZE * $PP_SIZE) * 8))
# GLOBAL_BATCH_SIZE=128

JOB_NAME="LLaMA_tp${TP_SIZE}_pp${PP_SIZE}_mbs${MICRO_BATCH_SIZE}_gpus${WORLD_SIZE}"

LOAD_CHECKPOINT_PATH="/code/xx/LLM_mine/reference/Megatron-LLaMA/model/llama2_7b"
SAVE_CHECKPOINT_PATH="/code/xx/LLM_mine/reference/Megatron-LLaMA/output/llama2_7b"
TOKENIZER_PATH="/code/xx/LLM_mine/reference/Megatron-LLaMA/model/llama2_7b"
TENSORBOARD_DIR="/code/xx/LLM_mine/reference/Megatron-LLaMA/output/llama2/tensorboard"

TRAIN_ITERS=1000
EVAL_ITERS=10
EVAL_INTERVAL=1000
SAVE_INTERVAL=100
LOG_INTERVAL=1

# Setting --tensorboard-queue-size to 1 significantly slows down the training
options=" \
    --finetune \
    --sequence-parallel \
        --tensor-model-parallel-size ${TP_SIZE} \
        --pipeline-model-parallel-size ${PP_SIZE} \
    --num-layers 32 \
        --hidden-size 4096 \
        --num-attention-heads 32 \
        --seq-length 2048 \
        --max-position-embeddings 2048 \
        --no-position-embedding \
        --use-rotary-position-embeddings \
        --swiglu \
        --ffn-hidden-size 11008\
        --disable-bias-linear \
        --RMSNorm \
        --layernorm-epsilon 1e-6 \
        --causal-lm \
    --tokenizer-type PretrainedFromHF \
        --tokenizer-name-or-path $TOKENIZER_PATH \
        --make-vocab-size-divisible-by 1 \
    --init-method-std 0.01 \
    --micro-batch-size ${MICRO_BATCH_SIZE} \
        --global-batch-size ${GLOBAL_BATCH_SIZE} \
    --train-iters ${TRAIN_ITERS} \
    --lr 6.0e-5 \
        --lr-decay-iters 10 \
        --lr-warmup-iters 5 \
        --min-lr 6.0e-6 \
        --override-opt_param-scheduler \
        --lr-decay-style cosine \
    --adam-beta1 0.9 \
        --adam-beta2 0.95 \
        --clip-grad 1.0 \
        --weight-decay 0.1 \
        --overlapped-distributed-optimizer \
        --reduce-bucket-size=2e8 \
        --no-gradient-accumulation-fusion \
    --dataloader-type cyclic \
        --data-impl mmap \
        --data-path ${DATASET} \
        --split 98,2,0 \
    --eval-interval ${EVAL_INTERVAL} \
        --eval-iters ${EVAL_ITERS} \
    --save-interval ${SAVE_INTERVAL} \
        --save ${SAVE_CHECKPOINT_PATH} \
    --load ${LOAD_CHECKPOINT_PATH} \
        --no-load-optim \
    --log-interval ${LOG_INTERVAL} \
    --tensorboard-dir ${TENSORBOARD_DIR} \
        --tensorboard-queue-size 1000 \
        --log-timers-to-tensorboard \
        --log-batch-size-to-tensorboard \
        --log-validation-ppl-to-tensorboard \
    --job-name ${JOB_NAME} \
    --bf16 \
    --recompute-activations \
        --recompute-granularity selective "

torchrun --nnodes 2 --nproc_per_node 2 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:29400 /code/xx/LLM_mine/reference/Megatron-LLaMA/pretrain_llama.py ${options}

from megatron-llama.

13416157913 avatar 13416157913 commented on June 26, 2024

2个节点,每个节点2张卡,能否跑得动lama2-7b吗?这边2个节点,每个节点4张卡跑不动,不知道是不是这边脚本有问题,还是其他原因。

from megatron-llama.

Double-bear avatar Double-bear commented on June 26, 2024

2个节点,每个节点2张卡,能否跑得动lama2-7b吗?这边2个节点,每个节点4张卡跑不动,不知道是不是这边脚本有问题,还是其他原因。

我还没有跑起来,还卡在读模型这里的报错

from megatron-llama.

li-yi-dong avatar li-yi-dong commented on June 26, 2024

看起来从hf 转megatron 的时候vocab_size 是32000,但是训练初始化模型的时候,vocab_size 是32002。
你修改过tokenizer 吗?

from megatron-llama.

Double-bear avatar Double-bear commented on June 26, 2024

看起来从hf 转megatron 的时候vocab_size 是32000,但是训练初始化模型的时候,vocab_size 是32002。 你修改过tokenizer 吗?

@li-yi-dong 没有修改过,不过我找到问题是因为我7b的huggingface里多了一个add_token.json的文件,加了一个pad token,导致读取config读的是32001大小,然后在tokenizer里面被padding到了32002,我把add_token的json去了就正常了,不过我发现另一个问题,当我pp设置成2的时候,转完模型load_state_dict的时候load lm_head时会找不到key,然后发现读的那个文pt里确实没有,而是在不同的pt文件中,当我tp=2,pp=2的时候,四个文件只有两个pt有lm_head,不知道是不是我哪里操作有问题?

from megatron-llama.

li-yi-dong avatar li-yi-dong commented on June 26, 2024

抱歉,PP暂时没有完整测试过,还有些问题,我们近期会修复。
建议优先使用TP,节省显存更明显,额外开销更小。

from megatron-llama.

Double-bear avatar Double-bear commented on June 26, 2024

抱歉,PP暂时没有完整测试过,还有些问题,我们近期会修复。 建议优先使用TP,节省显存更明显,额外开销更小。

好的好的,谢谢,麻烦你们啦

from megatron-llama.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.