Comments (10)
May you post your launching script?
from megatron-llama.
May you post your launching script?
hi @li-yi-dong .this is my script:
python /code/xx/LLM_mine/reference/Megatron-LLaMA/tools/checkpoint_conversion/llama_checkpoint_conversion.py \
--load_path "/code/xx/LLM_mine/model/LLama2/Llama2_7b/llama2_7b" \
--save_path "/code/xx/LLM_mine/reference/Megatron-LLaMA/model/llama2_7b" \
--target_tensor_model_parallel_size 2 \
--target_pipeline_model_parallel_size 2 \
--target_data_parallel_size 1 \
--target_params_dtype "fp16" \
--make_vocab_size_divisible_by 1 \
--print-checkpoint-structure \
--megatron-path "/code/xx/LLM_mine/reference/Megatron-LLaMA"
and i also met another problem :
LocalFileSystem: /code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/model/llama2_7b
LocalFileSystem: /code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/model/llama2_7b
loading checkpoint from root /code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/model/llama2_7b, release: True, iteration: 0
Traceback (most recent call last):
File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/pretrain_llama.py", line 119, in < module >
pretrain(train_valid_test_datasets_provider, model_provider,
File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 112, in pretrain
model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 384, in setup_model_and_optimizer
args.iteration = load_checkpoint(model, optimizer, opt_param_scheduler, load_dir=args.checkpoint_dir_name)
File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/checkpointing.py", line 721, in load_checkpoint
_load_common_from_state_dict(args, release, state_dict, model, strict=strict)
File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/checkpointing.py", line 856, in _load_common_from_state_dict
model[0].load_state_dict(state_dict["model"], strict=strict)
File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/model/distributed.py", line 71, in load_state_dict
self.module.load_state_dict(state_dict, strict=strict)
File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/model/module.py", line 199, in load_state_dict
self.module.load_state_dict(state_dict, strict=strict)
File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/model/llama_model.py", line 154, in load_state_dict
self.lm_head.load_state_dict(state_dict['lm_head'], strict=strict)
KeyError: 'lm_head'
load_common_from_state_dict: Could not find arguments in the checkpoint ...
Traceback (most recent call last):
File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/pretrain_llama.py", line 119, in < module >
pretrain(train_valid_test_datasets_provider, model_provider,
File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 112, in pretrain
model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/training.py", line 384, in setup_model_and_optimizer
args.iteration = load_checkpoint(model, optimizer, opt_param_scheduler, load_dir=args.checkpoint_dir_name)
File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/checkpointing.py", line 721, in load_checkpoint
_load_common_from_state_dict(args, release, state_dict, model, strict=strict)
File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/checkpointing.py", line 856, in _load_common_from_state_dict
model[0].load_state_dict(state_dict["model"], strict=strict)
File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/model/distributed.py", line 71, in load_state_dict
self.module.load_state_dict(state_dict, strict=strict)
File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/model/module.py", line 199, in load_state_dict
self.module.load_state_dict(state_dict, strict=strict)
File "/code/xiongxiong/LLM_mine/reference/Megatron-LLaMA/megatron/model/llama_model.py", line 154, in load_state_dict
self.lm_head.load_state_dict(state_dict['lm_head'], strict=strict)
KeyError: 'lm_head'
I restarted my task, and both of these issues have occurred.
and I used torch.load to load the state_dict in megatron, there is no key named 'lm_head' indeed.
I'm not quite sure where the problem lies.
from megatron-llama.
May you post your launching script?
The script that launching the training.
from megatron-llama.
May you post your launching script?
The script that launching the training.
sry @li-yi-dong this is my training script:
#!/bin/bash
DATASET_1="/code/konglingkai/02LLM/data/megatron/llama2_wudao/pretrain_6_text_document"
# DATASET_2="<PATH TO THE SECOND DATASET>"
# DATASET_3="<PATH TO THE THIRD DATASET>"
DATASET="1.0 ${DATASET_1}"
# 0.3 ${DATASET_2} 0.5 ${DATASET_3}
TP_SIZE=2
PP_SIZE=2
WORLD_SIZE=4
MICRO_BATCH_SIZE=2
# The int is the number of micro steps of gradient accumulation
GLOBAL_BATCH_SIZE=$((($WORLD_SIZE * $MICRO_BATCH_SIZE) / ($TP_SIZE * $PP_SIZE) * 8))
# GLOBAL_BATCH_SIZE=128
JOB_NAME="LLaMA_tp${TP_SIZE}_pp${PP_SIZE}_mbs${MICRO_BATCH_SIZE}_gpus${WORLD_SIZE}"
LOAD_CHECKPOINT_PATH="/code/xx/LLM_mine/reference/Megatron-LLaMA/model/llama2_7b"
SAVE_CHECKPOINT_PATH="/code/xx/LLM_mine/reference/Megatron-LLaMA/output/llama2_7b"
TOKENIZER_PATH="/code/xx/LLM_mine/reference/Megatron-LLaMA/model/llama2_7b"
TENSORBOARD_DIR="/code/xx/LLM_mine/reference/Megatron-LLaMA/output/llama2/tensorboard"
TRAIN_ITERS=1000
EVAL_ITERS=10
EVAL_INTERVAL=1000
SAVE_INTERVAL=100
LOG_INTERVAL=1
# Setting --tensorboard-queue-size to 1 significantly slows down the training
options=" \
--finetune \
--sequence-parallel \
--tensor-model-parallel-size ${TP_SIZE} \
--pipeline-model-parallel-size ${PP_SIZE} \
--num-layers 32 \
--hidden-size 4096 \
--num-attention-heads 32 \
--seq-length 2048 \
--max-position-embeddings 2048 \
--no-position-embedding \
--use-rotary-position-embeddings \
--swiglu \
--ffn-hidden-size 11008\
--disable-bias-linear \
--RMSNorm \
--layernorm-epsilon 1e-6 \
--causal-lm \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path $TOKENIZER_PATH \
--make-vocab-size-divisible-by 1 \
--init-method-std 0.01 \
--micro-batch-size ${MICRO_BATCH_SIZE} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--train-iters ${TRAIN_ITERS} \
--lr 6.0e-5 \
--lr-decay-iters 10 \
--lr-warmup-iters 5 \
--min-lr 6.0e-6 \
--override-opt_param-scheduler \
--lr-decay-style cosine \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--clip-grad 1.0 \
--weight-decay 0.1 \
--overlapped-distributed-optimizer \
--reduce-bucket-size=2e8 \
--no-gradient-accumulation-fusion \
--dataloader-type cyclic \
--data-impl mmap \
--data-path ${DATASET} \
--split 98,2,0 \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--save ${SAVE_CHECKPOINT_PATH} \
--load ${LOAD_CHECKPOINT_PATH} \
--no-load-optim \
--log-interval ${LOG_INTERVAL} \
--tensorboard-dir ${TENSORBOARD_DIR} \
--tensorboard-queue-size 1000 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--job-name ${JOB_NAME} \
--bf16 \
--recompute-activations \
--recompute-granularity selective "
torchrun --nnodes 2 --nproc_per_node 2 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:29400 /code/xx/LLM_mine/reference/Megatron-LLaMA/pretrain_llama.py ${options}
from megatron-llama.
2个节点,每个节点2张卡,能否跑得动lama2-7b吗?这边2个节点,每个节点4张卡跑不动,不知道是不是这边脚本有问题,还是其他原因。
from megatron-llama.
2个节点,每个节点2张卡,能否跑得动lama2-7b吗?这边2个节点,每个节点4张卡跑不动,不知道是不是这边脚本有问题,还是其他原因。
我还没有跑起来,还卡在读模型这里的报错
from megatron-llama.
看起来从hf 转megatron 的时候vocab_size 是32000,但是训练初始化模型的时候,vocab_size 是32002。
你修改过tokenizer 吗?
from megatron-llama.
看起来从hf 转megatron 的时候vocab_size 是32000,但是训练初始化模型的时候,vocab_size 是32002。 你修改过tokenizer 吗?
@li-yi-dong 没有修改过,不过我找到问题是因为我7b的huggingface里多了一个add_token.json的文件,加了一个pad token,导致读取config读的是32001大小,然后在tokenizer里面被padding到了32002,我把add_token的json去了就正常了,不过我发现另一个问题,当我pp设置成2的时候,转完模型load_state_dict的时候load lm_head时会找不到key,然后发现读的那个文pt里确实没有,而是在不同的pt文件中,当我tp=2,pp=2的时候,四个文件只有两个pt有lm_head,不知道是不是我哪里操作有问题?
from megatron-llama.
抱歉,PP暂时没有完整测试过,还有些问题,我们近期会修复。
建议优先使用TP,节省显存更明显,额外开销更小。
from megatron-llama.
抱歉,PP暂时没有完整测试过,还有些问题,我们近期会修复。 建议优先使用TP,节省显存更明显,额外开销更小。
好的好的,谢谢,麻烦你们啦
from megatron-llama.
Related Issues (20)
- fp16的支持问题 HOT 1
- 每次GA的backward都需要做通信 HOT 5
- CUDA_DEVICE_MAX_CONNECTIONS 设置问题
- 是否兼容sequence parallel HOT 2
- 请问目前Megatron-LLaMA支持LLaMA2-70B的训练吗? HOT 1
- INT4 量化的模型可以被Megatron-LLaMA支持吗? HOT 1
- hf权重转换代码小bug
- 求一份Serving的教程代码 HOT 1
- 问下readme中32机的吞吐对应的参数可以提供下吗,目前没有复现出来 HOT 5
- llama中decoder layer层里面的MLP问题 HOT 4
- Unable to import Megatron HOT 8
- Megatron-LM权重转hf格式 HOT 4
- LLaMAModel._causal_lm_process中的labels和logits对齐方法疑问 HOT 3
- 出现 forward() missing 1 required positional argument: 'memory_efficient'
- 使用distributed optimzer时grad_norm计算准确度的疑问 HOT 1
- 在模型转换权重时遇到了如下问题 Zarr-based strategies will not be registered because of missing packages
- 注意力权重转换问题 HOT 2
- 请问是否支持从0训练一个小规模的LLaMA模型,如:1B HOT 1
- sh LLaMA2_7B_standalone.sh
- About batch_size
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from megatron-llama.