Comments (5)
可以看看你启动训练的配置吗?
可以看看megatron 对配置的说明,--finetune 只是在加载checkpoint 时,仅加载模型,不加载optimizer 等。
from megatron-llama.
以下脚本内容是这边配置,单机4张卡可以正常运行:
#!/bin/bash
DATASET_1="/xxx/Megatron-LLaMA/corpus_indexed/corpus_indexed_text_document"
DATASET_2="/xxx/Megatron-LLaMA/corpus_indexed/corpus_indexed_text_document"
DATASET_3="/xxx/Megatron-LLaMA/corpus_indexed/corpus_indexed_text_document"
DATASET="1.0 ${DATASET_1}"
TP_SIZE=2
PP_SIZE=1
WORLD_SIZE=8
MICRO_BATCH_SIZE=4
The int is the number of micro steps of gradient accumulation
GLOBAL_BATCH_SIZE=$((($WORLD_SIZE * $MICRO_BATCH_SIZE) / ($TP_SIZE * $PP_SIZE) * 8))
GLOBAL_BATCH_SIZE=128
JOB_NAME="LLaMA_tp${TP_SIZE}_pp${PP_SIZE}_mbs${MICRO_BATCH_SIZE}_gpus${WORLD_SIZE}"
LOAD_CHECKPOINT_PATH="/xxx/megatron-llama-2-7b-checkpoint"
SAVE_CHECKPOINT_PATH="/xxx/megatron-llama-2-7b-checkpoint"
TOKENIZER_PATH="/xxx/megatron-llama-2-7b-checkpoint"
TENSORBOARD_DIR="/xxx/Megatron-LLaMA/tensorboard_dir"
TRAIN_ITERS=4000
EVAL_ITERS=10
EVAL_INTERVAL=4000
SAVE_INTERVAL=4000
LOG_INTERVAL=1
Setting --tensorboard-queue-size to 1 significantly slows down the training
options="
--finetune
--sequence-parallel
--tensor-model-parallel-size ${TP_SIZE}
--pipeline-model-parallel-size ${PP_SIZE}
--num-layers 32
--hidden-size 4096
--num-attention-heads 32
--seq-length 2048
--max-position-embeddings 4096
--no-position-embedding
--use-rotary-position-embeddings
--swiglu
--ffn-hidden-size 11008
--disable-bias-linear
--RMSNorm
--layernorm-epsilon 1e-6
--causal-lm
--tokenizer-type PretrainedFromHF
--tokenizer-name-or-path $TOKENIZER_PATH
--make-vocab-size-divisible-by 1
--init-method-std 0.01
--micro-batch-size ${MICRO_BATCH_SIZE}
--global-batch-size ${GLOBAL_BATCH_SIZE}
--train-iters ${TRAIN_ITERS}
--lr 6.0e-5
--lr-decay-iters 10
--lr-warmup-iters 5
--min-lr 6.0e-6
--override-opt_param-scheduler
--lr-decay-style cosine
--adam-beta1 0.9
--adam-beta2 0.95
--clip-grad 1.0
--weight-decay 0.1
--overlapped-distributed-optimizer
--reduce-bucket-size=2e8
--no-gradient-accumulation-fusion
--dataloader-type cyclic
--data-impl mmap
--data-path ${DATASET}
--split 98,2,0
--eval-interval ${EVAL_INTERVAL}
--eval-iters ${EVAL_ITERS}
--save-interval ${SAVE_INTERVAL}
--save ${SAVE_CHECKPOINT_PATH}
--load ${LOAD_CHECKPOINT_PATH}
--no-load-optim
--log-interval ${LOG_INTERVAL}
--tensorboard-dir ${TENSORBOARD_DIR}
--tensorboard-queue-size 1000
--log-timers-to-tensorboard
--log-batch-size-to-tensorboard
--log-validation-ppl-to-tensorboard
--job-name ${JOB_NAME}
--bf16
--recompute-activations
--recompute-granularity selective
--use-flash-attn
"
export CUDA_VISIBLE_DEVICES=4,5,6,7
torchrun --nproc_per_node=4 --master_port=29501 pretrain_llama.py ${options}
from megatron-llama.
分布式的optimizer 随着数据并行增大,占用的显存会变小。
TP2,两卡的时候没有数据并行,optimizer 也无法并行,将占用(7B/2)x4x3 = 42GB 显存。
80GB 的显存7B 模型应该可以不使用TP。如果只有一台8卡机器,建议使用megatron 原生optimizer,训练速度可能会快一些。
另外,我们没有实际跑过7B 的模型,欢迎你提交个PR,分享下7B 模型的训练。
from megatron-llama.
为什么我用单机4卡运行时的GPU内存占用是75G左右显存?不是你计算的42GB显存
| 4 N/A N/A 4037835 C .../anaconda3/bin/python 75721MiB |
| 5 N/A N/A 4037836 C .../anaconda3/bin/python 76173MiB |
| 6 N/A N/A 4037837 C .../anaconda3/bin/python 75891MiB |
| 7 N/A N/A 4037838 C .../anaconda3/bin/python 76087MiB |
from megatron-llama.
TP2,两卡的时候没有数据并行,optimizer 也无法并行,将占用(7B/2)x4x3 = 42GB 显存。
@li-yi-dong 请问这个计算公式是怎么得来的?(7B/2):是不是表示模型并行度为2,所以要除以2;但是后面43是代表什么意思?
*4:是表示全精度占用4个字符么,所以要乘以4?(可是脚本中设置了半精度--bf=16)
*3:不理解!!!!
@li-yi-dong 能否分享一下计算公式原理?
from megatron-llama.
Related Issues (20)
- fp16的支持问题 HOT 1
- 每次GA的backward都需要做通信 HOT 5
- CUDA_DEVICE_MAX_CONNECTIONS 设置问题
- 是否兼容sequence parallel HOT 2
- 请问目前Megatron-LLaMA支持LLaMA2-70B的训练吗? HOT 1
- INT4 量化的模型可以被Megatron-LLaMA支持吗? HOT 1
- hf权重转换代码小bug
- 求一份Serving的教程代码 HOT 1
- 问下readme中32机的吞吐对应的参数可以提供下吗,目前没有复现出来 HOT 5
- llama中decoder layer层里面的MLP问题 HOT 4
- Unable to import Megatron HOT 8
- Megatron-LM权重转hf格式 HOT 4
- LLaMAModel._causal_lm_process中的labels和logits对齐方法疑问 HOT 3
- 出现 forward() missing 1 required positional argument: 'memory_efficient'
- 使用distributed optimzer时grad_norm计算准确度的疑问 HOT 1
- 在模型转换权重时遇到了如下问题 Zarr-based strategies will not be registered because of missing packages
- 注意力权重转换问题 HOT 2
- 请问是否支持从0训练一个小规模的LLaMA模型,如:1B HOT 1
- sh LLaMA2_7B_standalone.sh
- About batch_size
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from megatron-llama.