Giter Site home page Giter Site logo

llama2-7b模型用单机2张卡跑不动报内存不够,2个节点,8张卡(每个节点4卡)也跑不动,这个正常吗?(配置:A800*8张卡*80G) about megatron-llama HOT 5 OPEN

alibaba avatar alibaba commented on September 27, 2024
llama2-7b模型用单机2张卡跑不动报内存不够,2个节点,8张卡(每个节点4卡)也跑不动,这个正常吗?(配置:A800*8张卡*80G)

from megatron-llama.

Comments (5)

li-yi-dong avatar li-yi-dong commented on September 27, 2024

可以看看你启动训练的配置吗?

可以看看megatron 对配置的说明,--finetune 只是在加载checkpoint 时,仅加载模型,不加载optimizer 等。

from megatron-llama.

13416157913 avatar 13416157913 commented on September 27, 2024

以下脚本内容是这边配置,单机4张卡可以正常运行:
#!/bin/bash

DATASET_1="/xxx/Megatron-LLaMA/corpus_indexed/corpus_indexed_text_document"
DATASET_2="/xxx/Megatron-LLaMA/corpus_indexed/corpus_indexed_text_document"
DATASET_3="/xxx/Megatron-LLaMA/corpus_indexed/corpus_indexed_text_document"
DATASET="1.0 ${DATASET_1}"

TP_SIZE=2
PP_SIZE=1
WORLD_SIZE=8
MICRO_BATCH_SIZE=4

The int is the number of micro steps of gradient accumulation

GLOBAL_BATCH_SIZE=$((($WORLD_SIZE * $MICRO_BATCH_SIZE) / ($TP_SIZE * $PP_SIZE) * 8))

GLOBAL_BATCH_SIZE=128

JOB_NAME="LLaMA_tp${TP_SIZE}_pp${PP_SIZE}_mbs${MICRO_BATCH_SIZE}_gpus${WORLD_SIZE}"

LOAD_CHECKPOINT_PATH="/xxx/megatron-llama-2-7b-checkpoint"
SAVE_CHECKPOINT_PATH="/xxx/megatron-llama-2-7b-checkpoint"
TOKENIZER_PATH="/xxx/megatron-llama-2-7b-checkpoint"
TENSORBOARD_DIR="/xxx/Megatron-LLaMA/tensorboard_dir"

TRAIN_ITERS=4000
EVAL_ITERS=10
EVAL_INTERVAL=4000
SAVE_INTERVAL=4000
LOG_INTERVAL=1

Setting --tensorboard-queue-size to 1 significantly slows down the training

options="
--finetune
--sequence-parallel
--tensor-model-parallel-size ${TP_SIZE}
--pipeline-model-parallel-size ${PP_SIZE}
--num-layers 32
--hidden-size 4096
--num-attention-heads 32
--seq-length 2048
--max-position-embeddings 4096
--no-position-embedding
--use-rotary-position-embeddings
--swiglu
--ffn-hidden-size 11008
--disable-bias-linear
--RMSNorm
--layernorm-epsilon 1e-6
--causal-lm
--tokenizer-type PretrainedFromHF
--tokenizer-name-or-path $TOKENIZER_PATH
--make-vocab-size-divisible-by 1
--init-method-std 0.01
--micro-batch-size ${MICRO_BATCH_SIZE}
--global-batch-size ${GLOBAL_BATCH_SIZE}
--train-iters ${TRAIN_ITERS}
--lr 6.0e-5
--lr-decay-iters 10
--lr-warmup-iters 5
--min-lr 6.0e-6
--override-opt_param-scheduler
--lr-decay-style cosine
--adam-beta1 0.9
--adam-beta2 0.95
--clip-grad 1.0
--weight-decay 0.1
--overlapped-distributed-optimizer
--reduce-bucket-size=2e8
--no-gradient-accumulation-fusion
--dataloader-type cyclic
--data-impl mmap
--data-path ${DATASET}
--split 98,2,0
--eval-interval ${EVAL_INTERVAL}
--eval-iters ${EVAL_ITERS}
--save-interval ${SAVE_INTERVAL}
--save ${SAVE_CHECKPOINT_PATH}
--load ${LOAD_CHECKPOINT_PATH}
--no-load-optim
--log-interval ${LOG_INTERVAL}
--tensorboard-dir ${TENSORBOARD_DIR}
--tensorboard-queue-size 1000
--log-timers-to-tensorboard
--log-batch-size-to-tensorboard
--log-validation-ppl-to-tensorboard
--job-name ${JOB_NAME}
--bf16
--recompute-activations
--recompute-granularity selective
--use-flash-attn
"

export CUDA_VISIBLE_DEVICES=4,5,6,7
torchrun --nproc_per_node=4 --master_port=29501 pretrain_llama.py ${options}

from megatron-llama.

li-yi-dong avatar li-yi-dong commented on September 27, 2024

分布式的optimizer 随着数据并行增大,占用的显存会变小。

TP2,两卡的时候没有数据并行,optimizer 也无法并行,将占用(7B/2)x4x3 = 42GB 显存。

80GB 的显存7B 模型应该可以不使用TP。如果只有一台8卡机器,建议使用megatron 原生optimizer,训练速度可能会快一些。

另外,我们没有实际跑过7B 的模型,欢迎你提交个PR,分享下7B 模型的训练。

from megatron-llama.

13416157913 avatar 13416157913 commented on September 27, 2024

为什么我用单机4卡运行时的GPU内存占用是75G左右显存?不是你计算的42GB显存
| 4 N/A N/A 4037835 C .../anaconda3/bin/python 75721MiB |
| 5 N/A N/A 4037836 C .../anaconda3/bin/python 76173MiB |
| 6 N/A N/A 4037837 C .../anaconda3/bin/python 75891MiB |
| 7 N/A N/A 4037838 C .../anaconda3/bin/python 76087MiB |

from megatron-llama.

13416157913 avatar 13416157913 commented on September 27, 2024

TP2,两卡的时候没有数据并行,optimizer 也无法并行,将占用(7B/2)x4x3 = 42GB 显存。
@li-yi-dong 请问这个计算公式是怎么得来的?(7B/2):是不是表示模型并行度为2,所以要除以2;但是后面43是代表什么意思?
*4:是表示全精度占用4个字符么,所以要乘以4?(可是脚本中设置了半精度--bf=16)
*3:不理解!!!!
@li-yi-dong 能否分享一下计算公式原理?

from megatron-llama.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.