CUDA out of memory for a single core A100 80G GPU about stanford_alpaca HOT 11 OPEN

tatsu-lab commented on May 18, 2024

CUDA out of memory for a single core A100 80G GPU

from stanford_alpaca.

Comments (11)

JoelNiklaus commented on May 18, 2024 1

For me it also worked with batch size 4 on 2 80GB A100 GPUs for sequence length 512.

from stanford_alpaca.

KurtFeynmanGodel commented on May 18, 2024

I cant say for certain. But the first thing you should try is setting the batch sizes to 1 and the gradient accumulation to 1 as well. That is the configuration that gives the minimal memory footprint without any code changes. Start there.

from stanford_alpaca.

yysjasmine commented on May 18, 2024

I encountered the CUDA OOM on a single core A100 80G using your training code? Can i fix this by changing anything?

Can you fix the OOM problem? I ran in to that problem as well using python 3.8 and pytorch 1.13.1，in a single core A100 80G.

from stanford_alpaca.

JoelNiklaus commented on May 18, 2024

I have the same problem

from stanford_alpaca.

dlwh commented on May 18, 2024

Just to swoop in here, if you're using 6.7B or larger, 1 GPU (even an A100) isn't going to be enough without DeepSpeed Zero CPU Offload or Zero Infinity. At a minimum, to fine tune you need 6.7e9 * 4 bytes * 3 = 80.4GB to store parameters and optimizer states (even in mixed precision, you want to store those in fp32)

from stanford_alpaca.

JoelNiklaus commented on May 18, 2024

Thanks @dlwh. When switching to two A100 80GB GPUs it worked for me.

from stanford_alpaca.

zhl5842 commented on May 18, 2024

Just to swoop in here, if you're using 6.7B or larger, 1 GPU (even an A100) isn't going to be enough without DeepSpeed Zero CPU Offload or Zero Infinity. At a minimum, to fine tune you need 6.7e9 * 4 bytes * 3 = 80.4GB to store parameters and optimizer states (even in mixed precision, you want to store those in fp32)

why *3 ?

from stanford_alpaca.

dlwh commented on May 18, 2024

Adam, the default optimizer, stores two momentum terms which are each the same size as the parameters themselves, so essentially three copies of the parameters... plus a fourth half-sized one (so, 14 bytes total) for the gradients, and then memory for activations that can range from "not very much" to a lot of extra GB.

2 A100s or Zero CPU offload should fix it right up.

from stanford_alpaca.

yysjasmine commented on May 18, 2024

Thanks a lot, 2+ gpus worked fine.

from stanford_alpaca.

maziyarpanahi commented on May 18, 2024

What was the batch size for those who could make it work with 2 A100 80GB GPUs? I have 4 and it fails. The reason I am asking is mine only works wit batch_size set to 1 on 7B. However, the README says?

Below is a command that fine-tunes LLaMA-7B with our dataset on a machine with 4 A100 80G GPUs in FSDP full_shard mode: (batch_size is 4 here, how did they manage this?)

torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LLaMADecoderLayer' \
    --tf32 True

from stanford_alpaca.

maziyarpanahi commented on May 18, 2024

That's really strange, the default value for model_max_length is already 512. With batch_size 1 it takes forever unfortunately

from stanford_alpaca.

CUDA out of memory for a single core A100 80G GPU about stanford_alpaca HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent