Giter Site home page Giter Site logo

Comments (11)

JoelNiklaus avatar JoelNiklaus commented on May 18, 2024 1

For me it also worked with batch size 4 on 2 80GB A100 GPUs for sequence length 512.

from stanford_alpaca.

KurtFeynmanGodel avatar KurtFeynmanGodel commented on May 18, 2024

I cant say for certain. But the first thing you should try is setting the batch sizes to 1 and the gradient accumulation to 1 as well. That is the configuration that gives the minimal memory footprint without any code changes. Start there.

from stanford_alpaca.

yysjasmine avatar yysjasmine commented on May 18, 2024

I encountered the CUDA OOM on a single core A100 80G using your training code? Can i fix this by changing anything?

Can you fix the OOM problem? I ran in to that problem as well using python 3.8 and pytorch 1.13.1,in a single core A100 80G.

from stanford_alpaca.

JoelNiklaus avatar JoelNiklaus commented on May 18, 2024

I have the same problem

from stanford_alpaca.

dlwh avatar dlwh commented on May 18, 2024

Just to swoop in here, if you're using 6.7B or larger, 1 GPU (even an A100) isn't going to be enough without DeepSpeed Zero CPU Offload or Zero Infinity. At a minimum, to fine tune you need 6.7e9 * 4 bytes * 3 = 80.4GB to store parameters and optimizer states (even in mixed precision, you want to store those in fp32)

from stanford_alpaca.

JoelNiklaus avatar JoelNiklaus commented on May 18, 2024

Thanks @dlwh. When switching to two A100 80GB GPUs it worked for me.

from stanford_alpaca.

zhl5842 avatar zhl5842 commented on May 18, 2024

Just to swoop in here, if you're using 6.7B or larger, 1 GPU (even an A100) isn't going to be enough without DeepSpeed Zero CPU Offload or Zero Infinity. At a minimum, to fine tune you need 6.7e9 * 4 bytes * 3 = 80.4GB to store parameters and optimizer states (even in mixed precision, you want to store those in fp32)

why *3 ?

from stanford_alpaca.

dlwh avatar dlwh commented on May 18, 2024

Adam, the default optimizer, stores two momentum terms which are each the same size as the parameters themselves, so essentially three copies of the parameters... plus a fourth half-sized one (so, 14 bytes total) for the gradients, and then memory for activations that can range from "not very much" to a lot of extra GB.

2 A100s or Zero CPU offload should fix it right up.

from stanford_alpaca.

yysjasmine avatar yysjasmine commented on May 18, 2024

Thanks a lot, 2+ gpus worked fine.

from stanford_alpaca.

maziyarpanahi avatar maziyarpanahi commented on May 18, 2024

What was the batch size for those who could make it work with 2 A100 80GB GPUs? I have 4 and it fails. The reason I am asking is mine only works wit batch_size set to 1 on 7B. However, the README says?

Below is a command that fine-tunes LLaMA-7B with our dataset on a machine with 4 A100 80G GPUs in FSDP full_shard mode: (batch_size is 4 here, how did they manage this?)

torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LLaMADecoderLayer' \
    --tf32 True

from stanford_alpaca.

maziyarpanahi avatar maziyarpanahi commented on May 18, 2024

That's really strange, the default value for model_max_length is already 512. With batch_size 1 it takes forever unfortunately

from stanford_alpaca.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.