Comments (11)
For me it also worked with batch size 4 on 2 80GB A100 GPUs for sequence length 512.
from stanford_alpaca.
I cant say for certain. But the first thing you should try is setting the batch sizes to 1 and the gradient accumulation to 1 as well. That is the configuration that gives the minimal memory footprint without any code changes. Start there.
from stanford_alpaca.
I encountered the CUDA OOM on a single core A100 80G using your training code? Can i fix this by changing anything?
Can you fix the OOM problem? I ran in to that problem as well using python 3.8 and pytorch 1.13.1οΌin a single core A100 80G.
from stanford_alpaca.
I have the same problem
from stanford_alpaca.
Just to swoop in here, if you're using 6.7B or larger, 1 GPU (even an A100) isn't going to be enough without DeepSpeed Zero CPU Offload or Zero Infinity. At a minimum, to fine tune you need 6.7e9 * 4 bytes * 3 = 80.4GB
to store parameters and optimizer states (even in mixed precision, you want to store those in fp32)
from stanford_alpaca.
Thanks @dlwh. When switching to two A100 80GB GPUs it worked for me.
from stanford_alpaca.
Just to swoop in here, if you're using 6.7B or larger, 1 GPU (even an A100) isn't going to be enough without DeepSpeed Zero CPU Offload or Zero Infinity. At a minimum, to fine tune you need
6.7e9 * 4 bytes * 3 = 80.4GB
to store parameters and optimizer states (even in mixed precision, you want to store those in fp32)
why *3 ?
from stanford_alpaca.
Adam, the default optimizer, stores two momentum terms which are each the same size as the parameters themselves, so essentially three copies of the parameters... plus a fourth half-sized one (so, 14 bytes total) for the gradients, and then memory for activations that can range from "not very much" to a lot of extra GB.
2 A100s or Zero CPU offload should fix it right up.
from stanford_alpaca.
Thanks a lot, 2+ gpus worked fine.
from stanford_alpaca.
What was the batch size for those who could make it work with 2 A100 80GB GPUs? I have 4 and it fails. The reason I am asking is mine only works wit batch_size set to 1 on 7B. However, the README says?
Below is a command that fine-tunes LLaMA-7B with our dataset on a machine with 4 A100 80G GPUs in FSDP full_shard mode: (batch_size is 4 here, how did they manage this?)
torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
--model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
--data_path ./alpaca_data.json \
--bf16 True \
--output_dir <your_output_dir> \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LLaMADecoderLayer' \
--tf32 True
from stanford_alpaca.
That's really strange, the default value for model_max_length is already 512. With batch_size 1 it takes forever unfortunately
from stanford_alpaca.
Related Issues (20)
- DeepSpeed compilation (cpu_adam issue)
- incorrect model_max_length HOT 1
- BUG: "labels" information leakage into "input_ids" fields - incorrect attention_mask HOT 1
- [Windows]: RuntimeError: Distributed package doesn't have NCCL built in
- How to provide extra contexrt as a pdf file?
- Python bindings and text classification & summarisation tasks
- Question about padding the input sequence HOT 3
- Wonder how to inference after finetuning.
- How to finetune with a own private data and then build chatbot on that? HOT 1
- Utilize regen.json in finetuning
- Loss will suddenly turn 0 during SFT HOT 2
- Confusion about instruction task HOT 1
- AttributeError: 'ModelArguments' object has no attribute 'target_modules'
- Can you release your evaluation code and data?
- How to get the model
- ImportError when using `weight_diff.py` script HOT 2
- weight_diff.py state_dict_recovered[key].add_(state_dict_raw[key]) RuntimeError: The size of tensor a (32001) must match the size of tensor b (32000) at non-singleton dimension 0 HOT 2
- Problems generating my own data offline
- NotImplementedError: offload_to_cpu=True and NO_SHARD is not supported yet HOT 1
- The arugment order of Rouge score might be wrong.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stanford_alpaca.