philschmid / deep-learning-pytorch-huggingface Goto Github PK

License: MIT License

Jupyter Notebook 84.15% Python 14.99% Dockerfile 0.42% JavaScript 0.44%

deep-learning-pytorch-huggingface's Introduction

Getting Started with Deep Learning with PyTorch and Hugging Face

This repository contains instructions/examples/tutorials for getting started with Deep Learning Pytorch and Hugging Face libraries like transformers, datasets.

Training

Requirements

Before we can start make sure you have met the following requirements

AWS Account with quota
AWS CLI installed
AWS IAM user configured in CLI with permission to create and manage ec2 instances

Commands

echo 'export PATH="${HOME}/.local/bin:$PATH"' >> ${HOME}/.bashrc

watch -n0.1 nvidia-smi

deep-learning-pytorch-huggingface's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes mrm8488 hertera1 tobiastroelsen mitzen candide0negn markhng525 vsevolodl samkenxstream personx000 kishore-rj nomiscientist srikanthsrnvs bgoldfe2 jrcastropy mikful zhiyu-chen yao-matrix djm2131 moerehman juju2181 tohtana vanessahlyan hujunchao dlnlpchenliyu haojiepan1 yizhang-unifr chrissimxx rizkihaleemdeen tobyge llplay liger82 edumunozsala rafael-ariascalles kucukagan andrebox mayerantoine mowanhu albertbj limbowk iramonarch williamtran29 shehryar-malik thilakden melvinebenezer brevdev danilowhk fisharp xtie97 renormalizedkat bhartisharma1122 leonlahoud skyoflove1406 jcassiojr jan-karsten-kuhnke arosenswie ajaykumar005 drewskidang nascentcore nickydark1 smyucas sadransh davidmrau bloomlocal tokarev-i-v kumar-sameer meflyup aied-ecnu capybara-brain346 bijoyboban7 research-clone woshiduwei ankojubhanuprakash shane3444 vaibhav9401 sicariussicariistuff asuraabhijeet sorokinvld ltcevil pnlph gitwhilebear reynold97 ukairia777 baptismoffire chagmgang somitm nft-cid yanste hellohupc pmistani pourion maksimkoster kassemsabeh akazakci qu574 fake10086 bitisony socialogix roelofkuipers jack1981

deep-learning-pytorch-huggingface's Issues

Deprecation warnings.

Thanks for the great work.

Just a very gentle reminder, so far it works but may not be in near future due to the warnings.

'''
...accelerate/accelerator.py:436: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
'''

Regarding the OOM issues with fine-tuning Flan-T5-xl

I tried to do full fine-tuning on the Flan-T5-XL, but I have always faced the issue of OOM. I used 5 A5000 cards, each with 24GB, which should be acceptable in theory. However, I still have OOM. Do I have to use Deepspeed. In the explanation, I saw the word 'DS unload'. 'yes' means that Deepspeed was not used, right? Do any friends also conduct similar experiments? Can you tell me the reason?

Chat Inference Code

Hi,

Just wanted to check after training my custom data model how should I use it as a chatbot to generate answers for questions asked by the user?
Can you please tell me what needs to be changed in inference part of the code?

Looking forward to your response

ValueError

Hi Phil,

thank you for sharing this useful blog post on how to finetune flan-t5-xxl efficiently. I am trying to run your code on google colab with gpu enabled but run into a ValueError when trying to load the sharded model "philschmid/flan-t5-xxl-sharded-fp16". This is the error message I am getting

when trying to execute this code cell

Any help would be highly appreciated

Thanks,

Max

Not able to run training/fsdp-qlora-distributed-llama3.ipynb

Hi @philschmid ! Thank you for the blog. Its very helpful.

I am trying to reproduce the results as it is. Followed the blog, installed libraries with same versions.

Running into following issue:
ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32

Someone mentioned setting FSDP_CPU_RAM_EFFICIENT_LOADING=1 here should solve, but this is already set in the torchrun command as per blog.

Pretty much clueless. Any suggestions would be really helpful.

Llama patch for FlashAttention support fails with use_cache

I came across your llama_patch.py when looking to patch Llama for inference myself and unless I'm doing something wrong the implementation fails when use_cache=True and past_key_value is not None.

Specifically during geneartion with use_cache=True in this line query_states will have sequence length 1 while key_states and value_states will have length 1 + past_key_value[0].shape[-2] and thus these tensors won't stack.

https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/05d83eaa3c2ad6088227fa26dffb097e06439aef/training/utils/llama_patch.py#L76C3-L76C3

I think this is also the other llama patches referenced in the comments don't support flash attention + kv cache at the same time. Not sure if there's a clever workaround?

CPU offload when not using offload deepspeed config file

Hi Philipp,

thanks for your awesome blog on training Flan T5 XXL. I am playing around with it and doing just zero-shot inference using ds_flan_t5_z3_config_bf16.json deepspeed config file. I believe this should not do any offload however I see the following in the deepspeed logs

[2023-06-15 17:30:29,691] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.4, git-hash=unknown, git-bra
nch=unknown
[2023-06-15 17:30:58,137] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-06-15 17:30:58,141] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
[2023-06-15 17:30:58,223] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-06-15 17:30:58,223] [INFO] [utils.py:786:see_memory_usage] MA 20.74 GB         Max_MA 20.74 GB         CA 20.74 GB
      Max_CA 21 GB
[2023-06-15 17:30:58,224] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 21.87 GB, percent = 2.9%
Parameter Offload: Total persistent parameters: 503808 in 124 params
[2023-06-15 17:31:08,932] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2023-06-15 17:31:08,933] [INFO] [utils.py:786:see_memory_usage] MA 2.59 GB         Max_MA 20.77 GB         CA 16.29 GB
     Max_CA 21 GB
[2023-06-15 17:31:08,933] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 21.97 GB, percent = 2.9%

I am also seeing logs mentioning trace cache. Is this related to CPU offload?

Invalidate trace cache @ step 0: expected module 2, but got module 0

Thanks again and looking forward to your reply.

How to create a json file for create_flan_t5_cnn_dataset.py

Thanks for sharing the script for conversion, the file is stored in arrowformat. how to store in json. thanks

Out of Memory: Cannot reproduce T5-XXL run on 8xA10G.

I am trying to reproduce the FLAN-T5-XXL (11B) results from this blog post.

I have an 8xA10G instance. Since the blog shows that you can run FLAN-T5-XXL (11B) training on a 4xA10G setup, I was surprised to see that I get a CUDA OOM error as soon as the first training epoch starts:

OutOfMemoryError: CUDA out of memory. Tried to allocate 3.73 GiB. GPU 0 has a total capacity of 21.99 GiB of which 723.06 MiB is free. Including non-PyTorch memory, this process has 21.27 GiB memory in use. Of the allocated memory 17.87 GiB is allocated by PyTorch, and 2.96 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

I have even tried to run at batch=1, but that didn't help, and I have double checked that bf16 is enabled.

Additionally, I have attempted to run default T5-11B and T5-3B models using Accelerate + DeepSpeed (level 3, bf16) using the instructions from this tutorial, and I also get a CUDA OOM. The only case in which I do not get an OOM, is when I run the default T5-Large at batch=1.

I'm not sure where I am going wrong. The error message suggests that something is reserving almost all of the GPU memory before the "real" allocations start (as only 723 MiB is free.)

Target modules all-linear not found in the base model.

Thank you for these excellent resources!

When running: https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/fine-tune-llms-in-2024-with-trl.ipynb notebook. I get an error on this line of code:

peft_config = LoraConfig( lora_alpha=128, lora_dropout=0.05, r=256, bias="none", target_modules="all-linear", task_type="CAUSAL_LM", )

It gives this error:

ValueError: Target modules all-linear not found in the base model. Please check the target modules and try again.

Any ideas?

What's the use of "messages" in dpo step?

Refer to: https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/dpo-align-llms-in-2024-with-trl.ipynb

for prompt in prompts:
  # 👇 No use?
  messages = pipe.tokenizer.apply_chat_template([{"role":"user", "content": prompt}], tokenize=False)
  outputs = pipe(prompt, max_new_tokens=2048, do_sample=True, temperature=1.0, top_k=50, top_p=0.9, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id)
  print(f"**Prompt**:\n{prompt}\n")
  print(f"**Generated Answer**:\n{outputs[0]['generated_text'][len(prompt):].strip()}")
  print("===" * 10)

There's no use here and after?

CUDA OOM error while saving the model

Hi @philschmid ! Thanks a lot for your blog on finetuning FLAN-T5-XXL with LoRA.
I was trying the same on custom dataset.
Some more details -

Dataset size = 6k records,
instance_type = AWS's ml.g5.16xlarge.
batch_size = 2,
gradient_accumulation_steps = 2
learning_rate = 1e-3,
num_train_epochs = 1

Training completes with this output -
{'train_runtime': 1364.2004, 'train_samples_per_second': 0.733, 'train_steps_per_second': 0.183, 'train_loss': 1.278140380859375, 'epoch': 1.0}

But getting CUDA OOM error at the point of saving the model.
Error -

ErrorMessage “OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 22.19
GiB total capacity; 20.34 GiB already allocated; 32.50 MiB free; 20.96 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation. See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF”

Code -

trainer.save_model(os.environ["SM_MODEL_DIR"])
tokenizer.save_pretrained(os.environ["SM_MODEL_DIR"])

Do you have any suggestion on how to solve this error ?

Instruction tuning of LLama2 is significantly slower compared to documented 3 hours fine-tuning time on A10G.

Hi,

First of all, thanks for setting up the nicely formatted code for fine-tuning LLaMa2 in 4-bits.
I was able to follow all the steps and was able to setup training of the model (as shown in your tutorial/ ipython notebook): https://www.philschmid.de/instruction-tune-llama-2

Your tutorial mentions that the training time on a g5.2x large without flash attention is around 3hours. However, running your code shows training time as 40hours! Can you help narrow down the difference/ issue?

I am attaching some screen-shots. On a high-level I suspect there is a bottleneck in data-loader (since the code is only using 1 cpu core), I did try adding the num_workers flag in TrainingArguments but that did not help. GPU utilization seems decent.

Any thoughts @philschmid ?

question about the llama instruction code

Hello. The qlora and flash attention code has been very helpful.

I noticed that the code worked fine in llam2 7b, but there seems to be a shape issue in 70b. Is that correct?

compute_metrics() function

The compute_metrics function takes as a parameter eval_preds which is unpacked into preds and labels.
Here's the line mentioned:

preds, labels = eval_preds

What I'm trying to figure out is what exactly eval_preds is . When it is called in the Trainer, it doesn't contain any arguments.

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
)

If my understanding is correct it's a dictionary that contains the labels we already have in tokenized_dataset["train"]["labels"] and the predictions of labels that are going to be compared to using the rouge score. So where are these labels coming from?

Thanks.

Does this work for Llama2 - Fine-tune Falcon 180B with DeepSpeed ZeRO, LoRA & Flash Attention?

Thanks Phil for the great post "Fine-tune Falcon 180B with DeepSpeed ZeRO, LoRA & Flash Attention". When I tried to change falon to llama2 (tried all 3 mode sizes), I always got "CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)". Should there be more changes than just model name to make it work? Or will you have a follow up post about fine tuning Llama2 with DeepSpeed + LoRA?

gcc/cuda used for training

In your tutorial on llama-2 (below) would you mind sharing what versions of gcc and cuda you're using?

https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/instruction-tune-llama-2-int4.ipynb

Error (return code -7) when finetuning FLANT5-xxl on 8* A100

Hi Philipp @philschmid ,

Thank you for the wonderful tutorial. These days I've been using your codes to achieve finetuning on Flan-T5-xxl with my own corpus using DeepSpeed as shown here by you.
I am using the same configurations as what is given in your codes, e.g., 8*A100, 100 CPU cores, 500 Mem. And I prepared my datasets, and got everything ready.

I've almost done everything well, however still one step away from success:
When I execute 'deepspeed --num_gpus=8 scripts/run_seq2seq_deepspeed.py', it seems everything goes fine initially.
But soon when the 8 GPUs start working (loaded in with data), the process has immediately been terminated, killed.

$ deepspeed --num_gpus=8 run_seq2seq_deepspeed.py
[2023-02-24 18:10:18,983] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-02-24 18:10:19,049] [INFO] [runner.py:548:main] cmd = /home/aiops/xxxx/.miniconda3/envs/torch-py39/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_seq2seq_deepspeed.py
[2023-02-24 18:10:22,043] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-02-24 18:10:22,043] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-02-24 18:10:22,043] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-02-24 18:10:22,043] [INFO] [launch.py:162:main] dist_world_size=8
[2023-02-24 18:10:22,043] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:34<00:00,  6.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:37<00:00,  7.49s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:40<00:00,  8.11s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:41<00:00,  8.37s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:43<00:00,  8.60s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:43<00:00,  8.74s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:44<00:00,  8.95s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:45<00:00,  9.10s/it]
[2023-02-24 18:12:41,354] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Using cuda_amp half precision backend
[2023-02-24 18:12:42,235] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
[2023-02-24 18:13:10,284] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17786
[2023-02-24 18:13:10,284] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17787
[2023-02-24 18:13:10,286] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17788
[2023-02-24 18:13:10,287] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17789
[2023-02-24 18:13:10,620] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17790
[2023-02-24 18:13:10,953] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17791
[2023-02-24 18:13:10,954] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17792
[2023-02-24 18:13:11,287] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17793
[2023-02-24 18:13:11,901] [ERROR] [launch.py:324:sigkill_handler] ['/home/aiops/xxxx/.miniconda3/envs/torch-py39/bin/python', '-u', 'run_seq2seq_deepspeed.py', '--local_rank=7'] exits with return code = -7

I am pretty sure that all the environment requirements are satisfied as you noted there (8 batch size). But I am not sure where is the problem at. I don't understand the meaning of error code -7 (as in the last line above), even after searching the whole Internet.
I am not knowledgeable about the underlying mechanism of GPU parallelism. And from your viewpoint, what could be the most probable factors?

Quantization question:

Using this code in order to fine-tune llama3 70b on AWS GPUs.
Here we use BitsAndBytesConfig to quantize the model weights and load them as float 4.

    torch_dtype = torch.bfloat16
    quant_storage_dtype = torch.bfloat16

    quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch_dtype,
            bnb_4bit_quant_storage=quant_storage_dtype,
        )

    model = AutoModelForCausalLM.from_pretrained(
        script_args.model_id,
        quantization_config=quantization_config,
        attn_implementation="sdpa",
        torch_dtype=quant_storage_dtype,
        use_cache=False if training_args.gradient_checkpointing else True,
    )

But the output seems to be quant_storage_dtype = torch.bfloat16.

When i then try to merge LoRA weights with the model later:

from peft import AutoPeftModelForCausalLM
import torch

# # Load PEFT model on CPU
model = AutoPeftModelForCausalLM.from_pretrained(
     '/my-checkpoint-40',
     torch_dtype=torch.float16,
     low_cpu_mem_usage=True,
)
# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
# Double-check if quantization is still effective
for name, param in merged_model.named_parameters():
    print(name, param.dtype, param.shape)  # This will show the dtype and shape of each parameter

Each layer is stored as quant_storage_dtype = torch.bfloat16.

The size of the safetensors added together for the fine-tuned model is 118 GB, and llama3-70b size is 127 GB. Meaning a ~7% reduction in the fine-tuned model.

Maybe a weird question but -> Is this model quantized? Is it semi-quantized? Should I quantize it further to reduce the size even more? (Need a smaller model because of the GPUs I have)

The quant_storage_dtype = torch.bfloat16 confuses me a bit.

Cannot load tokenizer for llama2

following instructions from this tutorial (https://www.philschmid.de/instruction-tune-llama-2), I was able to download the model from hugginface but while loading tokenizer I am facing this issue

“”“OSError: Can’t load tokenizer for ‘meta-llama/Llama-2-7b-hf’. If you were trying to load it from ‘Models - Hugging Face’, make sure you don’t have a local directory with the same name. Otherwise, make sure ‘meta-llama/Llama-2-7b-hf’ is the correct path to a directory containing all relevant files for a LlamaTokenizer tokenizer.
“””

any help is appreciated!

Thanks

Colab notebook fails

something with rouge_score library

#------
import evaluate
metric = evaluate.load("rouge")

Sample inference script for FLAN-T5 XXL using DeepSpeed & Hugging Face.

Hi there,

I trained a flan-t5-xxl model following the steps from your blog. The training went well without any issues.

I ran the model for inference with an inference script for a normal huggingface seq2seq model. Just called the script with deepspeed as:

deepspeed --num_gpus=4 test_flan_t5_xxl.py --model_id /mnt/flan_modelling/flan-t5-xxl-ner-ft/checkpoint-26/

The out of the model is very random. I believe I am doing something wrong, maybe I need to pass the config file for inference as well.

Could you please provide a sample inference script. My appologies if this is quite trivial, I am fairly new to deepspeed.

Looking forward for your responce.

Best,
Irshad

Re. fine-tune-llms-in-2024-with-trl.ipynb

from datasets import load_dataset 
from random import randint


# Load our test dataset
eval_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
rand_idx = randint(0, len(eval_dataset))

# Test on sample 
prompt = pipe.tokenizer.apply_chat_template(eval_dataset[rand_idx]["messages"][:2], tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=False, temperature=0.1, top_k=50, top_p=0.1, eos_token_id=pipe.tokenizer.eos_token_id, pad_token_id=pipe.tokenizer.pad_token_id)

print(f"Query:\n{eval_dataset[rand_idx]['messages'][1]['content']}")
print(f"Original Answer:\n{eval_dataset[rand_idx]['messages'][2]['content']}")
print(f"Generated Answer:\n{outputs[0]['generated_text'][len(prompt):].strip()}")

gave error:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[14], line 11
      9 # Test on sample 
     10 prompt = pipe.tokenizer.apply_chat_template(eval_dataset[rand_idx]["messages"][:2], tokenize=False, add_generation_prompt=True)
---> 11 outputs = pipe(prompt, max_new_tokens=256, do_sample=False, temperature=0.1, top_k=50, top_p=0.1, eos_token_id=pipe.tokenizer.eos_token_id, pad_token_id=pipe.tokenizer.pad_token_id)
     13 print(f"Query:\n{eval_dataset[rand_idx]['messages'][1]['content']}")
     14 print(f"Original Answer:\n{eval_dataset[rand_idx]['messages'][2]['content']}")

File /usr/local/lib/python3.10/dist-packages/transformers/pipelines/text_generation.py:208, in TextGenerationPipeline.__call__(self, text_inputs, **kwargs)
    167 def __call__(self, text_inputs, **kwargs):
    168     """
    169     Complete the prompt(s) given as inputs.
    170 
   (...)
    206           ids of the generated text.
    207     """
--> 208     return super().__call__(text_inputs, **kwargs)

File /usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py:1140, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
   1132     return next(
   1133         iter(
   1134             self.get_iterator(
   (...)
   1137         )
   1138     )
   1139 else:
-> 1140     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)

File /usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py:1147, in Pipeline.run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
   1145 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
   1146     model_inputs = self.preprocess(inputs, **preprocess_params)
-> 1147     model_outputs = self.forward(model_inputs, **forward_params)
   1148     outputs = self.postprocess(model_outputs, **postprocess_params)
   1149     return outputs

File /usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py:1046, in Pipeline.forward(self, model_inputs, **forward_params)
   1044     with inference_context():
   1045         model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device)
-> 1046         model_outputs = self._forward(model_inputs, **forward_params)
   1047         model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device("cpu"))
   1048 else:

File /usr/local/lib/python3.10/dist-packages/transformers/pipelines/text_generation.py:271, in TextGenerationPipeline._forward(self, model_inputs, **generate_kwargs)
    268         generate_kwargs["min_length"] += prefix_length
    270 # BS x SL
--> 271 generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
    272 out_b = generated_sequence.shape[0]
    273 if self.framework == "pt":

File /usr/local/lib/python3.10/dist-packages/peft/peft_model.py:1140, in PeftModelForCausalLM.generate(self, **kwargs)
   1138     self.base_model.generation_config = self.generation_config
   1139 try:
-> 1140     outputs = self.base_model.generate(**kwargs)
   1141 except:
   1142     self.base_model.prepare_inputs_for_generation = self.base_model_prepare_inputs_for_generation

File /usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1718, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   1701     return self.assisted_decoding(
   1702         input_ids,
   1703         assistant_model=assistant_model,
   (...)
   1714         **model_kwargs,
   1715     )
   1716 if generation_mode == GenerationMode.GREEDY_SEARCH:
   1717     # 11. run greedy search
-> 1718     return self.greedy_search(
   1719         input_ids,
   1720         logits_processor=logits_processor,
   1721         stopping_criteria=stopping_criteria,
   1722         pad_token_id=generation_config.pad_token_id,
   1723         eos_token_id=generation_config.eos_token_id,
   1724         output_scores=generation_config.output_scores,
   1725         return_dict_in_generate=generation_config.return_dict_in_generate,
   1726         synced_gpus=synced_gpus,
   1727         streamer=streamer,
   1728         **model_kwargs,
   1729     )
   1731 elif generation_mode == GenerationMode.CONTRASTIVE_SEARCH:
   1732     if not model_kwargs["use_cache"]:

File /usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:2579, in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
   2576 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
   2578 # forward pass to get next token
-> 2579 outputs = self(
   2580     **model_inputs,
   2581     return_dict=True,
   2582     output_attentions=output_attentions,
   2583     output_hidden_states=output_hidden_states,
   2584 )
   2586 if synced_gpus and this_peer_finished:
   2587     continue  # don't waste resources running the code we don't need

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    163         output = module._old_forward(*args, **kwargs)
    164 else:
--> 165     output = module._old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:1199, in LlamaForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1197     logits = torch.cat(logits, dim=-1)
   1198 else:
-> 1199     logits = self.lm_head(hidden_states)
   1200 logits = logits.float()
   1202 loss = None

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:160, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    159 def new_forward(module, *args, **kwargs):
--> 160     args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
    161     if module._hf_hook.no_grad:
    162         with torch.no_grad():

File /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:293, in AlignDevicesHook.pre_forward(self, module, *args, **kwargs)
    291             if self.weights_map[name].dtype == torch.int8:
    292                 fp16_statistics = self.weights_map[name.replace("weight", "SCB")]
--> 293         set_module_tensor_to_device(
    294             module, name, self.execution_device, value=self.weights_map[name], fp16_statistics=fp16_statistics
    295         )
    297 return send_to_device(args, self.execution_device), send_to_device(
    298     kwargs, self.execution_device, skip_keys=self.skip_keys
    299 )

File /usr/local/lib/python3.10/dist-packages/accelerate/utils/modeling.py:347, in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics)
    345             module._parameters[tensor_name] = param_cls(new_value, requires_grad=old_value.requires_grad)
    346 elif isinstance(value, torch.Tensor):
--> 347     new_value = value.to(device)
    348 else:
    349     new_value = torch.tensor(value, device=device)

NotImplementedError: Cannot copy out of meta tensor; no data!

Error when finetuning Flan-T5-XXL on custom dataset

I'm trying to finetune flan-t5-xxl on custom QA task, thanks for detailed article peft. However I'm encountering error:

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/opt/conda/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py:298: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
RuntimeError: Caught RuntimeError in replica 0 on device 0.
RuntimeError: mat1 and mat2 shapes cannot be multiplied (56x44 and 56x4096)

This mat values change and different each time when I try running trainer.train() multiple times.

To eliminate doubt of my custom dataset issue, I ran your notebook without changing any code, it still failed with above matmul error. Sometimes i get

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on
device: cuda:1

Apologies, I tried online to resolve this but no luck.

Versions: transformers==4.27.1" "datasets==2.9.0" "accelerate==0.17.1" "evaluate==0.4.0" "bitsandbytes==0.37.1
Sagemaker notebook instance: ml.g5.24xlarge

Problem with preprocess_function() in tutorial

Hello,

I was following your tutorial on fine-tuning a FLAN-T5 model. However, I’ve encountered an error for which I don’t understand where it stems from.
At the line:

labels = tokenizer(text_target=sample["headline"], max_length=max_target_length, padding=padding, truncation=True)

I've changed the field obviously to match any dataset and I get this error:

_TypeError: __call__() missing 1 required positional argument: 'text'_

I'm using this dataset: JulesBelveze/tldr_news on Hugging Face.
The keys are : ['headline', 'content', 'category']

Here's all the relevant code:

model_checkpoint = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def preprocess_function(sample, padding="max_length"):

    # add prefix to the input for t5
    inputs = ["summarize: " + item for item in sample["content"]]
       
    # Tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)
    
    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["headline"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = raw_datasets.map(preprocess_function, batched=True, remove_columns=["headline", "content", "category"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

Does deepspeed partition the model to multi GPUs?

I am trying to run your code with the flan-t5-base model, I have one machine with multi V100, each gpu has 16G memory, and then I meet the following question:
when I use single gpu, the gpu memory usage is 11.5G,
when I ues 4 gpus, each gpu memory usage is 11.7G,
deepspeed said it could partition the model, so I don't understand why gpu memory usage is almost same?

single gpu:

deepspeed --num_gpus=1  run_seq2seq_deepspeed.py
    --model_id model
    --dataset_path dataset
    --epochs 3
    --per_device_train_batch_size 16
    --per_device_eval_batch_size 16
    --generation_max_length 111
    --lr 1e-4
    --deepspeed configs/ds_flan_t5_z3_offload.json

multi gpus:

deepspeed --num_gpus=4  run_seq2seq_deepspeed.py
    --model_id model
    --dataset_path dataset
    --epochs 3
    --per_device_train_batch_size 16
    --per_device_eval_batch_size 16
    --generation_max_length 111
    --lr 1e-4
    --deepspeed configs/ds_flan_t5_z3_offload.json

ds_config:
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}

OOM when finetuning FLANT5-xxl

I use 4 A40 48G to finetune a FLANT5-XXL, following your blog "https://www.philschmid.de/fine-tune-flan-t5-deepspeed#2-fine-tune-flan-t5-xxl-using-deepspeed". However, I met OOM with batch_size = 1 per GPU.

I use fp16 and with offload off. If I turn on the offload, I could run the code and each GPU uses 22G VM.

Do you have any suggestions?

Is the DataCollator necessary in peft-flan-t5-int8-summarization.ipynb ?

First, thank you for these examples.. they are very helpful!

From what I can tell, the preprocess_function(..) adds padding. If so, what is the reason for using the DataCollatorForSeq2Seq?

Compute metrics while using SFT trainer

Hi @philschmid!

If I would like to compute metrics while training (similar to https://www.philschmid.de/fine-tune-flan-t5), how should I change the compute metrics function to pass to SFT trainer for Llamav2 (https://www.philschmid.de/instruction-tune-llama-2)

Would this be same as:

def compute_metrics(eval_pred, tokenizer):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

Specifically, do I still need to do

labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

Is there a way to save the predictions decoded_preds while also passing the filename=f"{savedir}/{preds}.csv" to Trainer?

Inference on CNN validation set takes 2+ hours on p4dn.24xlarge machine with 8 A100s, 40GB each

I tried to run the code as it is for training and at the end of each epoch it does inference on test set, I found that it was taking too long for inference and the GPU memory and utilization was getting maxed out on p4dn.24x that has 8 A100s, 40GB. Surprisingly the training was much faster than inference! Any idea how to fix this? Thanks!

question about DeepSpeedPeftCallback

Hi I would like to ask the function of SaveDeepSpeedPeftModelCallback in peft_utils.py. How does it work and why need to write a callback for the deepspeed and peft?

FLAN-T5 XXL using DeepSpeed fits well for training but gives OOM error for inference.

Hi,

I trained a flan-t5-xxl model following the steps from your blog on a custom ner dataset. The training went well without any issues. I also put a print(labels) line in postprocess_text function to check the predicted labels while evaluation and the results were pretty good.

I used 4x A10G 24GB hardware for training with ds_flan_t5_z3_offload_bf16.json deepspeed config file.

Now I wanted to run the model for inference and I used the below deepspeed inference code to do the prediction.

import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))

zero_optimization = {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": True
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": True
    },
    "overlap_comm": True,
    "contiguous_gradients": True,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": True
  }

generator = pipeline('text2text-generation', model='/mnt/flan_modelling/flan-t5-xxl-ner-ft/checkpoint-26', device=local_rank, max_length=256)

generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype="bf16",
                                           #replace_with_kernel_inject=False,
                                           zero=zero_optimization)
string = generator("Text: kindly give jack 533 from account ending with 4473 if possible\nIdentify the entities from the following options: global.money * global.time-period * global.city * global.cardinal * global.person-name * global.postal-code * global.region * global.language * global.org * global.email-id * global.phone-number * global.temperature", do_sample=False, max_length=50, num_beams=1, min_length=2)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)

I used the same parameters from the config file to initiate the model for inference. But I am getting CUDA OutOfMemory error. I don't understand how the model fit well for training but needs more memory to do the inferencing.

I believe I am doing something wrong. Please suggest any changes I need to do so I can use the trained model for inference.

Thanks

Error when training peft model example

Hi, I am trying to train the example training/peft-flan-t5-int8-summarization.ipynb
I am using a

p3dn.24xlarge | 8GPU | 96 | 768 | 256(vram). I am simply trying to run the example directly on the machine exactly as written however I am getting this error when calling train().

trainer.train()

trainer.train() /home/ubuntu/.local/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=Trueto disable this warning warnings.warn( 0%| | 0/1155 [00:00<?, ?it/s]You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using thecallmethod is faster than using a method to encode the text followed by a call to thepad method to get a padded encoding. /home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py:298: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization") ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ in <module>:1 │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py:1633 in train │ │ │ │ 1630 │ │ inner_training_loop = find_executable_batch_size( │ │ 1631 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │ │ 1632 │ │ ) │ │ ❱ 1633 │ │ return inner_training_loop( │ │ 1634 │ │ │ args=args, │ │ 1635 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │ │ 1636 │ │ │ trial=trial, │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/accelerate/utils/memory.py:124 in decorator │ │ │ │ 121 │ │ │ if batch_size == 0: │ │ 122 │ │ │ │ raise RuntimeError("No executable batch size found, reached zero.") │ │ 123 │ │ │ try: │ │ ❱ 124 │ │ │ │ return function(batch_size, *args, **kwargs) │ │ 125 │ │ │ except Exception as e: │ │ 126 │ │ │ │ if should_reduce_batch_size(e): │ │ 127 │ │ │ │ │ gc.collect() │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py:1902 in │ │ _inner_training_loop │ │ │ │ 1899 │ │ │ │ │ with model.no_sync(): │ │ 1900 │ │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │ │ 1901 │ │ │ │ else: │ │ ❱ 1902 │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │ │ 1903 │ │ │ │ │ │ 1904 │ │ │ │ if ( │ │ 1905 │ │ │ │ │ args.logging_nan_inf_filter │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py:2645 in training_step │ │ │ │ 2642 │ │ │ return loss_mb.reduce_mean().detach().to(self.args.device) │ │ 2643 │ │ │ │ 2644 │ │ with self.compute_loss_context_manager(): │ │ ❱ 2645 │ │ │ loss = self.compute_loss(model, inputs) │ │ 2646 │ │ │ │ 2647 │ │ if self.args.n_gpu > 1: │ │ 2648 │ │ │ loss = loss.mean() # mean() to average on multi-gpu parallel training │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py:2677 in compute_loss │ │ │ │ 2674 │ │ │ labels = inputs.pop("labels") │ │ 2675 │ │ else: │ │ 2676 │ │ │ labels = None │ │ ❱ 2677 │ │ outputs = model(**inputs) │ │ 2678 │ │ # Save past state if it exists │ │ 2679 │ │ # TODO: this needs to be fixed and made cleaner later. │ │ 2680 │ │ if self.args.past_index >= 0: │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py:1501 in _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py:171 in │ │ forward │ │ │ │ 168 │ │ │ if len(self.device_ids) == 1: │ │ 169 │ │ │ │ return self.module(*inputs[0], **kwargs[0]) │ │ 170 │ │ │ replicas = self.replicate(self.module, self.device_ids[:len(inputs)]) │ │ ❱ 171 │ │ │ outputs = self.parallel_apply(replicas, inputs, kwargs) │ │ 172 │ │ │ return self.gather(outputs, self.output_device) │ │ 173 │ │ │ 174 │ def replicate(self, module, device_ids): │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py:181 in │ │ parallel_apply │ │ │ │ 178 │ │ return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim) │ │ 179 │ │ │ 180 │ def parallel_apply(self, replicas, inputs, kwargs): │ │ ❱ 181 │ │ return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) │ │ 182 │ │ │ 183 │ def gather(self, outputs, output_device): │ │ 184 │ │ return gather(outputs, output_device, dim=self.dim) │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py:89 in │ │ parallel_apply │ │ │ │ 86 │ for i in range(len(inputs)): │ │ 87 │ │ output = results[i] │ │ 88 │ │ if isinstance(output, ExceptionWrapper): │ │ ❱ 89 │ │ │ output.reraise() │ │ 90 │ │ outputs.append(output) │ │ 91 │ return outputs │ │ 92 │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/torch/_utils.py:644 in reraise │ │ │ │ 641 │ │ │ # If the exception takes multiple arguments, don't try to │ │ 642 │ │ │ # instantiate since we don't know how to │ │ 643 │ │ │ raise RuntimeError(msg) from None │ │ ❱ 644 │ │ raise exception │ │ 645 │ │ 646 │ │ 647 def _get_available_device_type(): │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker output = module(*input, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/peft/peft_model.py", line 667, in forward return self.base_model( File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1667, in forward encoder_outputs = self.encoder( File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1061, in forward layer_outputs = checkpoint( File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, *args) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(*args) File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1057, in custom_forward return tuple(module(*inputs, use_cache, output_attentions)) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 693, in forward self_attention_outputs = self.layer[0]( File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 600, in forward attention_output = self.SelfAttention( File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 572, in forward attn_output = self.o(attn_output) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 242, in forward out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state) File "/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul return MatMul8bitLt.apply(A, B, out, bias, state) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward output += torch.matmul(subA, state.subB) RuntimeError: mat1 and mat2 shapes cannot be multiplied (2048x2 and 1x4096)

Fine-tune-llm-in-2024-with-trl.ipynb not producing the outputs

I tried to run the code in fine-tune-llm-in-2024-with-trl.ipnyb notebook but it is not producing the output it should. IOW, the output it generates is garbage. Below is what I am getting:

`
output[0]['generated_text']>

|im_start|>system\nYou are an text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.\nSCHEMA:\nCREATE TABLE table_2538117_2 (type VARCHAR, organization VARCHAR)<|im_end|>\n<|im_start|>user\nwhat type of organization is sigma phi omega?<|im_end|>\n<|im_start|>assistant\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n`

And the accuracy is 0%, not 79% as mentioned in the notebook. I have ran it multiple times on a A100 machine with 40GB of GPU RAM. I couldn't figure out why it is doing this.

Can you throw some light as to why this is happening and how to fix this issue ?

BTW: Not always I get the \n tokens repeated. Whatever is in between <|im_end|>\n<|im_start|>assistant gets repeated multiple times (in the above example it happened to be \n).

LLama 2 Flash Attention Patch Not Working For 70B

The flash attention patch seems to be working for LLama 7B and LLama 13B(though I need to confirm more than just a successful backward pass). However, for whatever reason, for LLama 70B, I am getting an error like the following:

File "/datadrive/Finetune_LLMs/finetuning_repo/llama_patch.py", line 47, in forward
key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
RuntimeError: shape '[1, 190, 64, 128]' is invalid for input of size 194560

Falcon-180B "forward() got an unexpected keyword argument 'position_ids'"

Hi Phil, and many thanks for the new Falcon-180B tutorial!

I am trying to replicate your experiment but when I run it I get the error forward() got an unexpected keyword argument 'position_ids'. I saw that, while llama_patch takes position_ids as a parameter, falcon_patch redefines forward() without it. I saw the patch is used only when one tries to runt the code with flash attention so I disabled it, but I am not sure it was supposed to run without it (first it is in the post's title 😅, plus I get a Signal 7 (SIGBUS) but I am not sure it is necessarily related to it...). What are your thoughts about this?

Ah I forgot to say: I am running it from command line as shown in the jupyter notebook:

torchrun --nproc_per_node 8 run_ds_lora.py \
  --model_id tiiuae/falcon-180B \
  --dataset_path dolly-processed \
  --output_dir falcon-180b-lora-fa \
  --num_train_epochs 3 \
  --per_device_train_batch_size 1 \
  --learning_rate 4e-3 \
  --gradient_checkpointing True \
  --gradient_accumulation_steps 8 \
  --bf16 True \
  --tf32 True \
  --use_flash_attn True \
  --lr_scheduler_type "constant_with_warmup" \
  --logging_steps 25 \
  --save_steps 100 \
  --save_total_limit 3 \
  --deepspeed configs/ds_falcon_180b_z3.json

flash attention error on instruction tune llama-2 tutorial on Sagemaker notebook

Thank you for the excellent Blogs!

When running https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/instruction-tune-llama-2-int4.ipynb

I am trying to enable flash attention in a Sagemaker Notebook using ml.g5.2xlarge and nvidia-smi tells me I am on CUDA Version: 12.0 but

os.environ["MAX_JOBS"] = "4" 
!pip install flash-attn --no-build-isolation

gives this error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [12 lines of output]
      fatal: not a git repository (or any of the parent directories): .git
      
      
      torch.__version__  = 2.1.0+cu121
      
      
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-1b2ql47d/flash-attn_2180596c15514b7d9e4d004796412440/setup.py", line 117, in <module>
          raise RuntimeError(
      RuntimeError: FlashAttention is only supported on CUDA 11.6 and above.  Note: make sure nvcc has a supported version by running nvcc -V.
      [end of output]

Is this something you've seen?

Precision Issue

Hi Philipp!

Thanks for this great repo!

I was trying to run llama2 instruction tuning following the tutorial. The code went well without flash attention. But after I commented in flash attention, I got this error:

"Runtime Error: Flash Attention only support fp16 and bf16 data type", from line 87 in llama_patch.py.

It seems to have something to do with data precision. Could you help me figure it out? Thanks a lot!