Giter Site home page Giter Site logo

tiger-ai-lab / mantis Goto Github PK

View Code? Open in Web Editor NEW
145.0 8.0 11.0 77.77 MB

Official code for Paper "Mantis: Multi-Image Instruction Tuning"

Home Page: https://tiger-ai-lab.github.io/Mantis/

License: Apache License 2.0

Python 86.53% Shell 11.29% Jupyter Notebook 2.18%
language vision fuyu llava-llama3 lmm mantis mllm video vlm multi-image-understanding

mantis's People

Contributors

hexuan21 avatar jdf-prog avatar wenhuchen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mantis's Issues

performance drop when using lora finetune

Thanks for your nice work.

I am training mantis-8b-siglip-llama3 with lora. However, it appears that there is a non-negligible performance drop when using lora finetuning. Two evaluation benchmark results are given here,

- Mantis-Eval NLVR2
report in paper 59.45 87.43
lora results 51.61 82.06

For pretraining and fine-tuning, I did not modify any training-specific hyperparameter.

Perhaps it is caused by some errors when loading lora model during inference. I notice that you did lora finetuning in ablations. Can authors help on this?

Thanks.

4 GPUS 4090 training bash scripts/finetune.sh error only one machine

e7497f3c073c:6956:9753 [1] NCCL INFO comm 0x761d313cac20 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 4000 commId 0x273b00670b73d30e - Init COMPLETE
e7497f3c073c:6958:9750 [3] NCCL INFO comm 0x75ebdb2446d0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 6000 commId 0x273b00670b73d30e - Init COMPLETE
e7497f3c073c:6955:9751 [0] NCCL INFO comm 0x73d7493caf90 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 3000 commId 0x273b00670b73d30e - Init COMPLETE
e7497f3c073c:6957:9752 [2] NCCL INFO comm 0x7ed4193cafe0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 5000 commId 0x273b00670b73d30e - Init COMPLETE
W0715 08:53:40.701000 127238712194880 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 6955 closing signal SIGTERM
W0715 08:53:40.704000 127238712194880 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 6956 closing signal SIGTERM
W0715 08:53:40.704000 127238712194880 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 6958 closing signal SIGTERM
E0715 08:53:42.146000 127238712194880 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 2 (pid: 6957) of binary: /workspace/miniconda3/envs/Mantis/bin/python

Training only with Batch_size = 1 for Multimodal models

Hello! Thanks for open-sourcing the code :)
I'm playing a bit with the code and trying to replicate the pre-training and fine-tuning results, but I'm encountering many problems.

The problems I'm having can be represented in two groups:

  • Datasets not well-formatted
  • Processor problems with multi-image input

1. Datasets

The LLaVA pre-training script mantis/train/scripts/pretrain_mllava.sh uses the online hugging face dataset, but the majority of the data seems not present. In the TIGER-Lab/llava-data dataset, the llava-pretrain presents only 278670 images in 660 subfolders while the llava-finetune has only 348679 images across the folders "coco, gqa, ocr_vqa, textvqa, vg". Regarding the llava-finetune, for the majority of the images in the CC3M I'm getting:

[PosixPath('.cache/huggingface/datasets/TIGER-Lab___llava-data/llava_pretrain/0.0.0/73ecf20404b651edea601a8b868b2c818f3a7abf5fda24c2d8b0993118c3aea4/train_images/images/00266/002667434.jpg')] does not exist
Error at 100759

2. AssertionError

Regarding LLaVA fine-tuning, the script raise an error of unsupported multi-images input ("This method only supports a single input, but get 4 inputs"), which seems to come from the LLaVA processor method _right_pad_inputs_with_attention_mask. Why doesn't LLaVA support multi-input images? This error is raised both with llava and mllava for mllava_type in the setting, and using the CLIP version.

The idefics2 script seems to work with the default settings, downloading the data and starting the run (I'm using the train_idefics2.sh script, with default qlora finetuning). However, I have noticed that during training the memory usage of some GPUs (in my case the 0 and 2) spikes many times, going from 30% to 89%. I have tried to increase the per_device_train_batch_size to more than 1 (I have 40GB a100 GPU) but I'm getting strange errors that I didn't get in the original settings (the only change is per_device_train_batch_size=4 in row 139):

Traceback (most recent call last):
  File "Mantis/mantis/train/train_idefics2.py", line 255, in <module>
    main(training_args, data_args, model_args)
  File "Mantis/mantis/train/train_idefics2.py", line 229, in main
    trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "miniconda3/envs/mantis/lib/python3.10/site-packages/transformers/trainer.py", line 1876, in train
    return inner_training_loop(
  File "miniconda3/envs/mantis/lib/python3.10/site-packages/transformers/trainer.py", line 2178, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "miniconda3/envs/mantis/lib/python3.10/site-packages/accelerate/data_loader.py", line 454, in __iter__
    current_batch = next(dataloader_iter)
  File ".local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File ".local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File ".local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File ".local/lib/python3.10/site-packages/torch/_utils.py", line 722, in reraise
    raise exception
AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File ".local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File ".local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "Mantis/mantis/train/data.py", line 753, in __call__
    batch_encoding = self._right_pad_inputs_with_attention_mask(model_inputs=batch)
  File "Mantis/mantis/train/data.py", line 741, in _right_pad_inputs_with_attention_mask
    assert len(model_inputs) == 1, "This method only supports a single input, but get {} inputs".format(len(model_inputs))
AssertionError: This method only supports a single input, but get 4 inputs

This error seems related to the one I had with LLaVA which makes me thinking that probably we should modify/create a _right_pad_inputs_with_attention_mask method for the idefict2 processor, as well as for llava, similar to the one made for MFuyu ?

Please, give some advice

about conda env for finetuning

Nice work! Thanks for contribution.

We are carrying out instruction tuning experiments with Mantis-8B-siglip-llama3. The pretraining and instruction finetuning with lora work fine, except for full param finetuning. The warning below came up and finetuning got stuck. I put this here for others reference.

Invalidate trace cache @ step 344: expected module 345, but got module 1

Referring to the issue, this might be due to how accelerate or deepspeed is installed. Noticing that there is no version specifications in setup.py from this repo, may we ask the exact versions you use for fine-tuning, for some dependencies like torch, accelerate or deepspeed?

Thanks in advance.

Question about mantis-eval matching criteria

Hi,

Thank you for open-sourcing this great work. I appreciate the team's efforts in putting this together.

I have a question about the evaluation criteria in the mantis-eval, "short-answer" question specifically. It looks like the correctness of "short-answer" is judged by exact match between model's output and the reference answer, without further parsing(see the edit below). But the prompt template for this type of question also instructs the model to output both analysis and final answer.

In this case, I noticed that a model would give the correct answer (for example, "Yes") followed by some reasoning, but such an answer wouldn't be counted as correct because of how the exact match works.

Could you help me understand why it's written like this? Does it make sense to improve the matching rule? Thanks.

Edit:
I just saw that there is parsing on the model's output that only takes the outputs after "Final Answer: ". This makes much more sense. However, I noticed that sometimes a model would answer correctly but with more than one word.
Do you think it makes sense to loosen the matching criteria? Alternatively, I think it also makes sense to make the instruction more clear in the prompt template, for example, by adding one more sentence like "Answer the question in a single word or phrase."

Idefics2 full fine-tuning getting RuntimeError: shape mismatch

I'm working on fine-tuning Idefics2 with multiple images in instruction
I follow this script for full fine-tuning: https://github.com/TIGER-AI-Lab/Mantis/blob/89d34077bd87b66eaadc13117add553e3a3d4c0b/mantis/train/scripts/train_idefics2_full.sh

Here is the command

NCCL_DEBUG=WARN accelerate launch --config_file=./accelerate_configs/accelerate_config_zero3.yaml \
    --machine_rank 0 --main_process_ip 10.29.35.44 --main_process_port 12956 \
    --num_machines 1 --num_processes 8 \
    train_idefics2.py \
    --model_name_or_path HuggingFaceM4/idefics2-8b \
    --data_config_file custom_data_config.yaml \
    --data_format chat \
    --run_name 240523_idefics2_mantis \
    --output_dir 240523_idefics2_mantis \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "steps" \
    --save_strategy "steps" \
    --save_steps 200 \
    --eval_steps 200 \
    --save_total_limit 5 \
    --learning_rate 2e-5 \
    --weight_decay 0.01 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 10 \
    --gradient_checkpointing True \
    --dataloader_num_workers 5 \
    --report_to wandb \
    --do_train \
    --lora_enabled False \
    --qlora_enabled False \
    --dora_enabled False \
    --max_seq_len 512 \
    --fp16 \
    --attn_implementation eager

Error i got is

[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/home/user/Mantis/mantis/models/idefics2/modeling_idefics2.py", line 1677, in forward
[rank0]:     inputs_embeds = self.inputs_merger(
[rank0]:   File "/home/user/Mantis/mantis/models/idefics2/modeling_idefics2.py", line 1564, in inputs_merger
[rank0]:     new_inputs_embeds[special_image_token_mask] = reshaped_image_hidden_states
[rank0]: RuntimeError: shape mismatch: value tensor of shape [256, 4096] cannot be broadcast to indexing result of shape [192, 4096]

Any suggestions how to fix it?

Thanks in advance

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.