tiger-ai-lab / mantis Goto Github PK

View Code? Open in Web Editor NEW

145.0 8.0 11.0 77.77 MB

Official code for Paper "Mantis: Multi-Image Instruction Tuning"

Home Page: https://tiger-ai-lab.github.io/Mantis/

License: Apache License 2.0

Python 86.53% Shell 11.29% Jupyter Notebook 2.18%

language vision fuyu llava-llama3 lmm mantis mllm video vlm multi-image-understanding

mantis's People

Contributors

Stargazers

Watchers

Forkers

awayfromkeyboardwarrior scott-mao techthiyanes scottsuk0306 mqianliu whitefu xiechengmude chris-tng yurun-yuan brenchcc

mantis's Issues

performance drop when using lora finetune

Thanks for your nice work.

I am training mantis-8b-siglip-llama3 with lora. However, it appears that there is a non-negligible performance drop when using lora finetuning. Two evaluation benchmark results are given here,

-	Mantis-Eval	NLVR2
report in paper	59.45	87.43
lora results	51.61	82.06

For pretraining and fine-tuning, I did not modify any training-specific hyperparameter.

Perhaps it is caused by some errors when loading lora model during inference. I notice that you did lora finetuning in ablations. Can authors help on this?

Thanks.

Support for Idefics3

Hi,

Thank you for your work on this library. I'd like to know if there's any planned support for Idefics3, this model seems to be better than Idefics2, for visual reasoning https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3

Regards,

Missing FUYU device_map fix for multi-GPU setups

Issue Description

The FUYU model implementation currently lacks support for multi-GPU setups. This issue has already been addressed and fixed in Huggingface's transformers repository.

Relevant Pull Request

Here's the link to the PR in the Hugging Face repository: huggingface/transformers#29880

I have a question about the evaluation of Q-BENCH dataset

Why not use q-bench2-a1-pair-test.json for q-bench2?

I tried to deploy Mantis in my own server for some test. Do you have any suggestion about the tools which can deploy Mantis to run faster?

I am trying to add Mantis to the supported model list in VLLM or Sglang

4 GPUS 4090 training bash scripts/finetune.sh error only one machine

e7497f3c073c:6956:9753 [1] NCCL INFO comm 0x761d313cac20 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 4000 commId 0x273b00670b73d30e - Init COMPLETE
e7497f3c073c:6958:9750 [3] NCCL INFO comm 0x75ebdb2446d0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 6000 commId 0x273b00670b73d30e - Init COMPLETE
e7497f3c073c:6955:9751 [0] NCCL INFO comm 0x73d7493caf90 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 3000 commId 0x273b00670b73d30e - Init COMPLETE
e7497f3c073c:6957:9752 [2] NCCL INFO comm 0x7ed4193cafe0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 5000 commId 0x273b00670b73d30e - Init COMPLETE
W0715 08:53:40.701000 127238712194880 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 6955 closing signal SIGTERM
W0715 08:53:40.704000 127238712194880 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 6956 closing signal SIGTERM
W0715 08:53:40.704000 127238712194880 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 6958 closing signal SIGTERM
E0715 08:53:42.146000 127238712194880 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 2 (pid: 6957) of binary: /workspace/miniconda3/envs/Mantis/bin/python

Training only with Batch_size = 1 for Multimodal models

Hello! Thanks for open-sourcing the code :)
I'm playing a bit with the code and trying to replicate the pre-training and fine-tuning results, but I'm encountering many problems.

The problems I'm having can be represented in two groups:

Datasets not well-formatted
Processor problems with multi-image input

1. Datasets

The LLaVA pre-training script mantis/train/scripts/pretrain_mllava.sh uses the online hugging face dataset, but the majority of the data seems not present. In the TIGER-Lab/llava-data dataset, the llava-pretrain presents only 278670 images in 660 subfolders while the llava-finetune has only 348679 images across the folders "coco, gqa, ocr_vqa, textvqa, vg". Regarding the llava-finetune, for the majority of the images in the CC3M I'm getting:

[PosixPath('.cache/huggingface/datasets/TIGER-Lab___llava-data/llava_pretrain/0.0.0/73ecf20404b651edea601a8b868b2c818f3a7abf5fda24c2d8b0993118c3aea4/train_images/images/00266/002667434.jpg')] does not exist
Error at 100759

2. AssertionError

Regarding LLaVA fine-tuning, the script raise an error of unsupported multi-images input ("This method only supports a single input, but get 4 inputs"), which seems to come from the LLaVA processor method _right_pad_inputs_with_attention_mask. Why doesn't LLaVA support multi-input images? This error is raised both with llava and mllava for mllava_type in the setting, and using the CLIP version.

The idefics2 script seems to work with the default settings, downloading the data and starting the run (I'm using the train_idefics2.sh script, with default qlora finetuning). However, I have noticed that during training the memory usage of some GPUs (in my case the 0 and 2) spikes many times, going from 30% to 89%. I have tried to increase the per_device_train_batch_size to more than 1 (I have 40GB a100 GPU) but I'm getting strange errors that I didn't get in the original settings (the only change is per_device_train_batch_size=4 in row 139):

Traceback (most recent call last):
  File "Mantis/mantis/train/train_idefics2.py", line 255, in <module>
    main(training_args, data_args, model_args)
  File "Mantis/mantis/train/train_idefics2.py", line 229, in main
    trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "miniconda3/envs/mantis/lib/python3.10/site-packages/transformers/trainer.py", line 1876, in train
    return inner_training_loop(
  File "miniconda3/envs/mantis/lib/python3.10/site-packages/transformers/trainer.py", line 2178, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "miniconda3/envs/mantis/lib/python3.10/site-packages/accelerate/data_loader.py", line 454, in __iter__
    current_batch = next(dataloader_iter)
  File ".local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File ".local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File ".local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File ".local/lib/python3.10/site-packages/torch/_utils.py", line 722, in reraise
    raise exception
AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File ".local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File ".local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "Mantis/mantis/train/data.py", line 753, in __call__
    batch_encoding = self._right_pad_inputs_with_attention_mask(model_inputs=batch)
  File "Mantis/mantis/train/data.py", line 741, in _right_pad_inputs_with_attention_mask
    assert len(model_inputs) == 1, "This method only supports a single input, but get {} inputs".format(len(model_inputs))
AssertionError: This method only supports a single input, but get 4 inputs

This error seems related to the one I had with LLaVA which makes me thinking that probably we should modify/create a _right_pad_inputs_with_attention_mask method for the idefict2 processor, as well as for llava, similar to the one made for MFuyu ?

Please, give some advice

Support Chinese?

Excellent work! BTW, Does the model support Chinese?

about conda env for finetuning

Nice work! Thanks for contribution.

We are carrying out instruction tuning experiments with Mantis-8B-siglip-llama3. The pretraining and instruction finetuning with lora work fine, except for full param finetuning. The warning below came up and finetuning got stuck. I put this here for others reference.

Invalidate trace cache @ step 344: expected module 345, but got module 1

Referring to the issue, this might be due to how accelerate or deepspeed is installed. Noticing that there is no version specifications in setup.py from this repo, may we ask the exact versions you use for fine-tuning, for some dependencies like torch, accelerate or deepspeed?

Thanks in advance.

How do

Question about mantis-eval matching criteria

Hi,

Thank you for open-sourcing this great work. I appreciate the team's efforts in putting this together.

I have a question about the evaluation criteria in the mantis-eval, "short-answer" question specifically. It looks like the correctness of "short-answer" is judged by exact match between model's output and the reference answer, ~~without further parsing~~(see the edit below). But the prompt template for this type of question also instructs the model to output both analysis and final answer.

In this case, I noticed that a model would give the correct answer (for example, "Yes") followed by some reasoning, but such an answer wouldn't be counted as correct because of how the exact match works.

Could you help me understand why it's written like this? Does it make sense to improve the matching rule? Thanks.

Edit:
I just saw that there is parsing on the model's output that only takes the outputs after "Final Answer: ". This makes much more sense. However, I noticed that sometimes a model would answer correctly but with more than one word.
Do you think it makes sense to loosen the matching criteria? Alternatively, I think it also makes sense to make the instruction more clear in the prompt template, for example, by adding one more sentence like "Answer the question in a single word or phrase."

NCCL_DEBUG=WARN accelerate launch --config_file=./accelerate_configs/accelerate_config_zero3.yaml \
    --machine_rank 0 --main_process_ip 10.29.35.44 --main_process_port 12956 \
    --num_machines 1 --num_processes 8 \
    train_idefics2.py \
    --model_name_or_path HuggingFaceM4/idefics2-8b \
    --data_config_file custom_data_config.yaml \
    --data_format chat \
    --run_name 240523_idefics2_mantis \
    --output_dir 240523_idefics2_mantis \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "steps" \
    --save_strategy "steps" \
    --save_steps 200 \
    --eval_steps 200 \
    --save_total_limit 5 \
    --learning_rate 2e-5 \
    --weight_decay 0.01 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 10 \
    --gradient_checkpointing True \
    --dataloader_num_workers 5 \
    --report_to wandb \
    --do_train \
    --lora_enabled False \
    --qlora_enabled False \
    --dora_enabled False \
    --max_seq_len 512 \
    --fp16 \
    --attn_implementation eager

Error i got is

[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/home/user/Mantis/mantis/models/idefics2/modeling_idefics2.py", line 1677, in forward
[rank0]:     inputs_embeds = self.inputs_merger(
[rank0]:   File "/home/user/Mantis/mantis/models/idefics2/modeling_idefics2.py", line 1564, in inputs_merger
[rank0]:     new_inputs_embeds[special_image_token_mask] = reshaped_image_hidden_states
[rank0]: RuntimeError: shape mismatch: value tensor of shape [256, 4096] cannot be broadcast to indexing result of shape [192, 4096]

Any suggestions how to fix it?

Thanks in advance