dvlab-research / mgm Goto Github PK

View Code? Open in Web Editor NEW

3.1K 26.0 277.0 58.48 MB

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

License: Apache License 2.0

Python 85.01% HTML 1.38% JavaScript 1.80% CSS 0.33% Shell 11.48%

generation large-language-models vision-language-model

mgm's People

Contributors

Stargazers

Watchers

Forkers

wcy1122 julianjuaner shaotengliu crackercat bruinxiong evdcush otherbackup dtbinh hertera1 frostbyte012 eltociear donbr stjordanis itsharex jac0320 techthiyanes evelynmitchell kotthoff xiaolong-rrl hfengzhi ericxsun dearborn-open-ai duyuankai1992 qwerty6518 friendmine jakubik2023 josephrp chp20230826 slidersun mandarukrulkar thanhuit23 yanxg j-l-o hudawei996 bachvudinh f901107 vinodtahelyani sereact jwgu paperwave lathis4 xiechengmude bo-work nemonameless alphaqi apokar liunix61 haikuoxin nhsjgczryf vtea long630904 fangwudi flclain jiangzongkang dogpandacat alexlly xianzhuoliu ainisa20 bzwqq onenotell stephan-who kknet lovecove hezuogongying 360mini chaojigang001 slimsymphony yjfkpyu jqk6 lingmao01 linguo123 albertbj tonghengcheng jiaqianjing chrysfay neomx7 aixia121 cuiyuheng dthcle captain29999 hubin858130 maxmax2016 jimmyleesnow githubmalajava qiangqiang199 thanhpham1987 daishoui weisili2016 zcfrank1st shaun95 qinwentu z1qsx leoncuhk hunaid2000 hesam7711 ablazeone ruiwenhong xuexiaofei comphy zhangliaoxiaoqin

mgm's Issues

RuntimeError: shape '[3, 2, 336, 2, 336]' is invalid for input of size 2709504

command used:
python -m minigemini.serve.cli
--model-path work_dirs/Mini-Gemini/Mini-Gemini-13B-HD
--image-file image_1.png,image_2.png

Why does any stage didn't optimize vision encoder?

Nice work, from the scripts provided, seems all optimize_vision_tower either aux were not used.

Then I have some puzzels:

Is that optimize get worse results?
If not finetune the vision tower, how to make it learn new visual tokens?

AutoTokenizer resolve error

Simply run the demo code as following fails:

python -m minigemini.serve.cli \
    --model-path YanweiLi/Mini-Gemini-2B \
    --image-file ./images/demo_gen.png

error message:

  File "XXXXX/envs/minigemini/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 784, in from_pretrained
    raise ValueError(
ValueError: Tokenizer class GemmaTokenizer does not exist or is not currently imported.

Update to newest transformers fix this problem

Continue FT from stage 2 with custom data

Hi, was wondering whether the stage2 script would be applicable for further FT from stage 2 with a small custom dataset for domain transfer? Or do we have to rewrite a separate script to do this?

Thanks and appreciate any help given!

Regards,

Adriel

3 gradio issues

There's quite a few gradio issues:

function_markdown is undefined
unexpected keyword concurrency_limit
recursive json encoder

First two were "fixed" by just commenting them out, however the third issue prevent gradio working at all as it immediately crashes the instance with that error.

/xxx//minigemini/serve/gradio_web_server.py:351: UserWarning: `layout` parameter is deprecated, and it has no effect
  chatbot = gr.Chatbot(
Traceback (most recent call last):
  File "/usr/lib64/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/xxx/minigemini/serve/gradio_web_server.py", line 472, in <module>
    demo = build_demo(args.embed, concurrency_count=args.concurrency_count)
  File "/xxx/minigemini/serve/gradio_web_server.py", line 371, in build_demo
    gr.Markdown(function_markdown)

/xxx//minigemini/serve/gradio_web_server.py:351: UserWarning: `layout` parameter is deprecated, and it has no effect
  chatbot = gr.Chatbot(
Traceback (most recent call last):
  File "/usr/lib64/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/xxx//minigemini/serve/gradio_web_server.py", line 472, in <module>
    demo = build_demo(args.embed, concurrency_count=args.concurrency_count)
  File "/xxx//minigemini/serve/gradio_web_server.py", line 394, in build_demo
    regenerate_btn.click(
TypeError: EventListenerMethod.__call__() got an unexpected keyword argument 'concurrency_limit'

File "/xxx//venv/lib64/python3.10/site-packages/fastapi/encoders.py", line 287, in jsonable_encoder
  encoded_value = jsonable_encoder(
File "/xxx//venv/lib64/python3.10/site-packages/fastapi/encoders.py", line 331, in jsonable_encoder
  return jsonable_encoder(
File "/xxx//venv/lib64/python3.10/site-packages/fastapi/encoders.py", line 287, in jsonable_encoder
  encoded_value = jsonable_encoder(
File "/xxx//venv/lib64/python3.10/site-packages/fastapi/encoders.py", line 331, in jsonable_encoder
  return jsonable_encoder(
File "/xxx//venv/lib64/python3.10/site-packages/fastapi/encoders.py", line 287, in jsonable_encoder
  encoded_value = jsonable_encoder(
File "/xxx//venv/lib64/python3.10/site-packages/fastapi/encoders.py", line 318, in jsonable_encoder
  if isinstance(obj, classes_tuple):
File "/usr/lib64/python3.10/abc.py", line 119, in __instancecheck__
  return _abc_instancecheck(cls, instance)
RecursionError: maximum recursion depth exceeded in comparison

Demo inference wrong?

Don't see any positive information on a simple task?

Can provide laion-gpt4v dataset images zip?

Hi, after downloaded the laion-gpt4v images, I got only: 11686 images, am using json index order as image name, to avoid the bias between dataset, and possibly wrong annotation to image, just to be sure, does the last image index is:

size mismatch about convnext_large_d_320

Hello, I downloaded the pre-trained model based on the readme and placed it in the corresponding location. Which step did I have a problem with? Please help.

About vl model path

model_zoo/OpenAI/clip-vit-large-patch14-336 and model_zoo/OpenAI/openclip-convnext-large-d-320-laion2B-s29B-b131K-ft-soup from https://huggingface.co/YanweiLi/Mini-Gemini-7B/blob/main/config.json#L31

What is the exact model path of the huggingface?

batch inference

Hi could you tell me how I can do batch inferencing?
I have multiple images and a different prompt for eacg image, so is there a way i can get the output in one go ?

运行代码报错AttributeError: 'list' object has no attribute 'to'， image_aux_features_raw = self.get_model().get_vision_tower_aux()(images_aux).to(dtype=image_features.dtype, device=image_features.device)

Traceback (most recent call last):
File "/checkpoint/binary/train_package/minigemini/train/train_mem.py", line 14, in
train(attn_implementation="flash_attention_2")
File "/checkpoint/binary/train_package/minigemini/train/train.py", line 1262, in train
trainer.train()
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2902, in training_step
loss = self.compute_loss(model, inputs)
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2925, in compute_loss
outputs = model(**inputs)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1833, in forward
loss = self.module(*inputs, **kwargs)
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/checkpoint/binary/train_package/minigemini/model/language_model/mini_gemini_gemma.py", line 87, in forward
) = self.prepare_inputs_labels_for_multimodal(
File "/checkpoint/binary/train_package/minigemini/model/mini_gemini_arch.py", line 328, in prepare_inputs_labels_for_multimodal
image_features = self.encode_images(images, images_aux)
File "/checkpoint/binary/train_package/minigemini/model/mini_gemini_arch.py", line 255, in encode_images
image_aux_features_raw = self.get_model().get_vision_tower_aux()(images_aux).to(
AttributeError: 'list' object has no attribute 'to'

怎么从你们的检查点启动呢

我发现教程只有从开源大模型的权重微调的，如果要在你们的检查点启动使用什么方法呢

OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory model_zoo/OpenAI/clip-vit-large-patch14-336.

Hello, everyone, I am using MiniGemini evaluation on an image by typing command:

python -m minigemini.serve.cli  --model-path ./Mini-Gemini-2B/     --image-file replaced_with_path_to_image

then following OSError emerged:

Traceback (most recent call last):
  File "home_path/anaconda3/envs/minigeimini/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "home_path/anaconda3/envs/minigeimini/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data1/code_path/Projects/github/MiniGemini/minigemini/serve/cli.py", line 237, in <module>
    main(args)
  File "/data1/code_path/Projects/github/MiniGemini/minigemini/serve/cli.py", line 56, in main
    tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name, args.load_8bit, args.load_4bit, device=args.device)
  File "/data1/code_path/Projects/github/MiniGemini/minigemini/model/builder.py", line 112, in load_pretrained_model
    vision_tower.load_model()
  File "/data1/code_path/Projects/github/MiniGemini/minigemini/model/multimodal_encoder/clip_encoder.py", line 33, in load_model
    self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name)
  File "home_path/anaconda3/envs/minigeimini/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3144, in from_pretrained
    raise EnvironmentError(
OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory model_zoo/OpenAI/clip-vit-large-patch14-336.

Some internet friends share their solutions: Downloading the corresponding "xxx.index.json" file, but I can't find any "xxx.index.json" file of "clip-vit-large-patch14-336" on the huggingface website.

I think maybe the relative path leads to the probelm, so I replace it with absolute path, but the same problem is shot.

My environment:
OS: ubuntu 22.04 64 bit
python: 3.10.14
others: other libraries are installed according to MiniGemini's official installation guide.

Does anybody have a solution for that, I will be very grateful for you, thank u.

AttributeError: 'MiniGeminiLlamaModel' object has no attribute 'vlm_uni_query_projector'

when i was running minigemini.serve.cli to inference an image（minigemini_34b_hd），i got this error：AttributeError: 'MiniGeminiLlamaModel' object has no attribute 'vlm_uni_query_projector'.
Should i put all the following pretrainded models to the specific directory？or just need some of them.

Looking forward to your reply
Thanks

Could you please share the pretrain weights after stage 1?

evaluate MMMU_val of 2B error

Hi, I try evaluating MMMU_val of 2B by running this command

 bash /home/user/minigemini/scripts/gemma/eval/mmmu.sh

but get error log as follows while evaluation of MMMU_val on 7B, 13B and 34B models could run successfully.

Traceback (most recent call last):
  File "/home/user/MiniGemini/minigemini/eval/MMMU/eval/run_llava.py", line 207, in <module>
    main()                               
  File "/home/user/MiniGemini/minigemini/eval/MMMU/eval/run_llava.py", line 158, in main
    response = call_model_engine(args, sample, model, tokenizer, processor)
  File "/home/user/MiniGemini/minigemini/eval/MMMU/eval/utils/model_utils_ind.py", line 53, in call_llava_engine_df
    output_ids = model.generate(
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/MiniGemini/minigemini/model/language_model/mini_gemini_gemma.py", line 144, in generate
    return super().generate(
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/transformers/generation/utils.py", line 1648, in generate
    result = self._beam_sample(
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/transformers/generation/utils.py", line 3402, in _beam_sample
    outputs = self(                      
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/MiniGemini/minigemini/model/language_model/mini_gemini_gemma.py", line 97, in forward
    return super().forward(
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/transformers/models/gemma/modeling_gemma.py", line 1105, in forward
    outputs = self.model(
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/transformers/models/gemma/modeling_gemma.py", line 891, in forward
    causal_mask = self._update_causal_mask(attention_mask, inputs_embeds, cache_position)
  File "/home/user/anaconda2/envs/minigemini/lib/python3.10/site-packages/transformers/models/gemma/modeling_gemma.py", line 984, in _update_causal_mask
    causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
RuntimeError: The size of tensor a (701) must match the size of tensor b (0) at non-singleton dimension 0
scripts/gemma/eval/mmmu.sh: line 31: MMMU/answers/Mini-Gemini-2B/merge.jsonl: No such file or directory
scripts/gemma/eval/mmmu.sh: line 35: MMMU/answers/Mini-Gemini-2B/merge.jsonl: No such file or directory
Traceback (most recent call last):
  File "/home/user/MiniGemini/minigemini/eval/MMMU/eval/eval.py", line 31, in <module>
    main()                               
  File "/home/user/MiniGemini/minigemini/eval/MMMU/eval/eval.py", line 17, in main
    out_samples = [json.loads(line) for line in open(args.result_file)]
FileNotFoundError: [Errno 2] No such file or directory: 'MMMU/answers/Mini-Gemini-2B/merge.jsonl'

Demo page is not working

Demo page is not working also domain is an ip adress.

any idea coping with long video input?

Dear author:
Thanks for publishing the mini gemini paper. As Gemini 1.5 support up to one hour long video input for 1fps sampling, I wonder how to adopt your framework to support long video training and inference? thank you.

Possible positional embedding in Patch Info Mining process

Thanks for sharing your great work. I have a small question about the attention process in the patch info mining module. I found that there is no positional embedding for the high-resolution tokens to indicate their position in the corresponding patch. Do you think adding positional embedding here would be helpful and have you tried this?

Questions about how to enlarge the base vision tower input resolution

Currently, there is an option to using two image grid to double the input, but this introduce sort of heavy compute.

I want just make the input resolution slightly large, say from 336 -> 448, while keep the Convnext input resolution same (although I think currenctly it should larger if base vision tower are larger).

Could that be possible? Can u give me some advisor how to adopt it?

4 bit loading fails

I have tried both the model worker and the cli, and both when passed 4bit loading just fail with the error message:

Loading pretrained weights (convnext_large_d_320).
Traceback (most recent call last):
  File "/usr/lib64/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/xxx/minigemini/serve/cli.py", line 234, in <module>
    main(args)
  File "/xxx/minigemini/serve/cli.py", line 56, in main
    tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name, args.load_8bit, args.load_4bit, device=args.device)
  File "/xxx/minigemini/model/builder.py", line 124, in load_pretrained_model
    model.get_model().initialize_uni_modules(model.config, for_eval=True)
  File "/xxx/minigemini/model/mini_gemini_arch.py", line 213, in initialize_uni_modules
    get_w(projector_weights, 'vision_tower.vision_tower', self.vision_tower, 'vision_tower')
  File "/xxx/minigemini/model/mini_gemini_arch.py", line 209, in get_w
    getattr(main_module, sub_module).to(device=device_type, dtype=weight_type)
  File "/xxx/venv/lib64/python3.10/site-packages/transformers/modeling_utils.py", line 2460, in to
    return super().to(*args, **kwargs)
  File "/xxx/venv/lib64/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in to
    raise TypeError('nn.Module.to only accepts floating point or complex '
TypeError: nn.Module.to only accepts floating point or complex dtypes, but got desired dtype=torch.uint8

loading with 8bit works, but OOMs on my hardware. (24+24 vram)

RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 1, 3, 336, 336]

When I generate it using the example image, it tells me that the dimensions are not correct

Not find vision tower: model_zoo/OpenAI/clip-vit-large-patch14-336

Hi, Nice work! but when i run the example code , i have encounter some problems,when i run python -m minigemini.serve.cli --model-path YanweiLi/Mini-Gemini-13B-HD --image-file Woman.jpg ,i got the error ValueError: Not find vision tower: model_zoo/OpenAI/clip-vit-large-patch14-336 ,how can i solve this problem thank u!

Similarities to LLaVA-HR

Congratulations on your great work and solid performance! However, we notice that your core idea and design are highly similar to our previsiou work LLaVA-HR, especially the dual visual pathways for MLLMs. We would like to see simple clarifications and discussions with our work in your paper. Thank you!

minigemini_instruction.json包含了预训练的LLaVA Images，但是在readme中没有写到微调用到了这部分数据

minigemini_instruction.json中有条目的image为llava/LLaVA-Pretrain/images/00013/000133305.jpg，有很多llava的图片，但是readme中没有提到要用llava的图片，是你们疏忽了吗

Questions about change ViT to 378 input resolution, but got poor results.

Hi, am alreaady tried using vit336 and convnext + Qwen LLM, which is great, and really got a good performance.

But when I try using another CLIP vit model with input size is 378, rest things are same (include traning data) the result are extremly poor.

To precisely:

the loss are lower, normally I got 0.9-1.0 , but using CLIP with input size 378, the loss can to 0.7-0.8, but the inference result are very poor;
The CLIP model I used was Apple's DNFS_vit_G_378 model.
I have changed the convnext input resuoltion accordingly.

Any reason for this? This is really weired, better and larger ViT got bad results.

Pretrain data not found in AllaVA

Hi, the pretrained data used allava images both from laion nad vfan.

But the laion part image names are totally different from ALLava's images format.

I tried to found:

465440.jpeg
320609

they all used in minigemini_pretrain.json but can not be found in ALLava images folder.

ls -f images | grep 465440
46544031.jpeg
(base) ➜  allava_laion git:(main) ✗ ls -f images | grep 320609     
132060956.jpeg
43206091.jpeg

why is that？

Mini-Gemini-2B evaluation error

Hi, I follow the instructions to prepare the data and model and get the following error when evaluating the Mini-Gemini-2B.

AttributeError: 'OpenCLIPVisionTower' object has no attribute 'vision_stem'

CLI of 2B does not work

How to reproduce:

Install this repo as README.md
run the following commands

export HF_ENDPOINT=https://hf-mirror.com

python -m minigemini.serve.cli \
    --model-path ./Mini-Gemini/Mini-Gemini-2B \
    --image-file ./images/demo_gen.png  \
    --debug \

Results:

The model does not generate anything since the stop_str is set to be EMPTY STRING ''. After I fixed this, I also found the chat history is not preserved in the prompt which results the multi-turn conversation results to be unexpected.

Could you offer LAION GPT4V images

follew https://huggingface.co/datasets/laion/gpt4v-dataset. There are about 400-500 pictures that cannot be downloaded. Could you provide the complete version?

关于代码实现的疑问

作者好，非常感谢你的开源。我有两个疑问想咨询下

1. convnext 的 drop path 问题

convnext 的 drop path 是 0.1，虽然你在训练时候设置的是不训练，但是这个路径还是会执行的。原则上在训练时候，convnext 要设置为 eval 模式吧？我没有找到相关代码，有点奇怪，想了解下

2. 代码鲁棒性问题

如果是 clip+convnext 组合方式训练没有任何问题。但是我想尝试 siglip，发现在仅仅将 clip 换成 siglip 后(均值和方差也是两套)，模型在 iter=2 时候就会出现 nan。注释如下

        # token attention
        embed_query = self.vlm_uni_query_projector(images)
        embed_aux = self.vlm_uni_aux_projector(images_aux)
        embed_value = self.vlm_uni_val_projector(images_aux)
        # TODO siglip+convnext 在第一次 forward 后正常，但是 embed_att 会出现 nan
        # TODO 导致第二次迭代时候 embed_value 会出现 nan，无法训练
        # TODO 怀疑是特征不匹配，即使全部转换为 fp32 也会出现 nan, 需要进一步排查
        embed_att = embed_query[:, :, None] @ (embed_aux.transpose(-1, -2) / (embed_aux.shape[-1] ** 0.5))
        # print('=xxxx=', torch.any(torch.isnan(embed_query)).item(),
        #       torch.any(torch.isnan(embed_aux)).item(),
        #       torch.any(torch.isnan(embed_value)).item(),
        #       torch.any(torch.isnan(embed_att)).item())
        embed_att = embed_att.nan_to_num()
        embed_feat = (embed_att.softmax(-1) @ embed_value).mean(2)
        # print('=xxcccxx=', torch.any(torch.isnan(embed_feat)).item())
        image_features = images + embed_feat
        return image_features

不知道作者对于这个问题是咋看的。有一个细节不一样：因为 siglip 输入是 384x384 的，输出尺度是 27x27，所以我必须要将 convnext 输入分辨率设置为 864, 这样才可以保证两者空间完全对齐。

期待您的回复。

关于token数量的疑问

针对llama2-7b模型，max token length为2048，按照stage2 的训练参数，一旦设置IMAGE_GRID=2和IMAGE_GLOBAL为True，
image_features = torch.cat([image_feat_global, image_features], dim=1)
image_aux_features = torch.cat([image_aux_feat_global, image_aux_features], dim=1)
两行最终得到的图像特征token数就变成了2880，这里token数不是超了吗？如果理解有误敬请指正。

Limitations on current image generating method

Hi, I found currently the image generating way hard to make input image as reference:

any thought about it?

BTW, I found the pretrain stage of this loss is very high:

{'loss': 2.573, 'learning_rate': 0.0007701925673852566, 'epoch': 0.34}                                                                                                                                                           
{'loss': 2.8217, 'learning_rate': 0.0007698799612970509, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.5646, 'learning_rate': 0.0007695672062744539, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.6857, 'learning_rate': 0.0007692543024900611, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.6944, 'learning_rate': 0.0007689412501165496, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.6418, 'learning_rate': 0.0007686280493266786, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.6801, 'learning_rate': 0.0007683147002932893, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.7245, 'learning_rate': 0.0007680012031893049, 'epoch': 0.34}                                                                                                                                                          
{'loss': 2.6275, 'learning_rate': 0.0007676875581877296, 'epoch': 0.34}

Minor issue of the HR feature map size

In section 3.1 of the paper, Should "N`=H/4*W/4" be adjusted to "N`*N`=H/4*W/4"?

TypeError: MiniGeminiMixtralForCausalLM.forward() got an unexpected keyword argument 'output_router_logits'

Hi, I'm trying to use MiniGemini outside of the demo environment, but am running into the following error when calling model.generate():

  File "/app/backend/minigemini.py", line 104, in chat_with_images
    output_ids = self.model.generate(
                 ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/MiniGemini/minigemini/model/language_model/mini_gemini_mixtral.py", line 142, in generate
    return super().generate(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/generation/utils.py", line 1527, in generate
    result = self._greedy_search(
             ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/generation/utils.py", line 2411, in _greedy_search
    outputs = self(
              ^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: MiniGeminiMixtralForCausalLM.forward() got an unexpected keyword argument 'output_router_logits'

I have MiniGemini-2B working, but Mixtral is still giving me some trouble. The only relevant reference I can find is: https://www.opensourceagenda.com/projects/transformers/versions, which mentions output_router_logits was removed in transformers 4.39.0.

I see a similar error with Mini-Gemini-7B:

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: MiniGeminiLlamaForCausalLM.forward() got an unexpected keyword argument 'cache_position'

I'm using:
transformers 4.39.3
accelerate-0.29.1
torch 2.2.2
torchvision 0.17.2

I confess, the dependencies for MiniGemini are a real challenge to integrate, but I hope it can get resolved.

How to train 34B model?

How to get 13K generation-related instructions dataset?

Dear author:

Thanks for your interesting work.

But I am still confused about following questions:

How to get 13K generation-related instructions dataset?
How to only tuning LLM for generation (or tuning LLM for understanding and generation separately)

Looking forward to your reply~
Thanks!!

performing finetune on top of mini-gemini-8x7b-HD

when performing finetune on top of the minigemini-8x7b-HD model using the following config:

#!/bin/bash
PRETRAIN_NAME=Mini-Gemini-8x7B-Pretrain
FINETUNE_NAME=Mini-Gemini-8x7B-HD
AUX_SIZE=1536
IMAGE_GRID=2
IMAGE_GLOBAL=True
LR_MULTI="model.mm_projector:2,model.vlm_uni:2"

# delete --hostfile hostfile_4 and change --per_device_train_batch_size if trained on single machine

deepspeed minigemini/train/train_mem.py \
    --deepspeed ./scripts/zero3_offload.json \
    --model_name_or_path Mini-Gemini-8x7B-HD \
    --version mistral_instruct \
    --data_path data/minigemini_instruction.json \
    --image_folder data/figures \
    --vision_tower model_zoo/OpenAI/clip-vit-large-patch14-336 \
    --vision_tower_aux model_zoo/OpenAI/openclip-convnext-large-d-320-laion2B-s29B-b131K-ft-soup \
    --image_grid $IMAGE_GRID \
    --image_global $IMAGE_GLOBAL \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --image_size_aux $AUX_SIZE \
    --bf16 True \
    --output_dir ./work_dirs/experiment-2/$FINETUNE_NAME \
    --num_train_epochs 5 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 2 \
    --save_strategy "steps" \
    --save_steps 2 \
    --save_total_limit 1 \
    --learning_rate 8e-4 \
    --lr_multi $LR_MULTI \
    --weight_decay 0. \
    --warmup_ratio 0.05 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 8192 \
    --gradient_checkpointing True \
    --dataloader_num_workers 64 \
    --lazy_preprocess True \
    --report_to wandb

The final model only has 4 safetensors files instead of 20, why is that ? thanks.

Where is the data of Generation-related Instructions in Section 3.3?

Thank you for your great work!
Could you clarify where the data about Generation-related Instructions in Section 3.3 is located?

No image generated while run python -m minigemini.serve.cli --gen

Dear author:

Thanks for your interesting work.

When I run the following command with the input What is unusual about this image?, no image generated in the output:

python -m minigemini.serve.cli \
    --model-path work_dirs/Mini-Gemini/Mini-Gemini-2B \
    --image-file examples/extreme_ironing.jpg \
    --gen

I wonder if the Mini-Gemini-2B model does not have the ability to generate images?

And if fine-tuning is needed, which datasets should be used to make the model output <h> ... </h>?

Thanks!!

Training on customized LLM loss got sort of high

Hi, am using Qwen2 4b as LLM training the model, the template solving way I previous tried on LLaVa works OK, and no warnings appeared.

But when pretrain on minigemini, got unexpected loss result:

{'loss': 1.9597, 'learning_rate': 0.0007436649460805199, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.8565, 'learning_rate': 0.0007435934693368483, 'epoch': 0.36}                                                                                                                                                            
{'loss': 2.0595, 'learning_rate': 0.0007435219860653267, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.9304, 'learning_rate': 0.0007434504962678705, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.9705, 'learning_rate': 0.0007433789999463957, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.9536, 'learning_rate': 0.000743307497102818, 'epoch': 0.36}                                                                                                                                                             
{'loss': 1.9405, 'learning_rate': 0.0007432359877390538, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.8501, 'learning_rate': 0.0007431644718570192, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.8722, 'learning_rate': 0.000743092949458631, 'epoch': 0.36}                                                                                                                                                             
{'loss': 1.6984, 'learning_rate': 0.0007430214205458056, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.8309, 'learning_rate': 0.0007429498851204598, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.8681, 'learning_rate': 0.0007428783431845109, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.9383, 'learning_rate': 0.0007428067947398757, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.8059, 'learning_rate': 0.000742735239788472, 'epoch': 0.36}                                                                                                                                                             
{'loss': 1.8016, 'learning_rate': 0.0007426636783322172, 'epoch': 0.36}                                                                                                                                                            
{'loss': 2.0427, 'learning_rate': 0.0007425921103730288, 'epoch': 0.36}                                                                                                                                                            
{'loss': 1.899, 'learning_rate': 0.0007425205359128248, 'epoch': 0.36}

Looks like stucked at ~1.8, no longer decrease, what could be the reason for this? (the data all same)

(I tried using both chat template in pretrain and finetune, same as mini does)

images_aux encode error

When using clip.py for inference, I encountered the following error. How should I solve it?

File "/mnt/bn/codegen-finetune/projects/MiniGemini/minigemini/model/mini_gemini_arch.py", line 255, in encode_images
if images_aux is not None:
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/mnt/bn/codegen-finetune/projects/MiniGemini/minigemini/model/multimodal_encoder/clip_encoder.py", line 58, in forward
image_features = self.image_forward(images)
File "/mnt/bn/codegen-finetune/projects/MiniGemini/minigemini/model/multimodal_encoder/clip_encoder.py", line 50, in image_forward
image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 917, in forward
return self.vision_model(
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 841, in forward
hidden_states = self.embeddings(pixel_values)
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/vl_model/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 187, in forward
embeddings = embeddings + self.position_embedding(self.position_ids)
RuntimeError: The size of tensor a (2305) must match the size of tensor b (50) at non-singleton dimension 1

issues running cli inference example

command: python -m minigemini.serve.cli --model-path Mini-Gemini-34B-HD --image-file test_image.png

  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/paperspace/MiniGemini/minigemini/serve/cli.py", line 241, in <module>
    main(args)
  File "/home/paperspace/MiniGemini/minigemini/serve/cli.py", line 200, in main
    output_ids = model.generate(
  File "/home/paperspace/MiniGemini/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/paperspace/MiniGemini/minigemini/model/language_model/mini_gemini_llama.py", line 183, in generate
    return super().generate(
  File "/home/paperspace/MiniGemini/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/paperspace/MiniGemini/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1575, in generate
    result = self._sample(
  File "/home/paperspace/MiniGemini/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2697, in _sample
    outputs = self(
  File "/home/paperspace/MiniGemini/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/paperspace/MiniGemini/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
TypeError: MiniGeminiLlamaForCausalLM.forward() got an unexpected keyword argument 'cache_position'```

batching giving weird outputs

Hi I noticed that when doing batch inference with a static prompt like ‘Describe the image’ the model gives wrong output like ‘in detail’, like it is just doing sentence completion. Whereas if I try a more descriptive prompt, where I tell minigemini that it is a ‘prompt generator’ then it goes into sentence completion mode and gives me an okayish response.

However, I have the original image description also, so I tried adding those in the prompt and then asking the model to describe the image, given the information about the image. This works perfectly fine when I just use 1 image at a time.

But when doing batch processing I get completely garbage outputs. To do batch processing, I use padding to get the prompts to the same shape. But this gives me completely garbage output.
I do this by changing line 44 in / MiniGemini/minigemini/mm_utils.py
to
tokenizer(chunk, padding='max_length', max_length=max_len).input_ids for chunk in prompt.split('')

Could you give me any advice on how to do this effectively?

How many SAM images were used from ShareGPT4v?

I downloaded sharegpt4v used fientuneing part data, but always got image not found, am using finetune stage.

Does finetune used sharegpt4v pretrain data?

sharegpt4v finetune just used very little data from SAM.

Shall we download the. sam_000000 - 0000050 whole 500GB images for it?

Training model got some error

TypeError: pad_sequence(): argument 'padding_value' (position 3) must be float, not NoneType

Am using Qwen as LLM, got above error, what could be the reason? I have checked:

if tokenizer.pad_token_id == None:
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.unk_token
        tokenizer.pad_token_id = tokenizer.encode(tokenizer.pad_token)

didnt make it work, any help could be possible?

for example:

cat ALLaVA-Caption-LAION-4V.json | grep 'https://slideplayer.it/slide/553401/1/images/40/Delayed+relaxation+filling+pattern.jpg' -C 9

this url existed in minigemini but gone at latest ALLava images

Therefore, please help fix the data first, it emergency. Stops anyone want to reveal the minigemini training and scores.

Begin training loss a little high

Hi, the begin loss is a little high, does it normal?

{'loss': 6.7177, 'learning_rate': 0.0, 'epoch': 0.0}                                                                                                                                                                             
{'loss': 6.7738, 'learning_rate': 0.0, 'epoch': 0.0}                                                                                                                                                                             
{'loss': 6.801, 'learning_rate': 0.0, 'epoch': 0.0}                                                                                                                                                                              
{'loss': 6.5226, 'learning_rate': 0.0, 'epoch': 0.0}                                                                                                                                                                             
{'loss': 6.813, 'learning_rate': 7.633587786259541e-06, 'epoch': 0.0}                                                                                                                                                            
{'loss': 6.6355, 'learning_rate': 1.5267175572519083e-05, 'epoch': 0.0}                                                                                                                                                          
{'loss': 6.3212, 'learning_rate': 2.2900763358778628e-05, 'epoch': 0.0}                                                                                                                                                          
{'loss': 6.1449, 'learning_rate': 3.0534351145038166e-05, 'epoch': 0.0}                                                                                                                                                          
{'loss': 5.8368, 'learning_rate': 3.816793893129771e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 5.9083, 'learning_rate': 4.5801526717557256e-05, 'epoch': 0.0}                                                                                                                                                          
{'loss': 5.7246, 'learning_rate': 5.3435114503816794e-05, 'epoch': 0.0}                                                                                                                                                          
{'loss': 5.6311, 'learning_rate': 6.106870229007633e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 5.7906, 'learning_rate': 6.870229007633588e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 5.6311, 'learning_rate': 7.633587786259542e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 5.1357, 'learning_rate': 8.396946564885496e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 4.9726, 'learning_rate': 9.160305343511451e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 4.7747, 'learning_rate': 9.923664122137405e-05, 'epoch': 0.0}                                                                                                                                                           
{'loss': 4.8886, 'learning_rate': 0.00010687022900763359, 'epoch': 0.0}                                                                                                                                                          
{'loss': 4.7224, 'learning_rate': 0.00011450381679389313, 'epoch': 0.0}                                                                                                                                                          
{'loss': 4.4193, 'learning_rate': 0.00012213740458015266, 'epoch': 0.0}                                                                                                                                                          
{'loss': 4.5041, 'learning_rate': 0.00012977099236641222, 'epoch': 0.0}                                                                                                                                                          
{'loss': 4.3678, 'learning_rate': 0.00013740458015267177, 'epoch': 0.01}                                                                                                                                                         
{'loss': 4.3701, 'learning_rate': 0.0001450381679389313, 'epoch': 0.01}                                                                                                                                                          
{'loss': 4.505, 'learning_rate': 0.00015267175572519084, 'epoch': 0.01}                                                                                                                                                          
{'loss': 4.5022, 'learning_rate': 0.00016030534351145037, 'epoch': 0.01}                                                                                                                                                         
{'loss': 4.2254, 'learning_rate': 0.00016793893129770992, 'epoch': 0.01}                                                                                                                                                         
{'loss': 4.339, 'learning_rate': 0.00017557251908396944, 'epoch': 0.01}                                                                                                                                                          
{'loss': 4.5537, 'learning_rate': 0.00018320610687022902, 'epoch': 0.01}                                                                                                                                                         
{'loss': 4.2307, 'learning_rate': 0.00019083969465648857, 'epoch': 0.01}

How's the loss curve on your side? (am using a 4B llm model