dhansmair / flamingo-mini Goto Github PK

Implementation of the deepmind Flamingo vision-language model, based on Hugging Face language models and ready for training

License: MIT License

Python 98.82% Shell 1.18%

flamingo-mini's People

Contributors

Stargazers

Watchers

Forkers

thandal lizeyujack hertera1 radityapujamurti yashucr773 peterhan91 medmaleap zyloveslego cxz qqq-tech dasklarleo jisonz phuvinhnguyen 15737939656

flamingo-mini's Issues

Prompting example

Is there an example that shows how to perform few-shot multimodal prompting?

Error on image_captioning.ipynb

Error when running caption = model.generate_captions(processor, images=[image]) in image_captioning.ipynb

What is the best way to generate inference from code?

Also tried pipeline, but am missing use_auth_token.
pipe = pipeline(task='image-to-text', model='dhansmair/flamingo')

Thanks in advance! Great work

parameters_trainable() and state_dict_trainable() do not include the token embedding matrix

parameters_trainable() and state_dict_trainable() do not include the token embedding matrices.
This is problematic since the special token <EOC> is added to the embeddings, but the embedding is not learned.

Doubts about training

Hi,

I am new to multi modal models and I have several questions regarding the training process. First of all, how do you feed the model with the training data? Do you insert as tokens the features of the images or do you feed the model with the new tokens created in the presampler?Is it the input the image and the label the text, or do you insert some extra text in the input?Secondly, what kind of hardware have you use to train the model and what was the time needed to train it? Lastly, what are the main differences of this project in comparison with the one published by Deepmind´s Flamingo?

Thank you in advanced

Text Prompting the model using the cached image

Hello dhansmair!

First and foremost: Thank you for your effort in this repository and for making it public. Much appreciated.

Currently, I am simply playing around with flamingo-mini and following examples you provided. While I understand the differences from the original paper I would like to ask the following question:

Is it possible to prompt the model with text only input and get a response from the model for the already cached image?

E.g.

1-) An image of a cat is provided to the model.
2-) The model successfully generates a caption.
3-) The user enters a text prompt (I do not know what how the formatting should be) such as "What is the color of this cat?".
4-) The model generates a text only response using the already cached cat image.

Do you think such a work flow is possible in the current state of the repository?

Thank you for your help!

few shot example

Hi, The notebook throws the following error from the final cell, when "multimodal_prompt" function is called:

TypeError: type object got multiple values for keyword argument 'past_key_values'

Any suggestions?

Error:

TypeError                                 Traceback (most recent call last)
Input In [6], in <module>
      1 prompt = "<image>Output: A cat wearing sunglasses.<EOC><image>Output: Elephants walking in the Savanna.<EOC><image>Output: "
----> 2 response = multimodal_prompt(model, processor, prompt, images=[cat_image, elephants_image, flamingo_image], device=device)
      3 print('prompt:', prompt)
      4 print('output:', response)

Input In [2], in multimodal_prompt(model, processor, prompt, images, device)
      9 pixels = processor(images, device=device)['pixel_values']
     10 pixels = repeat(pixels, 'N c h w -> b N T c h w', b=1, T=1)
---> 12 output = model.generate(
     13     inputs=input_ids,
     14     media_locations=media_locations,
     15     attention_mask=attention_mask,
     16     pixel_values=pixels,
     17     max_length=150,
     18     use_cache=True,
     19     early_stopping=True,
     20     bos_token_id=model.flamingo.lm.config.bos_token_id,
     21     eos_token_id=model.flamingo.lm.config.eos_token_id,
     22     pad_token_id=model.flamingo.lm.config.eos_token_id
     23 )
     25 response = processor.tokenizer.batch_decode(output, skip_special_tokens=True)
     26 return response[0]

File ~/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File ~/.local/lib/python3.8/site-packages/transformers/generation/utils.py:1391, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, **kwargs)
   1385         raise ValueError(
   1386             f"num_return_sequences has to be 1, but is {generation_config.num_return_sequences} when doing"
   1387             " greedy search."
   1388         )
   1390     # 11. run greedy search
-> 1391     return self.greedy_search(
   1392         input_ids,
   1393         logits_processor=logits_processor,
   1394         stopping_criteria=stopping_criteria,
   1395         pad_token_id=generation_config.pad_token_id,
   1396         eos_token_id=generation_config.eos_token_id,
   1397         output_scores=generation_config.output_scores,
   1398         return_dict_in_generate=generation_config.return_dict_in_generate,
   1399         synced_gpus=synced_gpus,
   1400         **model_kwargs,
   1401     )
   1403 elif is_contrastive_search_gen_mode:
   1404     if generation_config.num_return_sequences > 1:

File ~/.local/lib/python3.8/site-packages/transformers/generation/utils.py:2176, in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, **model_kwargs)
   2173         break
   2175 # prepare model inputs
-> 2176 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
   2178 # forward pass to get next token
   2179 outputs = self(
   2180     **model_inputs,
   2181     return_dict=True,
   2182     output_attentions=output_attentions,
   2183     output_hidden_states=output_hidden_states,
   2184 )

File ~/.local/lib/python3.8/site-packages/flamingo_mini/modeling_flamingo.py:513, in FlamingoModel.prepare_inputs_for_generation(self, input_ids, media_locations, attention_mask, pixel_values, visual_features, past, **kwargs)
    510 if past is not None:
    511     input_ids = input_ids[:, -1:]
--> 513 return dict(
    514     input_ids=input_ids,
    515     past_key_values=past,
    516     media_locations=media_locations,
    517     attention_mask=attention_mask,
    518     pixel_values=pixel_values,
    519     visual_features=visual_features,
    520     **kwargs
    521 )

TypeError: type object got multiple values for keyword argument 'past_key_values'

use weight tying when copying lm_head from the language model

usually LMs use tied weights between the token embedding matrix and the linear output layer lm_head. Right now, the flamingo simply copies the weights, meaning that they occupy two times the memory.

Doubt about MaskedCrossAttention

Hi, I'm unsure about this piece of code in MaskedCrossAttention inside gated_cross_attention.py

media_time = torch.arange(n_media, device=y.device) + 1
# >> David:
# side note: here, text tokens attend to ALL previous visual tokens. If We only want to attend to the
# one image coming before in the text (like in the flamingo paper),
# we need to change >= to == at the line where 'text_to_media_mask' is created.
text_to_media_mask = rearrange(text_time, 'b i -> b 1 i 1') == repeat(media_time, 'j -> 1 1 1 (j m)', m=self.n_visual)
sim = sim.masked_fill(~text_to_media_mask, -torch.finfo(sim.dtype).max)

sim = sim - sim.amax(dim=-1, keepdim=True).detach()

It seems you are setting the positions you want to mask out to -torch.finfo(sim.dtype).max (large negative number), but then finding the largest value sim.amax to normalize by?

I would think it should be:

sim = sim.masked_fill(~text_to_media_mask, torch.finfo(sim.dtype).max)
sim = sim - sim.amax(dim=-1, keepdim=True).detach()

Any clarification on this logic is appreciated. Thanks!

Bugs of evaluation / caption generation

self.visuals -> self.encode_resample_visuals
media_locations shape not consistent with attention_masks, causing an error at
sim = sim.masked_fill(~text_to_media_mask, -torch.finfo(sim.dtype).max)
The size of tensor a (4) must match the size of tensor b (5) at non-singleton dimension 2

text generation incompatible with transformers 4.22.0

ImportError: cannot import name 'CLIPImageProcessor' from 'transformers' (/databricks/python/lib/python3.9/site-packages/transformers/init.py)

Anyone else getting this error?

ImportError: cannot import name 'CLIPImageProcessor' from 'transformers' (/databricks/python/lib/python3.9/site-packages/transformers/__init__.py)

command-1272036722344618> in <cell line: 1>()
----> 1 from flamingo_mini import FlamingoConfig, FlamingoModel, FlamingoProcessor
      2 
      3 # create a model for training
      4 device = ...
      5 config = FlamingoConfig(...)

/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
    169             # Import the desired module. If you’re seeing this while debugging a failed import,
    170             # look at preceding stack frames for relevant error information.
--> 171             original_result = python_builtin_import(name, globals, locals, fromlist, level)
    172 
    173             is_root_import = thread_local._nest_level == 1

/local_disk0/.ephemeral_nfs/envs/pythonEnv-fbd7ab33-bb5b-4088-ae5e-31771ac6ca13/lib/python3.9/site-packages/flamingo_mini/__init__.py in <module>
----> 1 from .flamingo_processor import FlamingoProcessor
      2 from .configuration_flamingo import FlamingoConfig
      3 from .modeling_flamingo import FlamingoModel

/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
    169             # Import the desired module. If you’re seeing this while debugging a failed import,
    170             # look at preceding stack frames for relevant error information.
--> 171             original_result = python_builtin_import(name, globals, locals, fromlist, level)
    172 
    173             is_root_import = thread_local._nest_level == 1

/local_disk0/.ephemeral_nfs/envs/pythonEnv-fbd7ab33-bb5b-4088-ae5e-31771ac6ca13/lib/python3.9/site-packages/flamingo_mini/flamingo_processor.py in <module>
      4 
      5 import torch
----> 6 from transformers import CLIPImageProcessor
      7 
      8 from .configuration_flamingo import FlamingoConfig

ImportError: cannot import name 'CLIPImageProcessor' from 'transformers' (/databricks/python/lib/python3.9/site-packages/transformers/__init__.py)

Video Support

In principle, the FlamingoModel expects visual_features input to have shape [b N T v d] where b=batch_size, N=number of media (images/videos) T=number of frames (T=1 for images, T>1 for videos), v=number of visual features (=number of patches) and d=visual feature dimensionality.
So while in principle it should be able to digest videos, the perceiver resampler is based on Lucidrains implementation https://github.com/lucidrains/flamingo-pytorch/blob/10913abbc8b2ceabb2320560d7d9b85fcb85eee3/flamingo_pytorch/flamingo_pytorch.py#L74 and I haven't checked if it works.
Also FlamingoProcessor currently has no implemented functionality to preprocess videos and I won't add it anytime soon.

FlamingoProcessor detects any "<" character as a media_location

flamingo-mini/flamingo_mini/flamingo_processor.py

Line 67 in 6d7e4b1

self.leq_ids = [

right now, the token for the "<" character is used to identify the position of an <image> tag in a sentence. This means any other "<" character will also be wrongly identified as a media location.

text generation not working with transformers > v4.25.1

When trying the example image caption: the following error message is generated for the last cell:

Cell:

caption = model.generate_captions(processor, images=[image])
print('generated caption:')
print(caption)

Error:

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

[<ipython-input-5-749dca7b9c67>](https://localhost:8080/#) in <module>
----> 1 caption = model.generate_captions(processor, images=[image])
      2 print('generated caption:')
      3 print(caption)

5 frames

[/usr/local/lib/python3.8/dist-packages/flamingo_mini/modeling_flamingo.py](https://localhost:8080/#) in prepare_inputs_for_generation(self, input_ids, media_locations, attention_mask, pixel_values, visual_features, past, **kwargs)
    511             input_ids = input_ids[:, -1:]
    512 
--> 513         return dict(
    514             input_ids=input_ids,
    515             past_key_values=past,

TypeError: type object got multiple values for keyword argument 'past_key_values'

Web scraped data pretraining

As discussed here lucidrains/flamingo-pytorch#4 (comment), pretraining on CLIP can be problematic.
Can you suggest another parameter than openai/clip-vit-base-patch32 for the vision encoder, something more task agnostic ?
I am struggling with XRay images.

About the training script

Thanks for this excellent work.
In the README, you mentioned that the training script is available at https://github.com/dhansmair/flamingo-mini/tree/hf_trainer. However, this url returns page not found. Could you please let me know the correct url to the training script?

generate() only works with use_cache=True

It makes sense to always use use_cache=True anyway, so this is a minor problem.
The reason for the error is that inside of .generate(), the created tokens are appended to input_ids, but media_locations is not extended accordingly. media_locations should have the same shape as input_ids.