dhansmair / flamingo-mini Goto Github PK
View Code? Open in Web Editor NEWImplementation of the deepmind Flamingo vision-language model, based on Hugging Face language models and ready for training
License: MIT License
Implementation of the deepmind Flamingo vision-language model, based on Hugging Face language models and ready for training
License: MIT License
Is there an example that shows how to perform few-shot multimodal prompting?
Error when running caption = model.generate_captions(processor, images=[image])
in image_captioning.ipynb
What is the best way to generate inference from code?
Also tried pipeline, but am missing use_auth_token
.
pipe = pipeline(task='image-to-text', model='dhansmair/flamingo')
Thanks in advance! Great work
parameters_trainable()
and state_dict_trainable()
do not include the token embedding matrices.
This is problematic since the special token <EOC>
is added to the embeddings, but the embedding is not learned.
Hi,
I am new to multi modal models and I have several questions regarding the training process. First of all, how do you feed the model with the training data? Do you insert as tokens the features of the images or do you feed the model with the new tokens created in the presampler?Is it the input the image and the label the text, or do you insert some extra text in the input?Secondly, what kind of hardware have you use to train the model and what was the time needed to train it? Lastly, what are the main differences of this project in comparison with the one published by Deepmind´s Flamingo?
Thank you in advanced
Hello dhansmair!
First and foremost: Thank you for your effort in this repository and for making it public. Much appreciated.
Currently, I am simply playing around with flamingo-mini and following examples you provided. While I understand the differences from the original paper I would like to ask the following question:
Is it possible to prompt the model with text only input and get a response from the model for the already cached image?
E.g.
1-) An image of a cat is provided to the model.
2-) The model successfully generates a caption.
3-) The user enters a text prompt (I do not know what how the formatting should be) such as "What is the color of this cat?".
4-) The model generates a text only response using the already cached cat image.
Do you think such a work flow is possible in the current state of the repository?
Thank you for your help!
Hi, The notebook throws the following error from the final cell, when "multimodal_prompt" function is called:
TypeError: type object got multiple values for keyword argument 'past_key_values'
Any suggestions?
Error:
TypeError Traceback (most recent call last)
Input In [6], in <module>
1 prompt = "<image>Output: A cat wearing sunglasses.<EOC><image>Output: Elephants walking in the Savanna.<EOC><image>Output: "
----> 2 response = multimodal_prompt(model, processor, prompt, images=[cat_image, elephants_image, flamingo_image], device=device)
3 print('prompt:', prompt)
4 print('output:', response)
Input In [2], in multimodal_prompt(model, processor, prompt, images, device)
9 pixels = processor(images, device=device)['pixel_values']
10 pixels = repeat(pixels, 'N c h w -> b N T c h w', b=1, T=1)
---> 12 output = model.generate(
13 inputs=input_ids,
14 media_locations=media_locations,
15 attention_mask=attention_mask,
16 pixel_values=pixels,
17 max_length=150,
18 use_cache=True,
19 early_stopping=True,
20 bos_token_id=model.flamingo.lm.config.bos_token_id,
21 eos_token_id=model.flamingo.lm.config.eos_token_id,
22 pad_token_id=model.flamingo.lm.config.eos_token_id
23 )
25 response = processor.tokenizer.batch_decode(output, skip_special_tokens=True)
26 return response[0]
File ~/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
24 @functools.wraps(func)
25 def decorate_context(*args, **kwargs):
26 with self.clone():
---> 27 return func(*args, **kwargs)
File ~/.local/lib/python3.8/site-packages/transformers/generation/utils.py:1391, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, **kwargs)
1385 raise ValueError(
1386 f"num_return_sequences has to be 1, but is {generation_config.num_return_sequences} when doing"
1387 " greedy search."
1388 )
1390 # 11. run greedy search
-> 1391 return self.greedy_search(
1392 input_ids,
1393 logits_processor=logits_processor,
1394 stopping_criteria=stopping_criteria,
1395 pad_token_id=generation_config.pad_token_id,
1396 eos_token_id=generation_config.eos_token_id,
1397 output_scores=generation_config.output_scores,
1398 return_dict_in_generate=generation_config.return_dict_in_generate,
1399 synced_gpus=synced_gpus,
1400 **model_kwargs,
1401 )
1403 elif is_contrastive_search_gen_mode:
1404 if generation_config.num_return_sequences > 1:
File ~/.local/lib/python3.8/site-packages/transformers/generation/utils.py:2176, in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, **model_kwargs)
2173 break
2175 # prepare model inputs
-> 2176 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
2178 # forward pass to get next token
2179 outputs = self(
2180 **model_inputs,
2181 return_dict=True,
2182 output_attentions=output_attentions,
2183 output_hidden_states=output_hidden_states,
2184 )
File ~/.local/lib/python3.8/site-packages/flamingo_mini/modeling_flamingo.py:513, in FlamingoModel.prepare_inputs_for_generation(self, input_ids, media_locations, attention_mask, pixel_values, visual_features, past, **kwargs)
510 if past is not None:
511 input_ids = input_ids[:, -1:]
--> 513 return dict(
514 input_ids=input_ids,
515 past_key_values=past,
516 media_locations=media_locations,
517 attention_mask=attention_mask,
518 pixel_values=pixel_values,
519 visual_features=visual_features,
520 **kwargs
521 )
TypeError: type object got multiple values for keyword argument 'past_key_values'
usually LMs use tied weights between the token embedding matrix and the linear output layer lm_head
. Right now, the flamingo simply copies the weights, meaning that they occupy two times the memory.
Hi, I'm unsure about this piece of code in MaskedCrossAttention
inside gated_cross_attention.py
media_time = torch.arange(n_media, device=y.device) + 1
# >> David:
# side note: here, text tokens attend to ALL previous visual tokens. If We only want to attend to the
# one image coming before in the text (like in the flamingo paper),
# we need to change >= to == at the line where 'text_to_media_mask' is created.
text_to_media_mask = rearrange(text_time, 'b i -> b 1 i 1') == repeat(media_time, 'j -> 1 1 1 (j m)', m=self.n_visual)
sim = sim.masked_fill(~text_to_media_mask, -torch.finfo(sim.dtype).max)
sim = sim - sim.amax(dim=-1, keepdim=True).detach()
It seems you are setting the positions you want to mask out to -torch.finfo(sim.dtype).max
(large negative number), but then finding the largest value sim.amax
to normalize by?
I would think it should be:
sim = sim.masked_fill(~text_to_media_mask, torch.finfo(sim.dtype).max)
sim = sim - sim.amax(dim=-1, keepdim=True).detach()
Any clarification on this logic is appreciated. Thanks!
sim = sim.masked_fill(~text_to_media_mask, -torch.finfo(sim.dtype).max)
The size of tensor a (4) must match the size of tensor b (5) at non-singleton dimension 2
Anyone else getting this error?
ImportError: cannot import name 'CLIPImageProcessor' from 'transformers' (/databricks/python/lib/python3.9/site-packages/transformers/__init__.py)
command-1272036722344618> in <cell line: 1>()
----> 1 from flamingo_mini import FlamingoConfig, FlamingoModel, FlamingoProcessor
2
3 # create a model for training
4 device = ...
5 config = FlamingoConfig(...)
/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
169 # Import the desired module. If you’re seeing this while debugging a failed import,
170 # look at preceding stack frames for relevant error information.
--> 171 original_result = python_builtin_import(name, globals, locals, fromlist, level)
172
173 is_root_import = thread_local._nest_level == 1
/local_disk0/.ephemeral_nfs/envs/pythonEnv-fbd7ab33-bb5b-4088-ae5e-31771ac6ca13/lib/python3.9/site-packages/flamingo_mini/__init__.py in <module>
----> 1 from .flamingo_processor import FlamingoProcessor
2 from .configuration_flamingo import FlamingoConfig
3 from .modeling_flamingo import FlamingoModel
/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
169 # Import the desired module. If you’re seeing this while debugging a failed import,
170 # look at preceding stack frames for relevant error information.
--> 171 original_result = python_builtin_import(name, globals, locals, fromlist, level)
172
173 is_root_import = thread_local._nest_level == 1
/local_disk0/.ephemeral_nfs/envs/pythonEnv-fbd7ab33-bb5b-4088-ae5e-31771ac6ca13/lib/python3.9/site-packages/flamingo_mini/flamingo_processor.py in <module>
4
5 import torch
----> 6 from transformers import CLIPImageProcessor
7
8 from .configuration_flamingo import FlamingoConfig
ImportError: cannot import name 'CLIPImageProcessor' from 'transformers' (/databricks/python/lib/python3.9/site-packages/transformers/__init__.py)
In principle, the FlamingoModel expects visual_features input to have shape [b N T v d] where b=batch_size, N=number of media (images/videos) T=number of frames (T=1 for images, T>1 for videos), v=number of visual features (=number of patches) and d=visual feature dimensionality.
So while in principle it should be able to digest videos, the perceiver resampler is based on Lucidrains implementation https://github.com/lucidrains/flamingo-pytorch/blob/10913abbc8b2ceabb2320560d7d9b85fcb85eee3/flamingo_pytorch/flamingo_pytorch.py#L74 and I haven't checked if it works.
Also FlamingoProcessor currently has no implemented functionality to preprocess videos and I won't add it anytime soon.
right now, the token for the "<" character is used to identify the position of an <image>
tag in a sentence. This means any other "<" character will also be wrongly identified as a media location.
When trying the example image caption: the following error message is generated for the last cell:
Cell:
caption = model.generate_captions(processor, images=[image])
print('generated caption:')
print(caption)
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
[<ipython-input-5-749dca7b9c67>](https://localhost:8080/#) in <module>
----> 1 caption = model.generate_captions(processor, images=[image])
2 print('generated caption:')
3 print(caption)
5 frames
[/usr/local/lib/python3.8/dist-packages/flamingo_mini/modeling_flamingo.py](https://localhost:8080/#) in prepare_inputs_for_generation(self, input_ids, media_locations, attention_mask, pixel_values, visual_features, past, **kwargs)
511 input_ids = input_ids[:, -1:]
512
--> 513 return dict(
514 input_ids=input_ids,
515 past_key_values=past,
TypeError: type object got multiple values for keyword argument 'past_key_values'
As discussed here lucidrains/flamingo-pytorch#4 (comment), pretraining on CLIP can be problematic.
Can you suggest another parameter than openai/clip-vit-base-patch32 for the vision encoder, something more task agnostic ?
I am struggling with XRay images.
Thanks for this excellent work.
In the README, you mentioned that the training script is available at https://github.com/dhansmair/flamingo-mini/tree/hf_trainer. However, this url returns page not found. Could you please let me know the correct url to the training script?
It makes sense to always use use_cache=True
anyway, so this is a minor problem.
The reason for the error is that inside of .generate()
, the created tokens are appended to input_ids
, but media_locations
is not extended accordingly. media_locations
should have the same shape as input_ids
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.