Giter Site home page Giter Site logo

flamingo-mini's People

Contributors

dhansmair avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

flamingo-mini's Issues

Prompting example

Is there an example that shows how to perform few-shot multimodal prompting?

Error on image_captioning.ipynb

Screenshot 2023-01-23 103244

Error when running caption = model.generate_captions(processor, images=[image]) in image_captioning.ipynb

What is the best way to generate inference from code?

Also tried pipeline, but am missing use_auth_token.
pipe = pipeline(task='image-to-text', model='dhansmair/flamingo')

Thanks in advance! Great work

Doubts about training

Hi,

I am new to multi modal models and I have several questions regarding the training process. First of all, how do you feed the model with the training data? Do you insert as tokens the features of the images or do you feed the model with the new tokens created in the presampler?Is it the input the image and the label the text, or do you insert some extra text in the input?Secondly, what kind of hardware have you use to train the model and what was the time needed to train it? Lastly, what are the main differences of this project in comparison with the one published by Deepmind´s Flamingo?

Thank you in advanced

Text Prompting the model using the cached image

Hello dhansmair!

First and foremost: Thank you for your effort in this repository and for making it public. Much appreciated.

Currently, I am simply playing around with flamingo-mini and following examples you provided. While I understand the differences from the original paper I would like to ask the following question:

Is it possible to prompt the model with text only input and get a response from the model for the already cached image?

E.g.

1-) An image of a cat is provided to the model.
2-) The model successfully generates a caption.
3-) The user enters a text prompt (I do not know what how the formatting should be) such as "What is the color of this cat?".
4-) The model generates a text only response using the already cached cat image.

Do you think such a work flow is possible in the current state of the repository?

Thank you for your help!

few shot example

Hi, The notebook throws the following error from the final cell, when "multimodal_prompt" function is called:

TypeError: type object got multiple values for keyword argument 'past_key_values'

Any suggestions?

Error:


TypeError                                 Traceback (most recent call last)
Input In [6], in <module>
      1 prompt = "<image>Output: A cat wearing sunglasses.<EOC><image>Output: Elephants walking in the Savanna.<EOC><image>Output: "
----> 2 response = multimodal_prompt(model, processor, prompt, images=[cat_image, elephants_image, flamingo_image], device=device)
      3 print('prompt:', prompt)
      4 print('output:', response)

Input In [2], in multimodal_prompt(model, processor, prompt, images, device)
      9 pixels = processor(images, device=device)['pixel_values']
     10 pixels = repeat(pixels, 'N c h w -> b N T c h w', b=1, T=1)
---> 12 output = model.generate(
     13     inputs=input_ids,
     14     media_locations=media_locations,
     15     attention_mask=attention_mask,
     16     pixel_values=pixels,
     17     max_length=150,
     18     use_cache=True,
     19     early_stopping=True,
     20     bos_token_id=model.flamingo.lm.config.bos_token_id,
     21     eos_token_id=model.flamingo.lm.config.eos_token_id,
     22     pad_token_id=model.flamingo.lm.config.eos_token_id
     23 )
     25 response = processor.tokenizer.batch_decode(output, skip_special_tokens=True)
     26 return response[0]

File ~/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File ~/.local/lib/python3.8/site-packages/transformers/generation/utils.py:1391, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, **kwargs)
   1385         raise ValueError(
   1386             f"num_return_sequences has to be 1, but is {generation_config.num_return_sequences} when doing"
   1387             " greedy search."
   1388         )
   1390     # 11. run greedy search
-> 1391     return self.greedy_search(
   1392         input_ids,
   1393         logits_processor=logits_processor,
   1394         stopping_criteria=stopping_criteria,
   1395         pad_token_id=generation_config.pad_token_id,
   1396         eos_token_id=generation_config.eos_token_id,
   1397         output_scores=generation_config.output_scores,
   1398         return_dict_in_generate=generation_config.return_dict_in_generate,
   1399         synced_gpus=synced_gpus,
   1400         **model_kwargs,
   1401     )
   1403 elif is_contrastive_search_gen_mode:
   1404     if generation_config.num_return_sequences > 1:

File ~/.local/lib/python3.8/site-packages/transformers/generation/utils.py:2176, in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, **model_kwargs)
   2173         break
   2175 # prepare model inputs
-> 2176 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
   2178 # forward pass to get next token
   2179 outputs = self(
   2180     **model_inputs,
   2181     return_dict=True,
   2182     output_attentions=output_attentions,
   2183     output_hidden_states=output_hidden_states,
   2184 )

File ~/.local/lib/python3.8/site-packages/flamingo_mini/modeling_flamingo.py:513, in FlamingoModel.prepare_inputs_for_generation(self, input_ids, media_locations, attention_mask, pixel_values, visual_features, past, **kwargs)
    510 if past is not None:
    511     input_ids = input_ids[:, -1:]
--> 513 return dict(
    514     input_ids=input_ids,
    515     past_key_values=past,
    516     media_locations=media_locations,
    517     attention_mask=attention_mask,
    518     pixel_values=pixel_values,
    519     visual_features=visual_features,
    520     **kwargs
    521 )

TypeError: type object got multiple values for keyword argument 'past_key_values'

Doubt about MaskedCrossAttention

Hi, I'm unsure about this piece of code in MaskedCrossAttention inside gated_cross_attention.py

media_time = torch.arange(n_media, device=y.device) + 1
# >> David:
# side note: here, text tokens attend to ALL previous visual tokens. If We only want to attend to the
# one image coming before in the text (like in the flamingo paper),
# we need to change >= to == at the line where 'text_to_media_mask' is created.
text_to_media_mask = rearrange(text_time, 'b i -> b 1 i 1') == repeat(media_time, 'j -> 1 1 1 (j m)', m=self.n_visual)
sim = sim.masked_fill(~text_to_media_mask, -torch.finfo(sim.dtype).max)

sim = sim - sim.amax(dim=-1, keepdim=True).detach()

It seems you are setting the positions you want to mask out to -torch.finfo(sim.dtype).max (large negative number), but then finding the largest value sim.amax to normalize by?

I would think it should be:

sim = sim.masked_fill(~text_to_media_mask, torch.finfo(sim.dtype).max)
sim = sim - sim.amax(dim=-1, keepdim=True).detach()

Any clarification on this logic is appreciated. Thanks!

Bugs of evaluation / caption generation

  1. self.visuals -> self.encode_resample_visuals
  2. media_locations shape not consistent with attention_masks, causing an error at
    sim = sim.masked_fill(~text_to_media_mask, -torch.finfo(sim.dtype).max)
    The size of tensor a (4) must match the size of tensor b (5) at non-singleton dimension 2

ImportError: cannot import name 'CLIPImageProcessor' from 'transformers' (/databricks/python/lib/python3.9/site-packages/transformers/__init__.py)

Anyone else getting this error?

ImportError: cannot import name 'CLIPImageProcessor' from 'transformers' (/databricks/python/lib/python3.9/site-packages/transformers/__init__.py)

command-1272036722344618> in <cell line: 1>()
----> 1 from flamingo_mini import FlamingoConfig, FlamingoModel, FlamingoProcessor
      2 
      3 # create a model for training
      4 device = ...
      5 config = FlamingoConfig(...)

/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
    169             # Import the desired module. If you’re seeing this while debugging a failed import,
    170             # look at preceding stack frames for relevant error information.
--> 171             original_result = python_builtin_import(name, globals, locals, fromlist, level)
    172 
    173             is_root_import = thread_local._nest_level == 1

/local_disk0/.ephemeral_nfs/envs/pythonEnv-fbd7ab33-bb5b-4088-ae5e-31771ac6ca13/lib/python3.9/site-packages/flamingo_mini/__init__.py in <module>
----> 1 from .flamingo_processor import FlamingoProcessor
      2 from .configuration_flamingo import FlamingoConfig
      3 from .modeling_flamingo import FlamingoModel

/databricks/python_shell/dbruntime/PythonPackageImportsInstrumentation/__init__.py in import_patch(name, globals, locals, fromlist, level)
    169             # Import the desired module. If you’re seeing this while debugging a failed import,
    170             # look at preceding stack frames for relevant error information.
--> 171             original_result = python_builtin_import(name, globals, locals, fromlist, level)
    172 
    173             is_root_import = thread_local._nest_level == 1

/local_disk0/.ephemeral_nfs/envs/pythonEnv-fbd7ab33-bb5b-4088-ae5e-31771ac6ca13/lib/python3.9/site-packages/flamingo_mini/flamingo_processor.py in <module>
      4 
      5 import torch
----> 6 from transformers import CLIPImageProcessor
      7 
      8 from .configuration_flamingo import FlamingoConfig

ImportError: cannot import name 'CLIPImageProcessor' from 'transformers' (/databricks/python/lib/python3.9/site-packages/transformers/__init__.py)

Video Support

In principle, the FlamingoModel expects visual_features input to have shape [b N T v d] where b=batch_size, N=number of media (images/videos) T=number of frames (T=1 for images, T>1 for videos), v=number of visual features (=number of patches) and d=visual feature dimensionality.
So while in principle it should be able to digest videos, the perceiver resampler is based on Lucidrains implementation https://github.com/lucidrains/flamingo-pytorch/blob/10913abbc8b2ceabb2320560d7d9b85fcb85eee3/flamingo_pytorch/flamingo_pytorch.py#L74 and I haven't checked if it works.
Also FlamingoProcessor currently has no implemented functionality to preprocess videos and I won't add it anytime soon.

text generation not working with transformers > v4.25.1

When trying the example image caption: the following error message is generated for the last cell:

Cell:

caption = model.generate_captions(processor, images=[image])
print('generated caption:')
print(caption)

Error:

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

[<ipython-input-5-749dca7b9c67>](https://localhost:8080/#) in <module>
----> 1 caption = model.generate_captions(processor, images=[image])
      2 print('generated caption:')
      3 print(caption)

5 frames

[/usr/local/lib/python3.8/dist-packages/flamingo_mini/modeling_flamingo.py](https://localhost:8080/#) in prepare_inputs_for_generation(self, input_ids, media_locations, attention_mask, pixel_values, visual_features, past, **kwargs)
    511             input_ids = input_ids[:, -1:]
    512 
--> 513         return dict(
    514             input_ids=input_ids,
    515             past_key_values=past,

TypeError: type object got multiple values for keyword argument 'past_key_values'

generate() only works with use_cache=True

It makes sense to always use use_cache=True anyway, so this is a minor problem.
The reason for the error is that inside of .generate(), the created tokens are appended to input_ids, but media_locations is not extended accordingly. media_locations should have the same shape as input_ids.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.