Giter Site home page Giter Site logo

Comments (3)

yuqi657 avatar yuqi657 commented on May 19, 2024

Could you give a detailed description (in text) of the implementation of your visual abstractor?

from mplug-owl.

LukeForeverYoung avatar LukeForeverYoung commented on May 19, 2024

We put the query_embed in mPLUG_OwlModel and pass it to Visual Abstractor during forward. The implement of Visual Abstractor is similar to the Perceiver in Flamingo, except that we use FFNs the same as LLAMA.
Referred to mPLUG and mPLUG-2, we apply abstractor to reduce the length of token length and help model to learn visual knowledge in language space.

from mplug-owl.

MAGAer13 avatar MAGAer13 commented on May 19, 2024

First of all, thanks for your great work. From the paper, I see learnable queries in visual abastractor. I think it may be similar to Perceiver in Flamingo or Q-Former in BLIP-2. But I don't find the implementation in your code about learnable queries (mPLUG_OwlVisualAbstractorEncoder and mPLUG_OwlVisualAbstractorModel in modeling_mplug_owl.py). I am curious about the details of visual abastractor. In other words, is it seems to Q-Former or Perceiver? The details do not contain in your paper and I cannot find in the code. Thanks again.

Hi, just for additional claim. The aim of visual abstractor is to reduce the number of patches for images which would result in a large number of token (256 for ViT/L-14 with 224x224 resolution) for the LLM. The maximum token for LLMs such as LLaMA, Bloom are 2048 where 256 is relatively large number for it. However, it did not happen to flamingo since it utilizes cross-attention. So the purpose is different. Besides, since we want to learn some useful features such as region or object features from the image, as practiced by mPLUG-2, which also leverages similar idea and verified by the visualization of attention map for the learnable queries.

from mplug-owl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.