Giter Site home page Giter Site logo

awesome-mllm-grounding's Introduction

Awesome-Multimodal-Large-Language-Models-With-Grounding

A curated list of Multimodal Large Language Models (or Large Vision Language Model) with grounding ability.

Table of Contents

๐Ÿ”ฅ Large Vision-Language Model

Grounding

Format Desc Paper
Decoder on latent leverage a decoder to ground PerceptionGPT, NExT-Chat, PSALM, PixelLM, u-LLaVA, GSVA, ChatterBox
Output numerical coordinates direct output numerical tokens Shikra, VisionLLM, Ferret, Ferret2, CogVLM
Output token coordinates output new tokens added to refer positions Kosmos-2
Pixel space output in discrete pixel space encoded by VQGAN Unified-IO, Unified-IO 2
Proposal retrieval retrieval from region candidates LLM-Seg, Kosmos-2, GROUNDHOG

Referring

Format Desc Paper
Pooling Leverage Mask Pooling / RoI Pooling / RoI Align to obtain features from the im encoder output Groma, GPT4RoI, Osprey, PSALM, GROUNDHOG, Ferret, Ferret2, PVIT, ChatterBox
Numerical coordinates Leverage numerical coordinates for referring (bbox / sampled points in mask) Shikra, PerceptionGPT (w/ encoder), NExT-Chat (w/ encoder), CogVLM
Token coordinates Add new tokens to vocab to present spatial positions Kosmos-2
  • w/ encoder: refers to using a encoder to encode the input coordinates.

Training Dataset

Dataset Source Data Source Quantity Cnstruction Method
GRIT Ferret VG, Object365, RefCOCOs, Flickr30k-Entities, LLaVA-158K -
  • Templates are used to convert data.
  • SAM is used to generate masks for free-form referring.
  • ChatGPT4 is used to generate dialogues with bbox.
  • Use GLIPv2 to ground groundable nouns in LLaVA-158k.
  • Negative mining: generate negative yes/or question
  • Shikra-RD Shikra Flickr30K Entities 5,922 QA pairs ChatGPT4 ==> Referential Dialogue (CoT dialogues with grounding & referring)
    CB-300K ChatterBox VG 717,075 QA pairs 4 subsets.
  • CB-MRG: Use ChatGPT to write dialogues with bbox
  • CB-LC, extend strict relation (from scene graph) to multi-turn QA with ChatGPT
  • CB-REF REG task
  • CB-GND: grounding task
  • Training Recipe

    Model Recipe
    Ferret
  • Use LLaVA pretrained
  • SFT on GRIT
  • Ferret2
  • image-caption alignment on 1.4M image-text pairs
  • high-resolution dense alignment with template referring & grounding
  • instruction tuning with GRIT, VQA and OCR (VQA and OCR are augmented with GLIPv2 bbox)
  • ChatterBox Trainable: LoRA and location decoder
  • warm up training with visual grounding only dataset.
  • instruction tuning with CB-300K
  • GPT4RoI
  • Use LLaVA pretrained
  • pretrain region feature extractor with text-region datasets (COCO, RefCOCO, RefCOCO+)
  • train connector, region feature extractor and LLM to follow instructions
  • Evaluation Dataset

    Dataset Source Data Source Quantity Cnstruction Method
    Ferret Bench Ferret COCO validation set 120
  • Referring Description: models are asked to describe a referred region based on its interaction with surrounding objects.
  • Referring Reasoning: models need to reason on top of one or more referred regions correctly.
  • Grounding in Conversation: models are required to reason correctly and accurately ground/localize the objects/regions necessary for the reasoning.
  • Paper List

    GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

    Paper | Github

    1. propose referring for mllm by replacing placeholder <region_i> by feature obtained by mask pooling
    Osprey: Pixel Understanding with Visual Instruction Tuning

    Paper | Github

    1. similar to GPT4RoI, Osprey also use mask representation to refer to entities in images.
    2. It uses mask pooling to extract semantic features from image encoder and combines with a location extractor to process the mask and output spatial token.
    LISA: Reasoning Segmentation via Large Language Model

    Paper | Github

    1. adapt LLM with mask decoder trained with segmentation datasets converted to LLM format ==> reasoning segmentation ability naturally emerges
    2. promote reason seg (complex reasoning requirement) benchmark
    VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

    Paper | Github

    1. unified interface for vision and vl tasks: points for detection, sample points for instance seg ==> instruction format for training
    2. extra tokens & output-format-as-query to decode (faster)
    Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

    Paper | Github | Project

    1. creates a unified IO for all sorts of vision and vl task (into discrete tokens)
    2. using t5-like encoder-decoder arch
    Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

    Paper | Github | Project

    1. following Unified-IO v1, creates a unified IO for all sorts of modalities including image, masks, bboxes, audios (into discrete tokens)
      1. dense masks are all binary, unlike v1 which specifies the color in text instruction (model struggles to follow)
    2. propose 2D Rotary Embedding, QK Normalization and Scaled Cosine Attention to stabilize training and scaling
    3. Mixture of Denoisers taining objectives
    4. instruction tuning of 220 tasks drawn from over 120 external datasets
    PixelLM: Pixel Reasoning with Large Multimodal Model

    Paper | Github | Project

    1. learnable seg tokens + light-weight decoder
    2. a bunch of tricks:
      1. N x L seg tokens for L level multi-scale vision features. N tokens within each group for better modeling
      2. reweighted loss on regions with overlapping predictions
    PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model

    Paper | Github

    1. new paradigm: first generate mask proposal, then genereate mask and classification (following mask2former)
    2. instruction prompt + conditional prompt + candidate masks token
      1. three types of conditional prompt: classes, sentence (ref seg) and visual cues (point, scribbles, boxes, etc)
      2. conditional prompt => condition embed, candidate masks token => mask embed.
      3. condition embed +mask embed + image feature => mask2former decoder => bipartite matching loss + query-based decoding ๅ›พ 0
    LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning

    Paper | Github

    1. Use SAM to generate mask candidates, then fomulate the problem as mask selection (mask classification)
    2. promote LLM-Seg40K dataset, by using LLaVA to generate caption, then GPT4 to generate question-answer pair.
    GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

    Paper | Project

    1. disantengle grounding with referring
    2. grounding as mask selection and train a mask2former+ to generate mask candidates
    3. referring by mask pooling on feature
    4. promote 2.5M M3G2 dataset
    DetGPT: Detect What You Need via Reasoning

    Paper | Github | Project

    1. Follow LLaVA to tune VLM and for vqa
    2. Use grouding DINO to ground response generated by VLM to detect the relevantg entities.
    Ferret: Refer and Ground Anything Anywhere at Any Granularity

    Paper | Github

    1. propose hybrid region representation for referring : region name + coordinates + mask pooled feature by Spatial-aware visual sampler
    2. grounding through bbox
    Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

    Paper

    1. propose a bunch of improvements on Ferret v1
    2. including any-resolution (patches) for larger resolution
    3. DINOv2 Encoder for local feature extraction
    4. and High-resolution Dense Alignment stage between SFT and instruction turning.
    u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

    Paper | Github

    1. propose to use different decoder for grounding (SAM for segmentation, Grounding DINO for detection)
    GSVA: Generalized Segmentation via Multimodal Large Language Models

    Paper | Github

    1. propose to Generalized Referring Expression Segmentation (GRES) in grounding LLM
      1. multiple object to ground
      2. need to reject null target
    2. propose to use multple [SEG] token to ground multiple objects (indicted by the texts before the [SEG] token), and [REJ] token to rej null target
    NExT-Chat: An LMM for Chat, Detection and Segmentation

    Paper | Github | Project

    1. propose box encoder-decoder for referring and grounding
    2. for grounding, use token to indicate the presence of a grounding output and input the latent embedding to the box decoder (mask decoder e.g. SAM) for box (mask) generation
    3. for referring, use boxes to represent referred region and use box encoder to encode the referred boxes into features, which is input to LLM.
    4. propose a cycle consistency loss for regularization of box encoder-decoder ๅ›พ 0
    PerceptionGPT: Effectively Fusing Visual Perception into LLM

    Paper

    1. similar to NExT-Chat, propose box encoder-decoder to encode and decode boxes, but seems to only focus on grounding without referring
    2. One possible intriguing point: grounding output indicator <vis> is used to indicate the presence of grounding output (as usual) but the is replaced by the encoder's output feature in the LLM input. ๅ›พ 1
    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Paper | Github

    1. build a web-scale grounding dataset by web-scale data (COYO-700M & LAION-2B etc) and vision detector (GLIP)
    2. following pix2seq, divide the image into PxP grids and introduce PxP new tokens to represent
    3. Use <box></box> to represent a bbox, with <delim> to separate multiple boxes (if there are multiple boxes)
    4. Use markdown-like grammar to reference grounded text with <p> </p> e.g. ๅ›พ 2
    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Paper | Github

    1. propose to use normalized boxes for unified grounding and referring
    2. Use texts to represent all normalized boxes (directly tokenized by text tokenizer) and input to LLM
    Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

    Paper | Github | Project

    1. Propose to ground and refer with a set of proposed regions.
    2. Change a Deformable DETR detection head into binary classifier to propose ROI and use AlignROI pooling to get the region feature

    ๐Ÿ”ฅ Multi-modality

    GroundingGPT:Language Enhanced Multi-modal Grounding Model

    Paper | Github

    1. grounding and referring of multi-modality in text
      1. bounding box by four relative coordinate values:[x1, y1, x2, y2]
      2. video timestamps by two two-digit decimals: {t1, t2}
    2. curate dataset for three stage training ๅ›พ 1

    awesome-mllm-grounding's People

    Contributors

    williamium3000 avatar

    Recommend Projects

    • React photo React

      A declarative, efficient, and flexible JavaScript library for building user interfaces.

    • Vue.js photo Vue.js

      ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

    • Typescript photo Typescript

      TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

    • TensorFlow photo TensorFlow

      An Open Source Machine Learning Framework for Everyone

    • Django photo Django

      The Web framework for perfectionists with deadlines.

    • D3 photo D3

      Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

    Recommend Topics

    • javascript

      JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

    • web

      Some thing interesting about web. New door for the world.

    • server

      A server is a program made to process requests and deliver data to clients.

    • Machine learning

      Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

    • Game

      Some thing interesting about game, make everyone happy.

    Recommend Org

    • Facebook photo Facebook

      We are working to build community through open source technology. NB: members must have two-factor auth.

    • Microsoft photo Microsoft

      Open source projects and samples from Microsoft.

    • Google photo Google

      Google โค๏ธ Open Source for everyone.

    • D3 photo D3

      Data-Driven Documents codes.