Awesome-Multimodal-Large-Language-Models-With-Grounding

A curated list of Multimodal Large Language Models (or Large Vision Language Model) with grounding ability.

Awesome-Multimodal-Large-Language-Models-With-Grounding

🔥 Large Vision-Language Model

Grounding

Format	Desc	Paper
Decoder on latent	leverage a decoder to ground	PerceptionGPT, NExT-Chat, PSALM, PixelLM, u-LLaVA, GSVA, ChatterBox
Output numerical coordinates	direct output numerical tokens	Shikra, VisionLLM, Ferret, Ferret2, CogVLM
Output token coordinates	output new tokens added to refer positions	Kosmos-2
Pixel space	output in discrete pixel space encoded by VQGAN	Unified-IO, Unified-IO 2
Proposal retrieval	retrieval from region candidates	LLM-Seg, Kosmos-2, GROUNDHOG

Referring

Format	Desc	Paper
Pooling	Leverage Mask Pooling / RoI Pooling / RoI Align to obtain features from the im encoder output	Groma, GPT4RoI, Osprey, PSALM, GROUNDHOG, Ferret, Ferret2, PVIT, ChatterBox
Numerical coordinates	Leverage numerical coordinates for referring (bbox / sampled points in mask)	Shikra, PerceptionGPT (w/ encoder), NExT-Chat (w/ encoder), CogVLM
Token coordinates	Add new tokens to vocab to present spatial positions	Kosmos-2

w/ encoder: refers to using a encoder to encode the input coordinates.

Training Dataset

Dataset	Source	Data Source	Quantity	Cnstruction Method
GRIT	Ferret	VG, Object365, RefCOCOs, Flickr30k-Entities, LLaVA-158K	-	Templates are used to convert data. SAM is used to generate masks for free-form referring. ChatGPT4 is used to generate dialogues with bbox. Use GLIPv2 to ground groundable nouns in LLaVA-158k. Negative mining: generate negative yes/or question
Shikra-RD	Shikra	Flickr30K Entities	5,922 QA pairs	ChatGPT4 ==> Referential Dialogue (CoT dialogues with grounding & referring)
CB-300K	ChatterBox	VG	717,075 QA pairs	4 subsets. CB-MRG: Use ChatGPT to write dialogues with bbox CB-LC, extend strict relation (from scene graph) to multi-turn QA with ChatGPT CB-REF REG task CB-GND: grounding task

Training Recipe

Model	Recipe
Ferret	Use LLaVA pretrained SFT on GRIT
Ferret2	image-caption alignment on 1.4M image-text pairs high-resolution dense alignment with template referring & grounding instruction tuning with GRIT, VQA and OCR (VQA and OCR are augmented with GLIPv2 bbox)
ChatterBox	Trainable: LoRA and location decoder warm up training with visual grounding only dataset. instruction tuning with CB-300K
GPT4RoI	Use LLaVA pretrained pretrain region feature extractor with text-region datasets (COCO, RefCOCO, RefCOCO+) train connector, region feature extractor and LLM to follow instructions

Evaluation Dataset

Dataset	Source	Data Source	Quantity	Cnstruction Method
Ferret Bench	Ferret	COCO validation set	120	Referring Description: models are asked to describe a referred region based on its interaction with surrounding objects. Referring Reasoning: models need to reason on top of one or more referred regions correctly. Grounding in Conversation: models are required to reason correctly and accurately ground/localize the objects/regions necessary for the reasoning.

Paper List

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

Paper | Github

propose referring for mllm by replacing placeholder <region_i> by feature obtained by mask pooling

Osprey: Pixel Understanding with Visual Instruction Tuning

Paper | Github

similar to GPT4RoI, Osprey also use mask representation to refer to entities in images.
It uses mask pooling to extract semantic features from image encoder and combines with a location extractor to process the mask and output spatial token.

LISA: Reasoning Segmentation via Large Language Model

Paper | Github

adapt LLM with mask decoder trained with segmentation datasets converted to LLM format ==> reasoning segmentation ability naturally emerges
promote reason seg (complex reasoning requirement) benchmark

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Paper | Github

unified interface for vision and vl tasks: points for detection, sample points for instance seg ==> instruction format for training
extra tokens & output-format-as-query to decode (faster)

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Paper | Github | Project

creates a unified IO for all sorts of vision and vl task (into discrete tokens)
using t5-like encoder-decoder arch

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Paper | Github | Project

following Unified-IO v1, creates a unified IO for all sorts of modalities including image, masks, bboxes, audios (into discrete tokens)
1. dense masks are all binary, unlike v1 which specifies the color in text instruction (model struggles to follow)
propose 2D Rotary Embedding, QK Normalization and Scaled Cosine Attention to stabilize training and scaling
Mixture of Denoisers taining objectives
instruction tuning of 220 tasks drawn from over 120 external datasets

PixelLM: Pixel Reasoning with Large Multimodal Model

Paper | Github | Project

learnable seg tokens + light-weight decoder
a bunch of tricks:
1. N x L seg tokens for L level multi-scale vision features. N tokens within each group for better modeling
2. reweighted loss on regions with overlapping predictions

PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model

Paper | Github

new paradigm: first generate mask proposal, then genereate mask and classification (following mask2former)
instruction prompt + conditional prompt + candidate masks token
1. three types of conditional prompt: classes, sentence (ref seg) and visual cues (point, scribbles, boxes, etc)
2. conditional prompt => condition embed, candidate masks token => mask embed.
3. condition embed +mask embed + image feature => mask2former decoder => bipartite matching loss + query-based decoding

LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning

Paper | Github

Use SAM to generate mask candidates, then fomulate the problem as mask selection (mask classification)
promote LLM-Seg40K dataset, by using LLaVA to generate caption, then GPT4 to generate question-answer pair.

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Paper | Project

disantengle grounding with referring
grounding as mask selection and train a mask2former+ to generate mask candidates
referring by mask pooling on feature
promote 2.5M M3G2 dataset

DetGPT: Detect What You Need via Reasoning

Paper | Github | Project

Follow LLaVA to tune VLM and for vqa
Use grouding DINO to ground response generated by VLM to detect the relevantg entities.

Ferret: Refer and Ground Anything Anywhere at Any Granularity

Paper | Github

propose hybrid region representation for referring : region name + coordinates + mask pooled feature by Spatial-aware visual sampler
grounding through bbox

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Paper

propose a bunch of improvements on Ferret v1
including any-resolution (patches) for larger resolution
DINOv2 Encoder for local feature extraction
and High-resolution Dense Alignment stage between SFT and instruction turning.

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

Paper | Github

propose to use different decoder for grounding (SAM for segmentation, Grounding DINO for detection)

GSVA: Generalized Segmentation via Multimodal Large Language Models

Paper | Github

propose to Generalized Referring Expression Segmentation (GRES) in grounding LLM
1. multiple object to ground
2. need to reject null target
propose to use multple [SEG] token to ground multiple objects (indicted by the texts before the [SEG] token), and [REJ] token to rej null target

NExT-Chat: An LMM for Chat, Detection and Segmentation

Paper | Github | Project

propose box encoder-decoder for referring and grounding
for grounding, use token to indicate the presence of a grounding output and input the latent embedding to the box decoder (mask decoder e.g. SAM) for box (mask) generation
for referring, use boxes to represent referred region and use box encoder to encode the referred boxes into features, which is input to LLM.
propose a cycle consistency loss for regularization of box encoder-decoder

PerceptionGPT: Effectively Fusing Visual Perception into LLM

Paper

similar to NExT-Chat, propose box encoder-decoder to encode and decode boxes, but seems to only focus on grounding without referring
One possible intriguing point: grounding output indicator <vis> is used to indicate the presence of grounding output (as usual) but the is replaced by the encoder's output feature in the LLM input.

Kosmos-2: Grounding Multimodal Large Language Models to the World

Paper | Github

build a web-scale grounding dataset by web-scale data (COYO-700M & LAION-2B etc) and vision detector (GLIP)
following pix2seq, divide the image into PxP grids and introduce PxP new tokens to represent
Use <box></box> to represent a bbox, with <delim> to separate multiple boxes (if there are multiple boxes)
Use markdown-like grammar to reference grounded text with <p> </p> e.g.

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Paper | Github

propose to use normalized boxes for unified grounding and referring
Use texts to represent all normalized boxes (directly tokenized by text tokenizer) and input to LLM

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Paper | Github | Project

Propose to ground and refer with a set of proposed regions.
Change a Deformable DETR detection head into binary classifier to propose ROI and use AlignROI pooling to get the region feature

🔥 Multi-modality

GroundingGPT:Language Enhanced Multi-modal Grounding Model

Paper | Github

grounding and referring of multi-modality in text
1. bounding box by four relative coordinate values:[x1, y1, x2, y2]
2. video timestamps by two two-digit decimals: {t1, t2}
curate dataset for three stage training

m3dade / awesome-mllm-grounding Goto Github PK

awesome-mllm-grounding's Introduction

Awesome-Multimodal-Large-Language-Models-With-Grounding

Table of Contents

🔥 Large Vision-Language Model

Grounding

Referring

Training Dataset

Training Recipe

Evaluation Dataset

Paper List

🔥 Multi-modality

awesome-mllm-grounding's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent