The vlm_survey from jxhuang0508

Vision Language Models for Vision Tasks: A Survey

This is the repository of Vision Language Models for Vision Tasks: a Survey, a systematic survey of VLM studies in various visual recognition tasks including image classification, object detection, semantic segmentation, etc. For details, please refer to:

Vision-Language Models for Vision Tasks: A Survey
[Paper]

Abstract

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.

Citation

If you find our work useful in your research, please consider citing:

@article{zhang2023vision,
  title={Vision-Language Models for Vision Tasks: A Survey},
  author={Zhang, Jingyi and Huang, Jiaxing and Jin, Sheng and Lu, Shijian},
  journal={arXiv preprint arXiv:2304.00685},
  year={2023}
}

Datasets

Datasets for VLM Pre-training

Public Datasets

SBU Caption [Paper] [Project Page]
COCO Caption [Paper] [Project Page]
Yahoo Flickr Creative Commons 100 Million (YFCC100M) [Paper] [Project Page]
Visual Genome [Paper] [Project Page]
Conceptual Captions (CC3M) [Paper] [Project Page]
Localized Narratives [Paper] [Project Page]
Conceptual 12M (CC12M) [Paper] [Project Page]
Wikipedia-based Image Text (WIT) [Paper] [Project Page]
Red Caps [Paper] [Project Page]
LAION400M [Paper] [Project Page]
LAION5B [Paper] [Project Page]
WuKong [Paper] [Project Page]

Non-public Datasets

CLIP [Paper]
ALIGN [Paper]
FILIP [Paper]
WebLI [Paper]

Datasets for VLM Evaluation

Image Classification

MNIST [Project Page]
Caltech-101 [Project Page]
PASCAL VOC 2007 Classification [Project Page]
Oxford 102 Folwers [Project Page]
CIFAR-10 [Project Page]
CIFAR-100 [Project Page]
ImageNet-1k [Project Page]
SUN397 [Project Page]
SVHN [Project Page]
STL-10 [Project Page]
GTSRB [Project Page]
KITTI Distance [Project Page]
IIIT5k [Project Page]
Oxford-IIIT PETS [Project Page]
Stanford Cars [Project Page]
FGVC Aircraft [Project Page]
Facial Emotion Recognition 2013 [Project Page]
Rendered SST2 [Project Page]
Describable Textures (DTD) [Project Page]
Food-101 [Project Page]
Birdsnap [Project Page]
RESISC45 [Project Page]
CLEVR Counts [Project Page]
PatchCamelyon [Project Page]
EuroSAT [Project Page]
Hateful Memes [Project Page]
Country211 [Project Page]

Image-Text Retrieval

Flickr30k [Project Page]
COCO Caption [Project Page]

Action Recognition

UCF101 [Project Page]
Kinetics700 [Project Page]
RareAct [Project Page]

Object Detection

COCO 2014 Detection [Project Page]
COCO 2017 Detection [Project Page]
LVIS [Project Page]
ODinW [Project Page]

Semantic Segmentation

PASCAL VOC 2012 Segmentation [Project Page]
PASCAL Content [Project Page]
Cityscapes [Project Page]
ADE20k [Project Page]

Vision-Language Pre-training Methods

Pre-training with Contrastive Objective

Learning Transferable Visual Models From Natural Language Supervision (CLIP) [Paper][Code]
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (ALIGN) [Paper]
Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation (OTTER) [Paper][Code]
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm (DeCLIP) [Paper][Code]
Contrastive Vision-Language Pre-training with Limited Resources (ZeroVL) [Paper][Code]
FILIP: Fine-grained Interactive Language-Image Pre-Training [Paper]
Unified Contrastive Learning in Image-Text-Label Space (UniCL) [Paper][Code]
Florence: A New Foundation Model for Computer Vision [Paper]
SLIP: Self-supervision meets Language-Image Pre-training [Paper][Code]
PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining [Paper]
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese [Paper][Code]
LiT: Zero-Shot Transfer with Locked-image text Tuning [Paper][Code]
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities [Paper][Code]
FLAVA: A Foundational Language And Vision Alignment Model [Paper][Code]
Large-scale Bilingual Language-Image Contrastive Learning (KELIP)[Paper][Code]
CoCa: Contrastive Captioners are Image-Text Foundation Models [Paper][Code]
Non-Contrastive Learning Meets Language-Image Pre-Training (nCLIP) [Paper]
K-LITE: Learning Transferable Visual Models with External Knowledge [Paper][Code]
NLIP: Noise-robust Language-Image Pre-training [Paper]
UniCLIP: Unified Framework for Contrastive Language-Image Pre-training [Paper]

jxhuang0508 / vlm_survey Goto Github PK

vlm_survey's Introduction

Vision Language Models for Vision Tasks: A Survey

Abstract

Citation

Menu

Datasets

Datasets for VLM Pre-training

Public Datasets

Non-public Datasets

Datasets for VLM Evaluation

Image Classification

Image-Text Retrieval

Action Recognition

Object Detection

Semantic Segmentation

Vision-Language Pre-training Methods

Pre-training with Contrastive Objective

Pre-training with Generative Objective

Pre-training with Alignment Objective

Vision-Language Model Transfer Learning Methods

Vision-Language Model Knowledge Distillation Methods

vlm_survey's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org