This is the repository of Vision Language Models for Vision Tasks: a Survey, a systematic survey of VLM studies in various visual recognition tasks including image classification, object detection, semantic segmentation, etc. For details, please refer to:
Vision-Language Models for Vision Tasks: A Survey
[Paper]
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.
If you find our work useful in your research, please consider citing:
@article{zhang2023vision,
title={Vision-Language Models for Vision Tasks: A Survey},
author={Zhang, Jingyi and Huang, Jiaxing and Jin, Sheng and Lu, Shijian},
journal={arXiv preprint arXiv:2304.00685},
year={2023}
}
- Datasets
- Vision-Language Pre-training Methods
- Vision-Language Model Transfer Learning Methods
- Vision-Language Model Knowledge Distillation Methods
- SBU Caption [Paper] [Project Page]
- COCO Caption [Paper] [Project Page]
- Yahoo Flickr Creative Commons 100 Million (YFCC100M) [Paper] [Project Page]
- Visual Genome [Paper] [Project Page]
- Conceptual Captions (CC3M) [Paper] [Project Page]
- Localized Narratives [Paper] [Project Page]
- Conceptual 12M (CC12M) [Paper] [Project Page]
- Wikipedia-based Image Text (WIT) [Paper] [Project Page]
- Red Caps [Paper] [Project Page]
- LAION400M [Paper] [Project Page]
- LAION5B [Paper] [Project Page]
- WuKong [Paper] [Project Page]
- MNIST [Project Page]
- Caltech-101 [Project Page]
- PASCAL VOC 2007 Classification [Project Page]
- Oxford 102 Folwers [Project Page]
- CIFAR-10 [Project Page]
- CIFAR-100 [Project Page]
- ImageNet-1k [Project Page]
- SUN397 [Project Page]
- SVHN [Project Page]
- STL-10 [Project Page]
- GTSRB [Project Page]
- KITTI Distance [Project Page]
- IIIT5k [Project Page]
- Oxford-IIIT PETS [Project Page]
- Stanford Cars [Project Page]
- FGVC Aircraft [Project Page]
- Facial Emotion Recognition 2013 [Project Page]
- Rendered SST2 [Project Page]
- Describable Textures (DTD) [Project Page]
- Food-101 [Project Page]
- Birdsnap [Project Page]
- RESISC45 [Project Page]
- CLEVR Counts [Project Page]
- PatchCamelyon [Project Page]
- EuroSAT [Project Page]
- Hateful Memes [Project Page]
- Country211 [Project Page]
- Flickr30k [Project Page]
- COCO Caption [Project Page]
- UCF101 [Project Page]
- Kinetics700 [Project Page]
- RareAct [Project Page]
- COCO 2014 Detection [Project Page]
- COCO 2017 Detection [Project Page]
- LVIS [Project Page]
- ODinW [Project Page]
- PASCAL VOC 2012 Segmentation [Project Page]
- PASCAL Content [Project Page]
- Cityscapes [Project Page]
- ADE20k [Project Page]
- Learning Transferable Visual Models From Natural Language Supervision (CLIP) [Paper][Code]
- Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (ALIGN) [Paper]
- Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation (OTTER) [Paper][Code]
- Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm (DeCLIP) [Paper][Code]
- Contrastive Vision-Language Pre-training with Limited Resources (ZeroVL) [Paper][Code]
- FILIP: Fine-grained Interactive Language-Image Pre-Training [Paper]
- Unified Contrastive Learning in Image-Text-Label Space (UniCL) [Paper][Code]
- Florence: A New Foundation Model for Computer Vision [Paper]
- SLIP: Self-supervision meets Language-Image Pre-training [Paper][Code]
- PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining [Paper]
- Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese [Paper][Code]
- LiT: Zero-Shot Transfer with Locked-image text Tuning [Paper][Code]
- AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities [Paper][Code]
- FLAVA: A Foundational Language And Vision Alignment Model [Paper][Code]
- Large-scale Bilingual Language-Image Contrastive Learning (KELIP)[Paper][Code]
- CoCa: Contrastive Captioners are Image-Text Foundation Models [Paper][Code]
- Non-Contrastive Learning Meets Language-Image Pre-Training (nCLIP) [Paper]
- K-LITE: Learning Transferable Visual Models with External Knowledge [Paper][Code]
- NLIP: Noise-robust Language-Image Pre-training [Paper]
- UniCLIP: Unified Framework for Contrastive Language-Image Pre-training [Paper]