Giter Site home page Giter Site logo

vlm_survey's Introduction

Vision Language Models for Vision Tasks: A Survey

This is the repository of Vision Language Models for Vision Tasks: a Survey, a systematic survey of VLM studies in various visual recognition tasks including image classification, object detection, semantic segmentation, etc. For details, please refer to:

Vision-Language Models for Vision Tasks: A Survey
[Paper]

arXiv Maintenance PR's Welcome GitHub license

Abstract

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.

Citation

If you find our work useful in your research, please consider citing:

@article{zhang2023vision,
  title={Vision-Language Models for Vision Tasks: A Survey},
  author={Zhang, Jingyi and Huang, Jiaxing and Jin, Sheng and Lu, Shijian},
  journal={arXiv preprint arXiv:2304.00685},
  year={2023}
}

Menu

Datasets

Datasets for VLM Pre-training

Public Datasets

  1. SBU Caption [Paper] [Project Page]
  2. COCO Caption [Paper] [Project Page]
  3. Yahoo Flickr Creative Commons 100 Million (YFCC100M) [Paper] [Project Page]
  4. Visual Genome [Paper] [Project Page]
  5. Conceptual Captions (CC3M) [Paper] [Project Page]
  6. Localized Narratives [Paper] [Project Page]
  7. Conceptual 12M (CC12M) [Paper] [Project Page]
  8. Wikipedia-based Image Text (WIT) [Paper] [Project Page]
  9. Red Caps [Paper] [Project Page]
  10. LAION400M [Paper] [Project Page]
  11. LAION5B [Paper] [Project Page]
  12. WuKong [Paper] [Project Page]

Non-public Datasets

  1. CLIP [Paper]
  2. ALIGN [Paper]
  3. FILIP [Paper]
  4. WebLI [Paper]

Datasets for VLM Evaluation

Image Classification

  1. MNIST [Project Page]
  2. Caltech-101 [Project Page]
  3. PASCAL VOC 2007 Classification [Project Page]
  4. Oxford 102 Folwers [Project Page]
  5. CIFAR-10 [Project Page]
  6. CIFAR-100 [Project Page]
  7. ImageNet-1k [Project Page]
  8. SUN397 [Project Page]
  9. SVHN [Project Page]
  10. STL-10 [Project Page]
  11. GTSRB [Project Page]
  12. KITTI Distance [Project Page]
  13. IIIT5k [Project Page]
  14. Oxford-IIIT PETS [Project Page]
  15. Stanford Cars [Project Page]
  16. FGVC Aircraft [Project Page]
  17. Facial Emotion Recognition 2013 [Project Page]
  18. Rendered SST2 [Project Page]
  19. Describable Textures (DTD) [Project Page]
  20. Food-101 [Project Page]
  21. Birdsnap [Project Page]
  22. RESISC45 [Project Page]
  23. CLEVR Counts [Project Page]
  24. PatchCamelyon [Project Page]
  25. EuroSAT [Project Page]
  26. Hateful Memes [Project Page]
  27. Country211 [Project Page]

Image-Text Retrieval

  1. Flickr30k [Project Page]
  2. COCO Caption [Project Page]

Action Recognition

  1. UCF101 [Project Page]
  2. Kinetics700 [Project Page]
  3. RareAct [Project Page]

Object Detection

  1. COCO 2014 Detection [Project Page]
  2. COCO 2017 Detection [Project Page]
  3. LVIS [Project Page]
  4. ODinW [Project Page]

Semantic Segmentation

  1. PASCAL VOC 2012 Segmentation [Project Page]
  2. PASCAL Content [Project Page]
  3. Cityscapes [Project Page]
  4. ADE20k [Project Page]

Vision-Language Pre-training Methods

Pre-training with Contrastive Objective

  1. Learning Transferable Visual Models From Natural Language Supervision (CLIP) [Paper][Code]
  2. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (ALIGN) [Paper]
  3. Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation (OTTER) [Paper][Code]
  4. Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm (DeCLIP) [Paper][Code]
  5. Contrastive Vision-Language Pre-training with Limited Resources (ZeroVL) [Paper][Code]
  6. FILIP: Fine-grained Interactive Language-Image Pre-Training [Paper]
  7. Unified Contrastive Learning in Image-Text-Label Space (UniCL) [Paper][Code]
  8. Florence: A New Foundation Model for Computer Vision [Paper]
  9. SLIP: Self-supervision meets Language-Image Pre-training [Paper][Code]
  10. PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining [Paper]
  11. Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese [Paper][Code]
  12. LiT: Zero-Shot Transfer with Locked-image text Tuning [Paper][Code]
  13. AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities [Paper][Code]
  14. FLAVA: A Foundational Language And Vision Alignment Model [Paper][Code]
  15. Large-scale Bilingual Language-Image Contrastive Learning (KELIP)[Paper][Code]
  16. CoCa: Contrastive Captioners are Image-Text Foundation Models [Paper][Code]
  17. Non-Contrastive Learning Meets Language-Image Pre-Training (nCLIP) [Paper]
  18. K-LITE: Learning Transferable Visual Models with External Knowledge [Paper][Code]
  19. NLIP: Noise-robust Language-Image Pre-training [Paper]
  20. UniCLIP: Unified Framework for Contrastive Language-Image Pre-training [Paper]

Pre-training with Generative Objective

Pre-training with Alignment Objective

Vision-Language Model Transfer Learning Methods

Vision-Language Model Knowledge Distillation Methods

vlm_survey's People

Contributors

jingyi0000 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.