Coustomized curated list of Vision and Language works
Name |
Paper |
Code |
VindLU: A Recipe for Effective Video-and-Language Pretraining |
Text |
code |
MAGVIT: Masked Generative Video Transformer |
Text |
project page |
Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners |
paper |
- |
PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data |
paper |
- |
Learning Video Representations from Large Language Models |
paper page |
code |
Name |
Paper |
Code |
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory |
paper |
- |
Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning |
paper |
code |
Constrative Image-Language Pretraining (CLIP)
Name |
Paper |
Code |
CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet |
Paper |
- |
Miscellaneous but Interesting!
Name |
Paper |
Code |
ULIP: Learning Unified Representation of Language, Image and Point Cloud for 3D Understanding |
paper |
page |
Do DALL-E and Flamingo Understand Each Other? |
paper |
- |
GPT Takes the Bar Exam |
paper |
code |