𝓐𝔀𝓮𝓼𝓸𝓶𝓮 𝓣𝓮𝔁𝓽📝-𝓽𝓸-𝓘𝓶𝓪𝓰𝓮🌇

𝓐 𝓬𝓸𝓵𝓵𝓮𝓬𝓽𝓲𝓸𝓷 𝓸𝓯 𝓻𝓮𝓼𝓸𝓾𝓻𝓬𝓮𝓼 𝓸𝓷 𝓽𝓮𝔁𝓽-𝓽𝓸-𝓲𝓶𝓪𝓰𝓮 𝓼𝔂𝓷𝓽𝓱𝓮𝓼𝓲𝓼/𝓶𝓪𝓷𝓲𝓹𝓾𝓵𝓪𝓽𝓲𝓸𝓷 𝓽𝓪𝓼𝓴𝓼.

From: Hierarchical Text-Conditional Image Generation with CLIP Latents

To Do

- Add Best Collection for Awesome-Text-to-Image
- Add Topic Order list and Chronological Order list

Content

- 1. Description
- 2. Quantitative Evaluation Metrics
- 3. Datasets
- 4. Project
- 5. ⏳Recently Focused Papers (FYI)
- 6. Paper With Code
- - Survey
- - Text to Face👨🏻🧒👧🏼🧓🏽
- - Compounding Issues🤔
- - 2022
- - 2021
- - 2020
- - 2019
- - 2018
- - 2017
- - 2016
- 7. Other Related Works
Contact Me

1. Description

In the last few decades, the fields of Computer Vision (CV) and Natural Language Processing (NLP) have been made several major technological breakthroughs in deep learning research. Recently, researchers appear interested in combining semantic information and visual information in these traditionally independent fields. A number of studies have been conducted on the text-to-image synthesis techniques that transfer input textual description (keywords or sentences) into realistic images.
Papers, codes and datasets for the text-to-image task are available here.

🐌 Markdown Format:

(Conference/Journal Year) Title, First Author et al. [Paper] [Code] [Project]

2. Quantitative Evaluation Metrics «🎯Back To Top»

Inception Score (IS) [Paper] [Python Code (Pytorch)] [(New!)Python Code (Tensorflow)] [Python Code (Tensorflow)] [Ref.Code(AttnGAN)]
Fréchet Inception Distance (FID) [Paper] [Python Code (Pytorch)] [(New!)Python Code (Tensorflow)] [Python Code (Tensorflow)] [Ref.Code(DM-GAN)]
R-precision [Paper] [Ref.Code(CPGAN)]
L₂ error [Paper]
Learned Perceptual Image Patch Similarity (LPIPS) [Paper] [Python Code]

3. Datasets «🎯Back To Top»

Caltech-UCSD Bird(CUB)

Caltech-UCSD Birds-200-2011 (CUB-200-2011) is an extended version of the CUB-200 dataset, with roughly double the number of images per class and new part location annotations.
- Detailed information (Images): ⇒ [Paper] [Website]
  - Number of different categories: 200 (Training: 150 categories. Testing: 50 categories.)
  - Number of bird images: 11,788
  - Annotations per image: 15 Part Locations, 312 Binary Attributes, 1 Bounding Box, Ground-truth Segmentation
- Detailed information (Text Descriptions): ⇒ [Paper] [Website]
  - Descriptions per image: 10 Captions
Oxford-102 Flower

Oxford-102 Flower is a 102 category dataset, consisting of 102 flower categories. The flowers are chosen to be flower commonly occurring in the United Kingdom. The images have large scale, pose and light variations.
- Detailed information (Images): ⇒ [Paper] [Website]
  - Number of different categories: 102 (Training: 82 categories. Testing: 20 categories.)
  - Number of flower images: 8,189
- Detailed information (Text Descriptions): ⇒ [Paper] [Download]
  - Descriptions per image: 10 Captions
MS-COCO

COCO is a large-scale object detection, segmentation, and captioning dataset.
- Detailed information (Images): ⇒ [Paper] [Website]
  - Number of different categories: 91
  - Number of images: 120k (Training: 80k. Testing: 40k.)
- Detailed information (Text Descriptions): ⇒ [Paper] [Download]
  - Descriptions per image: 5 Captions
Multi-Modal-CelebA-HQ

Multi-Modal-CelebA-HQ is a large-scale face image dataset for text-to-image-generation, text-guided image manipulation, sketch-to-image generation, GANs for face generation and editing, image caption, and VQA.
- Detailed information (Images & Text Descriptions): ⇒ [Paper] [Website] [Download]
  - Number of images (from Celeba-HQ): 30,000 (Training: 24,000. Testing: 6,000.)
  - Descriptions per image: 10 Captions
- Detailed information (Masks):
  - Number of masks (from Celeba-Mask-HQ): 30,000 (512 x 512)
- Detailed information (Sketches):
  - Number of Sketches: 30,000 (512 x 512)
- Detailed information (Image with transparent background):
  - Not fully uploaded
CelebA-Dialog

CelebA-Dialog is a large-scale visual-language face dataset. It has two properties: (1) Facial images are annotated with rich fine-grained labels, which classify one attribute into multiple degrees according to its semantic meaning. (2) Accompanied with each image, there are captions describing the attributes and a user request sample.
- Detailed information (Images & Text Descriptions): ⇒ [Paper] [Website] [Download]
  - Number of identities: 10,177
  - Number of images: 202,599
  - 5 fine-grained attributes annotations per image: Bangs, Eyeglasses, Beard, Smiling, and Age
FFHQ-Text

FFHQ-Text is a small-scale face image dataset with large-scale facial attributes, designed for text-to-face generation & manipulation, text-guided facial image manipulation, and other vision-related tasks.
- Detailed information (Images & Text Descriptions): ⇒ [Paper] [Website] [Download]
  - Number of images (from FFHQ): 760 (Training: 500. Testing: 260.)
  - Descriptions per image: 9 Captions
  - 13 multi-valued facial element groups from coarse to fine.
- Detailed information (BBox): ⇒ [Website]
CelebAText-HQ

CelebAText-HQ is a large-scale face image dataset with large-scale facial attributes, designed for text-to-face generation.
- Detailed information (Images & Text Descriptions): ⇒ [Paper] [Website] [Download]
  - Number of images (from Celeba-HQ): 15010 (Training: 13,710. Testing: 1300.)
  - Descriptions per image: 10 Captions
DeepFashion-MultiModal

CelebA-Dialog is a large-scale high-quality human dataset. Human images are annotated with rich multi-modal labels, including human parsing labels, keypoints, densepose, fine-grained attributes and textual descriptions.
- Detailed information (Images & Text Descriptions): ⇒ [Paper] [Website] [Download]
  - Number of images: 44,096, including 12,701 full body images
  - Descriptions per image: 1 Caption

4. Project «🎯Back To Top»

⭐Stable Diffusion. [Awesome Stable-Diffusion] [Researcher Access Form] [Github] [Web UI] [WebUI Docker] [Hugging Face] [DreamStudio Beta]
- Stable Diffusion is a text-to-image model that will empower billions of people to create stunning art within seconds. It is a breakthrough in speed and quality meaning that it can run on consumer GPUs.
Midjourney. [Documentation] [Homepage]
- Midjourney is an independent research lab exploring new mediums of thought and expanding the imaginative powers of the human species. There are two ways to experience the tools: the Midjourney Bot and the Web App.
Artflow. [Start Creating!]
- Artflow lets users generate visual content with the help of AI. Create unique avatars with ease and turn any description into a portrait.
craiyon(~~DALL·E Mini~~). [Short Video Explanation] [Blog] [Github] [Huggingface official demo] [Homepage] [min(DALL·E)]
- A free, open-source AI that produces amazing images from text inputs. AI model drawing images from any prompt!
Disco Diffusion. [Github] [Colab]
- A frankensteinian amalgamation of notebooks, models and techniques for the generation of AI Art and Animations.
Aphantasia. [Github]
- This is a text-to-image tool, part of the artwork of the same name. (Aphantasia is the inability to visualize mental images, the deprivation of visual dreams.)
Text2Art. [Try it now!] [Github] [Blog]
- Text2Art is an AI-powered art generator based on VQGAN+CLIP that can generate all kinds of art such as pixel art, drawing, and painting from just text input.
Survey Text Based Image Synthesis [Blog (2021)]

5. ⏳Recently Focused Papers (FYI) «🎯Back To Top»

⭐(arXiv preprint 2022) ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts, Zhida Feng et al. [Paper]
- 🍬 ERNIE-ViLG 2.0: a large-scale Chinese text-to-image diffusion model, which progressively upgrades the quality of generated images by: (1) incorporating fine-grained textual and visual knowledge of key elements in the scene, and (2) utilizing different denoising experts at different denoising stages. ERNIE-ViLG 2.0 achieves state-of-the-art on MS-COCO with a zero-shot FID score of 6.75.
⭐⭐(arXiv preprint 2022) Prompt-to-Prompt Image Editing with Cross Attention Control, Amir Hertz et al. [Paper] [Code] [Unofficial Code] [Project]
- 🍬 Prompt-to-Prompt Editing: Control the attention maps of the edited image by injecting the attention maps of the original image along the diffusion process. Monitor the synthesis process by editing the textual prompt only, paving the way to a myriad of caption-based editing applications.
⭐⭐(arXiv preprint 2022) Imagen Video: High Definition Video Generation with Diffusion Models, Jonathan Ho et al. [Paper] [Project]
- 🍬 Imagen Video: Given a text prompt, Imagen Video generates high-definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. Imagen Video is not only capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge, including the ability to generate diverse videos and text animations in various artistic styles and with 3D object understanding.
⭐⭐(arXiv preprint 2022) Make-A-Video: Text-to-Video Generation without Text-Video Data, Uriel Singer et al. [Paper] [Project] [Short read] [Code]
- 🍬 Meta AI’s new model make-a-video is out and in a single sentence: it generates videos from text. It’s not only able to generate videos, but it’s also the new state-of-the-art method, producing higher quality and more coherent videos than ever before!
⭐⭐(arXiv preprint 2022) DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation, Nataniel Ruiz et al. [Paper] [Project]
- 🍬 DreamBooth: Given as input just a few images of a subject and fine-tune a pretrained text-to-image model (Imagen), such that it learns to bind a unique identifier with that specific subject, which synthesizing the subject in diverse scenes, poses, views, and lighting conditions that do not appear in the reference images.
- 📚 Subject Recontextualization, Text-guided View Synthesis, Appearance Modification, Artistic Rendering (all while preserving the subject's key features)

6. Paper With Code

Survey «🎯Back To Top»
- Text-to-Image Synthesis: A Comparative Study [v1(Digital Transformation Technology)] (2021.08)
- A survey on generative adversarial network-based text-to-image synthesis [v1(Neurocomputing)] (2021.04)
- Adversarial Text-to-Image Synthesis: A Review [v1(arXiv)] (2021.01) [v2(Neural Networks)] (2021.08)
- A Survey and Taxonomy of Adversarial Neural Networks for Text-to-Image Synthesis [v1(arXiv)] (2019.10)
Text to Face👨🏻🧒👧🏼🧓🏽 «🎯Back To Top»
- (arXiv preprint 2022) Bridging CLIP and StyleGAN through Latent Alignment for Image Editing, Wanfeng Zheng et al. [Paper]
- (ACMMM 2022) Learning Dynamic Prior Knowledge for Text-to-Face Pixel Synthesis, Jun Peng et al. [Paper]
- (ACMMM 2022) Towards Open-Ended Text-to-Face Generation, Combination and Manipulation, Jun Peng et al. [Paper]
- (BMVC 2022) clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP, Justin N. M. Pinkney et al. [Paper] [Code]
- (arXiv preprint 2022) ManiCLIP: Multi-Attribute Face Manipulation from Text, Hao Wang et al. [Paper]
- (arXiv preprint 2022) Generated Faces in the Wild: Quantitative Comparison of Stable Diffusion, Midjourney and DALL-E 2, Ali Borji, [Paper] [Code] [Data]
- (arXiv preprint 2022) Text-Free Learning of a Natural Language Interface for Pretrained Face Generators, Xiaodan Du et al. [Paper] [Code]
- (Knowledge-Based Systems-2022) CMAFGAN: A Cross-Modal Attention Fusion based Generative Adversarial Network for attribute word-to-face synthesis, Xiaodong Luo et al. [Paper]
- (Neural Networks-2022) DualG-GAN, a Dual-channel Generator based Generative Adversarial Network for text-to-face synthesis, Xiaodong Luo et al. [Paper]
- (arXiv preprint 2022) Text-to-Face Generation with StyleGAN2, D. M. A. Ayanthi et al. [Paper]
- (CVPR 2022) StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis, Zhiheng Li et al. [Paper] [Code]
- (arXiv preprint 2022) StyleT2F: Generating Human Faces from Textual Description Using StyleGAN2, Mohamed Shawky Sabae et al. [Paper] [Code]
- (CVPR 2022) AnyFace: Free-style Text-to-Face Synthesis and Manipulation, Jianxin Sun et al. [Paper]
- (IEEE Transactions on Network Science and Engineering-2022) TextFace: Text-to-Style Mapping based Face Generation and Manipulation, Xianxu Hou et al. [Paper]
- (FG 2021) Generative Adversarial Network for Text-to-Face Synthesis and Manipulation with Pretrained BERT Model, Yutong Zhou et al. [Paper]
- (ACMMM 2021) Multi-caption Text-to-Face Synthesis: Dataset and Algorithm, Jianxin Sun et al. [Paper] [Code]
- (ACMMM 2021) Generative Adversarial Network for Text-to-Face Synthesis and Manipulation, Yutong Zhou. [Paper]
- (WACV 2021) Faces a la Carte: Text-to-Face Generation via Attribute Disentanglement, Tianren Wang et al. [Paper]
- (arXiv preprint 2019) FTGAN: A Fully-trained Generative Adversarial Networks for Text to Face Generation, Xiang Chen et al. [Paper]
Compounding Issues🤔 «🎯Back To Top»
- (arXiv preprint 2022) [💬 Racial Politics] A Sign That Spells: DALL-E 2, Invisual Images and The Racial Politics of Feature Space, Fabian Offert et al. [Paper]
- (arXiv preprint 2022) [💬 Demographic Stereotypes] Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale, Federico Bianchi et al. [Paper]
- (arXiv preprint 2022) [💬 Privacy Analysis] Membership Inference Attacks Against Text-to-image Generation Models, Yixin Wu et al. [Paper]
- (arXiv preprint 2022) [💬 Authenticity Evaluation for Fake Images] DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Diffusion Models, Zeyang Sha et al. [Paper]
- (arXiv preprint 2022) [💬 Cultural Bias] The Biased Artist: Exploiting Cultural Biases via Homoglyphs in Text-Guided Image Generation Models, Lukas Struppek et al. [Paper]
2022 «🎯Back To Top»
- (arXiv preprint 2022) HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for Controllable Text-Driven Person Image Generation, Kaiduo Zhang et al. [Paper]
- ⭐⭐(arXiv preprint 2022) eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers, Yogesh Balaji et al. [Paper] [Project] [Video]
- (arXiv preprint 2022) [💬Text-Image Consistency] Towards Better Text-Image Consistency in Text-to-Image Generation, Zhaorui Tan et al. [Paper]
- (arXiv preprint 2022) ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts, Zhida Feng et al. [Paper]
- ⭐(ECCV 2022) [💬Evaluation Metrics] TISE: Bag of Metrics for Text-to-Image Synthesis Evaluation, Tan M. Dinh et al. [Paper] [Code] [Project]
- (ECCV 2022) [💬Trace+Text→Image] Trace Controlled Text to Image Generation, Kun Yan et al. [Paper]
- (arXiv preprint 2022) [💬Markup→Image] Markup-to-Image Diffusion Models with Scheduled Sampling, Yuntian Deng et al. [Paper] [Code]
- (arXiv preprint 2022) Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation, Ruijun Li et al. [Paper]
- ⭐⭐(arXiv preprint 2022) DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics, Ivan Kapelyukh et al. [Paper] [Project]
- (arXiv preprint 2022) Progressive Denoising Model for Fine-Grained Text-to-Image Generation, Zhengcong Fei et al. [Paper]
- (arXiv preprint 2022) Creative Painting with Latent Diffusion Models, Xianchao Wu [Paper]
- (arXiv preprint 2022) Re-Imagen: Retrieval-Augmented Text-to-Image Generator, Wenhu Chen et al. [Paper]
- (ACMMM 2022) AtHom: Two Divergent Attentions Stimulated By Homomorphic Training in Text-to-Image Synthesis, Zhenbo Shi et al. [Paper]
- (ACMMM 2022) Adma-GAN: Attribute-Driven Memory Augmented GANs for Text-to-Image Generation, Xintian Wu et al. [Paper] [Code]
- (arXiv preprint 2022) [💬Aesthetic Image Generation] Best Prompts for Text-to-Image Models and How to Find Them, Nikita Pavlichenko et al. [Paper]
- (ACMMM 2022) AI Illustrator: Translating Raw Descriptions into Images by Prompt-based Cross-Modal Generation, Yiyang Ma et al. [Paper]
- (ACMMM 2022) DSE-GAN: Dynamic Semantic Evolution Generative Adversarial Network for Text-to-Image Generation, Mengqi Huang et al. [Paper]
- (arXiv preprint 2022) [💬Radiology] What Does DALL-E 2 Know About Radiology?, Lisa C. Adams et al. [Paper]
- (arXiv preprint 2022) Prompt-to-Prompt Image Editing with Cross Attention Control, Amir Hertz et al. [Paper] [Code] [Unofficial Code] [Project]
- (arXiv preprint 2022) Text to Image Generation: Leaving no Language Behind, Pedro Reviriego et al. [Paper]
- (arXiv preprint 2022) Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks, Qingrong Cheng et al. [Paper]
- (arXiv preprint 2022) [💬Visual Understanding on Generated Images] How good are deep models in understanding the generated images?, Ali Borji [Paper]
- (arXiv preprint 2022) Text-to-Image Generation via Implicit Visual Guidance and Hypernetwork, Xin Yuan et al. [Paper]
- (arXiv preprint 2022) [💬Hybrid word→Image] Adversarial Attacks on Image Generation With Made-Up Words, Raphaël Millière [Paper]
- (arXiv preprint 2022) Memory-Driven Text-to-Image Generation, Bowen Li et al. [Paper]
- (arXiv preprint 2022) [💬Text-to-Person]T-Person-GAN: Text-to-Person Image Generation with Identity-Consistency and Manifold Mix-Up, Lin Wu et al. [Paper] [Code]
- (arXiv preprint 2022) LogicRank: Logic Induced Reranking for Generative Text-to-Image Systems, Björn Deiseroth et al. [Paper]
- (arXiv preprint 2022) [💬Text→Layout→Image] Layout-Bridging Text-to-Image Synthesis, Jiadong Liang et al. [Paper]
- (arXiv preprint 2022) DALLE-URBAN: Capturing the urban design expertise of large text to image transformers, Sachith Seneviratne et al. [Paper] [Generated Images]
- (arXiv preprint 2022) An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion, Rinon Gal et al. [Paper] [Code] [Project]
- (arXiv preprint 2022) [💬Relational Understanding Analysis] Testing Relational Understanding in Text-Guided Image Generation, Colin Conwell [Paper]
- (arXiv preprint 2022) [💬Lighting Consistency Analysis] Lighting (In)consistency of Paint by Text, Hany Farid [Paper]
- (ECCV 2022) NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion, Chenfei Wu et al. [Paper] [Code]
  - Multimodal Pretrained Model for Multi-tasks🎄: Text-To-Image (T2I), Sketch-to-Image (S2I), Image Completion (I2I), Text-Guided Image Manipulation (TI2I), Text-to-Video (T2V), Video Prediction (V2V), Sketch-to-Video (S2V), Text-Guided Video Manipulation (TV2V)
    
    (From: https://github.com/microsoft/NUWA [2021/11/30])
- (ECCV 2022) Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors, Oran Gafni et al. [Paper] [Code] [The Little Red Boat Story]
- (arXiv preprint 2022) Exploring Generative Adversarial Networks for Text-to-Image Generation with Evolution Strategies, Victor Costa et al. [Paper]
- (CVPR 2022) Text to Image Generation with Semantic-Spatial Aware GAN, Wentong Liao et al. [Paper] [Code]
- (ICMR 2022) Disentangled Representations and Hierarchical Refinement of Multi-Granularity Features for Text-to-Image Synthesis, Pei Dong et al. [Paper]
- (arXiv preprint 2022) Scaling Autoregressive Models for Content-Rich Text-to-Image Generation, Jiahui Yu et al. [Paper] [Code] [Project]
- (Information Sciences-2022) Text-to-Image Synthesis: Starting Composite from the Foreground Content, Zhiqiang Zhang et al. [Paper]
- (Applied Intelligence-2022) Generative adversarial network based on semantic consistency for text-to-image generation, Yue Ma et al. [Paper]
- (ICML 2022) GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, Alex Nichol et al. [Paper] [Code]
- ⭐⭐(arXiv preprint 2022) Compositional Visual Generation with Composable Diffusion Models, Nan Liu et al. [Paper] [Code] [Project]
- (SIGGRAPH 2022) Text2Human: Text-Driven Controllable Human Image Generation, Yuming Jiang et al. [Paper] [Code]
- ⭐⭐(arXiv preprint 2022) [Imagen] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, Chitwan Saharia et al. [Paper] [Blog]
- (ICME 2022) GR-GAN: Gradual Refinement Text-to-image Generation, Bo Yang et al. [Paper] [Code]
- (CHI 2022) Design Guidelines for Prompt Engineering Text-to-Image Generative Models, Vivian Liu et al. [Paper]
- (Neural Processing Letters-2022) PBGN: Phased Bidirectional Generation Network in Text-to-Image Synthesis, Jianwei Zhu et al. [Paper]
- (Signal Processing: Image Communication-2022) ARRPNGAN: Text-to-image GAN with attention regularization and region proposal networks, Fengnan Quan et al. [Paper] [Code]
- (arXiv preprint 2022) CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers, Ming Ding et al. [Paper] [Code]
- ⭐(OpenAI) [DALL-E 2] Hierarchical Text-Conditional Image Generation with CLIP Latents, Aditya Ramesh et al. [Paper] [Blog] [Risks and Limitations] [Unofficial Code]
- (arXiv preprint 2022) Recurrent Affine Transformation for Text-to-image Synthesis, Senmao Ye et al. [Paper] [Code]
- (AAAI 2022) Interactive Image Generation with Natural-Language Feedback, Yufan Zhou et al. [Paper]
- (IEEE Transactions on Neural Networks and Learning Systems-2022) DR-GAN: Distribution Regularization for Text-to-Image Generation, Hongchen Tan et al. [Paper] [Code]
- (Pattern Recognition Letters-2022) Text-to-image synthesis with self-supervised learning, Yong Xuan Tan et al. [Paper]
- (CVPR 2022) Vector Quantized Diffusion Model for Text-to-Image Synthesis, Shuyang Gu et al. [Paper] [Code]
- (CVPR 2022) Autoregressive Image Generation using Residual Quantization, Doyup Lee et al. [Paper] [Code]
- (CVPR 2022) Text-to-Image Synthesis based on Object-Guided Joint-Decoding Transformer, Fuxiang Wu et al. [Paper]
- (CVPR 2022) LAFITE: Towards Language-Free Training for Text-to-Image Generation, Yufan Zhou et al. [Paper] [Code]
- (CVPR 2022) DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis, Ming Tao et al. [Paper] [Code]
- (arXiv preprint 2022) DT2I: Dense Text-to-Image Generation from Region Descriptions, Stanislav Frolov et al. [Paper]
- (arXiv preprint 2022) CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP, Zihao Wang et al. [Paper] [Code]
- (arXiv preprint 2022) OptGAN: Optimizing and Interpreting the Latent Space of the Conditional Text-to-Image GANs, Zhenxing Zhang et al. [Paper]
- (arXiv preprint 2022) DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers, Jaemin Cho et al. [Paper] [Code]
- (IEEE Transactions on Network Science and Engineering-2022) Neural Architecture Search with a Lightweight Transformer for Text-to-Image Synthesis, Wei Li et al. [Paper]
- (Neurocomputing-2022) DiverGAN: An Efficient and Effective Single-Stage Framework for Diverse Text-to-Image Generation, Zhenxing Zhang et al. [Paper]
- (Knowledge-Based Systems-2022) CJE-TIG: Zero-shot cross-lingual text-to-image generation by Corpora-based Joint Encoding, Han Zhang et al. [Paper]
- (WACV 2022) StyleMC: Multi-Channel Based Fast Text-Guided Image Generationand Manipulation, Umut Kocasarı et al. [Paper] [Project]
2021 «🎯Back To Top»
- (arXiv preprint 2021) Multimodal Conditional Image Synthesis with Product-of-Experts GANs, Xun Huang et al. [Paper] [Project]
  - Text-to-Image, Segmentation-to-Image, Text+Segmentation/Sketch/Image→Image, Sketch+Segmentation/Image→Image, Segmentation+Image→Image
- (IEEE TCSVT) RiFeGAN2: Rich Feature Generation for Text-to-Image Synthesis from Constrained Prior Knowledge, Jun Cheng et al. [Paper]
- (ICONIP 2021) TRGAN: Text to Image Generation Through Optimizing Initial Image, Liang Zhao et al. [Paper]
- (NeurIPS 2021) Benchmark for Compositional Text-to-Image Synthesis, Dong Huk Park et al. [Paper] [Code]
- (arXiv preprint 2021) FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization, Xingchao Liu et al. [Paper] [Code]
- (ICONIP 2021) Self-Supervised Image-to-Text and Text-to-Image Synthesis, Anindya Sundar Das et al. [Paper]
- (arXiv preprint 2021) DiverGAN: An Efficient and Effective Single-Stage Framework for Diverse Text-to-Image Generation, Zhenxing Zhang et al. [Paper]
- (Image and Vision Computing) Transformer models for enhancing AttnGAN based text to image generation, S. Naveen et al. [Paper]
- (ACMMM 2021) R-GAN: Exploring Human-like Way for Reasonable Text-to-Image Synthesis via Generative Adversarial Networks, Yanyuan Qiao et al. [Paper]
- (ACMMM 2021) Cycle-Consistent Inverse GAN for Text-to-Image Synthesis, Hao Wang et al. [Paper]
- (ACMMM 2021) Unifying Multimodal Transformer for Bi-directional Image and Text Generation, Yupan Huang et al. [Paper] [Code]
- (ACMMM 2021) A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation, Yupan Huang et al. [Paper] [Code]
- (ICCV 2021) Talk-to-Edit: Fine-Grained Facial Editing via Dialog, Yuming Jiang et al. [Paper] [Project] [Code]
- (ICCV 2021) DAE-GAN: Dynamic Aspect-Aware GAN for Text-to-Image Synthesis, Shulan Ruan et al. [Paper] [Supp] [Code]
- (ICIP 2021) Text To Image Synthesis With Erudite Generative Adversarial Networks, Zhiqiang Zhang et al. [Paper]
- (PRCV 2021) MAGAN: Multi-attention Generative Adversarial Networks for Text-to-Image Generation, Xibin Jia et al. [Paper]
- (AAAI 2021) TIME: Text and Image Mutual-Translation Adversarial Networks, Bingchen Liu et al. [Paper] [arXiv Paper]
- (IJCNN 2021) Text to Image Synthesis based on Multi-Perspective Fusion, Zhiqiang Zhang et al. [Paper]
- (arXiv preprint 2021) CRD-CGAN: Category-Consistent and Relativistic Constraints for Diverse Text-to-Image Generation, Tao Hu et al. [Paper]
- (arXiv preprint 2021) Improving Text-to-Image Synthesis Using Contrastive Learning, Hui Ye et al. [Paper] [Code]
- (arXiv preprint 2021) CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders, Kevin Frans et al. [Paper] [Code]
- (ICASSP 2021) Drawgan: Text to Image Synthesis with Drawing Generative Adversarial Networks, Zhiqiang Zhang et al. [Paper]
- (IJCNN 2021) DTGAN: Dual Attention Generative Adversarial Networks for Text-to-Image Generation, Zhenxing Zhang et al. [Paper]
- (CVPR 2021) TediGAN: Text-Guided Diverse Image Generation and Manipulation, Weihao Xia et al. [Paper] [Extended Version][Code] [Dataset] [Colab] [Video]
- (CVPR 2021) Cross-Modal Contrastive Learning for Text-to-Image Generation, Han Zhang et al. [Paper] [Code]
- (NeurIPS 2021) CogView: Mastering Text-to-Image Generation via Transformers, Ming Ding et al. [Paper] [Code] [Demo Website(Chinese)]
- (IEEE Transactions on Multimedia 2021) Modality Disentangled Discriminator for Text-to-Image Synthesis, Fangxiang Feng et al. [Paper] [Code]
- ⭐(arXiv preprint 2021) Zero-Shot Text-to-Image Generation, Aditya Ramesh et al. [Paper] [Code] [Blog] [Model Card] [Colab] [Code(Pytorch)]
- (Pattern Recognition 2021) Unsupervised text-to-image synthesis, Yanlong Dong et al. [Paper]
- (WACV 2021) Text-to-Image Generation Grounded by Fine-Grained User Attention, Jing Yu Koh et al. [Paper] [Code]
- (IEEE TIP 2021) Multi-Sentence Auxiliary Adversarial Networks for Fine-Grained Text-to-Image Synthesis, Yanhua Yang et al. [Paper]
- (IEEE Access 2021) DGattGAN: Cooperative Up-Sampling Based Dual Generator Attentional GAN on Text-to-Image Synthesis, Han Zhang et al. [Paper]
2020 «🎯Back To Top»
- (WIREs Data Mining and Knowledge Discovery 2020) A survey and taxonomy of adversarial neural networks for text-to-image synthesis, Jorge Agnese et al. [Paper]
- (TPAMI 2020) Semantic Object Accuracy for Generative Text-to-Image Synthesis, Tobias Hinz et al. [Paper] [Code]
- (IEEE TIP 2020) KT-GAN: Knowledge-Transfer Generative Adversarial Network for Text-to-Image Synthesis, Hongchen Tan et al. [Paper]
- (ACM Trans 2020) End-to-End Text-to-Image Synthesis with Spatial Constrains, Min Wang et al. [Paper]
- (Neural Networks) Image manipulation with natural language using Two-sided Attentive Conditional Generative Adversarial Network, DaweiZhu et al. [Paper]
- (IEEE Access 2020) TiVGAN: Text to Image to Video Generation With Step-by-Step Evolutionary Generator, Doyeon Kim et al. [Paper]
- (IEEE Access 2020) Dualattn-GAN: Text to Image Synthesis With Dual Attentional Generative Adversarial Network, Yali Cai et al. [Paper]
- (ICCL 2020) VICTR: Visual Information Captured Text Representation for Text-to-Image Multimodal Tasks, Soyeon Caren Han et al. [Paper] [Code]
- (ECCV 2020) CPGAN: Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis, Jiadong Liang et al. [Paper] [Code]
- (CVPR 2020) RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis From Prior Knowledge, Jun Cheng et al. [Paper]
- (CVPR 2020) CookGAN: Causality based Text-to-Image Synthesis, Bin Zhu et al. [Paper]
- (CVPR 2020 - Workshop) SegAttnGAN: Text to Image Generation with Segmentation Attention, Yuchuan Gou et al. [Paper]
- (IVPR 2020) PerceptionGAN: Real-world Image Construction from Provided Text through Perceptual Understanding, Kanish Garg et al. [Paper]
- (COLING 2020) Leveraging Visual Question Answering to Improve Text-to-Image Synthesis, Stanislav Frolov et al. [Paper]
- (IRCDL 2020) Text-to-Image Synthesis Based on Machine Generated Captions, Marco Menardi et al. [Paper]
- (arXiv preprint 2020) MPG: A Multi-ingredient Pizza Image Generator with Conditional StyleGANs, Fangda Han et al. [Paper]
2019 «🎯Back To Top»
- (IEEE TCSVT 2019) Bridge-GAN: Interpretable Representation Learning for Text-to-image Synthesis, Mingkuan Yuan et al. [Paper] [Code]
- (AAAI 2019) Perceptual Pyramid Adversarial Networks for Text-to-Image Synthesis, Minfeng Zhu et al. [Paper]
- (AAAI 2019) Adversarial Learning of Semantic Relevance in Text to Image Synthesis, Miriam Cha et al. [Paper]
- (NeurIPS 2019) Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge, Tingting Qiao et al. [Paper] [Code]
- (NeurIPS 2019) Controllable Text-to-Image Generation, Bowen Li et al. [Paper] [Code]
- (CVPR 2019) DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis, Minfeng Zhu et al. [Paper] [Code]
- (CVPR 2019) Object-driven Text-to-Image Synthesis via Adversarial Training, Wenbo Li et al. [Paper] [Code]
- (CVPR 2019) MirrorGAN: Learning Text-to-image Generation by Redescription, Tingting Qiao et al. [Paper] [Code]
- (CVPR 2019) Text2Scene: Generating Abstract Scenes from Textual Descriptions, Fuwen Tan et al. [Paper] [Code]
- (CVPR 2019) Semantics Disentangling for Text-to-Image Generation, Guojun Yin et al. [Paper] [Website]
- (CVPR 2019) Text Guided Person Image Synthesis, Xingran Zhou et al. [Paper]
- (ICCV 2019) Semantics-Enhanced Adversarial Nets for Text-to-Image Synthesis, Hongchen Tan et al. [Paper]
- (ICCV 2019) Dual Adversarial Inference for Text-to-Image Synthesis, Qicheng Lao et al. [Paper]
- (ICCV 2019) Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction, Alaaeldin El-Nouby et al. [Paper] [Code]
- (BMVC 2019) MS-GAN: Text to Image Synthesis with Attention-Modulated Generators and Similarity-aware Discriminators, Fengling Mao et al. [Paper]
- (arXiv preprint 2019) GILT: Generating Images from Long Text, Ori Bar El et al. [Paper] [Code]
2018 «🎯Back To Top»
- (TPAMI 2018) StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks, Han Zhang et al. [Paper] [Code]
- (BMVC 2018) MC-GAN: Multi-conditional Generative Adversarial Network for Image Synthesis, Hyojin Park et al. [Paper] [Code]
- (CVPR 2018) AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks, Tao Xu et al. [Paper] [Code]
- (CVPR 2018) Photographic Text-to-Image Synthesis with a Hierarchically-nested Adversarial Network, Zizhao Zhang et al. [Paper] [Code]
- (CVPR 2018) Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis, Seunghoon Hong et al. [Paper]
- (CVPR 2018) Image Generation from Scene Graphs, Justin Johnson et al. [Paper] [Code]
- (ICLR 2018 - Workshop) ChatPainter: Improving Text to Image Generation using Dialogue, Shikhar Sharma et al. [Paper]
- (ACMMM 2018) Text-to-image Synthesis via Symmetrical Distillation Networks, Mingkuan Yuan et al. [Paper]
- (WACV 2018) C4Synth: Cross-Caption Cycle-Consistent Text-to-Image Synthesis, K. J. Joseph et al. [Paper]
- (arXiv preprint 2018) Text to Image Synthesis Using Generative Adversarial Networks, Cristian Bodnar. [Paper]
- (arXiv preprint 2018) Text-to-image-to-text translation using cycle consistent adversarial networks, Satya Krishna Gorti et al. [Paper] [Code]
2017 «🎯Back To Top»
- (ICCV 2017) StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks, Han Zhang et al. [Paper] [Code]
- (ICIP 2017) I2T2I: Learning Text to Image Synthesis with Textual Data Augmentation, Hao Dong et al. [Paper] [Code]
- (MLSP 2017) Adversarial nets with perceptual losses for text-to-image synthesis, Miriam Cha et al. [Paper]
2016 «🎯Back To Top»
- (ICML 2016) Generative Adversarial Text to Image Synthesis, Scott Reed et al. [Paper] [Code]
- (NeurIPS 2016) Learning What and Where to Draw, Scott Reed et al. [Paper] [Code]

7. Other Related Works

⭐Multimodality⭐ «🎯Back To Top»
- (arXiv preprint 2022) Versatile Diffusion: Text, Images and Variations All in One Diffusion Model, Xingqian Xu et al. [Paper] [Code] [Hugging Face]
  - 📚Text-to-Image, Image-Variation, Image-to-Text, Disentanglement, Text+Image-Guided Generation, Editable I2T2I
- (arXiv preprint 2022) Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis, Wan-Cyuan Fan et al. [Paper] [Code]
  - 📚Text-to-Image, Scene Gragh to Image, Layout-to-Image, Uncondition Image Generation
- (arXiv preprint 2022) NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis, Chenfei Wu et al. [Paper] [Code] [Project]
  - 📚Unconditional Image Generation(HD), Text-to-Image(HD), Image Animation(HD), Image Outpainting(HD), Text-to-Video(HD)
- (ECCV 2022) NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion, Chenfei Wu et al. [Paper] [Code]
  - 📚Text-To-Image, Sketch-to-Image, Image Completion, Text-Guided Image Manipulation, Text-to-Video, Video Prediction, Sketch-to-Video, Text-Guided Video Manipulation
- (ACMMM 2022) Rethinking Super-Resolution as Text-Guided Details Generation, Chenxi Ma et al. [Paper]
  - 📚Text-to-Image, High-resolution, Text-guided High-resolution
- (arXiv preprint 2022) Discrete Contrastive Diffusion for Cross-Modal and Conditional Generation, Ye Zhu et al. [Paper] [Code]
  - 📚Text-to-Image, Dance-to-Music, Class-to-Image
- (arXiv preprint 2022) M6-Fashion: High-Fidelity Multi-modal Image Generation and Editing, Zhikang Li et al. [Paper]
  - 📚Text-to-Image, Unconditional Image Generation, Local-editing, Text-guided Local-editing, In/Out-painting, Style-mixing
- (CVPR 2022) Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning, Yogesh Balaji et al. [Paper] [Code] Project
  - 📚Text-to-Video, Independent Multimodal Controls, Dependent Multimodal Controls
- ⭐⭐(CVPR 2022) High-Resolution Image Synthesis with Latent Diffusion Models, Robin Rombach et al. [Paper] [Code] [Stable Diffusion Code]
  - 📚Text-to-Image, Conditional Latent Diffusion, Super-Resolution, Inpainting
- ⭐⭐(arXiv preprint 2022) Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework, Peng Wang et al. [Paper] [Code] [Hugging Face]
  - 📚Text-to-Image Generation, Image Captioning, Text Summarization, Self-Supervised Image Classification, [SOTA] Referring Expression Comprehension, Visual Entailment, Visual Question Answering
- (NeurIPS 2021) M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis via Non-Autoregressive Generative Transformers, Zhu Zhang et al. [Paper]
  - 📚Text-to-Image, Sketch-to-Image, Style Transfer, Image Inpainting, Multi-Modal Control to Image
- (arXiv preprint 2021) ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation, Han Zhang et al. [Paper]
  - A pre-trained 10-billion parameter model: ERNIE-ViLG.
  - A large-scale dataset of 145 million high-quality Chinese image-text pairs.
  - 📚Text-to-Image, Image Captioning, Generative Visual Question Answering
- (arXiv preprint 2021) Multimodal Conditional Image Synthesis with Product-of-Experts GANs, Xun Huang et al. [Paper] [Project]
  - 📚Text-to-Image, Segmentation-to-Image, Text+Segmentation/Sketch/Image → Image, Sketch+Segmentation/Image → Image, Segmentation+Image → Image
- (arXiv preprint 2021) L-Verse: Bidirectional Generation Between Image and Text, Taehoon Kim et al. [Paper] [Code]
  - 📚Text-To-Image, Image-To-Text, Image Reconstruction
- (arXiv preprint 2021) [💬Semantic Diffusion Guidance] More Control for Free! Image Synthesis with Semantic Diffusion Guidance, Xihui Liu et al. [Paper] [Project]
  - 📚Text-To-Image, Image-To-Image, Text+Image → Image
Text+Image/Video → Image/Video «🎯Back To Top»
- (arXiv preprint 2022) Null-text Inversion for Editing Real Images using Guided Diffusion Models, Ron Mokady et al. [Paper] [Project]
- (arXiv preprint 2022) InstructPix2Pix: Learning to Follow Image Editing Instructions, Tim Brooks et al. [Paper] [Project]
- (ECCV 2022) [💬Style Transfer] Language-Driven Artistic Style Transfer, Tsu-Jui Fu et al. [Paper] [Code]
- (arXiv preprint 2022) Bridging CLIP and StyleGAN through Latent Alignment for Image Editing, Wanfeng Zheng et al. [Paper]
- (arXiv preprint 2022) DiffEdit: Diffusion-based semantic image editing with mask guidance, Guillaume Couairon et al. [Paper]
- (NeurIPS 2022) One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations, Yiming Zhu et al. [Paper] [Code]
- (BMVC 2022) LDEdit: Towards Generalized Text Guided Image Manipulation via Latent Diffusion Models, Paramanand Chandramouli et al. [Paper]
- (ACMMM 2022) [💬Iterative Language-based Image Manipulation] LS-GAN: Iterative Language-based Image Manipulation via Long and Short Term Consistency Reasoning, Gaoxiang Cong et al. [Paper]
- (ACMMM 2022) [💬Digital Art Synthesis] Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion, Huang Nisha et al. [Paper] [Code]
- (SIGGRAPH Asia 2022) [💬HDR Panorama Generation] Text2Light: Zero-Shot Text-Driven HDR Panorama Generation, Zhaoxi Chen et al. [Paper] [Project] [Code]
- (arXiv preprint 2022) LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data, Jihye Park et al. [Paper] [Project] [Code]
- (ACMMM PIES-ME 2022) [💬3D Semantic Style Transfer] Language-guided Semantic Style Transfer of 3D Indoor Scenes, Bu Jin et al. [Paper] [Code]
- (arXiv preprint 2022) DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation, Nataniel Ruiz et al. [Paper] [Project]
- (arXiv preprint 2022) [💬Face Animation] Language-Guided Face Animation by Recurrent StyleGAN-based Generator, Tiankai Hang et al. [Paper] [Code]
- (arXiv preprint 2022) [💬Fashion Design] ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal Fashion Design, Xujie Zhang et al. [Paper] [Code]
- (arXiv preprint 2022) [💬Image Colorization] TIC: Text-Guided Image Colorization, Subhankar Ghosh et al. [Paper]
- (ECCV 2022) [💬Pose Synthesis] TIPS: Text-Induced Pose Synthesis, Prasun Roy et al. [Paper] [Code] [Project]
- (ACMMM 2022) [💬Person Re-identification] Learning Granularity-Unified Representations for Text-to-Image Person Re-identification, Zhiyin Shao et al. [Paper] [Code]
- (ACMMM 2022) Towards Counterfactual Image Manipulation via CLIP, Yingchen Yu et al. [Paper] [Code]
- (ACMMM 2022) [💬Monocular Depth Estimation] Can Language Understand Depth?, Wangbo Zhao et al. [Paper] [Code]
- (arXiv preprint 2022) [💬Image Style Transfer] Referring Image Matting, Tsu-Jui Fu et al. [Paper]
- (CVPR 2022) [💬Image Segmentation] Image Segmentation Using Text and Image Prompts, Timo Lüddecke et al. [Paper] [Code]
- (CVPR 2022) [💬Video Segmentation] Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation, Wangbo Zhao et al. [Paper] [Code]
- (arXiv preprint 2022) [💬Image Matting] Referring Image Matting, Sebastian Loeschcke et al. [Paper] [Dataset]
- (arXiv preprint 2022) [💬Stylizing Video Objects] Text-Driven Stylization of Video Objects, Sebastian Loeschcke et al. [Paper] [Project]
- (arXiv preprint 2022) DALL-E for Detection: Language-driven Context Image Synthesis for Object Detection, Yunhao Ge et al. [Paper]
- (arXiv preprint 2022) [💬Animating Human Meshes] CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes, Kim Youwang et al. [Paper] [Code]
- (arXiv preprint 2022) Blended Latent Diffusion, Omri Avrahami et al. [Paper] [Code] [Project]
- (arXiv preprint 2022) DE-Net: Dynamic Text-guided Image Editing Adversarial Networks, Ming Tao et al. [Paper] [Code]
- (IEEE Transactions on Neural Networks and Learning Systems 2022) [💬Pose-Guided Person Generation] Verbal-Person Nets: Pose-Guided Multi-Granularity Language-to-Person Generation, Deyin Liu et al. [Paper]
- (SIGGRAPH 2022) [💬3D Avatar Generation] AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars, Fangzhou Hong et al. [Paper] [Code] [Project]
- ⭐⭐(arXiv preprint 2022) [💬Image & Video Editing] Text2LIVE: Text-Driven Layered Image and Video Editing, Omer Bar-Tal et al. [Paper] [Project]
- (Machine Vision and Applications 2022) Paired-D++ GAN for image manipulation with text, Duc Minh Vo et al. [Paper]
- (CVPR 2022) [💬Hairstyle Transfer] HairCLIP: Design Your Hair by Text and Reference Image, Tianyi Wei et al. [Paper] [Code]
- (CVPR 2022) DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation, Gwanghyun Kim et al. [Paper]
- (CVPR 2022) ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation, Jianan Wang et al. [Paper] [Project]
- (CVPR 2022) Blended Diffusion for Text-driven Editing of Natural Images, Omri Avrahami et al. [Paper] [Code] [Project]
- (CVPR 2022) Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model, Zipeng Xu et al. [Paper] [Code]
- (CVPR 2022) [💬Style Transfer] CLIPstyler: Image Style Transfer with a Single Text Condition, Gihyun Kwon et al. [Paper] [Code]
- (arXiv preprint 2022) [💬Multi-person Image Generation] Pose Guided Multi-person Image Generation From Text, Soon Yau Cheong et al. [Paper]
- (arXiv preprint 2022) [💬Image Style Transfer] StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Translation, Peter Schaldenbrand et al. [Paper] [Dataset] [Code] [Demo]
- (arXiv preprint 2022) [💬Image Style Transfer] Name Your Style: An Arbitrary Artist-aware Image Style Transfer, Zhi-Song Liu et al. [Paper]
- (arXiv preprint 2022) [💬3D Avatar Generation] Text and Image Guided 3D Avatar Generation and Manipulation, Zehranaz Canfes et al. [Paper] [Project]
- (arXiv preprint 2022) [💬Image Inpainting] NÜWA-LIP: Language Guided Image Inpainting with Defect-free VQGAN, Minheng Ni et al. [Paper]
- ⭐(arXiv preprint 2021) [💬Text+Image → Video] Make It Move: Controllable Image-to-Video Generation with Text Descriptions, Yaosi Hu et al. [Paper]
- (arXiv preprint 2021) [💬NeRF] CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields, Can Wang et al. [Paper] [Code] [Project]
- (arXiv preprint 2021) [💬NeRF] Zero-Shot Text-Guided Object Generation with Dream Fields, Ajay Jain et al. [Paper] [Project]
- (NeurIPS 2021) Instance-Conditioned GAN, Arantxa Casanova et al. [Paper] [Code]
- (ICCV 2021) Language-Guided Global Image Editing via Cross-Modal Cyclic Mechanism, Wentao Jiang et al. [Paper]
- (ICCV 2021) Talk-to-Edit: Fine-Grained Facial Editing via Dialog, Yuming Jiang et al. [Paper] [Project] [Code]
- (ICCVW 2021) CIGLI: Conditional Image Generation from Language & Image, Xiaopeng Lu et al. [Paper] [Code]
- (ICCV 2021) StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery, Or Patashnik et al. [Paper] [Code]
- (arXiv preprint 2021) Paint by Word, David Bau et al. [Paper]
- ⭐(arXiv preprint 2021) Zero-Shot Text-to-Image Generation, Aditya Ramesh et al. [Paper] [Code] [Blog] [Model Card] [Colab]
- (NeurIPS 2020) Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation, Bowen Li et al. [Paper]
- (CVPR 2020) ManiGAN: Text-Guided Image Manipulation, Bowen Li et al. [Paper] [Code]
- (ACMMM 2020) Text-Guided Neural Image Inpainting, Lisai Zhang et al. [Paper] [Code]
- (ACMMM 2020) Describe What to Change: A Text-guided Unsupervised Image-to-Image Translation Approach, Yahui Liu et al. [Paper]
- (NeurIPS 2018) Text-adaptive generative adversarial networks: Manipulating images with natural language, Seonghyeon Nam et al. [Paper] [Code]
Audio+Text+Image/Video → Image/Video «🎯Back To Top»
- (arXiv preprint 2022) Robust Sound-Guided Image Manipulation, Seung Hyun Lee et al. [Paper]
Layout → Image «🎯Back To Top»
- (CVPR 2022) Modeling Image Composition for Complex Scene Generation, Zuopeng Yang et al. [Paper] [Code]
- (CVPR 2022) Interactive Image Synthesis with Panoptic Layout Generation, Bo Wang et al. [Paper]
- (CVPR 2021 AI for Content Creation Workshop) High-Resolution Complex Scene Synthesis with Transformers, Manuel Jahn et al. [Paper]
- (CVPR 2021) Context-Aware Layout to Image Generation with Enhanced Object Appearance, Sen He et al. [Paper] [Code]
Label-set → Semantic maps «🎯Back To Top»
- (ECCV 2020) Controllable image synthesis via SegVAE, Yen-Chi Cheng et al. [Paper] [Code]
Speech → Image «🎯Back To Top»
- (IEEE/ACM Transactions on Audio, Speech and Language Processing-2021) Generating Images From Spoken Descriptions, Xinsheng Wang et al. [Paper] [Code] [Project]
- (INTERSPEECH 2020)[Extent Version👆] S2IGAN: Speech-to-Image Generation via Adversarial Learning, Xinsheng Wang et al. [Paper]
- (IEEE Journal of Selected Topics in Signal Processing-2020) Direct Speech-to-Image Translation, Jiguo Li et al. [Paper] [Code] [Project]
Text → Visual Retrieval «🎯Back To Top»
- (ACMMM 2022) CAIBC: Capturing All-round Information Beyond Color for Text-based Person Retrieval, Zijie Wang et al. [Paper]
- (AAAI 2022) Cross-Modal Coherence for Text-to-Image Retrieval, Malihe Alikhani et al. [Paper]
- (ECCV RWS 2022) [💬Person Retrieval] See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval, Xiujun Shu et al. [Paper] [Code]
- (ECCV 2022) [💬Text+Sketch→Visual Retrieval] A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch, Patsorn Sangkloy et al. [Paper] [Project]
- (Neurocomputing-2022) TIPCB: A simple but effective part-based convolutional baseline for text-based person search, Yuhao Chen et al. [Paper] [Code]
- (arXiv preprint 2021) [💬Dataset] FooDI-ML: a large multi-language dataset of food, drinks and groceries images and descriptions, David Amat Olóndriz et al. [Paper] [Code]
- (CVPRW 2021) TIED: A Cycle Consistent Encoder-Decoder Model for Text-to-Image Retrieval, Clint Sebastian et al. [Paper]
- (CVPR 2021) T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval, Xiaohan Wang et al. [Paper]
- (CVPR 2021) Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers, Antoine Miech et al. [Paper]
- (IEEE Access 2019) Query is GAN: Scene Retrieval With Attentional Text-to-Image Generative Adversarial Network, RINTARO YANAGI et al. [Paper]
Text → Motion/Shape/Mesh/Object... «🎯Back To Top»
- (arXiv preprint 2022) [💬Human Motion Generation] Human Motion Diffusion Model, Guy Tevet et al. [Paper] [Project] [Code]
- (arXiv preprint 2022) [💬Human Motion Generation] MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model, Mingyuan Zhang et al. [Paper] [Project]
- (arXiv preprint 2022) [💬3D]DreamFusion: Text-to-3D using 2D Diffusion, Ben Poole et al. [Paper] [Project] [Short Read]
- (arXiv preprint 2022) [💬3D Shape] ISS: Image as Stetting Stone for Text-Guided 3D Shape Generation, Zhengzhe Liu et al. [Paper]
- (ECCV 2022) [💬Virtual Humans] Compositional Human-Scene Interaction Synthesis with Semantic Control, Kaifeng Zhao et al. [Paper] [Project] [Code]
- (CVPR 2022) [💬3D Shape] Towards Implicit Text-Guided 3D Shape Generation, Zhengzhe Liu et al. [Paper] [Code]
- (CVPR 2022) [💬Object] Zero-Shot Text-Guided Object Generation with Dream Fields, Ajay Jain et al. [Paper] [Project] [Code]
- (CVPR 2022) [💬Mesh] Text2Mesh: Text-Driven Neural Stylization for Meshes, Oscar Michel et al. [Paper] [Project] [Code]
- (CVPR 2022) [💬Motion] Generating Diverse and Natural 3D Human Motions from Text, Chuan Guo et al. [Paper] [Project] [Code]
- (CVPR 2022) [💬Shape] CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation, Aditya Sanghi et al. [Paper] [Code]
- (arXiv preprint 2022) [💬Motion] TEMOS: Generating diverse human motions from textual descriptions, Mathis Petrovich et al. [Paper] [Project] [Code]
Text → Video «🎯Back To Top»
- (arXiv preprint 2022) Imagen Video: High Definition Video Generation with Diffusion Models, Jonathan Ho et al. [Paper] [Project]
- (arXiv preprint 2022) Text-driven Video Prediction, Xue Song et al. [Paper]
- (arXiv preprint 2022) Make-A-Video: Text-to-Video Generation without Text-Video Data, Uriel Singer et al. [Paper] [Project] [Short read] [Code]
- (ECCV 2022) [💬Story Continuation] StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation, Adyasha Maharana et al. [Paper] [Code]
- (arXiv preprint 2022) [💬Story → Video] Word-Level Fine-Grained Story Visualization, Bowen Li et al. [Paper] [Code]
- ⭐(arXiv preprint 2022) CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers, Wenyi Hong et al. [Paper] [Code]
- (CVPR 2022) Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning, Yogesh Balaji et al. [Paper] [Code] Project
- (arXiv preprint 2022) Video Diffusion Models, Jonathan Ho et al. [Paper] [Project]
- (arXiv preprint 2021) [❌Genertation Task] Transcript to Video: Efficient Clip Sequencing from Texts, Ligong Han et al. [Paper] [Project]
- (arXiv preprint 2021) GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions, Chenfei Wu et al. [Paper]
- (arXiv preprint 2021) Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary, Sibo Zhang et al. [Paper]
- (IEEE Access 2020) TiVGAN: Text to Image to Video Generation With Step-by-Step Evolutionary Generator, DOYEON KIM et al. [Paper]
- (IJCAI 2019) Conditional GAN with Discriminative Filter Generation for Text-to-Video Synthesis, Yogesh Balaji et al. [Paper] [Code]
- (IJCAI 2019) IRC-GAN: Introspective Recurrent Convolutional GAN for Text-to-video Generation, Kangle Deng et al. [Paper]
- (CVPR 2019) [💬Story → Video] StoryGAN: A Sequential Conditional GAN for Story Visualization, Yitong Li et al. [Paper] [Code]
- (AAAI 2018) Video Generation From Text, Yitong Li et al. [Paper]
- (ACMMM 2017) To create what you tell: Generating videos from captions, Yingwei Pan et al. [Paper]

Contact Me

Yutong ZHOU in Interaction Laboratory, Ritsumeikan University. ლ(╹◡╹ლ)
If you have any question, please feel free to contact Yutong ZHOU (E-mail: [email protected]).

jupyterjones / awesome-text-to-image Goto Github PK

awesome-text-to-image's Introduction

𝓐𝔀𝓮𝓼𝓸𝓶𝓮 𝓣𝓮𝔁𝓽📝-𝓽𝓸-𝓘𝓶𝓪𝓰𝓮🌇

To Do

Content

1. Description

2. Quantitative Evaluation Metrics «🎯Back To Top»

3. Datasets «🎯Back To Top»

4. Project «🎯Back To Top»

5. ⏳Recently Focused Papers (FYI) «🎯Back To Top»

6. Paper With Code

7. Other Related Works

Contact Me

awesome-text-to-image's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent