Giter Site home page Giter Site logo

bingogo888 / text2human Goto Github PK

View Code? Open in Web Editor NEW

This project forked from yumingj/text2human

0.0 0.0 0.0 11.18 MB

Code for Text2Human (SIGGRAPH 2022). Paper: Text2Human: Text-Driven Controllable Human Image Generation

Home Page: https://yumingj.github.io/projects/Text2Human.html

License: MIT License

Python 100.00%

text2human's Introduction

Text2Human - Official PyTorch Implementation

This repository provides the official PyTorch implementation for the following paper:

Text2Human: Text-Driven Controllable Human Image Generation
Yuming Jiang, Shuai Yang, Haonan Qiu, Wayne Wu, Chen Change Loy and Ziwei Liu
In ACM Transactions on Graphics (Proceedings of SIGGRAPH), 2022.

From MMLab@NTU affliated with S-Lab, Nanyang Technological University and SenseTime Research.

The lady wears a short-sleeve T-shirt with pure color pattern, and a short and denim skirt. The man wears a long and floral shirt, and long pants with the pure color pattern. A lady is wearing a sleeveless pure-color shirt and long jeans The man wears a short-sleeve T-shirt with the pure color pattern and a short pants with the pure color pattern.

[Project Page] | [Paper] | [Dataset] | [Demo Video] | [Gradio Web Demo]

Updates

Installation

Clone this repo:

git clone https://github.com/yumingj/Text2Human.git
cd Text2Human

Dependencies:

All dependencies for defining the environment are provided in environment/text2human_env.yaml. We recommend using Anaconda to manage the python environment:

conda env create -f ./environment/text2human_env.yaml
conda activate text2human
pip install mmcv-full==1.2.1 -f https://download.openmmlab.com/mmcv/dist/cu101/torch1.7.0/index.html
pip install mmsegmentation==0.9.0
conda install -c huggingface tokenizers=0.9.4
conda install -c huggingface transformers=4.0.0
conda install -c conda-forge sentence-transformers=2.0.0

If it doesn't work, you may need to install the following packages on your own:

(1) Dataset Preparation

In this work, we contribute a large-scale high-quality dataset with rich multi-modal annotations named DeepFashion-MultiModal Dataset. Here we pre-processed the raw annotations of the original dataset for the task of text-driven controllable human image generation. The pre-processing pipeline consists of:

  • align the human body in the center of the images according to the human pose
  • fuse the clothing color and clothing fabric annotations into one texture annotation
  • do some annotation cleaning and image filtering
  • split the whole dataset into the training set and testing set

You can download our processed dataset from this Google Drive. If you want to access the raw annotations, please refer to the DeepFashion-MultiModal Dataset.

After downloading the dataset, unzip the file and put them under the dataset folder with the following structure:

./datasets
โ”œโ”€โ”€ train_images
    โ”œโ”€โ”€ xxx.png
    ...
    โ”œโ”€โ”€ xxx.png
    โ””โ”€โ”€ xxx.png
โ”œโ”€โ”€ test_images
    % the same structure as in train_images
โ”œโ”€โ”€ densepose
    % the same structure as in train_images
โ”œโ”€โ”€ segm
    % the same structure as in train_images
โ”œโ”€โ”€ shape_ann
    โ”œโ”€โ”€ test_ann_file.txt
    โ”œโ”€โ”€ train_ann_file.txt
    โ””โ”€โ”€ val_ann_file.txt
โ””โ”€โ”€ texture_ann
    โ”œโ”€โ”€ test
        โ”œโ”€โ”€ lower_fused.txt
        โ”œโ”€โ”€ outer_fused.txt
        โ””โ”€โ”€ upper_fused.txt
    โ”œโ”€โ”€ train
        % the same files as in test
    โ””โ”€โ”€ val
        % the same files as in test

(2) Sampling

HuggingFace Demo

Full Web DemoHugging Face Spaces

Drawing-to-humanHugging Face Spaces

Colab

Unofficial Demo implemented by @neverix.

Pretrained Models

Pretrained models can be downloaded from the model zoo. Unzip the file and put them under the pretrained_models folder with the following structure:

pretrained_models
โ”œโ”€โ”€ index_pred_net.pth
โ”œโ”€โ”€ parsing_gen.pth
โ”œโ”€โ”€ parsing_token.pth
โ”œโ”€โ”€ sampler.pth
โ”œโ”€โ”€ vqvae_bottom.pth
โ””โ”€โ”€ vqvae_top.pth

Model Zoo

Model Dataset Annotations
Standard Model DeepFashion-Multimodal Follow the dataset preparation in Step(1)
Extended Model SHHQ Replace the annotations with the following ones: densepose, segm, shape, texture

Remark: For fair research comparisons, it is suggested to use the standard model.

Generation from Paring Maps

You can generate images from given parsing maps and pre-defined texture annotations:

python sample_from_parsing.py -opt ./configs/sample_from_parsing.yml

The results are saved in the folder ./results/sampling_from_parsing.

Generation from Poses

You can generate images from given human poses and pre-defined clothing shape and texture annotations:

python sample_from_pose.py -opt ./configs/sample_from_pose.yml

Remarks: The above two scripts generate images without language interactions. If you want to generate images using texts, you can use the notebook or our user interface.

User Interface

python ui_demo.py

The descriptions for shapes should follow the following format:

<gender>, <sleeve length>, <length of lower clothing>, <outer clothing type>, <other accessories1>, ...

Note: The outer clothing type and accessories can be omitted.

Examples:
man, sleeveless T-shirt, long pants
woman, short-sleeve T-shirt, short jeans

The descriptions for textures should follow the following format:

<upper clothing texture>, <lower clothing texture>, <outer clothing texture>

Note: Currently, we only support 5 types of textures, i.e., pure color, stripe/spline, plaid/lattice,
    floral, denim. Your inputs should be restricted to these textures.

(3) Training Text2Human

Stage I: Pose to Parsing

Train the parsing generation network. If you want to skip the training of this network, you can download our pretrained model from here.

python train_parsing_gen.py -opt ./configs/parsing_gen.yml

Stage II: Parsing to Human

Step 1: Train the top level of the hierarchical VQVAE. We provide our pretrained model here. This model is trained by:

python train_vqvae.py -opt ./configs/vqvae_top.yml

Step 2: Train the bottom level of the hierarchical VQVAE. We provide our pretrained model here. This model is trained by:

python train_vqvae.py -opt ./configs/vqvae_bottom.yml

Stage 3 & 4: Train the sampler with mixture-of-experts. To train the sampler, we first need to train a model to tokenize the parsing maps. You can access our pretrained parsing maps here.

python train_parsing_token.py -opt ./configs/parsing_token.yml

With the parsing tokenization model, the sampler is trained by:

python train_sampler.py -opt ./configs/sampler.yml

Our pretrained sampler is provided here.

Stage 5: Train the index prediction network. We provide our pretrained index prediction network here. It is trained by:

python train_index_prediction.py -opt ./configs/index_pred_net.yml

Remarks: In the config files, we use the path to our models as the required pretrained models. If you want to train the models from scratch, please replace the path to your own one. We set the numbers of the training epochs as large numbers and you can choose the best epoch for each model. For your reference, our pretrained parsing generation network is trained for 50 epochs, top-level VQVAE is trained for 135 epochs, bottom-level VQVAE is trained for 70 epochs, parsing tokenization network is trained for 20 epochs, sampler is trained for 95 epochs, and the index prediction network is trained for 70 epochs.

(4) Results

Please visit our Project Page to view more results.
You can select the attribtues to customize the desired human images.

DeepFashion-MultiModal Dataset

In this work, we also propose DeepFashion-MultiModal, a large-scale high-quality human dataset with rich multi-modal annotations. It has the following properties:

  1. It contains 44,096 high-resolution human images, including 12,701 full body human images.
  2. For each full body images, we manually annotate the human parsing labels of 24 classes.
  3. For each full body images, we manually annotate the keypoints.
  4. We extract DensePose for each human image.
  5. Each image is manually annotated with attributes for both clothes shapes and textures.
  6. We provide a textual description for each image.

Please refer to this repo for more details about our proposed dataset.

Citation

If you find this work useful for your research, please consider citing our paper:

@article{jiang2022text2human,
  title={Text2Human: Text-Driven Controllable Human Image Generation},
  author={Jiang, Yuming and Yang, Shuai and Qiu, Haonan and Wu, Wayne and Loy, Chen Change and Liu, Ziwei},
  journal={ACM Transactions on Graphics (TOG)},
  volume={41},
  number={4},
  articleno={162},
  pages={1--11},
  year={2022},
  publisher={ACM New York, NY, USA},
  doi={10.1145/3528223.3530104},
}

Acknowledgments

Part of the code is borrowed from unleashing-transformers, taming-transformers and mmsegmentation.

text2human's People

Contributors

ak391 avatar liuziwei7 avatar wywu avatar yumingj avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.