yeungchenwa / fontdiffuser Goto Github PK

[AAAI2024] FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning

Home Page: https://yeungchenwa.github.io/fontdiffuser-homepage/

Python 98.93% Shell 1.07%

deep-learning diffusers diffusion font-generation image-generation

fontdiffuser's Introduction

FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning

🔥 Model Zoo • 🛠️ Installation • 🏋️ Training • 📺 Sampling • 📱 Run WebUI

🌟 Highlights

We propose FontDiffuser, which can generate unseen characters and styles and can be extended to cross-lingual generation, such as Chinese to Korean.
FontDiffuser excels in generating complex characters and handling large style variations. And it achieves state-of-the-art performance.
The generated results by FontDiffuser can be perfectly used for InstructPix2Pix for decoration, as shown in thr above figure.
We release the 💻Hugging Face Demo online! Welcome to Try it Out!

📅 News

2024.01.27: The training of phase 2 is released.
2023.12.20: Our repository is public! 👏🤗
2023.12.19: 🔥🎉 The 💻Hugging Face Demo is public! Welcome to try it out!
2023.12.16: The gradio app demo is released.
2023.12.10: Release source code with phase 1 training and sampling.
2023.12.09: 🎉🎉 Our paper is accepted by AAAI2024.
Previously: Our Recommendations-of-Diffusion-for-Text-Image repo is public, which contains a paper collection of recent diffusion models for text-image generation tasks. Welcome to check it out!

🔥 Model Zoo

Model	chekcpoint	status
FontDiffuer	GoogleDrive / BaiduYun:gexg	Released
SCR	GoogleDrive / BaiduYun:gexg	Released

🚧 TODO List

Add phase 1 training and sampling script.
Add WebUI demo.
Push demo to Hugging Face.
Add phase 2 training script and checkpoint.
Add the pre-training of SCR module.
Combined with InstructPix2Pix.

🛠️ Installation

Prerequisites (Recommended)

Linux
Python 3.9
Pytorch 1.13.1
CUDA 11.7

Environment Setup

Clone this repo:

git clone https://github.com/yeungchenwa/FontDiffuser.git

Step 0: Download and install Miniconda from the official website.

Step 1: Create a conda environment and activate it.

conda create -n fontdiffuser python=3.9 -y
conda activate fontdiffuser

Step 2: Install related version Pytorch following here.

# Suggested
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

Step 3: Install the required packages.

pip install -r requirements.txt

🏋️ Training

Data Construction

The training data files tree should be (The data examples are shown in directory data_examples/train/):

├──data_examples
│   └── train
│       ├── ContentImage
│       │   ├── char0.png
│       │   ├── char1.png
│       │   ├── char2.png
│       │   └── ...
│       └── TargetImage.png
│           ├── style0
│           │     ├──style0+char0.png
│           │     ├──style0+char1.png
│           │     └── ...
│           ├── style1
│           │     ├──style1+char0.png
│           │     ├──style1+char1.png
│           │     └── ...
│           ├── style2
│           │     ├──style2+char0.png
│           │     ├──style2+char1.png
│           │     └── ...
│           └── ...

Training Configuration

Before running the training script (including the following three modes), you should set the training configuration, such as distributed training, through:

accelerate config

Training - Pretraining of SCR

Coming Soon ...

Training - Phase 1

sh train_phase_1.sh

data_root: The data root, as ./data_examples
output_dir: The training output logs and checkpoints saving directory.
resolution: The resolution of the UNet in our diffusion model.
style_image_size: The resolution of the style image, can be different with resolution.
content_image_size: The resolution of the content image, should be the same as the resolution.
channel_attn: Whether to use the channel attention in the MCA block.
train_batch_size: The batch size in the training.
max_train_steps: The maximum of the training steps.
learning_rate: The learning rate when training.
ckpt_interval: The checkpoint saving interval when training.
drop_prob: The classifier-free guidance training probability.

Training - Phase 2

After the phase 2 training, you should put the trained checkpoint files (unet.pth, content_encoder.pth, and style_encoder.pth) to the directory phase_1_ckpt. During phase 2, these parameters will be resumed.

sh train_phase_2.sh

phase_2: Tag to phase 2 training.
phase_1_ckpt_dir: The model checkpoints saving directory after phase 1 training.
scr_ckpt_path: The ckpt path of pre-trained SCR module. You can download it from above 🔥Model Zoo.
sc_coefficient: The coefficient of style contrastive loss for supervision.
num_neg: The number of negative samples, default to be 16.

📺 Sampling

Step 1 => Prepare the checkpoint

Option (1) Download the checkpoint following GoogleDrive / BaiduYun:gexg, then put the ckpt to the root directory, including the files unet.pth, content_encoder.pth, and style_encoder.pth.
Option (2) Put your re-training checkpoint folder ckpt to the root directory, including the files unet.pth, content_encoder.pth, and style_encoder.pth.

Step 2 => Run the script

(1) Sampling image from content image and reference image.

sh script/sample_content_image.sh

ckpt_dir: The model checkpoints saving directory.
content_image_path: The content/source image path.
style_image_path: The style/reference image path.
save_image: set True if saving as images.
save_image_dir: The image saving directory, the saving files including an out_single.png and an out_with_cs.png.
device: The sampling device, recommended GPU acceleration.
guidance_scale: The classifier-free sampling guidance scale.
num_inference_steps: The inference step by DPM-Solver++.

(2) Sampling image from content character.
Note Maybe you need a ttf file that contains numerous Chinese characters, you can download it from BaiduYun:wrth.

sh script/sample_content_character.sh

character_input: If set True, use character string as content/source input.
content_character: The content/source content character string.
The other parameters are the same as the above option (1).

📱 Run WebUI

(1) Sampling by FontDiffuser

gradio gradio_app.py

Example:

(2) Sampling by FontDiffuser and Rendering by InstructPix2Pix

Coming Soon ...

🌄 Gallery

Characters of hard level of complexity

Characters of medium level of complexity

Characters of easy level of complexity

Cross-Lingual Generation (Chinese to Korean)

💙 Acknowledgement

diffusers

Copyright

This repository can only be used for non-commercial research purposes.
For commercial use, please contact Prof. Lianwen Jin ([email protected]).
Copyright 2023, Deep Learning and Vision Computing Lab (DLVC-Lab), South China University of Technology.

Citation

@inproceedings{yang2024fontdiffuser,
  title={FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning},
  author={Yang, Zhenhua and Peng, Dezhi and Kong, Yuxin and Zhang, Yuyi and Yao, Cong and Jin, Lianwen},
  booktitle={Proceedings of the AAAI conference on artificial intelligence},
  year={2024}
}

⭐ Star Rising

fontdiffuser's People

Contributors

Stargazers

Watchers

fontdiffuser's Issues

src/modules/style_encoder.py中, 名为style_encoder_textedit_addskip_arch的函数报错, 导致训练阶段1无法开始

用Git Bash 2.46.0.windows.1运行D:\FontDiffuser-main\train_phase_1.sh的时候报出如下错误:

l257737602 MINGW64 /d/FontDiffuser-main
$ sh train_phase_1.sh
D:\Users\limin\AppData\Local\Programs\Python\Python312\Lib\site-packages\kornia\feature\lightglue.py:44: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
D:\Users\limin\AppData\Local\Programs\Python\Python312\Lib\site-packages\accelerate\accelerator.py:406: UserWarning: log_with=tensorboard was passed but no supported trackers are currently installed.
warnings.warn(f"log_with={log_with} was passed but no supported trackers are currently installed.")
pygame 2.6.0 (SDL 2.28.4, Python 3.12.1)
Hello from the pygame community. https://www.pygame.org/contribute.html
Load the down block DownBlock2D
Load the down block MCADownBlock2D
The style_attention cross attention dim in Down Block 1 layer is 1024
The style_attention cross attention dim in Down Block 2 layer is 1024
Load the down block MCADownBlock2D
The style_attention cross attention dim in Down Block 1 layer is 1024
The style_attention cross attention dim in Down Block 2 layer is 1024
Load the down block DownBlock2D
Load the up block UpBlock2D
Load the up block StyleRSIUpBlock2D
Load the up block StyleRSIUpBlock2D
Load the up block UpBlock2D
Traceback (most recent call last):
File "D:\FontDiffuser-main\train.py", line 272, in
main()
File "D:\FontDiffuser-main\train.py", line 74, in main
style_encoder = build_style_encoder(args=args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\FontDiffuser-main\src\build.py", line 41, in build_style_encoder
style_image_encoder = StyleEncoder(
^^^^^^^^^^^^^
File "D:\Users\limin\AppData\Local\Programs\Python\Python312\Lib\site-packages\diffusers\configuration_utils.py", line 653, in inner_init
init(self, *args, **init_kwargs)
File "D:\FontDiffuser-main\src\modules\style_encoder.py", line 362, in init
self.arch = style_encoder_textedit_addskip_arch(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 64
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in run_code
File "D:\Users\limin\AppData\Local\Programs\Python\Python312\Scripts\accelerate.exe_main.py", line 7, in
File "D:\Users\limin\AppData\Local\Programs\Python\Python312\Lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
args.func(args)
File "D:\Users\limin\AppData\Local\Programs\Python\Python312\Lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
simple_launcher(args)
File "D:\Users\limin\AppData\Local\Programs\Python\Python312\Lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['D:\Users\limin\AppData\Local\Programs\Python\Python312\python.exe', 'train.py', '--seed=123', '--experience_name=FontDiffuser_training_phase_1', '--data_root=data_examples', '--output_dir=outputs/FontDiffuser', '--report_to=tensorboard', '--resolution=64', '--style_image_size=64', '--content_image_size=64', '--content_encoder_downsample_size=3', '--channel_attn=True', '--content_start_channel=64', '--style_start_channel=64', '--train_batch_size=16', '--perceptual_coefficient=0.01', '--offset_coefficient=0.5', '--max_train_steps=440000', '--ckpt_interval=40000', '--gradient_accumulation_steps=1', '--log_interval=50', '--learning_rate=1e-4', '--lr_scheduler=linear', '--lr_warmup_steps=10000', '--drop_prob=0.1', '--mixed_precision=no']' returned non-zero exit status 1.

l257737602 MINGW64 /d/FontDiffuser-main

当时的FontDiffuser-main文件夹:
链接：https://pan.baidu.com/s/1z2A02skrnFJL-59ZyUKWwA?pwd=py98
提取码：py98
附训练材料来源:
寒蝉半圆体 https://github.com/Warren2060/ChillRound
天珩全字库 http://cheonhyeong.com/Simplified/download.html

About InstructPix2Pix

Hello @yeungchenwa ,

Thank you for your excellent work on the FontDiffuser project! I am very curious about how you combined FontDiffuser and InstructPix2Pix. I have tried different text prompts and parameters but haven’t been able to achieve the desired results. Could you please provide an example or some guidance on how to generate these pictures using InstructPix2Pix?

about training time

How long does it take to train this model from scratch?

training about the english letters

hi @yeungchenwa, thanks for your excellent work at first!
I want to ask some questions about the training of the English letter.

If I add English dataset, I need to train both the phase 1 and phase 2 process, right?
how many fonts do you think is efficient?
do I need to train from scratch or refine with your model would be enough?

Hello, I am very interested in your research after reading your paper. But I have a confusion about the second stage of training.

May I ask how to load the model from the first stage during the second stage of training? Is it the total model. pth from the first stage? But it seems that it cannot be loaded. If the downloaded scr_210000.pth can be loaded, what is the relationship between this and the first stage of my own training. Please help answer, thank you very much!!

请问附录在哪可以找到

您好，感谢您优秀的论文。文章中提到三个复杂程度的分类细节见附录，论文结尾并未找到附录，请问可以分享下附录的地址吗？非常感谢！

InfoNCE

您好，我想询问一下关于第二阶段的对比损失InfoNCE具体代码是怎么样的，感谢

How to increase the training image size to 256 instead of the definite 96 ？

Regarding the problem that the style transformation effect of English characters is relatively poor.

Thank you for your work, I was very excited when I saw this project on [spaces], and I tried it several times in English and Chinese characters.
The result of the experiment was that the stylistic variation of Chinese characters was very successful, but the English was very poor. I'd like to confirm to you that this is in line with expectations?

TEST1:
TEST2:

About how to generate an SCR_210000.pth file

Hello, how was the SCR_210000 used during the second training process saved? How can I separate the corresponding SCR from the total_model. pth file generated during my training? Looking forward to your answer, thank you very much!!!!

Does this model support multi words generate?

I found that it just generate one character

when ”Combined with InstructPix2Pix “ releasing ？

训练阶段的描述是不是有一处笔误?

https://github.com/yeungchenwa/FontDiffuser?tab=readme-ov-file#training---phase-2

After the phase 2 training, you should put the trained checkpoint files (unet.pth, content_encoder.pth, and style_encoder.pth) to the directory phase_1_ckpt. During phase 2, these parameters will be resumed.
You mean After the phase 1 training?

训练时长

您好！非常感谢您优秀的工作！
我想知道您当初用完整数据集在3090训练的时候，用了多长时间？

Request for Korean font pre-trained checkpoints

I am very interested in your repository. The checkpoint file you have provided seems to be trained primarily on Chinese characters, as certain shapes such as the Korean character 'ㅇ' do not render properly.

I was wondering if you happened to have a checkpoint file that has been trained on Korean fonts? Having a model pre-trained on Korean data would be incredibly valuable and helpful for my current project. If such a file is available, I would be most grateful if you could share it with me. If not, I completely understand, and I appreciate you taking the time to consider my request.

Thank you in advance for your help. I look forward to hearing from you.

How to use instructPix2Pix generate image like this.

Question about the resolution.

May I ask whether, during the comparative experiment, all the baselines were trained using the same training set as our method? The article mentions an image size of 96. Regarding DG-Font, whose resolution is limited to 80, how did you address this issue? Did you directly resize it from 80 to 96 for comparison, or did you employ other methods? Thank you.

BTW, without considering the situation of GPU memory, can we set the resolution arbitrarily for our method?

数据集相关问题

您好，这个项目真的太牛了，我想要自己训练一下，但是不知道这个项目数据集的那424种字体的选取有什么特别的考虑吗。还是说从公开字库里随机下载的这些字体并没有什么考虑。如果是有特别进行挑选的话，是个什么挑选策略呢

关于训练的几个问题

作者大佬们好，非常感谢你们的工作！
关于训练我有两个问题。
1、假如说有草书、楷书两个主要风格的笔记训练，我将他们混合到一起进行训练效果好，还是说进行分类后再单独生产效果好呢？
2、训练时loss值有推荐的区间不

模型在毛笔字上的效果

感谢你们的工作！我想用书法数据集来训练一下模型，所以问问模型在毛笔字上的效果如何呢？

SCR pre-training script release timeline?

First, thanks for open soucing your amazing work. Two weeks ago, I commented below an existing issue requesting for the script, but you might have missed my comment. It's been a few months since you announced that the SCR pretraining script will be released, so I just want to check in for any potential update.