yingqinghe / scalecrafter Goto Github PK

[ICLR 2024 Spotlight] Official implementation of ScaleCrafter for higher-resolution visual generation at inference time.

Python 97.59% MATLAB 2.28% Shell 0.13%

scalecrafter's Introduction

Hi there 👋

Hello! I am Yingqing He. Nice to meet you!
👨‍💻‍ I am currently a PhD student at HKUST. My research focuses on text-to-video generation and multimodal generation.
📫 How to reach me: [email protected]
📣 Our lab is hiring engineering-oriented research assistants (RA). If you would like to apply, feel free to reach out with your CV!
🧁 Other projects:

[CVPR 2024] Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners Github
[ECCV 2024] Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation Github
Awesome Diffusion Models in High-Resolution Synthesis Github

scalecrafter's People

Contributors

Stargazers

Watchers

scalecrafter's Issues

Just a simple question.....How to use it ?!

i did clone the repo,
installed requirements, and......every script i launch is crashing....
i miss something here, and there is no tutorials ^^

Adding Code_of_Conduct In repo

As this repo is open source having a code-of-conduct File Becomes Important:- We propose adding a comprehensive Code of Conduct to our repository to ensure
a safe, respectful, and inclusive environment for all contributors and users. This code will
serve as a guideline for behavior, promoting diversity, reducing conflicts,
and attracting a wider range of perspectives.

Issue type

[✅] Docs

@YingqingHe If you were planning this issue kindly assign it to me! I would love to work on it ! Thank you !

Question about settings for SDXL

(Congrats on the great work!)

I was having a look at the default example (2048x2048 (4x) generation) for SDXL which uses configs/sdxl_2048x2048.yaml (= Table 10 in the paper) and these dilate settings:

down_blocks.3.resnets.0.conv1:2
down_blocks.3.resnets.0.conv2:2
down_blocks.3.resnets.1.conv1:2
down_blocks.3.resnets.1.conv2:2
up_blocks.0.resnets.0.conv1:2
up_blocks.0.resnets.0.conv2:2
up_blocks.0.resnets.1.conv1:2
up_blocks.0.resnets.1.conv2:2
up_blocks.0.resnets.2.conv1:2
up_blocks.0.resnets.2.conv2:2
up_blocks.0.upsamplers.0.conv:2
mid_block.resnets.0.conv1:2
mid_block.resnets.0.conv2:2
mid_block.resnets.1.conv1:2
mid_block.resnets.1.conv2:2

I feel like there is an issue with the down_blocks.3... keys (aka DB3 in the paper). Namely, it looks like an off-by-one error, i.e. there are no such downblocks in the SDXL UNet (at runtime, these keys are just ignored and the corresponding convolutions are not replaced by the ReDilateConvProcessor).

Instead the 3rd down-block corresponds to down_blocks.2... so these should be:

down_blocks.2.resnets.0.conv1:2
down_blocks.2.resnets.0.conv2:2
down_blocks.2.resnets.1.conv1:2
down_blocks.2.resnets.1.conv2:2

(Same applies to the other SDXL configs)

Looking forward to your feedback!

SD1.5 inpainting

wonderful work! I want to know does it also work for inpainting pipeline? Thanks.

Can we create custom image aspect ratios like 9:16 format?

Hello, I noticed that the configs folder contains various aspect ratios such as 1:1 and 2:1.

However, I'm looking for a 9:16 ratio suitable for Instagram stories. Could you provide an example for this particular ratio?

Video example?

Could you provide an example on how to use this method for video inference?

Evaluation script error: 'VisionTransformer' object has no attribute 'input_patchnorm'

Traceback (most recent call last):
File "scripts/evaluation/inference.py", line 137, in
run_inference(args, gpu_num, rank)
File "scripts/evaluation/inference.py", line 115, in run_inference
img_emb = model.get_image_embeds(cond_images)
File "/home/ubuntu/VideoCrafter/scripts/evaluation/../../lvdm/models/ddpm3d.py", line 691, in get_image_embeds
img_token = self.embedder(batch_imgs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/VideoCrafter/scripts/evaluation/../../lvdm/modules/encoders/condition.py", line 341, in forward
z = self.encode_with_vision_transformer(image)
File "/home/ubuntu/VideoCrafter/scripts/evaluation/../../lvdm/modules/encoders/condition.py", line 348, in encode_with_vision_transformer
if self.model.visual.input_patchnorm:
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'VisionTransformer' object has no attribute 'input_patchnorm'

xl_controlnet.py error

when I have a control_img shape (2048,2048 ,3 ) as input ,
RuntimeError: mat1 and mat2 shapes cannot be multiplied (154x2048 and 768x320)

Adding Contributors Section in readme.md

Why Contributors section:- A "Contributors" section in a repo gives credit to and acknowledges
the people who have helped with the project, fosters a sense of community, and helps others
know who to contact for questions or issues related to the project.

Issue type

[✅] Docs

Expected Output:-

@YingqingHe If you were planning this issue kindly assign it to me! I would love to work on it ! Thank you !

img2img functionality?

I saw the scripts are there but there is no documentation...

请教

非常棒的工作，已经迫不及待想要试试了
请问，以SDXL v1.0 base为例，训练图像是1024^2大小，假设我要生成4096^2的图，耗时会比使用SDXL直接生成1024^2慢多少？
另外我正在搜集一批MJ的数据，然后使用SDXL v1.0 base做预训练模型进行微调，希望能达到MJ V5的效果，想请教两个问题：
1，您觉得这个微调SDXL来对齐midjourney的想法可行吗？
2，我只有一块A100，在大数据集上训练非常耗时，如果我先将搜集到的MJ数据缩小至256或者512，在该分辨率下训练，再使用您的方法生成1024甚至更高分辨率的图，您觉得可行吗
非常期待您的回复

artifacts in 4k images

Thanks for your awesome repo. I notice with default 4k config, the generated image has lots of artifacts. Any comments?

python3 text2image_xl.py \
--pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 \
--validation_prompt "a professional photograph of an astronaut riding a horse" \
--seed 23 \
--config ./configs/sdxl_4096x4096.yaml \
--logging_dir ${your-logging-dir}

Code for hyperparameter search

How did you find parameters for your models? Was it some kind of script with an enumeration and metrics calculation, and if so, do you plan to publish it? I suspect that different hyperparameters will be best for different problems. It would be interesting to test this hypothesis by finding the best parameters for img2img and comparing them with the ones you got for text2img.

About group normalization

Great work! Fixing the seam problem of tiled decoder by group normalization is a good idea, but why I do not find the relevant code in the repo?

Bug report: VAE tiling VRAM usage & non-square generation speed

use the --vae_tiling args seems using absurdly high VRAM when vae decoding, >2x than not using it, is there something wrong?
non-square config seems generation at a speed near bigger one, rather than between (2048x1024 as slow as 2048x2048, not in between 1024x1024 and 2048x2048), is it intentional?

Cannot use SD 1.5 and SD 2.1 models

If I try and use SD 1.5 with
python text2image_xl.py --pretrained_model_name_or_path runwayml/stable-diffusion-v1-5 --validation_prompt "roses in the rain" --seed 1500289018 --config .\configs\sd1.5_2048x2048.yaml --logging_dir logs
I get the error
OSError: Can't load tokenizer for 'runwayml/stable-diffusion-v1-5'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'runwayml/stable-diffusion-v1-5' is the correct path to a directory containing all relevant files for a CLIPTokenizer tokenizer.

If I try and use SD 2.1 with
python text2image_xl.py --pretrained_model_name_or_path stabilityai/stable-diffusion-2-1-base --validation_prompt "roses in the rain" --seed 470116415 --config .\configs\sd2.1_2048x2048.yaml --logging_dir logs
I get the error
OSError: Can't load tokenizer for 'stabilityai/stable-diffusion-2-1-base'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'stabilityai/stable-diffusion-2-1-base' is the correct path to a directory containing all relevant files for a CLIPTokenizer tokenizer.

If I try and use SD XL with
python text2image_xl.py --pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 --validation_prompt "roses in the rain" --seed 694038880 --config .\configs\sdxl_2048x2048.yaml --logging_dir logs
it works fine.

What am I missing to use the 1.5 and 2.1 models?

Bad results

I'm testing your work. And I got bad results, maybe I did something wrong? Do you have the same results with these parameters?
Before testing, I set up an environment with your requirements

Example 1:
python3 text2image.py --pretrained_model_name_or_path stabilityai/stable-diffusion-2-1-base \ --validation_prompt "A corgi sits on a beach chair on a beautiful beach, with palm trees behind, high details" \ --seed 23 \ --config ./configs/sd2.1_2048x2048.yaml \ --logging_dir ./out
result:

have repetition
Example 2:
python3 text2image_xl.py \ --pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 \ --validation_prompt "A corgi sits on a beach chair on a beautiful beach, with palm trees behind, high details" \ --seed 23 \ --config ./configs/sdxl_4096x2048.yaml \ --logging_dir ./out4

Blur and repetition
Example 3:
python3 text2image.py --pretrained_model_name_or_path stabilityai/stable-diffusion-2-1-base \ --validation_prompt "A corgi sits on a beach chair on a beautiful beach, with palm trees behind, high details" \ --seed 23 \ --config ./configs/sd2.1_2048x1024.yaml \ --logging_dir ./out5

Have repetition

Does this algorithm work with DDIM scheduler only?

I tried to generate images with DPM++ scheduler but got results with artifacts. I could not find any explanations about the scheduler in the paper and wondering what schedulers can be used with scalecrafter.

How I should call my locally stored checkpoints in the script?

It throws an error when I'm trying to call a local model using the method below.

🤷‍♂️ Where am I making the mistake? How I should call my local checkpoints?

The script part for running the code:

python3 text2image_xl.py \
--pretrained_model_name_or_path ./models/juggernautXL_version5.safetensors \
--validation_prompt "a professional photograph of an astronaut riding a steel horse at Mars" \
--seed 23 \
--config ./configs/sdxl_2048x2048.yaml \
--logging_dir ./outputs

⚠ Output:

Traceback (most recent call last):
  File "text2image_xl.py", line 541, in <module>
    main()
  File "text2image_xl.py", line 405, in main
    tokenizer = CLIPTokenizer.from_pretrained(
  File "/home/astroboy/anaconda3/envs/scalecrafter/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1699, in from_pretrained
    raise ValueError(
ValueError: Calling CLIPTokenizer.from_pretrained() with the path to a single file or url is not supported for this tokenizer. Use a model identifier or the path to a directory instead.

Folder structure:

How can I load an adaptor, IP-Adapter

I am trying to load https://github.com/tencent-ailab/IP-Adapter to add conditioning in the generation to ScaleCrafter. Is it possible to do that?

Code for generating linear transforms of convolution dispersion

Thanks for your wonderful work. I couldn't locate the code responsible for producing linear transforms of convolution dispersion, only the adept transformation. Furthermore, I couldn't identify any instances where convolution dispersion is applied, suggesting that all judgment conditions against using the transform. Are there any intentions to publish the related code? Alternatively, could you offer some direction on how to access the code?

Lightning version support

Hi. Great work.
May I know if the current repo supports SD repo based on PyTorch lighting?
It seems only diffusers-based repo is supported.
Thanks.

vs hires.fix ?

Very nice work, I have some questions, the paper focuses on solving the problem of object duplication in generating high resolution images, although it is possible to generate a higher resolution, the clarity of the image does not seem to be solved, this should still need to be solved by training high resolution images? Also what are the advantages of this approach over the hires fix used in web ui? Thank you for your answer.

yingqinghe / scalecrafter Goto Github PK

scalecrafter's Introduction

Hi there 👋

scalecrafter's People

Contributors

Stargazers

Watchers

Forkers

scalecrafter's Issues

Recommend Projects

Recommend Topics

Recommend Org