guoqincode / open-animateanyone Goto Github PK

View Code? Open in Web Editor NEW

2.8K 2.8K 227.0 17.07 MB

Unofficial Implementation of Animate Anyone

Python 99.75% Shell 0.25%

open-animateanyone's People

Contributors

Stargazers

Watchers

Forkers

taichuai jiananzhao0224 i486magzog wweevv-johndpope tufo830 exusial vcertion chuanmew camenduru vamoko arkboy1224 zhangxujinsh awekling paperwave cian0 syunar kuoenterprises rimbaborne happyxy eki-indradi asdlei99 jbluv lycokie tuhinmallick staccats troph-team reynroselle kimwoonggon sexyeah paramedick eltociear d-mad hay-man gieniuw hailangwu-aval wysstartgo hankleaf hhy5277 bismuth209 adambear yanxg threeneedone hassantsyed shungjhon peace-zy lygraychina mathewferon hydrupsw91 elinaibra bloodsolz 18notesys istorywar mrskiffer supermario-ai lyuyork ehsanqa mymusise xavierrabbit chnxindong irenemsm2020 chi2nagisa zhenzhiwang oreml ablenine wyspywoods princetrunks anthonyyuan redlegenddev delihiros xingpeima 55587jijing hadesnull123 a279780399 hs991023 molierflower tutumomo jingx8885 e-kiss-me slacklife mtkshu s8xy coder-drinker dineshkumares xupercoin spicyguml tori-ham fskeo jessehao123 nicbair barracudnn bashirhassan-appness linzai1992 treksis fangyouqing nekonabe-tokumori chibanem hustwyk imloama xing531120070 zhaopufeng

open-animateanyone's Issues

where dependecy list (requirements.txt) ?

Why DWpose instead of densepose?

Two papers report better results with the 'thicker' style of conditioning images:

https://arxiv.org/pdf/2308.03610.pdf (avatarverse)
https://arxiv.org/pdf/2311.16498.pdf (magicanimate)

Any reason you chose DWpose?

Results

Hi, @guoqincode, thanks for your effort in reimplementing this! Could you show some video results as demonstration?

Where is the file "configs/prompts/animation_stage_1.yaml"?

Could you please provide the file "configs/prompts/animation_stage_1.yaml" for the animation test for the stage 1?

Question: Why is CFG not set during training but set during inference?

Operations such as cfg_random_null_text or cfg_random_null_ref were not used during the training phase, but guidance_scale: 7.5 and do_classifier_free_guidance=True was set during inference. Is this as expected?

No module named 'models.hack_poseguider'

I tried to run demo.gradio_animate, but the following error was reported. Under the models folder, I did not find hack_poseguider

Traceback (most recent call last):
File "/home/work/diffuser-env/python/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/work/diffuser-env/python/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/work/AnimateAnyone-unofficial/demo/gradio_animate.py", line 8, in
from demo.animate import AnimateAnyone
File "/home/work/AnimateAnyone-unofficial/demo/animate.py", line 21, in
from models.hack_poseguider import Hack_PoseGuider as PoseGuider
ModuleNotFoundError: No module named 'models.hack_poseguider'

About animation_stage_1.yaml

Can you provide the file "configs/prompts/animation_stage_1.yaml"

About masks?

In Tiktok dataset, there is a masks file。 Maybe the foreground is trained separately, Have you taken this into account?

No trainable param for unet in stage 1

unet = DDP(unet, device_ids=[local_rank], output_device=local_rank)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 551, in init
self._log_and_throw(
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 686, in _log_and_throw
raise err_type(err_msg)
RuntimeError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient.

for classifier free guidance would it be better to use a blank condition image for negatives?

        latents_pose = poseguider(pose_condition)
        # latents_pose = rearrange(latents_pose, "(b f) c h w -> b c f h w", f=video_length)
        if do_classifier_free_guidance: latents_pose = latents_pose.repeat(2,1,1,1) # b c h w

here instead of repeating, would passing zeros through the poseguider and then catting be more appropriate?

What about adding time embedding to the pose guider?

As the title mentions, has this crossed your mind?

referencenet initializing warning ?

Some weights of the model checkpoint were not used when initializing ReferenceNet:
['conv_norm_out.bias, conv_norm_out.weight, conv_out.bias, conv_out.weight, up_blocks.3.attentions.2.proj_out.bias, up_blocks.3.attentions.2.proj_out.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_k.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_out.0.bias, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_out.0.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_k.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.0.proj.bias, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.0.proj.weight, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.2.bias, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.2.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm2.bias, up_blocks.3.attentions.2.transformer_blocks.0.norm2.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm3.bias, up_blocks.3.attentions.2.transformer_blocks.0.norm3.weight']

is this correct？the training loss is not decreasing，result：

the pose condition is invalid..

loss won't decrease

Is it normal when training the stage 1, the loss doesn't decrease

Good job

Really nice sharing!

About multi-gpu training

It seems that DDP isn't used in train.py.

stage2 training error

Thank you for your work.

When I was in the second stage of training, I kept reporting out-of-memory errors. I have 80G of memory. No matter on a single card or multiple cards, the same error was reported. Even if --train_batch_size is set to 1, what went wrong?

error message:
Traceback (most recent call last):
File "/home/work/animate-anyone/train_2nd_stage.py", line 919, in
main(args)
File "/home/work/animate-anyone/train_2nd_stage.py", line 823, in main
model_pred = unet(
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 632, in forward
return model_forward(*args, **kwargs)
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 620, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/home/work/animate-anyone/animate_anyone/models/unet_3d_condition.py", line 1011, in forward
sample = upsample_block(
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/work/animate-anyone/animate_anyone/models/unet_3d_blocks.py", line 901, in forward
hidden_states = resnet(hidden_states, temb, scale=lora_scale)
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/work/animate-anyone/animate_anyone/models/resnet.py", line 340, in forward
hidden_states = self.norm1(hidden_states)
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 273, in forward
return F.group_norm(
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/functional.py", line 2530, in group_norm
return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 810.00 MiB (GPU 0; 79.35 GiB total capacity; 76.87 GiB already allocated; 64.19 MiB free; 77.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Is classifier-free guidance embedded in attention 1 correctly？

https://github.com/guoqincode/AnimateAnyone-unofficial/blob/main/models/ReferenceNet_attention.py#L153

In AnimateAnyone paper, attention1 is responsible for spatial-attention operation，encoder_hidden_states is embedded in attention2, and Is it correct to apply classifier-free guidance to attention2?
Or for image, is it only necessary to set uncodition image_embeddings to 0 before input unet?

How many steps are needed in stage 1?

I trained in stage_1, but the loss does not decrease?，is is correct?

about training optimization

i try the 8bit adam optimizer, i can train stage one on 40g a100. i think it can help reduce the vram usage, but i don't know if it will decrease the model performance. what dou you think? did you try the 8bit adam?

how do you preprocess the data like ubc, do you first split the frames, what's structure of the output dataset?

Is there any way to avoid using torchrun?

About multi-GPU training

Thank you for your contributions! There are two questions below:

I have observed that the training duration using two RTX 6000 ada GPUs exceeds the time it takes with a single GPU. Is this an expected phenomenon?
I encountered the phenomenon of gradient explosion during the training process.

Tensor size mismatch in using clip-vit-large-patch14

Hi,

Thanks for sharing your implementation. It really helps the community a lot to reproduce animate-anyone. When I try to training the network with your code, I find that in the referencenet_attention, the hidden state size of stable diffusion unet is 768 while the clip image feature extracted from clip-vit-large-patch14 is 1024, which causes size mismatch in network forward (however, the hidden size of clip-vit-base-patch32 is 768). As your config yaml file was clip-vit-base-patch32 and recently change to clip-vit-large-patch14, and you mentioned that you use clip-vit-large-patch14 in another issue. Could you elaborate more details how your code works with clip-vit-large-patch14? I encountered errors when I directly run your training code with clip-vit-large-patch14.

Looking forward to your reply! Thanks again for your effort.

AttributeError: ‘PoseGuider’ object has no attribute ‘module’.

Thanks for sharing this repo @guoqincode.

While trying to run stage 2 training, getting this error: AttributeError: ‘PoseGuider’ object has no attribute ‘module’. Did you mean: ‘modules’? on line 550 in file train.py.

Do you happen to know why I might be getting this error?

batch size in training stage1.

I training the first stage with 8*A800 80G. However, the max batch size can only be set to 1 on each single GPU. Is that normal?

Running Inference step 2

I got train results for both stages 1 and 2. Inference stage one works but creates a video with the same frame for one second; the inference stage 2 module is not working. I tried python -m pipelines.animation_stage_2 --config configs/prompts/animation_stage_2.yaml. I set the config values correctly. It throws an import error, than I fixed it. I have this error:

  from diffusers.pipeline_utils import DiffusionPipeline
loaded temporal unet's pretrained weights from outputs/train_stage_2-2023-12-22T08-59-53
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/AnimateAnyone-unofficial/pipelines/animation_stage_2.py", line 244, in <module>
    run(args)
  File "/workspace/AnimateAnyone-unofficial/pipelines/animation_stage_2.py", line 233, in run
    main(args)
  File "/workspace/AnimateAnyone-unofficial/pipelines/animation_stage_2.py", line 70, in main
    unet = UNet3DConditionModel.from_pretrained_2d(config.pretrained_motion_unet_path, subfolder=None, unet_additional_kwargs=OmegaConf.to_container(inference_config.unet_additional_kwargs), specific_model=config.specific_motion_unet_model)
  File "/workspace/AnimateAnyone-unofficial/models/unet.py", line 457, in from_pretrained_2d
    raise RuntimeError(f"{config_file} does not exist")
RuntimeError: outputs/train_stage_2-2023-12-22T08-59-53/config.json does not exist

one of the variables needed for gradient computation has been modified by an inplace operation [torch.cuda.FloatTensor [128]] is at version 3; expected version 2 instead

File "train_th.py", line 637, in
main(name=name, launcher=args.launcher, use_wandb=args.wandb, **config)
File "train_th.py", line 460, in main
latents_pose = poseguider(mask_image)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/cto_labs/hongfating/workspace/src/AnimateAnyone-unofficial/models/PoseGuider.py", line 78, in forward
x = self.conv_layers(x)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 171, in forward
return F.batch_norm(
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/functional.py", line 2450, in batch_norm
return torch.batch_norm(
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/fx/traceback.py", line 57, in format_stack
return traceback.format_stack()

It should be here: File "/cto_labs/hongfating/workspace/src/AnimateAnyone-unofficial/models/PoseGuider.py", line 78, in forward
x = self.conv_layers(x)

I have no idea about that.

How big does the video dataset need to be for good results?

about loss

Why my loss is quite strange
today, I try the new code, and my loss gets NaN:

SparseCausalAttention2D

While reading the code I saw that the standard BasicTransformerBlock from diffusers has been replaced with a modified version that utilizes a new class called SparseCausalAttention2D for the attn1 layer. Could you specify where this class is defined? Or maybe, were you able to successfully train the model without using this class (replacing it with a different one)?

UBC-512*512的8卡A100，训练耗时多少？几天？

我跑了可能需要十几天呢，，

Has anyone implemented it in Comfy?

How can I implement this in Comfy UI?

Any results?

I saw you added an inference cmd to the readme.
Do you have any preliminary results?

save unet to state_dict during stage 1 training

about training memory optimization

In the README, you mentioned that you would optimize the training code using DeepSpeed and Accelerate. However, as far as I know, the DeepSpeed functionality integrated into the Accelerate library does not support multi-model training. Do you have any suggestions?

啥时候Release Inference Code and unofficial pre-trained weights呀，很急

急急急急

where is spatial attention?

Great work, and i have some question about the attention modules (spatial attention&cross-attention&temporal attention), but the spatial-attention for calculating reference-net latent feature and denoising-unet latent feature is ignored? (cite:we replace the self-attention layer with spatial-attention layer. Given a feature map x1∈Rt×h×w×c from denoising UNet and x2∈Rh×w×c from ReferenceNet, we first copy x2 by t times and concatenate it with x1 along w dimension)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x1024 and 768x320)

I modified the paths in the configuration file to point to my local directories (UBC Fashion Video dataset) and started the training process. However, an error occurred during the process.

/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
### Train Info: train stage 1: image pretrain ###
Some weights of the model checkpoint were not used when initializing ReferenceNet: 
 ['conv_norm_out.bias, conv_norm_out.weight, conv_out.bias, conv_out.weight, up_blocks.3.attentions.2.proj_out.bias, up_blocks.3.attentions.2.proj_out.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_k.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_out.0.bias, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_out.0.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_k.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.0.proj.bias, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.0.proj.weight, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.2.bias, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.2.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm2.bias, up_blocks.3.attentions.2.transformer_blocks.0.norm2.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm3.bias, up_blocks.3.attentions.2.transformer_blocks.0.norm3.weight']
12/20/2023 01:40:44 - INFO - root - ***** Running training *****
12/20/2023 01:40:44 - INFO - root -   Num examples = 500
12/20/2023 01:40:44 - INFO - root -   Num Epochs = 480
12/20/2023 01:40:44 - INFO - root -   Instantaneous batch size per device = 4
12/20/2023 01:40:44 - INFO - root -   Total train batch size (w. parallel, distributed & accumulation) = 4
12/20/2023 01:40:44 - INFO - root -   Gradient Accumulation steps = 1
12/20/2023 01:40:44 - INFO - root -   Total optimization steps = 60000

  0%|          | 0/60000 [00:00<?, ?it/s]
Steps:   0%|          | 0/60000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 629, in <module>
    main(name=name, launcher=args.launcher, use_wandb=args.wandb, **config)
  File "train.py", line 492, in main
    referencenet(latents_ref_img, ref_timesteps, encoder_hidden_states)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/AnimateAnyone-unofficial/models/ReferenceNet.py", line 1005, in forward
    sample, res_samples = downsample_block(
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/diffusers/models/unet_2d_blocks.py", line 1086, in forward
    hidden_states = attn(
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/diffusers/models/transformer_2d.py", line 315, in forward
    hidden_states = block(
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/AnimateAnyone-unofficial/models/ReferenceNet_attention.py", line 199, in hacked_basic_transformer_inner_forward
    attn_output = self.attn2(
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/diffusers/models/attention_processor.py", line 417, in forward
    return self.processor(
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/diffusers/models/attention_processor.py", line 1023, in __call__
    key = attn.to_k(encoder_hidden_states, scale=scale)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/diffusers/models/lora.py", line 224, in forward
    out = super().forward(hidden_states)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x1024 and 768x320)

Steps:   0%|          | 0/60000 [00:05<?, ?it/s]
[2023-12-20 01:40:55,416] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 412807) of binary: /home/user/miniconda3/envs/animateanyone-unofficial/bin/python
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/animateanyone-unofficial/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-12-20_01:40:55
  host      : gpuserver
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 412807)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

what is the poseguider_checkpoint_path value?

Hello, first of all thanks for your work.

I have some questions. During the second stage of training, in the train_stage_2.yaml file,
poseguider_checkpoint_path: ""
referencenet_checkpoint_path: ""
What should these two contents be? Should the model trained in the first stage be written in referencenet_checkpoint_path?
Or something else, I hope to get your reply.

Have you noticed any issue during training related to the denoising timesteps?

Per the title I've been a little perplexed to see that what was denoised well at 30 inference timesteps @ 60k training steps, requires 70 steps @ 100k training steps.

My implementation is slightly different than yours so there could be quite a few things going on. Just curious if you noticed any similar behaviors since you're in the middle of training these days.

Thank you

Is 40 GB VRAM enough for training?

The parameters of unet training are not used

In pipelines/animation_stage_1.py, the parameters of unet are load from config.pretrained_model_path, does not load from config.pretrained_unet_path

About "beta_schedule"

I noticed that you changed beta_schedule from linear to scaled_linear. Is it because the training results are better when using the latter?

about training memory optimization

I would like to know how to handle the dataset. What is the overall structure of the TikTok dataset, and do I need to preprocess it beforehand using DWPose or OpenPose?

Loss not decreasing in stage 1

After training stage-1 for 30000 steps on TikTok dataset I'm getting the following loss curve and images from validation_pipeline is this correct?

Which clip encoder is this?

Magicanimate doesn't seem to have it in their pretrained directory. Is it the same as "laion/CLIP-ViT-B-32-laion2B-s34B-b79K" ?

About time embedding in ReferenceNet

In the official paper, the authors say

While ReferenceNet introduces a comparable number of parameters to the denoising UNet, in diffusion-based video generation, all video frames undergo denoising multiple times, whereas ReferenceNet only needs to extract features once throughout the entire process

But in your implementation of inference, the forward of ReferenceNet is performed multiple times.

Consider fixing the timestep of ReferenceUnet？

model saving

Hi, it seems that you train the 2D unet, referencenet, and poseguider during the first stage,
but you don't save parameters of 2D unet.

about the result of the first stage

my config:
train_data:
csv_path: ../TikTok_info.csv
video_folder:../TikTok_dataset/TikTok_dataset
sample_size: 512
sample_stride: 4
sample_n_frames: 16
clip_model_path: openai/clip-vit-base-patch32

gradient_accumulation_steps: 128
batch_size: 1
use 1 V100, optimizer = torch.optim.SGD(trainable_params, lr=learning_rate / gradient_accumulation_steps, momentum=0.9)
result: show the result of 20000 steps

Could it be because the 20,000 steps I have here are actually only equivalent to more than 300 steps when the batchsize is 64? or other reasons?