Giter Site home page Giter Site logo

open-animateanyone's People

Contributors

dongxuyue avatar eltociear avatar guoqincode avatar zhenzhiwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

open-animateanyone's Issues

Results

Hi, @guoqincode, thanks for your effort in reimplementing this! Could you show some video results as demonstration?

No module named 'models.hack_poseguider'

I tried to run demo.gradio_animate, but the following error was reported. Under the models folder, I did not find hack_poseguider

Traceback (most recent call last):
File "/home/work/diffuser-env/python/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/work/diffuser-env/python/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/work/AnimateAnyone-unofficial/demo/gradio_animate.py", line 8, in
from demo.animate import AnimateAnyone
File "/home/work/AnimateAnyone-unofficial/demo/animate.py", line 21, in
from models.hack_poseguider import Hack_PoseGuider as PoseGuider
ModuleNotFoundError: No module named 'models.hack_poseguider'

About masks?

In Tiktok dataset, there is a masks file。 Maybe the foreground is trained separately, Have you taken this into account?

No trainable param for unet in stage 1

unet = DDP(unet, device_ids=[local_rank], output_device=local_rank)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 551, in init
self._log_and_throw(
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 686, in _log_and_throw
raise err_type(err_msg)
RuntimeError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient.

referencenet initializing warning ?

Some weights of the model checkpoint were not used when initializing ReferenceNet:
['conv_norm_out.bias, conv_norm_out.weight, conv_out.bias, conv_out.weight, up_blocks.3.attentions.2.proj_out.bias, up_blocks.3.attentions.2.proj_out.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_k.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_out.0.bias, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_out.0.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_k.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.0.proj.bias, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.0.proj.weight, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.2.bias, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.2.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm2.bias, up_blocks.3.attentions.2.transformer_blocks.0.norm2.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm3.bias, up_blocks.3.attentions.2.transformer_blocks.0.norm3.weight']

is this correct?the training loss is not decreasing,result:
grid

the pose condition is invalid..

stage2 training error

Thank you for your work.

When I was in the second stage of training, I kept reporting out-of-memory errors. I have 80G of memory. No matter on a single card or multiple cards, the same error was reported. Even if --train_batch_size is set to 1, what went wrong?

error message:
Traceback (most recent call last):
File "/home/work/animate-anyone/train_2nd_stage.py", line 919, in
main(args)
File "/home/work/animate-anyone/train_2nd_stage.py", line 823, in main
model_pred = unet(
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 632, in forward
return model_forward(*args, **kwargs)
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 620, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/home/work/animate-anyone/animate_anyone/models/unet_3d_condition.py", line 1011, in forward
sample = upsample_block(
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/work/animate-anyone/animate_anyone/models/unet_3d_blocks.py", line 901, in forward
hidden_states = resnet(hidden_states, temb, scale=lora_scale)
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/work/animate-anyone/animate_anyone/models/resnet.py", line 340, in forward
hidden_states = self.norm1(hidden_states)
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 273, in forward
return F.group_norm(
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/functional.py", line 2530, in group_norm
return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 810.00 MiB (GPU 0; 79.35 GiB total capacity; 76.87 GiB already allocated; 64.19 MiB free; 77.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

about training optimization

i try the 8bit adam optimizer, i can train stage one on 40g a100. i think it can help reduce the vram usage, but i don't know if it will decrease the model performance. what dou you think? did you try the 8bit adam?

About multi-GPU training

Thank you for your contributions! There are two questions below:

  1. I have observed that the training duration using two RTX 6000 ada GPUs exceeds the time it takes with a single GPU. Is this an expected phenomenon?
  2. I encountered the phenomenon of gradient explosion during the training process.

Tensor size mismatch in using clip-vit-large-patch14

Hi,

Thanks for sharing your implementation. It really helps the community a lot to reproduce animate-anyone. When I try to training the network with your code, I find that in the referencenet_attention, the hidden state size of stable diffusion unet is 768 while the clip image feature extracted from clip-vit-large-patch14 is 1024, which causes size mismatch in network forward (however, the hidden size of clip-vit-base-patch32 is 768). As your config yaml file was clip-vit-base-patch32 and recently change to clip-vit-large-patch14, and you mentioned that you use clip-vit-large-patch14 in another issue. Could you elaborate more details how your code works with clip-vit-large-patch14? I encountered errors when I directly run your training code with clip-vit-large-patch14.

Looking forward to your reply! Thanks again for your effort.

batch size in training stage1.

I training the first stage with 8*A800 80G. However, the max batch size can only be set to 1 on each single GPU. Is that normal?

Running Inference step 2

I got train results for both stages 1 and 2. Inference stage one works but creates a video with the same frame for one second; the inference stage 2 module is not working. I tried python -m pipelines.animation_stage_2 --config configs/prompts/animation_stage_2.yaml. I set the config values correctly. It throws an import error, than I fixed it. I have this error:

  from diffusers.pipeline_utils import DiffusionPipeline
loaded temporal unet's pretrained weights from outputs/train_stage_2-2023-12-22T08-59-53
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/AnimateAnyone-unofficial/pipelines/animation_stage_2.py", line 244, in <module>
    run(args)
  File "/workspace/AnimateAnyone-unofficial/pipelines/animation_stage_2.py", line 233, in run
    main(args)
  File "/workspace/AnimateAnyone-unofficial/pipelines/animation_stage_2.py", line 70, in main
    unet = UNet3DConditionModel.from_pretrained_2d(config.pretrained_motion_unet_path, subfolder=None, unet_additional_kwargs=OmegaConf.to_container(inference_config.unet_additional_kwargs), specific_model=config.specific_motion_unet_model)
  File "/workspace/AnimateAnyone-unofficial/models/unet.py", line 457, in from_pretrained_2d
    raise RuntimeError(f"{config_file} does not exist")
RuntimeError: outputs/train_stage_2-2023-12-22T08-59-53/config.json does not exist

one of the variables needed for gradient computation has been modified by an inplace operation [torch.cuda.FloatTensor [128]] is at version 3; expected version 2 instead

File "train_th.py", line 637, in
main(name=name, launcher=args.launcher, use_wandb=args.wandb, **config)
File "train_th.py", line 460, in main
latents_pose = poseguider(mask_image)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/cto_labs/hongfating/workspace/src/AnimateAnyone-unofficial/models/PoseGuider.py", line 78, in forward
x = self.conv_layers(x)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 171, in forward
return F.batch_norm(
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/functional.py", line 2450, in batch_norm
return torch.batch_norm(
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/fx/traceback.py", line 57, in format_stack
return traceback.format_stack()

It should be here: File "/cto_labs/hongfating/workspace/src/AnimateAnyone-unofficial/models/PoseGuider.py", line 78, in forward
x = self.conv_layers(x)

I have no idea about that.

about loss

image
Why my loss is quite strange
today, I try the new code, and my loss gets NaN:
image

SparseCausalAttention2D

While reading the code I saw that the standard BasicTransformerBlock from diffusers has been replaced with a modified version that utilizes a new class called SparseCausalAttention2D for the attn1 layer. Could you specify where this class is defined? Or maybe, were you able to successfully train the model without using this class (replacing it with a different one)?

Any results?

I saw you added an inference cmd to the readme.
Do you have any preliminary results?

about training memory optimization

In the README, you mentioned that you would optimize the training code using DeepSpeed and Accelerate. However, as far as I know, the DeepSpeed functionality integrated into the Accelerate library does not support multi-model training. Do you have any suggestions?

where is spatial attention?

Great work, and i have some question about the attention modules (spatial attention&cross-attention&temporal attention), but the spatial-attention for calculating reference-net latent feature and denoising-unet latent feature is ignored? (cite:we replace the self-attention layer with spatial-attention layer. Given a feature map x1∈Rt×h×w×c from denoising UNet and x2∈Rh×w×c from ReferenceNet, we first copy x2 by t times and concatenate it with x1 along w dimension)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x1024 and 768x320)

I modified the paths in the configuration file to point to my local directories (UBC Fashion Video dataset) and started the training process. However, an error occurred during the process.

/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
### Train Info: train stage 1: image pretrain ###
Some weights of the model checkpoint were not used when initializing ReferenceNet: 
 ['conv_norm_out.bias, conv_norm_out.weight, conv_out.bias, conv_out.weight, up_blocks.3.attentions.2.proj_out.bias, up_blocks.3.attentions.2.proj_out.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_k.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_out.0.bias, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_out.0.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_k.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.0.proj.bias, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.0.proj.weight, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.2.bias, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.2.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm2.bias, up_blocks.3.attentions.2.transformer_blocks.0.norm2.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm3.bias, up_blocks.3.attentions.2.transformer_blocks.0.norm3.weight']
12/20/2023 01:40:44 - INFO - root - ***** Running training *****
12/20/2023 01:40:44 - INFO - root -   Num examples = 500
12/20/2023 01:40:44 - INFO - root -   Num Epochs = 480
12/20/2023 01:40:44 - INFO - root -   Instantaneous batch size per device = 4
12/20/2023 01:40:44 - INFO - root -   Total train batch size (w. parallel, distributed & accumulation) = 4
12/20/2023 01:40:44 - INFO - root -   Gradient Accumulation steps = 1
12/20/2023 01:40:44 - INFO - root -   Total optimization steps = 60000

  0%|          | 0/60000 [00:00<?, ?it/s]
Steps:   0%|          | 0/60000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 629, in <module>
    main(name=name, launcher=args.launcher, use_wandb=args.wandb, **config)
  File "train.py", line 492, in main
    referencenet(latents_ref_img, ref_timesteps, encoder_hidden_states)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/AnimateAnyone-unofficial/models/ReferenceNet.py", line 1005, in forward
    sample, res_samples = downsample_block(
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/diffusers/models/unet_2d_blocks.py", line 1086, in forward
    hidden_states = attn(
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/diffusers/models/transformer_2d.py", line 315, in forward
    hidden_states = block(
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/AnimateAnyone-unofficial/models/ReferenceNet_attention.py", line 199, in hacked_basic_transformer_inner_forward
    attn_output = self.attn2(
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/diffusers/models/attention_processor.py", line 417, in forward
    return self.processor(
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/diffusers/models/attention_processor.py", line 1023, in __call__
    key = attn.to_k(encoder_hidden_states, scale=scale)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/diffusers/models/lora.py", line 224, in forward
    out = super().forward(hidden_states)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x1024 and 768x320)

Steps:   0%|          | 0/60000 [00:05<?, ?it/s]
[2023-12-20 01:40:55,416] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 412807) of binary: /home/user/miniconda3/envs/animateanyone-unofficial/bin/python
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/animateanyone-unofficial/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-12-20_01:40:55
  host      : gpuserver
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 412807)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

what is the poseguider_checkpoint_path value?

Hello, first of all thanks for your work.

I have some questions. During the second stage of training, in the train_stage_2.yaml file,
poseguider_checkpoint_path: ""
referencenet_checkpoint_path: ""
What should these two contents be? Should the model trained in the first stage be written in referencenet_checkpoint_path?
Or something else, I hope to get your reply.

Have you noticed any issue during training related to the denoising timesteps?

Per the title I've been a little perplexed to see that what was denoised well at 30 inference timesteps @ 60k training steps, requires 70 steps @ 100k training steps.

My implementation is slightly different than yours so there could be quite a few things going on. Just curious if you noticed any similar behaviors since you're in the middle of training these days.

Thank you

About "beta_schedule"

I noticed that you changed beta_schedule from linear to scaled_linear. Is it because the training results are better when using the latter?

about training memory optimization

In the README, you mentioned that you would optimize the training code using DeepSpeed and Accelerate. However, as far as I know, the DeepSpeed functionality integrated into the Accelerate library does not support multi-model training. Do you have any suggestions about use deepspeed to optimize the memory?

Loss not decreasing in stage 1

After training stage-1 for 30000 steps on TikTok dataset I'm getting the following loss curve and images from validation_pipeline is this correct?

image

Which clip encoder is this?

Magicanimate doesn't seem to have it in their pretrained directory. Is it the same as "laion/CLIP-ViT-B-32-laion2B-s34B-b79K" ?

About time embedding in ReferenceNet

In the official paper, the authors say

While ReferenceNet introduces a comparable number of parameters to the denoising UNet, in diffusion-based video generation, all video frames undergo denoising multiple times, whereas ReferenceNet only needs to extract features once throughout the entire process

But in your implementation of inference, the forward of ReferenceNet is performed multiple times.

Consider fixing the timestep of ReferenceUnet?

model saving

Hi, it seems that you train the 2D unet, referencenet, and poseguider during the first stage,
but you don't save parameters of 2D unet.

about the result of the first stage

my config:
train_data:
csv_path: ../TikTok_info.csv
video_folder:../TikTok_dataset/TikTok_dataset
sample_size: 512
sample_stride: 4
sample_n_frames: 16
clip_model_path: openai/clip-vit-base-patch32

gradient_accumulation_steps: 128
batch_size: 1
use 1 V100, optimizer = torch.optim.SGD(trainable_params, lr=learning_rate / gradient_accumulation_steps, momentum=0.9)
result: show the result of 20000 steps
image

Could it be because the 20,000 steps I have here are actually only equivalent to more than 300 steps when the batchsize is 64? or other reasons?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.