Giter Site home page Giter Site logo

helioszhao / make-a-protagonist Goto Github PK

View Code? Open in Web Editor NEW
311.0 311.0 36.0 55.65 MB

Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts

Home Page: https://make-a-protagonist.github.io/

License: Apache License 2.0

Python 91.46% C++ 1.36% Cuda 6.94% Shell 0.05% Cython 0.19%

make-a-protagonist's Introduction

make-a-protagonist's People

Contributors

helioszhao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

make-a-protagonist's Issues

train

Should we fine-tune the text-to-image diffusion models with visual and textual cluee for each video?

name '_C' is not defined

when i try the test: i got
python experts/grounded_sam_inference.py -d data/ikun/images/0000.jpg -t "a man with a basketball"
`/data2/home/srchen/project/github/in_work/Make-A-Protagonist/experts/GroundedSAM/GroundingDINO/groundingdino/models/GroundingDINO/ms_deform_attn.py:31: UserWarning: Failed to load custom C++ ops. Running on CPU mode Only!
warnings.warn("Failed to load custom C++ ops. Running on CPU mode Only!")
/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
final text_encoder_type: bert-base-uncased
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias']

  • This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    _IncompatibleKeys(missing_keys=[], unexpected_keys=['label_enc.weight'])
    0%| | 0/1 [00:00<?, ?it/s]/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/transformers/modeling_utils.py:866: FutureWarning: The device argument is deprecated and will be removed in v5 of Transformers.
    warnings.warn(
    /root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
    warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
    0%| | 0/1 [00:01<?, ?it/s]
    Traceback (most recent call last):
    File "/data2/home/srchen/project/github/in_work/Make-A-Protagonist/experts/grounded_sam_inference.py", line 208, in
    boxes_filt, pred_phrases, logits_filt = get_grounding_output(
    File "/data2/home/srchen/project/github/in_work/Make-A-Protagonist/experts/grounded_sam_inference.py", line 72, in get_grounding_output
    outputs = model(image[None], captions=[caption])
    File "/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
    File "/data2/home/srchen/project/github/in_work/Make-A-Protagonist/experts/GroundedSAM/GroundingDINO/groundingdino/models/GroundingDINO/groundingdino.py", line 313, in forward
    hs, reference, hs_enc, ref_enc, init_box_proposal = self.transformer(
    File "/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
    File "/data2/home/srchen/project/github/in_work/Make-A-Protagonist/experts/GroundedSAM/GroundingDINO/groundingdino/models/GroundingDINO/transformer.py", line 258, in forward
    memory, memory_text = self.encoder(
    File "/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
    File "/data2/home/srchen/project/github/in_work/Make-A-Protagonist/experts/GroundedSAM/GroundingDINO/groundingdino/models/GroundingDINO/transformer.py", line 576, in forward
    output = checkpoint.checkpoint(
    File "/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
    File "/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs) # type: ignore[misc]
    File "/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
    File "/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
    File "/data2/home/srchen/project/github/in_work/Make-A-Protagonist/experts/GroundedSAM/GroundingDINO/groundingdino/models/GroundingDINO/transformer.py", line 785, in forward
    src2 = self.self_attn(
    File "/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
    File "/data2/home/srchen/project/github/in_work/Make-A-Protagonist/experts/GroundedSAM/GroundingDINO/groundingdino/models/GroundingDINO/ms_deform_attn.py", line 338, in forward
    output = MultiScaleDeformableAttnFunction.apply(
    File "/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs) # type: ignore[misc]
    File "/data2/home/srchen/project/github/in_work/Make-A-Protagonist/experts/GroundedSAM/GroundingDINO/groundingdino/models/GroundingDINO/ms_deform_attn.py", line 53, in forward
    output = _C.ms_deform_attn_forward(
    NameError: name '_C' is not defined`

Improvements in the generation

Hi @HeliosZhao

After going through the complete code and experiments, I see the following issues.

  1. Issues with the generation quality and long video generation.
  2. Certainly some part of the background from the source video is missing in the target video though we are using the masks and editing only the protagonist.

Plans to improve:

  1. Can we use temporalNet from ControlNet as a guidance to improve the consistency.
  2. Can we use pretrained text to video models or train this architecture on videos dataset to better learn the patterns. As the current one used is T2I ( Text to Image) the frame-to-frame consistency is low compare to the source videos.
  3. Can we use any other additional guidance for the better generation?
  4. Can we use weighted temporal attention, I see we calculate the attention with single frame. Can we use moving weighted average so that the information is preserved ( RNN kind of architecture here).

Can you help answering these? Thanks in advance.

请教一下,应用参考图片替换主体效果却不怎么成功,效果不是很好

  1. 我训练了yanzi模型,并把参考图片替换成高启强,并不能很好的控制生成高启强的视频效果。后来替换了川普图片,也不行,只有把promt 也改成特朗普,效果才可以.
  2. 我自己训练了一个39帧得动态表情(一只猫)做了两组实验:
    (1)然后promot替换成不同的动物名字,如狗、狮子、老虎。reference指定同一张狗图片,然后生成的视频和promt的不同名词是对应的,没有受到参考图片的影响.
    (2)然后找了几张对应的名字的图片作为参考图,但是生成的视频主体和参考图片不是一摸一样。

训练参数:图片高宽 480, frames = 39
推理:source_protagonist: false & source_background: false

Multi GPU

I see few issues in running the code with multi gpu's. Is it supporting?

RuntimeError: Tensor type unknown to einops <class 'str'>

Hi,

Thanks for this repo! I want to ask whether you've encountered the following error before? Thank you!

Traceback (most recent call last):                                                                                                                                   | 0/50 [00:00<?, ?it/s]
  File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/pdb.py", line 1726, in main
    pdb._runscript(mainpyfile)
  File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/pdb.py", line 1586, in _runscript
    self.run(statement)
  File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/bdb.py", line 580, in run
    exec(cmd, globals, locals)
  File "<string>", line 1, in <module>
  File "/data/home/xindiw/Make-A-Protagonist/train.py", line 1, in <module>
    import argparse
  File "/data/home/xindiw/Make-A-Protagonist/train.py", line 470, in main
    sample = validation_pipeline(image=_ref_image, prompt=prompt, control_image=conditions, generator=generator, latents=ddim_inv_latent, image_embeds=image_embed, masks=masks, prior_latents=prior_embeds, prior_denoised_embeds=prior_denoised_embeds, **validation_data).videos
  File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/home/xindiw/Make-A-Protagonist/makeaprotagonist/pipelines/pipeline_stable_unclip_controlavideo.py", line 1433, in __call__
    down_block_res_samples = [rearrange(sample, "(b f) c h w -> b c f h w", f=video_length) for sample in down_block_res_samples]
  File "/data/home/xindiw/Make-A-Protagonist/makeaprotagonist/pipelines/pipeline_stable_unclip_controlavideo.py", line 1433, in <listcomp>
    down_block_res_samples = [rearrange(sample, "(b f) c h w -> b c f h w", f=video_length) for sample in down_block_res_samples]
  File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/site-packages/einops/einops.py", line 483, in rearrange
    return reduce(cast(Tensor, tensor), pattern, reduction='rearrange', **axes_lengths)
  File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/site-packages/einops/einops.py", line 412, in reduce
    return _apply_recipe(recipe, tensor, reduction_type=reduction)
  File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/site-packages/einops/einops.py", line 233, in _apply_recipe
    backend = get_backend(tensor)
  File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/site-packages/einops/_backends.py", line 52, in get_backend
    raise RuntimeError('Tensor type unknown to einops {}'.format(type(tensor)))
RuntimeError: Tensor type unknown to einops <class 'str'>
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /data/home/xindiw/miniconda3/envs/tune/lib/python3.9/site-packages/einops/_backends.py(52)get_backend()
-> raise RuntimeError('Tensor type unknown to einops {}'.format(type(tensor)))

Frame to frame consistency

Thanks for the code. I have edited few videos and seen that the frame-frame consistency is not that smooth like the source video. And looks like the frame is not constant ( like the camera angle is shaking). Do you have any ideas to improve this?

Windows version?

Will there be WINDOWS versions of this? I see some packages that the project depends on, such as triton, which only supports Linux systems.

license issue

This repository contains code from XMem, which is licensed under GPL-3.0.
Does it constitute the covered work of XMem? If so, I think this whole work could only be used under GPL-3.0, rather than apache-2.0. Unless there's an additional permission from XMem.

Impact of masks and depth

What happens if I don't use the masks and depth during inference. What is the impact of masks/depth for generating video.

Training very slow by increasing 1 frame

Hi @HeliosZhao
I see that using 28 frames takes 7-8 secs/iteration for training but increasing it to 29 frames takes 44 secs/iteration. It is strange. May I know what could have caused this. I see the 100% utilisation of GPU in both the cases, also see that memory is still left. And I'm training in A100 with 768*768 resolution. I just changed 'n_sample_frames' in config.

With 28 frames:

Screenshot 2023-08-10 at 10 22 07 AM Screenshot 2023-08-10 at 10 22 46 AM

With 29 frames:
Screenshot 2023-08-10 at 10 26 46 AM

Screenshot 2023-08-10 at 10 27 08 AM

Motion vectors

Hi @HeliosZhao,

How can I use motion vectors ( extracted from FFMPEG) as a guidance for generation in inference. Like we use segmentation map in the current pipeline during the inference. And will it effect in positive way for video generation?

Smooth video generation

The generated video is not temporally consistent, has flickering colours and changing backgrounds. Need suggestions to ensure temporal consistency and style consistency for all the frames of the video...

Reference image in training

@HeliosZhao Why the reference image is been used in training and does it make any significant difference if I use a different masked_reference image during the inference? If that doesn't make any difference then what is the use of reference image in training?

progressbar version?

(Text-video) E:\git\Make-A-Protagonist>python experts/xmem_inference.py -d data/bird-forest/images -v bird-forest --mask_dir bird.mask
Output path not provided. By default saving to the mask dir
save_all is forced to be true in generic evaluation mode.
Hyperparameters read from the model weights: C^k=64, C^v=512, C^h=64
Single object mode: False
Traceback (most recent call last):
File "E:\git\Make-A-Protagonist\experts\xmem_inference.py", line 103, in
for vid_reader in progressbar(meta_loader, max_value=len(meta_dataset), redirect_stdout=True):
TypeError: 'module' object is not callable

(Text-video) E:\git\Make-A-Protagonist>pip list|grep progressbar
progressbar 2.5

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.