Giter Site home page Giter Site logo

talc's Introduction

TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

[Paper] [Website] [Dataset] [Checkpoint]

Abstract

Recent advances in diffusion-based generative modeling have led to the development of text-to-video (T2V) models that can generate high-quality videos conditioned on a text prompt. Most of these T2V models often produce single-scene video clips that depict an entity performing a particular action (e.g., 'a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., 'a red panda climbing a tree' followed by `the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce \textbf{T}ime-\textbf{Al}igned \textbf{C}aptions (\name) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. For instance, we condition the visual features of the earlier and later scenes of the generated video with the representations of the first scene description (e.g., 'a red panda climbing a tree') and second scene description (e.g., 'the red panda sleeps on the top of the tree'), respectively. As a result, we show that the T2V model can generate multi-scene videos that adhere to the multi-scene text descriptions and be visually consistent (e.g., entity and background). Further, we finetune the pretrained T2V model with multi-scene video-text data using the \name framework. We show that the \name-finetuned model outperforms the baseline methods by 15.5 points in the overall score, which averages visual consistency and text adherence using human evaluation.

Examples

Scene 1: Superman is surfing on the waves.Scene 2: The Superman falls into the water.
Baseline (Merging Captions) TALC (Ours)
Scene 1: Spiderman is surfing on the waves. Scene 2: Darth Vader is surfing on the same waves.
Baseline (Merging Captions) TALC (Ours)
Scene 1: A stuffed toy is lying on the road. Scene 2: A person enters and picks the stuffed toy.
Baseline (Merging Captions) TALC (Ours)
Scene 1: Red panda is moving in the forest. Scene 2: The red panda spots a treasure chest. Scene 3: The red panda finds a map inside the treasure chest.
Merging Captions TALC (Ours)
Scene 1: A koala climbs a tree. Scene 2: The koala eats the eucalyptus leaves. Scene 3: The koala takes a nap.
Merging Captions TALC (Ours)

Installation

  1. Creating conda environment
conda create -n talc python=3.10
conda activate talc
  1. Install Dependencies
pip install -r requirements.txt
conda install -c menpo opencv

Inference

  1. We provide a sample command to generate multi-scene (n = 2) videos from the base ModelScopeT2V model using the TALC framework.
CUDA_VISIBLE_DEVICES=0 python inference.py --outfile test_scene.mp4 --model-name-path damo-vilab/text-to-video-ms-1.7b --talc --captions "koala is climbing a tree." "kangaroo is eating fruits."
  1. In the above command, replacing --talc with --merge will generate different video scenes for individual captions and output a merged video.
  2. To perform inference using the merging captions method, you can use:
CUDA_VISIBLE_DEVICES=0 python inference.py --outfile test_scene.mp4 --model-name-path damo-vilab/text-to-video-ms-1.7b --captions "koala is climbing a tree." "kangaroo is eating fruits."
  1. To generate multi-scene videos using the TALC-finetuned model, the command is:
CUDA_VISIBLE_DEVICES=4 python inference.py --outfile test_scene.mp4 --model-name-path talc_finetuned_modelscope_t2v --talc --captions --captions "spiderman surfing in the ocean." "darth vader surfing in the ocean."
  1. In essense, we make changes to pipeline_text_to_video_synth.py to support TALC framework.

Data

Task Prompts (4 scenes)

  1. Single characters under multiple visual context - file
  2. Different characters, single context - file
  3. Multi-scene captions from real videos - file

Finetuning Data

  1. We provide the video segments and caption dataset on HF 🤗 - Link.
  2. The data of the give form where c1 and c2 are the captions that align with the video segments v1 and v2, respectively.
{'captions': [c1, c2], 'video_segments': [v1, v2]}
  1. We also provide a file that provides a mapping between the video segments and the number of video frames in each video segment. We calculated these using the opencv. This information is useful for the finetuning purpose.
{'video_segment': number_of_video_frames}

Finetuning

  1. We utilize Huggingface accelerate to finetune the model on multiple GPUs.
  2. Accelerate setup, run accelerate config on the terminal and use the following settings:
- multi-GPU
- (How many machines?) 1
- (..) no
- (Number of GPUs) 3
- (np/fp16/bf16) fp16                                                                                                                                                   
  1. Make relevant changes to the config.yaml.
  2. Setup the wandb directory using wandb init in the terminal. If you want to disable wandb, then uncomment os.environ["WANDB_DISABLED"] = "true" in train.py.
  3. Sample run command:
CUDA_VISIBLE_DEVICES=4,5,6 accelerate launch train.py --config config.yaml 
  1. We make a changes to the unet_3d_condition.py to support TALC framework.

Automatic Evaluation

  1. We provide a script to perform automatic evaluation of the generated videos for entity consistency, background consistency, and text adherence.
  2. Sample command for eval.py to evaluate multi-scene generated video for a two-scene description:
OPENAI_API_KEY=[OPENAI_API_KEY] python eval.py --vidpath video.mp4 --captions "elephants is standing near the water" "the elephant plays with the water"

Acknowledgements

  1. Diffusers Library
  2. T2V Finetuning Repo

talc's People

Contributors

hritikbansal avatar

Stargazers

Jiaming Han avatar XYY avatar Yao Xiao avatar Qi Yan avatar  avatar 爱可可-爱生活 avatar Han Lin avatar Da Zhang avatar  avatar  avatar  avatar kai wang avatar  avatar dImrich avatar Said avatar Vishaal Udandarao avatar

Watchers

 avatar  avatar

Forkers

mobled37

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.