Giter Site home page Giter Site logo

v6p / open-sora-plan Goto Github PK

View Code? Open in Web Editor NEW

This project forked from pku-yuangroup/open-sora-plan

0.0 0.0 0.0 307.9 MB

This project aim to reproduce Sora (Open AI T2V model), but we only have limited resource. We deeply wish the all open source community can contribute to this project.

License: MIT License

Shell 1.66% C++ 0.21% Python 97.79% Cuda 0.33%

open-sora-plan's Introduction

Open-Sora Plan

[Project Page] [中文主页]

slack badge WeChat badge Twitter
License GitHub repo contributors GitHub Commit Pr GitHub issues GitHub closed issues
GitHub repo stars  GitHub repo forks  GitHub repo watchers  GitHub repo size

💪 Goal

This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ) and build knowledge about Video-VQVAE (VideoGPT) + DiT at scale. However, since we have limited resources, we deeply wish all open-source community can contribute to this project. Pull requests are welcome!!!

本项目希望通过开源社区的力量复现Sora,由北大-兔展AIGC联合实验室共同发起,当前我们资源有限仅搭建了基础架构,无法进行完整训练,希望通过开源社区逐步增加模块并筹集资源进行训练,当前版本离目标差距巨大,仍需持续完善和快速迭代,欢迎Pull request!!!

Project stages:

  • Primary
  1. Setup the codebase and train a un-conditional model on a landscape dataset.
  2. Train models that boost resolution and duration.
  • Extensions
  1. Conduct text2video experiments on landscape dataset.
  2. Train the 1080p model on video2text dataset.
  3. Control model with more conditions.

📰 News

[2024.03.27] 🚀🚀🚀 We release the report of VideoCausalVAE, which supports both images and videos. We present our reconstructed video in this demonstration as follows. The text-to-video model is on the way.

test_ducks.mp4

[2024.03.10] 🚀🚀🚀 This repo supports training a latent size of 225×90×90 (t×h×w), which means we are able to train 1 minute of 1080P video with 30FPS (2× interpolated frames and 2× super resolution) under class-condition.

[2024.03.08] We support the training code of text condition with 16 frames of 512x512. The code is mainly borrowed from Latte.

[2024.03.07] We support training with 128 frames (when sample rate = 3, which is about 13 seconds) of 256x256, or 64 frames (which is about 6 seconds) of 512x512.

[2024.03.05] See our latest todo, pull requests are welcome.

[2024.03.04] We re-organizes and modulizes our code to make it easy to contribute to the project, to contribute please see the Repo structure.

[2024.03.03] We opened some discussions to clarify several issues.

[2024.03.01] Training code is available now! Learn more on our project page. Please feel free to watch 👀 this repository for the latest updates.

✊ Todo

Setup the codebase and train a unconditional model on landscape dataset

  • Fix typos & Update readme. 🤝 Thanks to @mio2333, @CreamyLong, @chg0901, @Nyx-177, @HowardLi1984, @sennnnn, @Jason-fan20
  • Setup environment. 🤝 Thanks to @nameless1117
  • Add docker file. ⌛ [WIP] 🤝 Thanks to @Mon-ius, @SimonLeeGit
  • Enable type hints for functions. 🤝 Thanks to @RuslanPeresy, 🙏 [Need your contribution]
  • Resume from checkpoint.
  • Add Video-VQGAN model, which is borrowed from VideoGPT.
  • Support variable aspect ratios, resolutions, durations training on DiT.
  • Support Dynamic mask input inspired by FiT.
  • Add class-conditioning on embeddings.
  • Incorporating Latte as main codebase.
  • Add VAE model, which is borrowed from Stable Diffusion.
  • Joint dynamic mask input with VAE.
  • Add VQVAE from VQGAN. 🙏 [Need your contribution]
  • Make the codebase ready for the cluster training. Add SLURM scripts. 🙏 [Need your contribution]
  • Refactor VideoGPT. 🤝 Thanks to @qqingzheng, @luo3300612, @sennnnn
  • Add sampling script.
  • Add DDP sampling script. ⌛ [WIP]
  • Use accelerate on multi-node. 🤝 Thanks to @sysuyy
  • Incorporate SiT. 🤝 Thanks to @khan-yin
  • Add evaluation scripts (FVD, CLIP score). 🤝 Thanks to @rain305f

Train models that boost resolution and duration

  • Add PI to support out-of-domain size. 🤝 Thanks to @jpthu17
  • Add 2D RoPE to improve generalization ability as FiT. 🤝 Thanks to @jpthu17
  • Compress KV according to PixArt-sigma.
  • Support deepspeed for videogpt training. 🤝 Thanks to @sennnnn
  • Train a low dimension Video-AE, whether it is VAE or VQVAE. ⌛ [WIP] 🚀 [Require more computation]
  • Extract offline feature.
  • Train with offline feature.
  • Add frame interpolation model. 🤝 Thanks to @yunyangge
  • Add super resolution model. 🤝 Thanks to @Linzy19
  • Add accelerate to automatically manage training.
  • Joint training with images. 🙏 [Need your contribution]
  • Implement MaskDiT technique for fast training. 🙏 [Need your contribution]
  • Incorporate NaViT. 🙏 [Need your contribution]
  • Add FreeNoise support for training-free longer video generation. 🙏 [Need your contribution]

Conduct text2video experiments on landscape dataset.

  • Implement PeRFlow for improving the sampling process. 🙏 [Need your contribution]
  • Finish data loading, pre-processing utils.
  • Add T5 support.
  • Add CLIP support. 🤝 Thanks to @Ytimed2020
  • Add text2image training script.
  • Add prompt captioner.
    • Collect training data.
      • Need video-text pairs with caption. 🙏 [Need your contribution]
      • Extract multi-frame descriptions by large image-language models. 🤝 Thanks to @HowardLi1984
      • Extract video description by large video-language models. 🙏 [Need your contribution]
      • Integrate captions to get a dense caption by using a large language model, such as GPT-4. 🤝 Thanks to @HowardLi1984
    • Train a captioner to refine captions. 🚀 [Require more computation]

Train the 1080p model on video2text dataset

  • Looking for a suitable dataset, welcome to discuss and recommend. 🙏 [Need your contribution]
  • Add synthetic video created by game engines or 3D representations. 🙏 [Need your contribution]
  • Finish data loading, and pre-processing utils. ⌛ [WIP]
  • Support memory friendly training.
    • Add flash-attention2 from pytorch.
    • Add xformers. 🤝 Thanks to @jialin-zhao
    • Support mixed precision training.
    • Add gradient checkpoint.
    • Support for ReBased and Ring attention. 🤝 Thanks to @kabachuha
    • Train using the deepspeed engine. 🤝 Thanks to @sennnnn
    • Integrate with Colossal-AI for a cheaper, faster, and more efficient. 🙏 [Need your contribution]
  • Train with a text condition. Here we could conduct different experiments: 🚀 [Require more computation]
    • Train with T5 conditioning.
    • Train with CLIP conditioning.
    • Train with CLIP + T5 conditioning (probably costly during training and experiments).

Control model with more condition

  • Load pretrained weights from Latte.
  • Incorporating ControlNet. 🙏 [Need your contribution]

📂 Repo structure (WIP)

├── README.md
├── docs
│   ├── Data.md                    -> Datasets description.
│   ├── Contribution_Guidelines.md -> Contribution guidelines description.
├── scripts                        -> All scripts.
├── opensora
│   ├── dataset
│   ├── models
│   │   ├── ae                     -> Compress videos to latents
│   │   │   ├── imagebase
│   │   │   │   ├── vae
│   │   │   │   └── vqvae
│   │   │   └── videobase
│   │   │       ├── vae
│   │   │       └── vqvae
│   │   ├── captioner
│   │   ├── diffusion              -> Denoise latents
│   │   │   ├── diffusion         
│   │   │   ├── dit
│   │   │   ├── latte
│   │   │   └── unet
│   │   ├── frame_interpolation
│   │   ├── super_resolution
│   │   └── text_encoder
│   ├── sample
│   ├── train                      -> Training code
│   └── utils

🛠️ Requirements and Installation

  1. Clone this repository and navigate to Open-Sora-Plan folder
git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan
  1. Install required packages
conda create -n opensora python=3.8 -y
conda activate opensora
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
  1. Install optional requirements such as static type checking:
pip install -e '.[dev]'

🗝️ Usage

Datasets

Refer to Data.md

Evaluation

Refer to the document EVAL.md.

Causal Video VAE

Reconstructing

python examples/rec_video_vae.py --rec-path test_video.mp4 --video-path video.mp4 --resolution 512 --num-frames 1440 --sample-rate 1 --sample-fps 24 -
-device cuda --ckpt <Your ckpt>

For more details, please refer to: CausalVideoVAE Report.

VideoGPT VQVAE

Please refer to the document VQVAE.

Video Diffusion Transformer

Training

sh scripts/train.sh

The current resources are only enough for us to do primary experiments on the Sky dataset.

Sampling

sh scripts/sample.sh

Below is a visualization of the sampling results.

12s 256x256 25s 256x256

🚀 Improved Training Performance

In comparison to the original implementation, we implement a selection of training speed acceleration and memory saving features including gradient checkpointing, mixed precision training, and pre-extracted features, xformers, deepspeed. Some data points using a batch size of 1 with a A100:

64×32×32 (origin size: 256×256×256)

gradient checkpointing mixed precision xformers feature pre-extraction deepspeed config compress kv training speed memory
0.64 steps/sec 43G
Zero2 0.66 steps/sec 14G
Zero2 0.66 steps/sec 15G
Zero2 offload 0.33 steps/sec 11G
Zero2 offload 0.31 steps/sec 12G

128×64×64 (origin size: 512×512×512)

gradient checkpointing mixed precision xformers feature pre-extraction deepspeed config compress kv training speed memory
0.08 steps/sec 77G
Zero2 0.08 steps/sec 41G
Zero2 0.09 steps/sec 36G
Zero2 offload 0.07 steps/sec 39G
Zero2 offload 0.07 steps/sec 33G

💡 How to Contribute to the Open-Sora Plan Community

We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines

👍 Acknowledgement

  • Latte: The main codebase we built upon and it is an wonderful video gererated model.
  • VideoGPT: Video Generation using VQ-VAE and Transformers.
  • DiT: Scalable Diffusion Models with Transformers.
  • FiT: Flexible Vision Transformer for Diffusion Model.
  • Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.

🔒 License

🤝 Community contributors

open-sora-plan's People

Contributors

linb203 avatar sennnnn avatar qqingzheng avatar khan-yin avatar junwuzhang19 avatar yunyangge avatar tzy010822 avatar rain305f avatar liuhanchen-github avatar helios-fr avatar simonleegit avatar ytimed2020 avatar linzy19 avatar yanyang1024 avatar kabachuha avatar jpthu17 avatar howardli1984 avatar jason-fan20 avatar mon-ius avatar ruslanperesy avatar yuanli2333 avatar anapple-hub avatar jialin-zhao avatar glgh avatar luo3300612 avatar nameless1117 avatar sysuyy avatar touale avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.