Open-Sora Plan

💪 Goal

This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ) and build knowledge about Video-VQVAE (VideoGPT) + DiT at scale. However, since we have limited resources, we deeply wish all open-source community can contribute to this project. Pull requests are welcome!!!

本项目希望通过开源社区的力量复现Sora，由北大-兔展AIGC联合实验室共同发起，当前我们资源有限仅搭建了基础架构，无法进行完整训练，希望通过开源社区逐步增加模块并筹集资源进行训练，当前版本离目标差距巨大，仍需持续完善和快速迭代，欢迎Pull request！！！

Project stages:

Primary

Setup the codebase and train a un-conditional model on a landscape dataset.
Train models that boost resolution and duration.

Extensions

Conduct text2video experiments on landscape dataset.
Train the 1080p model on video2text dataset.
Control model with more conditions.

📰 News

[2024.03.27] 🚀🚀🚀 We release the report of VideoCausalVAE, which supports both images and videos. We present our reconstructed video in this demonstration as follows. The text-to-video model is on the way.

test_ducks.mp4

[2024.03.10] 🚀🚀🚀 This repo supports training a latent size of 225×90×90 (t×h×w), which means we are able to train 1 minute of 1080P video with 30FPS (2× interpolated frames and 2× super resolution) under class-condition.

[2024.03.08] We support the training code of text condition with 16 frames of 512x512. The code is mainly borrowed from Latte.

[2024.03.07] We support training with 128 frames (when sample rate = 3, which is about 13 seconds) of 256x256, or 64 frames (which is about 6 seconds) of 512x512.

[2024.03.05] See our latest todo, pull requests are welcome.

[2024.03.04] We re-organizes and modulizes our code to make it easy to contribute to the project, to contribute please see the Repo structure.

[2024.03.03] We opened some discussions to clarify several issues.

[2024.03.01] Training code is available now! Learn more on our project page. Please feel free to watch 👀 this repository for the latest updates.

✊ Todo

Setup the codebase and train a unconditional model on landscape dataset

Train models that boost resolution and duration

Conduct text2video experiments on landscape dataset.

Train the 1080p model on video2text dataset

Control model with more condition

Load pretrained weights from Latte.
Incorporating ControlNet. 🙏 [Need your contribution]

📂 Repo structure (WIP)

├── README.md
├── docs
│   ├── Data.md                    -> Datasets description.
│   ├── Contribution_Guidelines.md -> Contribution guidelines description.
├── scripts                        -> All scripts.
├── opensora
│   ├── dataset
│   ├── models
│   │   ├── ae                     -> Compress videos to latents
│   │   │   ├── imagebase
│   │   │   │   ├── vae
│   │   │   │   └── vqvae
│   │   │   └── videobase
│   │   │       ├── vae
│   │   │       └── vqvae
│   │   ├── captioner
│   │   ├── diffusion              -> Denoise latents
│   │   │   ├── diffusion         
│   │   │   ├── dit
│   │   │   ├── latte
│   │   │   └── unet
│   │   ├── frame_interpolation
│   │   ├── super_resolution
│   │   └── text_encoder
│   ├── sample
│   ├── train                      -> Training code
│   └── utils

🛠️ Requirements and Installation

Clone this repository and navigate to Open-Sora-Plan folder

git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan

Install required packages

conda create -n opensora python=3.8 -y
conda activate opensora
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Install optional requirements such as static type checking:

pip install -e '.[dev]'

🗝️ Usage

Datasets

Refer to Data.md

Evaluation

Refer to the document EVAL.md.

Causal Video VAE

Reconstructing

python examples/rec_video_vae.py --rec-path test_video.mp4 --video-path video.mp4 --resolution 512 --num-frames 1440 --sample-rate 1 --sample-fps 24 -
-device cuda --ckpt <Your ckpt>

For more details, please refer to: CausalVideoVAE Report.

VideoGPT VQVAE

Please refer to the document VQVAE.

Video Diffusion Transformer

Training

sh scripts/train.sh

The current resources are only enough for us to do primary experiments on the Sky dataset.

Sampling

sh scripts/sample.sh

Below is a visualization of the sampling results.

12s 256x256	25s 256x256

🚀 Improved Training Performance

In comparison to the original implementation, we implement a selection of training speed acceleration and memory saving features including gradient checkpointing, mixed precision training, and pre-extracted features, xformers, deepspeed. Some data points using a batch size of 1 with a A100:

64×32×32 (origin size: 256×256×256)

gradient checkpointing	mixed precision	xformers	feature pre-extraction	deepspeed config	compress kv	training speed	memory
✔	✔	✔	✔	❌	❌	0.64 steps/sec	43G
✔	✔	✔	✔	Zero2	❌	0.66 steps/sec	14G
✔	✔	✔	✔	Zero2	✔	0.66 steps/sec	15G
✔	✔	✔	✔	Zero2 offload	❌	0.33 steps/sec	11G
✔	✔	✔	✔	Zero2 offload	✔	0.31 steps/sec	12G

128×64×64 (origin size: 512×512×512)

gradient checkpointing	mixed precision	xformers	feature pre-extraction	deepspeed config	compress kv	training speed	memory
✔	✔	✔	✔	❌	❌	0.08 steps/sec	77G
✔	✔	✔	✔	Zero2	❌	0.08 steps/sec	41G
✔	✔	✔	✔	Zero2	✔	0.09 steps/sec	36G
✔	✔	✔	✔	Zero2 offload	❌	0.07 steps/sec	39G
✔	✔	✔	✔	Zero2 offload	✔	0.07 steps/sec	33G

💡 How to Contribute to the Open-Sora Plan Community

We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines

👍 Acknowledgement

Latte: The main codebase we built upon and it is an wonderful video gererated model.
VideoGPT: Video Generation using VQ-VAE and Transformers.
DiT: Scalable Diffusion Models with Transformers.
FiT: Flexible Vision Transformer for Diffusion Model.
Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.

🔒 License

See LICENSE for details.

v6p / open-sora-plan Goto Github PK

open-sora-plan's Introduction