Giter Site home page Giter Site logo

thudm / cogvideo Goto Github PK

View Code? Open in Web Editor NEW
3.5K 3.5K 376.0 131.76 MB

Text-to-video generation. The repo for ICLR2023 paper "CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers"

License: Apache License 2.0

Python 95.61% Shell 4.02% Dockerfile 0.37%

cogvideo's People

Contributors

ak391 avatar mallorbc avatar wenyihong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cogvideo's Issues

not work

Device info:
GPU Type: A100, 40G memory
Python 3.8.10 (default, Jun 4 2021, 15:09:15)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000000:00:07.0 Off | Off |
| N/A 32C P0 33W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Linux autodl-container-a3d5118ffa-751dc0f2 5.4.0-99-generic #112-Ubuntu SMP Thu Feb 3 13:50:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Error info:
global rank 0 is loading checkpoint /sharefs/cogview-new/cogvideo-stage1/27000/mp_rank_00_model_states.pt
Traceback (most recent call last):
File "cogvideo_pipeline.py", line 793, in
main(args)
File "cogvideo_pipeline.py", line 426, in main
model_stage1, args = InferenceModel_Sequential.from_pretrained(args, 'cogvideo-stage1')
File "/root/miniconda3/lib/python3.8/site-packages/SwissArmyTransformer/model/base_model.py", line 155, in from_pretrained
load_checkpoint(model, args, load_path=model_path)
File "/root/miniconda3/lib/python3.8/site-packages/SwissArmyTransformer/training/model_io.py", line 162, in load_checkpoint
sd = torch.load(checkpoint_name, map_location='cpu')
File "/root/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 777, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/root/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 282, in init
super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
root@autodl-container-a3d5118ffa-751dc0f2:~/autodl-tmp/CogVideo-main#

A segment fault was encountered during inference

(CogVideo) C:\Users\SAS\Desktop\CogVideo-main>sh scripts/inference_cogvideo_pipeline.sh
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
WARNING: No training data specified
using world size: 1 and model-parallel size: 1
> initializing model parallel with size 1
DEBUG:filelock:Attempting to acquire lock 1949198065920 on C:/Users/SAS/anaconda3/Library/sharefs/cogview-new\cogvideo-stage1.zip.lock
DEBUG:filelock:Lock 1949198065920 acquired on C:/Users/SAS/anaconda3/Library/sharefs/cogview-new\cogvideo-stage1.zip.lock
DEBUG:filelock:Attempting to release lock 1949198065920 on C:/Users/SAS/anaconda3/Library/sharefs/cogview-new\cogvideo-stage1.zip.lock
DEBUG:filelock:Lock 1949198065920 released on C:/Users/SAS/anaconda3/Library/sharefs/cogview-new\cogvideo-stage1.zip.lock
building InferenceModel_Sequential model ...
scripts/inference_cogvideo_pipeline.sh: line 38:  1209 Segmentation fault      MASTER_PORT=${MASTER_PORT} SAT_HOME=/sharefs/cogview-new python cogvideo_pipeline.py --input-source interactive --output-path ./output --parallel-size 1 --both-stages --use-guidance-stage1 --guidance-alpha 3.0 --generate-frame-num 5 --tokenizer-type fake --mode inference --distributed-backend nccl --fp16 --model-parallel-size $MPSIZE --temperature $TEMP --coglm-temperature2 0.89 --top_k $TOPK --sandwich-ln --seed 1234 --num-workers 0 --batch-size 4 --max-inference-batch-size 8 $@

Hi! Dears, I'm having this issue above, and I've reinstalled icetk and it still doesn't fix the issue. I noticed that when executing the script, my CPU memory usage has been rising, and finally ran full 15.9/15.9GB, and the space occupation has also temporarily increased by more than a dozen G, may I ask how much CPU memory and space requirements are needed to run the model, is the problem of the segfault fault I encountered above because of this? Thank you!

Computation Requirement to train CogVideo

Hi,

First of all, great work in developing CogVideo. Could you please provide information on how many GPUs and how much duration it took to train the model?

Thanks
Gaurav

About the computational resources used for training CogVideo.

Hi authors, thanks for sharing the nice work. I'm very interested in it!

Could you provide some information about the computational resources (e.g., how many A100 GPUs) needed to pre-train CogVideo on the 5.4M captioned videos and fine-tune it on UCF-101 and Kinetics-600?

'RuntimeError: CUDA out of memory.' when use RTX3080

My GPU is RTX3080, but when I use the command sudo sh ./scripts/inference_cogvideo_pipeline.sh, the following error occurs

RuntimeError: CUDA out of memory. Tried to allocate 54.00 MiB (GPU 0; 9.78 GiB total capacity; 9.53 GiB already allocated; 28.31 MiB free; 9.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Data source

Great work!
I'm curious about the collection of 5.4M pretraining video . Are they crawled from web or a combination of multiple datasets? And are they planned to be released in the future?

Make a new demo on Hugging Face

So the old CogVideo space on Hugging Face was removed. I'm not using Replicate because I need a credit card to continue. Please, make a new demo on CogVideo.

4090显卡CUDA out of memory

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.56 GiB (GPU 0; 23.68 GiB total capacity; 18.96 GiB already allocated; 2.84 GiB free; 18.97 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

super-resolution

作者您好,想问一下super-resolution这一步骤的意义和具体操作(在代码中我看到它是第二阶段的一部分),但是我在论文中没有找到对应的讲解。

谢谢。

add web demo/model to Hugging Face

Hi, would you be interested in adding CogVideo to Hugging Face? The Hub offers free hosting, and it would make your work more accessible and visible to the rest of the ML community. Models/datasets/spaces(web demos) can be added to a user account or organization similar to github.

Example from other organizations:
Keras: https://huggingface.co/keras-io
Microsoft: https://huggingface.co/microsoft
Facebook: https://huggingface.co/facebook

Example spaces with repos:
github: https://github.com/salesforce/BLIP
Spaces: https://huggingface.co/spaces/salesforce/BLIP

github: https://github.com/facebookresearch/omnivore
Spaces: https://huggingface.co/spaces/akhaliq/omnivore

and here are guides for adding spaces/models/datasets to your org

How to add a Space: https://huggingface.co/blog/gradio-spaces
how to add models: https://huggingface.co/docs/hub/adding-a-model
uploading a dataset: https://huggingface.co/docs/datasets/upload_dataset.html

Please let us know if you would be interested and if you have any questions, we can also help with the technical implementation.

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE exception

environment:
torch version: 1.13.0+cu117
cuda: 11.6

detail info:

_INFO:root:[Generating First Frame with CogView2]Raw text: 一个男人在滑雪 高清摄影
Traceback (most recent call last):
File "cogvideo_pipeline.py", line 793, in
main(args)
File "cogvideo_pipeline.py", line 736, in main
parent_given_tokens = process_stage1(model_stage1, raw_text, duration=4.0, video_raw_text=raw_text, video_guidance_text="视频",
File "cogvideo_pipeline.py", line 611, in process_stage1
my_filling_sequence(model, args,seq_1st.clone(),
File "cogvideo_pipeline.py", line 225, in my_filling_sequence
logits, *output_per_layers = model(
File "/data/limin.long/CogVideo/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/data/limin.long/CogVideo/venv/lib/python3.8/site-packages/SwissArmyTransformer/model/base_model.py", line 114, in forward
return self.transformer(*args, **kwargs)
File "/data/limin.long/CogVideo/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/data/limin.long/CogVideo/venv/lib/python3.8/site-packages/SwissArmyTransformer/model/transformer.py", line 560, in forward
layer_ret = layer(*args, layer_id=torch.tensor(i), **kw_args, **output_cross_layer,
File "/data/limin.long/CogVideo/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/data/limin.long/CogVideo/venv/lib/python3.8/site-packages/SwissArmyTransformer/model/transformer.py", line 330, in forward
return HOOKS_DEFAULT['layer_forward'](self, hidden_states, mask, *args, **kw_args)
File "/data/limin.long/CogVideo/venv/lib/python3.8/site-packages/SwissArmyTransformer/transformer_defaults.py", line 134, in layer_forward_default
attention_output = self.attention(attention_input, mask, **kw_args)
File "/data/limin.long/CogVideo/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in call_impl
return forward_call(*input, **kwargs)
File "/data/limin.long/CogVideo/venv/lib/python3.8/site-packages/SwissArmyTransformer/model/transformer.py", line 105, in forward
return self.hooks['attention_forward'](hidden_states, mask, **kw_args)
File "/data/limin.long/CogVideo/models/cogvideo_cache_model.py", line 624, in attention_forward
context_text, context_frame_local_text = attention_localframe_and_text_NAR(
File "/data/limin.long/CogVideo/models/cogvideo_cache_model.py", line 461, in attention_localframe_and_text_NAR
score_any2text = torch.matmul(q0 / math.sqrt(q0.shape[-1]), k0T[..., :text_len])
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

What code was used for evaluating Fréchet Video Distance(FVD)?

Hi hong and the whole THUDM team, thanks for your hard work and CogVideo seems really interesting!

In the "5.1 Machine Evaluation" section of your paper, you mentioned Inception Score(IS) was calculated using the official code of TGAN-v2, that's nice and handy.
But i can't find out how Fréchet Video Distance(FVD) was evaluated. More specifically, which library or code did you choose for evaluating FVD? I carefully looked into your paper and codebase but didn't find some clue.

Did i miss something? Or could you please give me some hint? Thanks in advance!

About using pretrained image model's weight in video task

Hi ! I've read your paper. It's really a interesting job. Now I'm interested in the method you use in using pretrained weight from image model. I also want to try this method in my task. But It seems that your architecture is designed for autoregressive task, but I want to use it in a video classification task.

I wonder that would you like to give me some advice in finding a proper way to use image model's pretrained weight in a video task of transformer architecture.

Code license

What is the license for this code? Could you add LICENSE file to this repo?

Demonstration data

Thanks for the amazing work!

can I check where does the demonstration dataset come from? Is there any part publicly available?

thanks.

Any descriptions on the dataset for pre-training?

Hi authors,

Congratulations on your great work! I have read through the paper. I found that there is no description on the source of dataset used for pre-training. Can you please share some information on which dataset or how you collect the dataset for pretraining?

Regards,
DQ

About 3D Swin Attention

In your description about the dual channel attention, you add the attention-base's and attention-plus's patches in the end. But as the orginal 3D Swin Attention, videos are divided into 3D patches, which is not suitable to add to 2D patches. Did you just divided frames into 2D patches and used the 3D Swin Attention Method?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.