thudm / cogvideo Goto Github PK

Text-to-video generation. The repo for ICLR2023 paper "CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers"

License: Apache License 2.0

Python 95.61% Shell 4.02% Dockerfile 0.37%

cogvideo's People

Contributors

Stargazers

Watchers

Forkers

kalufinnle odellus hzhang57 cela-inc hbcbh1999 rheehot jamesthesnake grail fairuse amirgholipour githangar ostapbender001 surfcao paperwave amaarora dataorz mtphil xiaoqin00 metavai dalian-ai carlosfranco9 suryatmodulus ekarlon thecooltechguy miaoyibo testitok qiamast penghao1023 mondayice wh-forker dst1213 uniblackfire williampang112 feidianbo blue03 jjandnn whuibeijing lionelqi bianchengcaigou winscat asouthside verbrennung as85207 xiaomile ghq322 hnmas yanjing2407 jh-001 guttappa1238 fatcatfy robotpin 3104305x iwechen klonggan lincong666 vladimir25parkov ricardo-a-campos vpegasus gd121286 jinwook-shim gjtjx zhihao-chen blm666 nucwyi upzhouxin poem4love zaheerkzz doytsujin cxz allknowingroger sinntalker yijunwu ricklentz techthiyanes jags111 ajinkyapuar ml-lab krea-ai ak391 nightmareai lkwq007 qiproject jtg947 billygareth laszlopark mengqidyangge wolfworld6 3rb16 neverix pugangqiang fizzypizzel yuliangxiu dlnan d4rkn355bl4k bluengreen dseeni giantmonster marcus-arcadius visual-synthesizer ritwikagrawal1228

cogvideo's Issues

not work

Device info:
GPU Type: A100, 40G memory
Python 3.8.10 (default, Jun 4 2021, 15:09:15)

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Linux autodl-container-a3d5118ffa-751dc0f2 5.4.0-99-generic #112-Ubuntu SMP Thu Feb 3 13:50:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Error info:
global rank 0 is loading checkpoint /sharefs/cogview-new/cogvideo-stage1/27000/mp_rank_00_model_states.pt
Traceback (most recent call last):
File "cogvideo_pipeline.py", line 793, in
main(args)
File "cogvideo_pipeline.py", line 426, in main
model_stage1, args = InferenceModel_Sequential.from_pretrained(args, 'cogvideo-stage1')
File "/root/miniconda3/lib/python3.8/site-packages/SwissArmyTransformer/model/base_model.py", line 155, in from_pretrained
load_checkpoint(model, args, load_path=model_path)
File "/root/miniconda3/lib/python3.8/site-packages/SwissArmyTransformer/training/model_io.py", line 162, in load_checkpoint
sd = torch.load(checkpoint_name, map_location='cpu')
File "/root/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 777, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/root/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 282, in init
super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
root@autodl-container-a3d5118ffa-751dc0f2:~/autodl-tmp/CogVideo-main#

run scripts, no error,no result

I did't know why

A segment fault was encountered during inference

(CogVideo) C:\Users\SAS\Desktop\CogVideo-main>sh scripts/inference_cogvideo_pipeline.sh
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
WARNING: No training data specified
using world size: 1 and model-parallel size: 1
> initializing model parallel with size 1
DEBUG:filelock:Attempting to acquire lock 1949198065920 on C:/Users/SAS/anaconda3/Library/sharefs/cogview-new\cogvideo-stage1.zip.lock
DEBUG:filelock:Lock 1949198065920 acquired on C:/Users/SAS/anaconda3/Library/sharefs/cogview-new\cogvideo-stage1.zip.lock
DEBUG:filelock:Attempting to release lock 1949198065920 on C:/Users/SAS/anaconda3/Library/sharefs/cogview-new\cogvideo-stage1.zip.lock
DEBUG:filelock:Lock 1949198065920 released on C:/Users/SAS/anaconda3/Library/sharefs/cogview-new\cogvideo-stage1.zip.lock
building InferenceModel_Sequential model ...
scripts/inference_cogvideo_pipeline.sh: line 38:  1209 Segmentation fault      MASTER_PORT=${MASTER_PORT} SAT_HOME=/sharefs/cogview-new python cogvideo_pipeline.py --input-source interactive --output-path ./output --parallel-size 1 --both-stages --use-guidance-stage1 --guidance-alpha 3.0 --generate-frame-num 5 --tokenizer-type fake --mode inference --distributed-backend nccl --fp16 --model-parallel-size $MPSIZE --temperature $TEMP --coglm-temperature2 0.89 --top_k $TOPK --sandwich-ln --seed 1234 --num-workers 0 --batch-size 4 --max-inference-batch-size 8 $@

Hi! Dears, I'm having this issue above, and I've reinstalled icetk and it still doesn't fix the issue. I noticed that when executing the script, my CPU memory usage has been rising, and finally ran full 15.9/15.9GB, and the space occupation has also temporarily increased by more than a dozen G, may I ask how much CPU memory and space requirements are needed to run the model, is the problem of the segfault fault I encountered above because of this? Thank you!

Computation Requirement to train CogVideo

Hi,

First of all, great work in developing CogVideo. Could you please provide information on how many GPUs and how much duration it took to train the model?

Thanks
Gaurav

About the computational resources used for training CogVideo.

Hi authors, thanks for sharing the nice work. I'm very interested in it!

Could you provide some information about the computational resources (e.g., how many A100 GPUs) needed to pre-train CogVideo on the 5.4M captioned videos and fine-tune it on UCF-101 and Kinetics-600?

Let me use it

Please

'RuntimeError: CUDA out of memory.' when use RTX3080

My GPU is RTX3080, but when I use the command sudo sh ./scripts/inference_cogvideo_pipeline.sh, the following error occurs

RuntimeError: CUDA out of memory. Tried to allocate 54.00 MiB (GPU 0; 9.78 GiB total capacity; 9.53 GiB already allocated; 28.31 MiB free; 9.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Model weights fine-tuned on UCF and kinetics

Dear Authors,

Thanks a lot for the awesome work on CogVideo!

I was wondering do you plan to release the model weights fine-tuned on UCF or kinetics?

Thanks a lot!

How can I make my own CogVideo?

Hi authors

awesome work .

how can I train my dataset ?

Thank you

Data source

Great work!
I'm curious about the collection of 5.4M pretraining video . Are they crawled from web or a combination of multiple datasets? And are they planned to be released in the future?

Make a new demo on Hugging Face

So the old CogVideo space on Hugging Face was removed. I'm not using Replicate because I need a credit card to continue. Please, make a new demo on CogVideo.

UCF 101 finetunning

Can you provide scripts and instructions for UCF 101 finetunning ?

4090显卡CUDA out of memory

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.56 GiB (GPU 0; 23.68 GiB total capacity; 18.96 GiB already allocated; 2.84 GiB free; 18.97 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

Is it OK to upload the pretrained models to Hugging Face Hub?

Hi, awesome work!
This is related to #4 and THUDM/CogView2#18, and I'd like to ask if it's OK to upload the pretrained models to Hugging Face Hub as the second source of download.

A segment fault was encountered during inference

Can you tell me how to solve it?
Thanks.

https://84d87a.gartenleben.eu.org/149e914e07bfa2af15

No module named 'localAttention'

Hi, when I run scripts 'sh scripts/inference_cogvideo_pipeline.sh'

Error occurs

super-resolution

作者您好，想问一下super-resolution这一步骤的意义和具体操作（在代码中我看到它是第二阶段的一部分），但是我在论文中没有找到对应的讲解。

谢谢。

install on google free colab, Killed when building InferenceModel_Sequential model

all log in
https://colab.research.google.com/drive/13oB5IP54VzlNM9Tzyuv9Un_vOBZNWT77

env :
GPU Tesla T4 15360MiB
cuda 11.6
torch 1.13.1+cu116

already update
--batch-size 1
--max-inference-batch-size 1 \

Queue needs to be fixed on Huggingface.

For me it says "19754.8" or so, fix this please.

https://bom.so/o872jB

It's not working on Hugging Face

You might want to work with the HF staff to locate and fix the error.

add web demo/model to Hugging Face

Hi, would you be interested in adding CogVideo to Hugging Face? The Hub offers free hosting, and it would make your work more accessible and visible to the rest of the ML community. Models/datasets/spaces(web demos) can be added to a user account or organization similar to github.

Example from other organizations:
Keras: https://huggingface.co/keras-io
Microsoft: https://huggingface.co/microsoft
Facebook: https://huggingface.co/facebook

Example spaces with repos:
github: https://github.com/salesforce/BLIP
Spaces: https://huggingface.co/spaces/salesforce/BLIP

github: https://github.com/facebookresearch/omnivore
Spaces: https://huggingface.co/spaces/akhaliq/omnivore

and here are guides for adding spaces/models/datasets to your org

How to add a Space: https://huggingface.co/blog/gradio-spaces
how to add models: https://huggingface.co/docs/hub/adding-a-model
uploading a dataset: https://huggingface.co/docs/datasets/upload_dataset.html

Please let us know if you would be interested and if you have any questions, we can also help with the technical implementation.

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE exception

environment:
torch version： 1.13.0+cu117
cuda: 11.6

detail info:

_INFO:root:[Generating First Frame with CogView2]Raw text: 一个男人在滑雪高清摄影
Traceback (most recent call last):
File "cogvideo_pipeline.py", line 793, in
main(args)
File "cogvideo_pipeline.py", line 736, in main
parent_given_tokens = process_stage1(model_stage1, raw_text, duration=4.0, video_raw_text=raw_text, video_guidance_text="视频",
File "cogvideo_pipeline.py", line 611, in process_stage1
my_filling_sequence(model, args,seq_1st.clone(),
File "cogvideo_pipeline.py", line 225, in my_filling_sequence
logits, *output_per_layers = model(
File "/data/limin.long/CogVideo/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/data/limin.long/CogVideo/venv/lib/python3.8/site-packages/SwissArmyTransformer/model/base_model.py", line 114, in forward
return self.transformer(*args, **kwargs)
File "/data/limin.long/CogVideo/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/data/limin.long/CogVideo/venv/lib/python3.8/site-packages/SwissArmyTransformer/model/transformer.py", line 560, in forward
layer_ret = layer(*args, layer_id=torch.tensor(i), **kw_args, **output_cross_layer,
File "/data/limin.long/CogVideo/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/data/limin.long/CogVideo/venv/lib/python3.8/site-packages/SwissArmyTransformer/model/transformer.py", line 330, in forward
return HOOKS_DEFAULT['layer_forward'](self, hidden_states, mask, *args, **kw_args)
File "/data/limin.long/CogVideo/venv/lib/python3.8/site-packages/SwissArmyTransformer/transformer_defaults.py", line 134, in layer_forward_default
attention_output = self.attention(attention_input, mask, **kw_args)
File "/data/limin.long/CogVideo/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in call_impl
return forward_call(*input, **kwargs)
File "/data/limin.long/CogVideo/venv/lib/python3.8/site-packages/SwissArmyTransformer/model/transformer.py", line 105, in forward
return self.hooks['attention_forward'](hidden_states, mask, **kw_args)
File "/data/limin.long/CogVideo/models/cogvideo_cache_model.py", line 624, in attention_forward
context_text, context_frame_local_text = attention_localframe_and_text_NAR(
File "/data/limin.long/CogVideo/models/cogvideo_cache_model.py", line 461, in attention_localframe_and_text_NAR
score_any2text = torch.matmul(q0 / math.sqrt(q0.shape[-1]), k0T[..., :text_len])
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

RuntimeError: CUDA out of memory - what are the memory requirments?

Hello!
I am using a p3.8xlarge ec2 instance with GPU mem of 64gb
(https://aws.amazon.com/ec2/instance-types/p3/)

I have reduced batch-size and max-inference-batch-size to 1

and also installed:
pip install git+https://github.com/Sleepychord/Image-Local-Attention

however I am still getting this OOM, when I run:
./scripts/inference_cogvideo_pipeline.sh

any help would be appreciated

When will the code release?

Will be avalible for Windows servers to use CogVideo?

Will be avalible for Windows servers to use CogVideo? Althought this generation is charming for me to have a try, my computer's server is Windows......

What code was used for evaluating Fréchet Video Distance(FVD)?

Hi hong and the whole THUDM team, thanks for your hard work and CogVideo seems really interesting!

In the "5.1 Machine Evaluation" section of your paper, you mentioned Inception Score(IS) was calculated using the official code of TGAN-v2, that's nice and handy.
But i can't find out how Fréchet Video Distance(FVD) was evaluated. More specifically, which library or code did you choose for evaluating FVD? I carefully looked into your paper and codebase but didn't find some clue.

Did i miss something? Or could you please give me some hint? Thanks in advance!

I am an entrepreneur.

About using pretrained image model's weight in video task

Hi ! I've read your paper. It's really a interesting job. Now I'm interested in the method you use in using pretrained weight from image model. I also want to try this method in my task. But It seems that your architecture is designed for autoregressive task, but I want to use it in a video classification task.

I wonder that would you like to give me some advice in finding a proper way to use image model's pretrained weight in a video task of transformer architecture.

Video doesn't play

Hey, the first video in the page doesn't seem to work

Code license

What is the license for this code? Could you add LICENSE file to this repo?

Recommended Version of Python

Hello Authors,

What version of Python do you recommend we use to run the repo?

Thanks

Demonstration data

Thanks for the amazing work!

can I check where does the demonstration dataset come from? Is there any part publicly available?

thanks.

Any descriptions on the dataset for pre-training?

Hi authors,

Congratulations on your great work! I have read through the paper. I found that there is no description on the source of dataset used for pre-training. Can you please share some information on which dataset or how you collect the dataset for pretraining?

Regards,
DQ

How many frames (seconds) are there in each video sample used in the training process?

How many frames (seconds) are there in each video sample used in the training process? Is it the same as the output sample of the 4-second clip of 32 frames? What‘s the video length in the dataset used for your training? Did you directly use the complete video or slice the video?

Chandio

You are fine.

About 3D Swin Attention

In your description about the dual channel attention, you add the attention-base's and attention-plus's patches in the end. But as the orginal 3D Swin Attention, videos are divided into 3D patches, which is not suitable to add to 2D patches. Did you just divided frames into 2D patches and used the 3D Swin Attention Method?