haoningwu3639 / storygen Goto Github PK

View Code? Open in Web Editor NEW

150.0 8.0 5.0 9.43 MB

[CVPR 2024] Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models

Home Page: https://haoningwu3639.github.io/StoryGen_Webpage/

License: MIT License

Python 99.99% Shell 0.01%

diffusion-models visual-storytelling video-generation

storygen's People

Contributors

Stargazers

Watchers

Forkers

alexandor91 anthonyyuan wangqi-xxxx

storygen's Issues

StorySalon Dataset

Thanks for your great work!

I'd like to ask that if you are going to release the complete StorySalon dataset (with both prompts and images).

Regards,

prev_prompt_ids.append(tokenizer(prev_prompt, truncation=True, padding="max_length", max_length=tokenizer.model_max_length, return_tensors="pt").input_ids.squeeze(0))

The line of code prev_prompt_ids.append(tokenizer(prev_prompt, truncation=True, padding="max_length", max_length=tokenizer.model_max_length, return_tensors="pt").input_ids.squeeze(0)) should not include .squeeze(0). The shape of t_prev_prompt_ids in t_prev_prompt_ids = torch.stack(prev_prompt_ids) is (3, 77).

Regarding the difference in memory usage between the first and second stages of training, and the necessity to set the batch size to 1 in order for it to work on a 48GB A6000 GPU.

Help

I am a newbie and I was looking for a project that can convert my audio book into video based on the transcription. I didn't wrote the complete paper because that confuses me so I am asking here to get help from you. The question is... m I at the. Right place? I mean do this project is what I am looking for? Can this project can change my audiobook into visual videos?. If not can you please refer any other project ? Thanks alot for ur time and consideration.

will the code be released?

Licensing

Can you add a license for the dataset and code? Also, it is specified that the EBooks are under CC-BY license, but it's unclear if they only downloaded YouTube videos that are under CC-BY license as well. Is this information available?

Downloading private videos

Has anybody had success in downloading the private videos in the list ? It seems a lot of the videos were rejected due to them being private. What command or script is being used ?

inference CUDA out of memory

thanks for your work! When I run inference.py, ref_image in inference.py dose not exist! so I change as follow:

prompt = "The curly puppy is running after the flying dragon."
    prev_p = ["The curly puppy", "The flying dragon."]
    ref_image = ["./data/image/00001.png", 
                 "./data/image/00002.png"]

After I run CUDA_VISIBLE_DEVICES=0 accelerate launch inference.py, it shows CUDA out of memory. My GPU is 80G A800. Can you give me some suggestions

inference.py

In the inference.py, the setting seems to be autoregressive and requires and initial image. How do we set it to story generation instead of continuation ?

youtube

youtube-dl --write-auto-sub -o 'file%(title)s.%(ext)s' -f 135 [url]
This line of code downloads videos from YouTube, but many of them show as private videos and cannot be downloaded. Can you provide the downloaded files?

Will trained checkpoints be released?

stage = 'no' issue

when I use the stage = 'no', the following error occurs.
Traceback (most recent call last):
File "D:\AIGC\StoryGen-master\inference.py", line 139, in
test(pretrained_model_path,
File "D:\AIGC\StoryGen-master\inference.py", line 103, in test
output = pipeline(
File "C:\Users\chwj0\miniconda3\envs\ldm\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "D:\AIGC\StoryGen-master\model\pipeline.py", line 443, in call
for k,v in img_conditions[0].items():
AttributeError: 'NoneType' object has no attribute 'items'

Mismatch in BasicTransformerBlock

Thank you for your great work; it's quite promising.

However, I've encountered an error during run the train_StorySalon_stage1.py. I have downloaded the stable-diffusion-v1-5 unet checkpoint from this. However, I'm experiencing the error when using the follow code to load it:

unet = UNet2DConditionModel.from_pretrained(pretrained_model_path, subfolder="unet")

The error is:

Cannot load <class 'model.unet_2d_condition.UNet2DConditionModel'> from ./ckpt/stable-diffusion-v1-5/ because the following keys are missing: 
 up_blocks.3.attentions.1.transformer_blocks.0.attn3.to_q.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn3.to_k.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn3.to_q.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.2.attentions.2.transformer_blocks.0.norm4.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn3.to_v.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn3.to_k.weight, up_blocks.3.attentions.0.transformer_blocks.0.attn3.to_k.weight, down_blocks.2.attentions.0.transformer_blocks.0.norm4.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn3.to_q.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn3.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm4.bias, up_blocks.3.attentions.0.transformer_blocks.0.attn3.to_v.weight, up_blocks.3.attentions.0.transformer_blocks.0.norm4.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn3.to_q.weight, mid_block.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, mid_block.attentions.0.transformer_blocks.0.attn3.to_q.weight, up_blocks.1.attentions.0.transformer_blocks.0.norm4.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn3.to_v.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.3.attentions.1.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.3.attentions.1.transformer_blocks.0.norm4.bias, up_blocks.1.attentions.1.transformer_blocks.0.norm4.bias, up_blocks.3.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, mid_block.attentions.0.transformer_blocks.0.attn3.to_v.weight, down_blocks.2.attentions.1.transformer_blocks.0.norm4.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.1.attentions.2.transformer_blocks.0.attn3.to_k.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn3.to_q.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn3.to_q.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.2.attentions.2.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn3.to_v.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.1.attentions.0.transformer_blocks.0.attn3.to_k.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn3.to_k.weight, up_blocks.2.attentions.2.transformer_blocks.0.norm4.bias, up_blocks.2.attentions.0.transformer_blocks.0.norm4.bias, up_blocks.1.attentions.2.transformer_blocks.0.norm4.bias, up_blocks.1.attentions.2.transformer_blocks.0.attn3.to_out.0.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn3.to_v.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn3.to_q.weight, down_blocks.0.attentions.0.transformer_blocks.0.norm4.weight, up_blocks.3.attentions.0.transformer_blocks.0.norm4.bias, down_blocks.1.attentions.1.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.0.attentions.1.transformer_blocks.0.attn3.to_v.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn3.to_v.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.3.attentions.1.transformer_blocks.0.norm4.weight, down_blocks.2.attentions.1.transformer_blocks.0.norm4.bias, down_blocks.1.attentions.0.transformer_blocks.0.norm4.bias, up_blocks.2.attentions.1.transformer_blocks.0.attn3.to_v.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.2.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.1.attentions.1.transformer_blocks.0.norm4.bias, up_blocks.1.attentions.1.transformer_blocks.0.attn3.to_v.weight, up_blocks.1.attentions.2.transformer_blocks.0.norm4.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn3.to_k.weight, down_blocks.1.attentions.0.transformer_blocks.0.norm4.weight, up_blocks.2.attentions.0.transformer_blocks.0.norm4.weight, up_blocks.2.attentions.1.transformer_blocks.0.norm4.bias, up_blocks.3.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.2.attentions.1.transformer_blocks.0.norm4.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn3.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.0.attentions.1.transformer_blocks.0.attn3.to_q.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn3.to_v.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn3.to_v.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn3.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm4.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn3.to_q.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn3.to_k.weight, up_blocks.1.attentions.0.transformer_blocks.0.norm4.bias, down_blocks.1.attentions.1.transformer_blocks.0.attn3.to_k.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn3.to_out.0.weight, mid_block.attentions.0.transformer_blocks.0.norm4.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn3.to_k.weight, up_blocks.1.attentions.1.transformer_blocks.0.norm4.weight, down_blocks.1.attentions.1.transformer_blocks.0.norm4.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn3.to_v.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, down_blocks.0.attentions.0.transformer_blocks.0.norm4.bias, up_blocks.2.attentions.2.transformer_blocks.0.attn3.to_q.weight, down_blocks.0.attentions.1.transformer_blocks.0.norm4.bias, down_blocks.0.attentions.1.transformer_blocks.0.norm4.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn3.to_k.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.0.attentions.1.transformer_blocks.0.attn3.to_out.0.weight, mid_block.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.1.attentions.0.transformer_blocks.0.attn3.to_k.weight, mid_block.attentions.0.transformer_blocks.0.norm4.bias, up_blocks.1.attentions.2.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.1.attentions.1.transformer_blocks.0.attn3.to_q.weight, down_blocks.2.attentions.0.transformer_blocks.0.norm4.bias, mid_block.attentions.0.transformer_blocks.0.attn3.to_k.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn3.to_k.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn3.to_k.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.3.attentions.0.transformer_blocks.0.attn3.to_q.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn3.to_q.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn3.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.2.attentions.0.transformer_blocks.0.attn3.to_k.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn3.to_out.0.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn3.to_v.weight. 
 Please make sure to pass `low_cpu_mem_usage=False` and `device_map=None` if you want to randomely initialize those weights or else make sure your checkpoint file is correct.

After inspecting the source code, I noticed the cause of part of the errors. The BasicTransformerBlock module in the checkpoint (runway's BasicTransformerBlock) does not include attn3. However, in your attention.py's BasicTransformerBlock module (your BasicTransformerBlock), attn3 is included, leading to a mismatch. Could you advise on how to resolve this issue?

loss=NAN

I encounter the issue of loss=NAN during the second stage of training. Can you provide some good suggestions? Thank you

How metadata is used in extracting frames ?

Can anybody pinpoint how metadata.json is being used in extracting the frames ? I cannot find how it is being used to extract keyframes from videos. Also, if one has already processed the frames from the videos that would be great and can share here

Dataset

Hello, have read your paper, and I think your paper is very good.
So I attempted to run the code, and unfortunately, I encountered an error. Here are the details:

Can I get the train and test data of the 'StorySalon' dataset?

Please. thank you:)

Unable to download youtube data, can you provide the corresponding key frames?

Hello, I am very interested in your work. I'm having trouble downloading the data and I'd like your help.

I cannot download "video_url" in metadata.json through the following command.

youtube-dl --write-auto-sub -o 'file\%(title)s.%(ext)s' -f 135 https://youtu.be/JBzHBp6kfMo [youtube] JBzHBp6kfMo: Downloading webpage ERROR: JBzHBp6kfMo: YouTube said: Unable to extract video data

This video is a private video.

About the dataset

Thanks for your wonderful work. I wonder when the dataset will be released.

Access to Model Checkpoint

I tried accessing the model checkpoint link, but it seems that the model checkpoint has not been uploaded. Is there any way for me to access it?

Additionally, I'm very interested in your research, and I'm truly impressed by your efforts and contributions.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.