Giter Site home page Giter Site logo

haoningwu3639 / storygen Goto Github PK

View Code? Open in Web Editor NEW
150.0 8.0 5.0 9.43 MB

[CVPR 2024] Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models

Home Page: https://haoningwu3639.github.io/StoryGen_Webpage/

License: MIT License

Python 99.99% Shell 0.01%
diffusion-models visual-storytelling video-generation

storygen's People

Contributors

haoningwu3639 avatar verg-avesta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

storygen's Issues

StorySalon Dataset

Thanks for your great work!

I'd like to ask that if you are going to release the complete StorySalon dataset (with both prompts and images).

Regards,

prev_prompt_ids.append(tokenizer(prev_prompt, truncation=True, padding="max_length", max_length=tokenizer.model_max_length, return_tensors="pt").input_ids.squeeze(0))

The line of code prev_prompt_ids.append(tokenizer(prev_prompt, truncation=True, padding="max_length", max_length=tokenizer.model_max_length, return_tensors="pt").input_ids.squeeze(0)) should not include .squeeze(0). The shape of t_prev_prompt_ids in t_prev_prompt_ids = torch.stack(prev_prompt_ids) is (3, 77).

Regarding the difference in memory usage between the first and second stages of training, and the necessity to set the batch size to 1 in order for it to work on a 48GB A6000 GPU.

Help

I am a newbie and I was looking for a project that can convert my audio book into video based on the transcription. I didn't wrote the complete paper because that confuses me so I am asking here to get help from you. The question is... m I at the. Right place? I mean do this project is what I am looking for? Can this project can change my audiobook into visual videos?. If not can you please refer any other project ? Thanks alot for ur time and consideration.

Licensing

Can you add a license for the dataset and code? Also, it is specified that the EBooks are under CC-BY license, but it's unclear if they only downloaded YouTube videos that are under CC-BY license as well. Is this information available?

Downloading private videos

Has anybody had success in downloading the private videos in the list ? It seems a lot of the videos were rejected due to them being private. What command or script is being used ?

inference CUDA out of memory

thanks for your work! When I run inference.py, ref_image in inference.py dose not exist! so I change as follow:

prompt = "The curly puppy is running after the flying dragon."
    prev_p = ["The curly puppy", "The flying dragon."]
    ref_image = ["./data/image/00001.png", 
                 "./data/image/00002.png"]

After I run CUDA_VISIBLE_DEVICES=0 accelerate launch inference.py, it shows CUDA out of memory. My GPU is 80G A800. Can you give me some suggestions

inference.py

In the inference.py, the setting seems to be autoregressive and requires and initial image. How do we set it to story generation instead of continuation ?

youtube

youtube-dl --write-auto-sub -o 'file%(title)s.%(ext)s' -f 135 [url]
This line of code downloads videos from YouTube, but many of them show as private videos and cannot be downloaded. Can you provide the downloaded files?

stage = 'no' issue

when I use the stage = 'no', the following error occurs.
Traceback (most recent call last):
File "D:\AIGC\StoryGen-master\inference.py", line 139, in
test(pretrained_model_path,
File "D:\AIGC\StoryGen-master\inference.py", line 103, in test
output = pipeline(
File "C:\Users\chwj0\miniconda3\envs\ldm\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "D:\AIGC\StoryGen-master\model\pipeline.py", line 443, in call
for k,v in img_conditions[0].items():
AttributeError: 'NoneType' object has no attribute 'items'

Mismatch in BasicTransformerBlock

Thank you for your great work; it's quite promising.

However, I've encountered an error during run the train_StorySalon_stage1.py. I have downloaded the stable-diffusion-v1-5 unet checkpoint from this. However, I'm experiencing the error when using the follow code to load it:

unet = UNet2DConditionModel.from_pretrained(pretrained_model_path, subfolder="unet")

The error is:

Cannot load <class 'model.unet_2d_condition.UNet2DConditionModel'> from ./ckpt/stable-diffusion-v1-5/ because the following keys are missing: 
 up_blocks.3.attentions.1.transformer_blocks.0.attn3.to_q.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn3.to_k.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn3.to_q.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.2.attentions.2.transformer_blocks.0.norm4.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn3.to_v.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn3.to_k.weight, up_blocks.3.attentions.0.transformer_blocks.0.attn3.to_k.weight, down_blocks.2.attentions.0.transformer_blocks.0.norm4.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn3.to_q.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn3.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm4.bias, up_blocks.3.attentions.0.transformer_blocks.0.attn3.to_v.weight, up_blocks.3.attentions.0.transformer_blocks.0.norm4.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn3.to_q.weight, mid_block.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, mid_block.attentions.0.transformer_blocks.0.attn3.to_q.weight, up_blocks.1.attentions.0.transformer_blocks.0.norm4.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn3.to_v.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.3.attentions.1.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.3.attentions.1.transformer_blocks.0.norm4.bias, up_blocks.1.attentions.1.transformer_blocks.0.norm4.bias, up_blocks.3.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, mid_block.attentions.0.transformer_blocks.0.attn3.to_v.weight, down_blocks.2.attentions.1.transformer_blocks.0.norm4.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.1.attentions.2.transformer_blocks.0.attn3.to_k.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn3.to_q.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn3.to_q.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.2.attentions.2.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn3.to_v.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.1.attentions.0.transformer_blocks.0.attn3.to_k.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn3.to_k.weight, up_blocks.2.attentions.2.transformer_blocks.0.norm4.bias, up_blocks.2.attentions.0.transformer_blocks.0.norm4.bias, up_blocks.1.attentions.2.transformer_blocks.0.norm4.bias, up_blocks.1.attentions.2.transformer_blocks.0.attn3.to_out.0.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn3.to_v.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn3.to_q.weight, down_blocks.0.attentions.0.transformer_blocks.0.norm4.weight, up_blocks.3.attentions.0.transformer_blocks.0.norm4.bias, down_blocks.1.attentions.1.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.0.attentions.1.transformer_blocks.0.attn3.to_v.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn3.to_v.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.3.attentions.1.transformer_blocks.0.norm4.weight, down_blocks.2.attentions.1.transformer_blocks.0.norm4.bias, down_blocks.1.attentions.0.transformer_blocks.0.norm4.bias, up_blocks.2.attentions.1.transformer_blocks.0.attn3.to_v.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.2.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.1.attentions.1.transformer_blocks.0.norm4.bias, up_blocks.1.attentions.1.transformer_blocks.0.attn3.to_v.weight, up_blocks.1.attentions.2.transformer_blocks.0.norm4.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn3.to_k.weight, down_blocks.1.attentions.0.transformer_blocks.0.norm4.weight, up_blocks.2.attentions.0.transformer_blocks.0.norm4.weight, up_blocks.2.attentions.1.transformer_blocks.0.norm4.bias, up_blocks.3.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.2.attentions.1.transformer_blocks.0.norm4.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn3.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.0.attentions.1.transformer_blocks.0.attn3.to_q.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn3.to_v.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn3.to_v.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn3.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm4.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn3.to_q.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn3.to_k.weight, up_blocks.1.attentions.0.transformer_blocks.0.norm4.bias, down_blocks.1.attentions.1.transformer_blocks.0.attn3.to_k.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn3.to_out.0.weight, mid_block.attentions.0.transformer_blocks.0.norm4.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn3.to_k.weight, up_blocks.1.attentions.1.transformer_blocks.0.norm4.weight, down_blocks.1.attentions.1.transformer_blocks.0.norm4.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn3.to_v.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, down_blocks.0.attentions.0.transformer_blocks.0.norm4.bias, up_blocks.2.attentions.2.transformer_blocks.0.attn3.to_q.weight, down_blocks.0.attentions.1.transformer_blocks.0.norm4.bias, down_blocks.0.attentions.1.transformer_blocks.0.norm4.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn3.to_k.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.0.attentions.1.transformer_blocks.0.attn3.to_out.0.weight, mid_block.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.1.attentions.0.transformer_blocks.0.attn3.to_k.weight, mid_block.attentions.0.transformer_blocks.0.norm4.bias, up_blocks.1.attentions.2.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.1.attentions.1.transformer_blocks.0.attn3.to_q.weight, down_blocks.2.attentions.0.transformer_blocks.0.norm4.bias, mid_block.attentions.0.transformer_blocks.0.attn3.to_k.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn3.to_k.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn3.to_k.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.3.attentions.0.transformer_blocks.0.attn3.to_q.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn3.to_q.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn3.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.2.attentions.0.transformer_blocks.0.attn3.to_k.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn3.to_out.0.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn3.to_v.weight. 
 Please make sure to pass `low_cpu_mem_usage=False` and `device_map=None` if you want to randomely initialize those weights or else make sure your checkpoint file is correct.

After inspecting the source code, I noticed the cause of part of the errors. The BasicTransformerBlock module in the checkpoint (runway's BasicTransformerBlock) does not include attn3. However, in your attention.py's BasicTransformerBlock module (your BasicTransformerBlock), attn3 is included, leading to a mismatch. Could you advise on how to resolve this issue?

loss=NAN

I encounter the issue of loss=NAN during the second stage of training. Can you provide some good suggestions? Thank you

How metadata is used in extracting frames ?

Can anybody pinpoint how metadata.json is being used in extracting the frames ? I cannot find how it is being used to extract keyframes from videos. Also, if one has already processed the frames from the videos that would be great and can share here

Dataset

Hello, have read your paper, and I think your paper is very good.
So I attempted to run the code, and unfortunately, I encountered an error. Here are the details:

image

Can I get the train and test data of the 'StorySalon' dataset?

Please. thank you:)

Unable to download youtube data, can you provide the corresponding key frames?

Hello, I am very interested in your work. I'm having trouble downloading the data and I'd like your help.

I cannot download "video_url" in metadata.json through the following command.

youtube-dl --write-auto-sub -o 'file\%(title)s.%(ext)s' -f 135 https://youtu.be/JBzHBp6kfMo [youtube] JBzHBp6kfMo: Downloading webpage ERROR: JBzHBp6kfMo: YouTube said: Unable to extract video data

This video is a private video.

About the dataset

Thanks for your wonderful work. I wonder when the dataset will be released.

Access to Model Checkpoint

I tried accessing the model checkpoint link, but it seems that the model checkpoint has not been uploaded. Is there any way for me to access it?

Additionally, I'm very interested in your research, and I'm truly impressed by your efforts and contributions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.