haoningwu3639 / storygen Goto Github PK
View Code? Open in Web Editor NEW[CVPR 2024] Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models
Home Page: https://haoningwu3639.github.io/StoryGen_Webpage/
License: MIT License
[CVPR 2024] Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models
Home Page: https://haoningwu3639.github.io/StoryGen_Webpage/
License: MIT License
Thanks for your great work!
I'd like to ask that if you are going to release the complete StorySalon dataset (with both prompts and images).
Regards,
The line of code prev_prompt_ids.append(tokenizer(prev_prompt, truncation=True, padding="max_length", max_length=tokenizer.model_max_length, return_tensors="pt").input_ids.squeeze(0)) should not include .squeeze(0). The shape of t_prev_prompt_ids in t_prev_prompt_ids = torch.stack(prev_prompt_ids) is (3, 77).
Regarding the difference in memory usage between the first and second stages of training, and the necessity to set the batch size to 1 in order for it to work on a 48GB A6000 GPU.
I am a newbie and I was looking for a project that can convert my audio book into video based on the transcription. I didn't wrote the complete paper because that confuses me so I am asking here to get help from you. The question is... m I at the. Right place? I mean do this project is what I am looking for? Can this project can change my audiobook into visual videos?. If not can you please refer any other project ? Thanks alot for ur time and consideration.
Can you add a license for the dataset and code? Also, it is specified that the EBooks are under CC-BY license, but it's unclear if they only downloaded YouTube videos that are under CC-BY license as well. Is this information available?
Has anybody had success in downloading the private videos in the list ? It seems a lot of the videos were rejected due to them being private. What command or script is being used ?
thanks for your work! When I run inference.py, ref_image
in inference.py dose not exist! so I change as follow:
prompt = "The curly puppy is running after the flying dragon."
prev_p = ["The curly puppy", "The flying dragon."]
ref_image = ["./data/image/00001.png",
"./data/image/00002.png"]
After I run CUDA_VISIBLE_DEVICES=0 accelerate launch inference.py
, it shows CUDA out of memory.
My GPU is 80G A800. Can you give me some suggestions
In the inference.py, the setting seems to be autoregressive and requires and initial image. How do we set it to story generation instead of continuation ?
youtube-dl --write-auto-sub -o 'file%(title)s.%(ext)s' -f 135 [url]
This line of code downloads videos from YouTube, but many of them show as private videos and cannot be downloaded. Can you provide the downloaded files?
when I use the stage = 'no', the following error occurs.
Traceback (most recent call last):
File "D:\AIGC\StoryGen-master\inference.py", line 139, in
test(pretrained_model_path,
File "D:\AIGC\StoryGen-master\inference.py", line 103, in test
output = pipeline(
File "C:\Users\chwj0\miniconda3\envs\ldm\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "D:\AIGC\StoryGen-master\model\pipeline.py", line 443, in call
for k,v in img_conditions[0].items():
AttributeError: 'NoneType' object has no attribute 'items'
Thank you for your great work; it's quite promising.
However, I've encountered an error during run the train_StorySalon_stage1.py. I have downloaded the stable-diffusion-v1-5 unet checkpoint from this. However, I'm experiencing the error when using the follow code to load it:
unet = UNet2DConditionModel.from_pretrained(pretrained_model_path, subfolder="unet")
The error is:
Cannot load <class 'model.unet_2d_condition.UNet2DConditionModel'> from ./ckpt/stable-diffusion-v1-5/ because the following keys are missing:
up_blocks.3.attentions.1.transformer_blocks.0.attn3.to_q.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn3.to_k.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn3.to_q.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.2.attentions.2.transformer_blocks.0.norm4.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn3.to_v.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn3.to_k.weight, up_blocks.3.attentions.0.transformer_blocks.0.attn3.to_k.weight, down_blocks.2.attentions.0.transformer_blocks.0.norm4.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn3.to_q.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn3.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm4.bias, up_blocks.3.attentions.0.transformer_blocks.0.attn3.to_v.weight, up_blocks.3.attentions.0.transformer_blocks.0.norm4.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn3.to_q.weight, mid_block.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, mid_block.attentions.0.transformer_blocks.0.attn3.to_q.weight, up_blocks.1.attentions.0.transformer_blocks.0.norm4.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn3.to_v.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.3.attentions.1.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.3.attentions.1.transformer_blocks.0.norm4.bias, up_blocks.1.attentions.1.transformer_blocks.0.norm4.bias, up_blocks.3.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, mid_block.attentions.0.transformer_blocks.0.attn3.to_v.weight, down_blocks.2.attentions.1.transformer_blocks.0.norm4.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.1.attentions.2.transformer_blocks.0.attn3.to_k.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn3.to_q.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn3.to_q.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.2.attentions.2.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn3.to_v.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.1.attentions.0.transformer_blocks.0.attn3.to_k.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn3.to_k.weight, up_blocks.2.attentions.2.transformer_blocks.0.norm4.bias, up_blocks.2.attentions.0.transformer_blocks.0.norm4.bias, up_blocks.1.attentions.2.transformer_blocks.0.norm4.bias, up_blocks.1.attentions.2.transformer_blocks.0.attn3.to_out.0.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn3.to_v.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn3.to_q.weight, down_blocks.0.attentions.0.transformer_blocks.0.norm4.weight, up_blocks.3.attentions.0.transformer_blocks.0.norm4.bias, down_blocks.1.attentions.1.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.0.attentions.1.transformer_blocks.0.attn3.to_v.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn3.to_v.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.3.attentions.1.transformer_blocks.0.norm4.weight, down_blocks.2.attentions.1.transformer_blocks.0.norm4.bias, down_blocks.1.attentions.0.transformer_blocks.0.norm4.bias, up_blocks.2.attentions.1.transformer_blocks.0.attn3.to_v.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.2.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.1.attentions.1.transformer_blocks.0.norm4.bias, up_blocks.1.attentions.1.transformer_blocks.0.attn3.to_v.weight, up_blocks.1.attentions.2.transformer_blocks.0.norm4.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn3.to_k.weight, down_blocks.1.attentions.0.transformer_blocks.0.norm4.weight, up_blocks.2.attentions.0.transformer_blocks.0.norm4.weight, up_blocks.2.attentions.1.transformer_blocks.0.norm4.bias, up_blocks.3.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.2.attentions.1.transformer_blocks.0.norm4.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn3.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.0.attentions.1.transformer_blocks.0.attn3.to_q.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn3.to_v.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn3.to_v.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn3.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm4.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn3.to_q.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn3.to_k.weight, up_blocks.1.attentions.0.transformer_blocks.0.norm4.bias, down_blocks.1.attentions.1.transformer_blocks.0.attn3.to_k.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn3.to_out.0.weight, mid_block.attentions.0.transformer_blocks.0.norm4.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn3.to_k.weight, up_blocks.1.attentions.1.transformer_blocks.0.norm4.weight, down_blocks.1.attentions.1.transformer_blocks.0.norm4.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn3.to_v.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn3.to_out.0.weight, down_blocks.0.attentions.0.transformer_blocks.0.norm4.bias, up_blocks.2.attentions.2.transformer_blocks.0.attn3.to_q.weight, down_blocks.0.attentions.1.transformer_blocks.0.norm4.bias, down_blocks.0.attentions.1.transformer_blocks.0.norm4.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn3.to_k.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.0.attentions.1.transformer_blocks.0.attn3.to_out.0.weight, mid_block.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.1.attentions.0.transformer_blocks.0.attn3.to_k.weight, mid_block.attentions.0.transformer_blocks.0.norm4.bias, up_blocks.1.attentions.2.transformer_blocks.0.attn3.to_out.0.bias, down_blocks.1.attentions.1.transformer_blocks.0.attn3.to_q.weight, down_blocks.2.attentions.0.transformer_blocks.0.norm4.bias, mid_block.attentions.0.transformer_blocks.0.attn3.to_k.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn3.to_k.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn3.to_k.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.3.attentions.0.transformer_blocks.0.attn3.to_q.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn3.to_q.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn3.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn3.to_out.0.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn3.to_out.0.bias, up_blocks.2.attentions.0.transformer_blocks.0.attn3.to_k.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn3.to_out.0.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn3.to_v.weight.
Please make sure to pass `low_cpu_mem_usage=False` and `device_map=None` if you want to randomely initialize those weights or else make sure your checkpoint file is correct.
After inspecting the source code, I noticed the cause of part of the errors. The BasicTransformerBlock
module in the checkpoint (runway's BasicTransformerBlock) does not include attn3
. However, in your attention.py
's BasicTransformerBlock
module (your BasicTransformerBlock), attn3
is included, leading to a mismatch. Could you advise on how to resolve this issue?
I encounter the issue of loss=NAN during the second stage of training. Can you provide some good suggestions? Thank you
Can anybody pinpoint how metadata.json is being used in extracting the frames ? I cannot find how it is being used to extract keyframes from videos. Also, if one has already processed the frames from the videos that would be great and can share here
Hello, I am very interested in your work. I'm having trouble downloading the data and I'd like your help.
I cannot download "video_url" in metadata.json through the following command.
youtube-dl --write-auto-sub -o 'file\%(title)s.%(ext)s' -f 135 https://youtu.be/JBzHBp6kfMo [youtube] JBzHBp6kfMo: Downloading webpage ERROR: JBzHBp6kfMo: YouTube said: Unable to extract video data
This video is a private video.
Thanks for your wonderful work. I wonder when the dataset will be released.
I tried accessing the model checkpoint link, but it seems that the model checkpoint has not been uploaded. Is there any way for me to access it?
Additionally, I'm very interested in your research, and I'm truly impressed by your efforts and contributions.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.