lucidrains / nuwa-pytorch Goto Github PK

Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch

License: MIT License

Python 100.00%

artificial-intelligence deep-learning transformers attention-mechanism text-to-video text-to-audio

nuwa-pytorch's Issues

Question about generated videos?

There are a lot of negative numbers and very small decimals (like 5e-1). But the loss degrades normally when training.
Is that a normal situation? How can I make the result visible?

Type of dataset for training VQ-GAN

Hi,

First, thanks a lot for the amazing work!
I have one question regarding the training of the VQ-GAN, do you recommend training it on a dataset similar to the dataset the nuwa model will be trained?
What I mean is, if I want to train nuwa to generate sport videos based on text, do I need to also train the VQ-GAN on a sport dataset?

Thanks a lot

Pseudocode for 3DNA?

me no comprendai le complex einops 😢

Can someone give the 3DNA pseudocode to illustrate what's going on 🤗

(Also how did lucidrains bang out thousands of lines of code in a few weeks - is he confirmed to be human? 🤔)

what's the recommanded hardware for using?

not consider the training, just the video generation inference tasks
Thank you very much for any advice!

cant wait

Questions about function forward() in NUWA please.

I'm confused me that, in function forward() of class NUWA, the ground-truth video is fed to transformer and calculate the output video, which is different from function generate().

frame_embeddings = self.video_transformer(
            frame_embeddings,  # calculated from ground-truth video
            context = text_embeds,
            context_mask = text_mask
        )

So when training NUWA, the loss comes from logits. But the logits are not only from text, but ground-truth video (only one transformer layer, different from the auto-regressive model in generate function). Is that some kind of cheating when training? Or should I generate logits in the same way as in generate(), and then calculate loss to train?

Why the video does not pass through the encoder?

Hi! lucidrains. Thanks for providing a great repo which is convenient to understand the NUWA paper.
I have a question as follows:
In the NUWA paper, we can see that the inputs of the Encoder are caption tokens (caption condition) and the video tokens (3DNA condition). So, in my eye, the video tokens sequence should fully self-attend in the Encoder, right? And then, the outputs condition the Decoder.
The Decoder provided by you is as following.
.
It has causal self-attention and text-condition as we expected. But from the definition in paper, the condition contains the text-condition and 3DNA condition, and these two condition the Decoder. Is my opinion right? I am just curious about the condition in the NUWA paper.
The Encoder in your repo is only the Text-Encoder, but the video does not pass through the encoder to condition the Encoder.

Looking forward to your reply! Thanks!

Looks like Microsoft made a significant advancement on there initial NUWA

https://nuwa-infinity.microsoft.com/#/

For doing very long sequences (images and videos)

lucidrains / nuwa-pytorch Goto Github PK

nuwa-pytorch's Issues

Question about generated videos?

Type of dataset for training VQ-GAN

Pseudocode for 3DNA?

what's the recommanded hardware for using?

cant wait

Questions about function forward() in NUWA please.

Why the video does not pass through the encoder?

Looks like Microsoft made a significant advancement on there initial NUWA

Colab

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent