Giter Site home page Giter Site logo

lucidrains / nuwa-pytorch Goto Github PK

View Code? Open in Web Editor NEW
538.0 23.0 60.0 1.82 MB

Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch

License: MIT License

Python 100.00%
artificial-intelligence deep-learning transformers attention-mechanism text-to-video text-to-audio

nuwa-pytorch's Issues

Question about generated videos?

There are a lot of negative numbers and very small decimals (like 5e-1). But the loss degrades normally when training.
Is that a normal situation? How can I make the result visible?

Type of dataset for training VQ-GAN

Hi,

First, thanks a lot for the amazing work!
I have one question regarding the training of the VQ-GAN, do you recommend training it on a dataset similar to the dataset the nuwa model will be trained?
What I mean is, if I want to train nuwa to generate sport videos based on text, do I need to also train the VQ-GAN on a sport dataset?

Thanks a lot

Pseudocode for 3DNA?

me no comprendai le complex einops 😢

Can someone give the 3DNA pseudocode to illustrate what's going on 🤗

(Also how did lucidrains bang out thousands of lines of code in a few weeks - is he confirmed to be human? 🤔)

Questions about function forward() in NUWA please.

I'm confused me that, in function forward() of class NUWA, the ground-truth video is fed to transformer and calculate the output video, which is different from function generate().

frame_embeddings = self.video_transformer(
            frame_embeddings,  # calculated from ground-truth video
            context = text_embeds,
            context_mask = text_mask
        )

So when training NUWA, the loss comes from logits. But the logits are not only from text, but ground-truth video (only one transformer layer, different from the auto-regressive model in generate function). Is that some kind of cheating when training? Or should I generate logits in the same way as in generate(), and then calculate loss to train?

Why the video does not pass through the encoder?

Hi! lucidrains. Thanks for providing a great repo which is convenient to understand the NUWA paper.
I have a question as follows:
In the NUWA paper, we can see that the inputs of the Encoder are caption tokens (caption condition) and the video tokens (3DNA condition). So, in my eye, the video tokens sequence should fully self-attend in the Encoder, right? And then, the outputs condition the Decoder.
The Decoder provided by you is as following.
截屏2022-05-12 上午11 07 12.
It has causal self-attention and text-condition as we expected. But from the definition in paper, the condition contains the text-condition and 3DNA condition, and these two condition the Decoder. Is my opinion right? I am just curious about the condition in the NUWA paper.
The Encoder in your repo is only the Text-Encoder, but the video does not pass through the encoder to condition the Encoder.

Looking forward to your reply! Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.