lucidrains / nuwa-pytorch Goto Github PK
View Code? Open in Web Editor NEWImplementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch
License: MIT License
Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch
License: MIT License
There are a lot of negative numbers and very small decimals (like 5e-1). But the loss degrades normally when training.
Is that a normal situation? How can I make the result visible?
Hi,
First, thanks a lot for the amazing work!
I have one question regarding the training of the VQ-GAN, do you recommend training it on a dataset similar to the dataset the nuwa model will be trained?
What I mean is, if I want to train nuwa to generate sport videos based on text, do I need to also train the VQ-GAN on a sport dataset?
Thanks a lot
me no comprendai le complex einops 😢
Can someone give the 3DNA pseudocode to illustrate what's going on 🤗
(Also how did lucidrains bang out thousands of lines of code in a few weeks - is he confirmed to be human? 🤔)
not consider the training, just the video generation inference tasks
Thank you very much for any advice!
:D
I'm confused me that, in function forward() of class NUWA, the ground-truth video is fed to transformer and calculate the output video, which is different from function generate().
frame_embeddings = self.video_transformer(
frame_embeddings, # calculated from ground-truth video
context = text_embeds,
context_mask = text_mask
)
So when training NUWA, the loss comes from logits. But the logits are not only from text, but ground-truth video (only one transformer layer, different from the auto-regressive model in generate function). Is that some kind of cheating when training? Or should I generate logits in the same way as in generate(), and then calculate loss to train?
Hi! lucidrains. Thanks for providing a great repo which is convenient to understand the NUWA paper.
I have a question as follows:
In the NUWA paper, we can see that the inputs of the Encoder are caption tokens (caption condition) and the video tokens (3DNA condition). So, in my eye, the video tokens sequence should fully self-attend in the Encoder, right? And then, the outputs condition the Decoder.
The Decoder provided by you is as following.
.
It has causal self-attention and text-condition as we expected. But from the definition in paper, the condition contains the text-condition and 3DNA condition, and these two condition the Decoder. Is my opinion right? I am just curious about the condition in the NUWA paper.
The Encoder in your repo is only the Text-Encoder, but the video does not pass through the encoder to condition the Encoder.
Looking forward to your reply! Thanks!
https://nuwa-infinity.microsoft.com/#/
For doing very long sequences (images and videos)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.