StyleGAN2 has fascinated us with the amazing quality of images it can generate. A demonstration of its power can be seen on thispersondoesnotexist.com.
People have come up with a way to create nice videos which morph from one image to another. Because they interpolate the images in latent space this is called a "Latent Walk". A demonstration of this can be seen in this Video, in which a beach is being generated.
This is already very cool, but there's even more. People wanted to have these videos react to music, which is called "Audioreactive Latent Walk". A demonstration of this can be seen in the Video "Audio-reactive Latent Interpolations with StyleGAN2-ada" of the YouTuber Nerdy Rodent (Demo starts at 5:50).
On the YouTube Channel of Tim Hawkey are some more really impressive videos, mainly generated with Stable Diffusion.
There are several libraries and pretrained models that are to be used in order to create such a video. This includes the default ML-Libraries like numpy, tensorflow/pytorch, opencv, and of course a pretrained StyleGAN2 model. Potentially, the StyleGAN2 Model could be exchanged for a stable-diffusion model, as there have already been made some attempts to create videos using stable diffusion.
Computation power can either be accessed using Google Colab or lambda labs for faster generation (we recommend that for the stable diffusion video generation).
We started by trying out the following technologies:
Right at the beginning, we wanted to start the project with full energy and try out the latest technologies. That's why our first approach was to try out StyleGAN 3 as well. Unfortunately, we quickly found out that this requires too much computing power on the one hand, but on the other hand there are also very few examples and documentation available.
While one person was busy with StyleGAN 3, the other person was trying to get Stable Diffusion running. This worked fine and we were able to convert this to use audioreactive. Unfortunately, when not used on a paid GPU cloud, this took too long to tune the parameters so that we got good images and this carried through the whole video.
After talking to our teacher, we looked around for an alternative in StyleGAN 2 and stumbled across this GitHub repo.
From the repo mentioned above, we took the maua StyleGAN approach, fixed it and played around with it. Even though it was StyleGAN it still needed a lot of computing power, which costed quite a lot to "only" run some experiments on google Colab.
maua_stylegan_song_1.mp4
maua_stylegan_song_2.mp4
As we struggled with most other models, we decided to use a small to test everything with and we were even able to run BigGAN on our Laptops locally. So we went with this and found a great GitHub repository, which we used to get us started.
biggan.mp4
Our Final Product is a mixture between BigGAN and Stable Diffusion. First, we take the music Input and create a Video with BigGAN. We take the frames of the video from BigGAN as Inputs for a Stable Diffusion Img2Img Model and create a new Video with BigGAN.
stable_diffusion_it_2.mp4
This video was generated with the following settings:
truncation = 0.7
extra_detail = 0.9
max_frequency_level = 11000
low_frequency_skip = 16
frequency_band_growth_rate = 1.01
smoothing_factor = 0.1
iterations = 2
seed=42,
prompt="Photorealistic, epic, focused, sharp, cinematic lighting, 4k, 8k, octane rendering, legendary, fantasy, trippy, LSD",
num_steps=10,
unconditional_guidance_scale=7.5,
temperature=0.0,
batch_size=1,
input_image=img,
input_image_strength=0.5
stable_diffusion_it_1.mp4
This one had those settings:
truncation = 0.7
extra_detail = 0.9
max_frequency_level = 11000
low_frequency_skip = 16
frequency_band_growth_rate = 1.01
smoothing_factor = 0.1
iterations = 1
seed=42,
prompt="Photorealistic, epic, focused, sharp, cinematic lighting, 4k, 8k, octane rendering, beautiful",
num_steps=15,
unconditional_guidance_scale=7.5,
temperature=0.0,
batch_size=1,
input_image=img,
input_image_strength=0.6
- Cut audio in small pieces
- Use audio fft to get spectograms for the piece generated in step 1.
- Summarize the frequency from spectogram
- Get weighted sum of random vectors per spectogram strength (each vector has noise_dim as dimensionality)
- Use noise to create an image / prompt class for interpolation
- Use diffusion to create final image
- Add the images together and add the music to them, for the output video
You can find the file saved as diffused-biggan.ipynb in the root of this repository. We recommend uploading that file onto Google Colab or lambda labs for faster image generation.
setup.mp4
If you want to try to run the file locally, there are the following points to consider:
- Install Pytorch with GPU support
- The same applies to Tensorflow
Additionally you need to install:
- numpy
- matplotlib
In the file you can easily change the following parameters:
- Input path of the music file (we recommend using music from Pixabay.com as it's royalty-free)
- Output path of the video file
- Biggan Labels that will be used in the images for the Video
- Truncation is the vector length that will be used within BigGAN
- Extra Detail defines how detailed the output image will be
- Maximum Frequency Level, from that frequency upwards the values shoudn't be considered anymore
- Low Frequency Skip, same as Maximum Frequency Level but for low values
- Frequency Band Growth Rate factor on how much to increase frequency bands
- Smoothing Factor defines how the random vectors will be mixed
- Iterations amount of times the smoothing algorithm will be applied
The video from BigGAN is not cached in an MP4 format or similar. If you want to do this, you can use the same procedure as for Stable Diffusion. You have to initialise a video writer and then write each created frame into it and finally generate the video with the release function.
- Prompt, the prompt that will be used to generate the Stable Diffusion image
- Number of interpolation steps, how many steps we use to interpolate between two images
- Number of steps to generate the image, how many steps we use to generate the image
- Unconditional guiding Scale, how much the image should follow the prompt
- Input Image strength, how much the output differs from the input image
If the video generation takes too long, you can cancel running the second last cell, then run the last cell and it will generate a video with the images generated up to then.
- Check Plexatics YouTube Channel for more Examples