Audioreactive Latent walks

Inspiration

StyleGAN2 has fascinated us with the amazing quality of images it can generate. A demonstration of its power can be seen on thispersondoesnotexist.com.

People have come up with a way to create nice videos which morph from one image to another. Because they interpolate the images in latent space this is called a "Latent Walk". A demonstration of this can be seen in this Video, in which a beach is being generated.

This is already very cool, but there's even more. People wanted to have these videos react to music, which is called "Audioreactive Latent Walk". A demonstration of this can be seen in the Video "Audio-reactive Latent Interpolations with StyleGAN2-ada" of the YouTuber Nerdy Rodent (Demo starts at 5:50).

On the YouTube Channel of Tim Hawkey are some more really impressive videos, mainly generated with Stable Diffusion.

Tools

There are several libraries and pretrained models that are to be used in order to create such a video. This includes the default ML-Libraries like numpy, tensorflow/pytorch, opencv, and of course a pretrained StyleGAN2 model. Potentially, the StyleGAN2 Model could be exchanged for a stable-diffusion model, as there have already been made some attempts to create videos using stable diffusion.

Computation power can either be accessed using Google Colab or lambda labs for faster generation (we recommend that for the stable diffusion video generation).

Project

Process

We started by trying out the following technologies:

StyleGAN 3

Right at the beginning, we wanted to start the project with full energy and try out the latest technologies. That's why our first approach was to try out StyleGAN 3 as well. Unfortunately, we quickly found out that this requires too much computing power on the one hand, but on the other hand there are also very few examples and documentation available.

Stable Diffusion

While one person was busy with StyleGAN 3, the other person was trying to get Stable Diffusion running. This worked fine and we were able to convert this to use audioreactive. Unfortunately, when not used on a paid GPU cloud, this took too long to tune the parameters so that we got good images and this carried through the whole video.

StyleGAN 2

After talking to our teacher, we looked around for an alternative in StyleGAN 2 and stumbled across this GitHub repo.

Maua StyleGAN

From the repo mentioned above, we took the maua StyleGAN approach, fixed it and played around with it. Even though it was StyleGAN it still needed a lot of computing power, which costed quite a lot to "only" run some experiments on google Colab.

maua_stylegan_song_1.mp4

maua_stylegan_song_2.mp4

BigGAN

As we struggled with most other models, we decided to use a small to test everything with and we were even able to run BigGAN on our Laptops locally. So we went with this and found a great GitHub repository, which we used to get us started.

biggan.mp4

Final Product

Our Final Product is a mixture between BigGAN and Stable Diffusion. First, we take the music Input and create a Video with BigGAN. We take the frames of the video from BigGAN as Inputs for a Stable Diffusion Img2Img Model and create a new Video with BigGAN.

stable_diffusion_it_2.mp4

This video was generated with the following settings:
truncation = 0.7
extra_detail = 0.9
max_frequency_level = 11000
low_frequency_skip = 16
frequency_band_growth_rate = 1.01
smoothing_factor = 0.1
iterations = 2
seed=42,
prompt="Photorealistic, epic, focused, sharp, cinematic lighting, 4k, 8k, octane rendering, legendary, fantasy, trippy, LSD",
num_steps=10,
unconditional_guidance_scale=7.5,
temperature=0.0,
batch_size=1,
input_image=img,
input_image_strength=0.5

stable_diffusion_it_1.mp4

This one had those settings:
truncation = 0.7
extra_detail = 0.9
max_frequency_level = 11000
low_frequency_skip = 16
frequency_band_growth_rate = 1.01
smoothing_factor = 0.1
iterations = 1
seed=42,
prompt="Photorealistic, epic, focused, sharp, cinematic lighting, 4k, 8k, octane rendering, beautiful",
num_steps=15,
unconditional_guidance_scale=7.5,
temperature=0.0,
batch_size=1,
input_image=img,
input_image_strength=0.6

Code Structure

Cut audio in small pieces
Use audio fft to get spectograms for the piece generated in step 1.
Summarize the frequency from spectogram
Get weighted sum of random vectors per spectogram strength (each vector has noise_dim as dimensionality)
Use noise to create an image / prompt class for interpolation
Use diffusion to create final image
Add the images together and add the music to them, for the output video

How to use

You can find the file saved as diffused-biggan.ipynb in the root of this repository. We recommend uploading that file onto Google Colab or lambda labs for faster image generation.

setup.mp4

Local installation

If you want to try to run the file locally, there are the following points to consider:

Install Pytorch with GPU support
The same applies to Tensorflow

Additionally you need to install:

numpy
matplotlib

Parameters

In the file you can easily change the following parameters:

Input path of the music file (we recommend using music from Pixabay.com as it's royalty-free)
Output path of the video file

BigGAN specific parameters

Biggan Labels that will be used in the images for the Video
Truncation is the vector length that will be used within BigGAN
Extra Detail defines how detailed the output image will be
Maximum Frequency Level, from that frequency upwards the values shoudn't be considered anymore
Low Frequency Skip, same as Maximum Frequency Level but for low values
Frequency Band Growth Rate factor on how much to increase frequency bands
Smoothing Factor defines how the random vectors will be mixed
Iterations amount of times the smoothing algorithm will be applied

BigGAN video Generation

The video from BigGAN is not cached in an MP4 format or similar. If you want to do this, you can use the same procedure as for Stable Diffusion. You have to initialise a video writer and then write each created frame into it and finally generate the video with the release function.

Stable Diffusion specific parameters

Prompt, the prompt that will be used to generate the Stable Diffusion image
Number of interpolation steps, how many steps we use to interpolate between two images
Number of steps to generate the image, how many steps we use to generate the image
Unconditional guiding Scale, how much the image should follow the prompt
Input Image strength, how much the output differs from the input image

If the video generation takes too long, you can cancel running the second last cell, then run the last cell and it will generate a video with the images generated up to then.

Outputs / Examples

Check Plexatics YouTube Channel for more Examples

plexatic / digcrea Goto Github PK

digcrea's Introduction