Hello, I've been using your for a while. I have two questions

Thanks for the quick answer. 'I also think that the amount of

Questions on conditional generations about audio-diffusion HOT 6 CLOSED

RochMollero commented on May 28, 2024

Questions on conditional generations

from audio-diffusion.

Comments (6)

teticio commented on May 28, 2024

Thanks for your comment. I'll try to answer your questions.

I am not surprised by the increase in memory requirement but I haven't looked into it in detail. In fact, there are fewer blocks in the conditional case. Intuitively, imagine that you trained one model for each genre of music: this would be a multiplicative effect. Doing it conditionally is more compact, but requires more parameters than unconditionally. I also think that the amount of data you need increases because you need to have representations corresponding to different areas of the encoding space. As a result, the results I obtained were not great I have to say.
Good observation. They should probably be normalized in the training / inference scrripts. That might possibly lead to better results than those I obtained. Feel free to make a PR for this.

Let me know how you get on.

Rob

from audio-diffusion.

RochMollero commented on May 28, 2024

Thanks for the quick answer.

'I also think that the amount of data you need increases because you need to have representations corresponding to different areas of the encoding space.'

Mm is it ? I don't really want to generate more data in my case, but rather 'organise' it, in particular to be able to chose between samples with 'low volume' versus samples with 'high volume' among the same dataset with just one volume variable .. but overall that would not make more data to learn.

I have a limited amount of compute resources / time and I honestly don't feel confident to tweak the CrossAttention Unet model / layers here Would you know by any chance another lib I could quickly try to do conditional generation ? What's the SOTA here in 2024 for human-sized model ?

Back on this lib, I've also seen that since 2021 I've seen there was many more schedulers lib who claim better results (https://huggingface.co/docs/diffusers/v0.3.0/en/api/schedulers)? Do you think they can be plugged out of the box into this lib ? Any recommendation for one in particular ?

'Feel free to make a PR for this.'

I'd be happy to make a PR, but to be clean I'd have to produce before and after results, and I don't think I have good data (or compute resources / time) for such test.. would it be fine if I make a PR with just add a warning in one audio_encode.py or train_unet.py (or both) ?

By the way thanks for having made this library available. Overall it works quite well and I'm having very realistic results on my current work.

from audio-diffusion.

teticio commented on May 28, 2024

Well you could use the encoder to cluster your samples (e.g., with tSNE or similar) and organize them that way. My approach using an audio encoder rather than a text encoder - which requires a text description to accompany each audio - is not very common, so I don't think you will find alternatives easily. However, if you want to train conditional on text (which could be a genre predicted by a classifier based on my audio encoding), there are other repos around, but I can't say if they are any easier to work with. My aim was to make this as accessible as possible (is this what you mean by human-sized model?).

That said, I'm not sure I can help you much more at the moment.

Regarding the PR: I'm not too keen on the warning idea. It should be fairly straightforwad to add a line in audio_encode that divides each vector by the numpy.linalg.norm of that vector. But you are right that I should then retrain the conditional model and compare it, and I don't have the time at the moment for that. So I would suggest that you make the change locally. In any case, I don't expect it to make a huge difference.

from audio-diffusion.

teticio commented on May 28, 2024

Ah and yes, you can drop in schedulers - it should work fine with them.

from audio-diffusion.

RochMollero commented on May 28, 2024

Indeed, I just need to condition to a few variables so most libs / models are overkill. I started investigating the model and here's what' I've seen and done so far :

I noticed is that compared to DownBlock2D, CrossAttnDownBlock2D adds an entire call to a Transformer2DModel. This class does two calls, one for 'self-attention', and one for 'cross-attention', so it seemed to roughly
multiplying by 3 the computation load per layer.

So i decided to do 2 things:

use the option "only_cross_attention" with the idea that the compute load per CrossAttnDownBlock2D call would only be 2x DownBlock2D.
reduce the number of layers from 4 to 3 so the resulting unet would be roughly equivalent to the non-conditionnal unet with 6 layers

This gives:

            model = UNet2DConditionModel(
                sample_size=resolution if vqvae is None else latent_resolution,
                in_channels=1
                if vqvae is None else vqvae.config["latent_channels"],
                out_channels=1
                if vqvae is None else vqvae.config["latent_channels"],
                layers_per_block=2,
                block_out_channels=(128, 256, 512),
                only_cross_attention=True,
                down_block_types=(
                    "CrossAttnDownBlock2D",
                    "CrossAttnDownBlock2D",
                    "DownBlock2D",
                ),
                up_block_types=(
                    "UpBlock2D",
                    "CrossAttnUpBlock2D",
                    "CrossAttnUpBlock2D",
                ),
                cross_attention_dim=list(encodings.values())[0].shape[-1],
            )

Then I tried on the small example you give with mels of size 64x64 and I got surprising results:

If I use the Non-Condition unet (6 layers), with batch size 1 and gradient acceleration 16 it uses 2800 Mb RAM of GPU
If I use the Condition unet (3 layer), with batch size 1 and gradient acceleration 16 AND only_cross_attention=False it uses 6694 Mb RAM of GPU
If I use the Condition unet (3 layer), with batch size 1 and gradient acceleration 16 AND only_cross_attention=True it uses 2370 Mb RAM of GPU

So far so good, all seem logical, and I was very optimistic to try bigger mels. But before I tried with bigger batch size and got a problem:

If I use the Non-Condition unet (6 layers), with batch size 8 and gradient acceleration 2 it uses 4334 Mb RAM of GPU
If I use the Condition unet (3 layer), with batch size 8 and gradient acceleration 2 AND only_cross_attention=False it crashes on my 2080 Ti, which is what I was saying in my first message.
If I use the Condition unet (3 layer), with batch size 8 and gradient acceleration 2 AND only_cross_attention=True it uses 7132 Mb RAM of GPU

So my interpretation is that the transformer layer doesn't scale well with batch size and that is currently limiting me since batch size is directly linked to training speed.

As a matter of fact I also tried on a bigger mel size (256) and I have quite similar scaling issue. Basically on 2 V100 with bigger mels I'm going from 4Gb per GPU RAM usage with batch size 1 to 26Gb (!) with batch size 2. So It seem the transformer layer doesn't scale well either on mel size 😞

I'm still investigating but what do you think of these results ? Do you think it's made sense to reduce the number of layers to 3 and use 'only_cross_attention' ? Do you have any idea how I could 'tame' further the "transformer 2D" call in term of ram usage an scaling so it's closer to the original U-Net ?

Regards,
Roch

P.S. I will be doing a PR later, I think I will add a normalisation in the generation script of encodings if that's good for you (should be sufficient in fact, without retraining).

from audio-diffusion.

teticio commented on May 28, 2024

Hi. Thanks for doing all of this and taking the trouble to post it here. To be honest, you have gone further than I did with these kind of experiments. I don't know how good the results will be, so the thing is to try it out. I've since upgraded my GPU to 2x4090 so I haven't been so VRAM constrained lately. I'd suggest you look at what people have to say in the image generation space, as there is generally more activity there.

Regarding bigger Mels, I found that the Mel inversion was a limiting factor. Apart from being slow, the Griffin-Lim method produces a lot of sound artefacts like warbling. I think a better approach is to use some kind of Neural Mel Inversion like HIFIGAN or similar. It also means you can work with, say, 256 frequency bins and train that much faster.

from audio-diffusion.

Questions on conditional generations about audio-diffusion HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent