<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Dataset constriants about audio-diffusion HOT 16 CLOSED

teticio commented on May 27, 2024

Dataset constriants

from audio-diffusion.

Comments (16)

teticio commented on May 27, 2024

Glad you liked it. Around 400 files were used. (You can load the dataset into a pandas dataframe and do a "unique" on the filename). If you count the number of rows (I think there were around 20,000), this tells you the total length = 5s * 20,000 = 27h or about 4 minutes per track on average.

from audio-diffusion.

deepak-newzera commented on May 27, 2024

I did the following way, please check it and comment on it:

I have some mp3 music recordings. I made a total of around 5000 clips out of those recordings, by splitting each of the recordings to make them be of 5 seconds each.
Then I used the command python scripts/audio_to_images.py --resolution 256,256 --hop_length 1024 --input_dir Splitted_mp3s --output_dir spectrogram_data-splitted-mp3-256 to get the spectogram data.

Then I executed the command accelerate launch scripts/train_unet.py --dataset_name spectrogram_data-splitted-mp3-256 \ --hop_length 1024 --output_dir models/audio-diffusion-splitted-mp3-256/ --train_batch_size 2 --num_epochs 100 --gradient_accumulation_steps 8 --save_images_epochs 100 --save_model_epochs 1 --scheduler ddim --learning_rate 1e-4 --lr_warmup_steps 500 --mixed_precision no to train the model with my dataset. The training is in progress.

Is this the correct way to train the model? Please let me know

from audio-diffusion.

teticio commented on May 27, 2024

Best not to split the mp3s yourself, as the split is not exactly 5 seconds. The audio_to_images script will do this for you - just provide a folder of regular mp3s. It should still work OK. What you have done looks correct otherwise.

from audio-diffusion.

deepak-newzera commented on May 27, 2024

I initially did the training without splitting only. But it gave clumsy and noisy outputs. Now I completed training with splitting as well. Yet the outputs are bad!
I am doing the following to test the trained model:
audio_diffusion = AudioDiffusion('/home/deepak/mansion/AD/audio-diffusion/models/audio-diffusion-splitted-mp3-256')
image, (sample_rate, audio) = audio_diffusion.generate_spectrogram_and_audio()
display(image)
display(Audio(audio, rate=sample_rate))

Please give me some suggestions for getting clean outputs

from audio-diffusion.

deepak-newzera commented on May 27, 2024

Also, is there a way to evaluate this model through some metrics? (Like checking how close the generated music is close to the training data)

from audio-diffusion.

teticio commented on May 27, 2024

It's a bit hard to say without being able to see you model. You could consider pushing it (with the tensorboad logs which should be included by default) to Hugging Face hub. Then I could look at it. One thing you can do is use the test_mel.ipynb notebook to load an example from your test dataset (make sure you set the parameters to mel to match those in the generation - i.e., hop_length 1024) and see how the recreated mp3 sounds. It is also possible that you don't have enough data, but I can't say as I didn't try with < 20,000 samples.

Regarding your second question about metrics, you can run tensorboard --logdir=. and see the loss curves and generated samples per epoch as training progresses. The losses measure how well the model is able to reconstruct an audio after noising and denoising. It doesn't measure the quality of samples generated from denoising pure noise (which is the generative process).

from audio-diffusion.

deepak-newzera commented on May 27, 2024

Yeah, the test_mel.ipynb is also not recreating the mp3s accurately. What might be the problem?
Also, for your dataset for epoch, the iterations are like 20000/20000 right?

from audio-diffusion.

teticio commented on May 27, 2024

So I would not recommend hop_length=1024: use the default (leave it blank or put 512). The higher hop_length was for low resolution cases. Can't remember the details but you can see my tensorboard here https://huggingface.co/teticio/audio-diffusion-256/tensorboard. I did 100 epochs. Before you do any training, make sure you can get a decent quality reconstruction of an audio sample from a mel image. Again, if you push your dataset to HF, I can download it and try it out, but try to solve it yourself first. Goodl luck and let me know how you get on.

from audio-diffusion.

teticio commented on May 27, 2024

PS: note that the first epochs have very quiet audio samples in the tensorboard because I was not normalizing them at first

from audio-diffusion.

deepak-newzera commented on May 27, 2024

That's a really supporting reply. I will keep trying.
If possible, you also please try it out. This is the link to my data directory containing the mp3 files.
https://drive.google.com/file/d/1lRYkvEzfpsiCc5byTBBl9nFbmeNnAnJg/view?usp=share_link

from audio-diffusion.

deepak-newzera commented on May 27, 2024

@teticio Also, please let me know how to generate longer samples from the pre-trained model.

from audio-diffusion.

deepak-newzera commented on May 27, 2024

@teticio I would like to simulate your model by training with your dataset. Could you please provide your dataset?
I could see it at https://huggingface.co/datasets/teticio/audio-diffusion-256/tree/main/data. But it is in parquet format. How can I get mp3 files from it?

from audio-diffusion.

deepak-newzera commented on May 27, 2024

@teticio I pushed my dataset to the HF and it can be found at https://huggingface.co/datasets/deepak-newzera/spectrogram_data_max_music_dataset-1

from audio-diffusion.

teticio commented on May 27, 2024

The dataset looks good and I checked how it sounds in the test_mel.ipynb and it sounds OK to me. If you want to use my dataset, you can load it with ds = load_dataset('teticio/audio-diffusion-256') or your one with ds = load_dataset('deepak-newzera/spectrogram_data_max_music_dataset-1') But you won't be able to access the original mp3s. Also, training using my dataset is just a question of setting --dataset_name teticio/audio-diffusion-256

…

On Sat, Mar 4, 2023 at 5:47 PM deepak-newzera ***@***.***> wrote: @teticio <https://github.com/teticio> I pushed my dataset to the HF and it can be found at https://huggingface.co/datasets/deepak-newzera/spectrogram_data_max_music_dataset-1 — Reply to this email directly, view it on GitHub <#30 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKRPDB7WWM3W4F2KBZVYCKTW2N52RANCNFSM6AAAAAAVKK5MEM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from audio-diffusion.

deepak-newzera commented on May 27, 2024

The dataset at deepak-newzera/spectrogram_data_max_music_dataset-1 is a newly created dataset. I have 180 music recordings. From each recording, I took 8-second clips as 0s to 8s, 1s to 9s and so on. This way I expanded the dataset and trained the model with this dataset which now contains around 15000 8-second clips. Now I could hear some better music outputs with this trained model.

But I have a doubt. While producing output music (running model inference), the progress bar iterates from 0 to 1000 if your model is used. But in the case of my model, it is iterating from 0 to 50 only. What does this signify? Does it affect the quality of the output?

from audio-diffusion.

teticio commented on May 27, 2024

This will be because you trained a DDIM model (with the --dim flag). My experience has been that the results are not so good with DDIM. But not to worry, I think it is pretty much equivalent to take a model trained with DDIM and change the scheduler to be DDPM. Or, pass to the inference a eta of 1 and and num_inference_steps ot 1000.

…

On Mon, Mar 6, 2023 at 6:00 PM deepak-newzera ***@***.***> wrote: The dataset at deepak-newzera/spectrogram_data_max_music_dataset-1 is a newly created dataset. I have 180 music recordings. From each recording, I took 8-second clips as 0s to 8s, 1s to 9s and so on. This way I expanded the dataset and trained the model with this dataset which now contains around 15000 8-second clips. Now I could hear some better music outputs with this trained model. But I have a doubt. While producing output music (running model inference), the progress bar iterates from 0 to 1000 if your model is used. But in the case of my model, it is iterating from 0 to 50 only. What does this signify? Does it affect the quality of the output? — Reply to this email directly, view it on GitHub <#30 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKRPDBZWUPHD62SPJFEHW2LW2YQ5RANCNFSM6AAAAAAVKK5MEM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from audio-diffusion.

Dataset constriants about audio-diffusion HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent