Giter Site home page Giter Site logo

Dataset constriants about audio-diffusion HOT 16 CLOSED

teticio avatar teticio commented on May 27, 2024
Dataset constriants

from audio-diffusion.

Comments (16)

teticio avatar teticio commented on May 27, 2024

Glad you liked it. Around 400 files were used. (You can load the dataset into a pandas dataframe and do a "unique" on the filename). If you count the number of rows (I think there were around 20,000), this tells you the total length = 5s * 20,000 = 27h or about 4 minutes per track on average.

from audio-diffusion.

deepak-newzera avatar deepak-newzera commented on May 27, 2024

I did the following way, please check it and comment on it:

I have some mp3 music recordings. I made a total of around 5000 clips out of those recordings, by splitting each of the recordings to make them be of 5 seconds each.
Then I used the command python scripts/audio_to_images.py --resolution 256,256 --hop_length 1024 --input_dir Splitted_mp3s --output_dir spectrogram_data-splitted-mp3-256 to get the spectogram data.

Then I executed the command accelerate launch scripts/train_unet.py --dataset_name spectrogram_data-splitted-mp3-256 \ --hop_length 1024 --output_dir models/audio-diffusion-splitted-mp3-256/ --train_batch_size 2 --num_epochs 100 --gradient_accumulation_steps 8 --save_images_epochs 100 --save_model_epochs 1 --scheduler ddim --learning_rate 1e-4 --lr_warmup_steps 500 --mixed_precision no to train the model with my dataset. The training is in progress.

Is this the correct way to train the model? Please let me know

from audio-diffusion.

teticio avatar teticio commented on May 27, 2024

Best not to split the mp3s yourself, as the split is not exactly 5 seconds. The audio_to_images script will do this for you - just provide a folder of regular mp3s. It should still work OK. What you have done looks correct otherwise.

from audio-diffusion.

deepak-newzera avatar deepak-newzera commented on May 27, 2024

I initially did the training without splitting only. But it gave clumsy and noisy outputs. Now I completed training with splitting as well. Yet the outputs are bad!
I am doing the following to test the trained model:
audio_diffusion = AudioDiffusion('/home/deepak/mansion/AD/audio-diffusion/models/audio-diffusion-splitted-mp3-256')
image, (sample_rate, audio) = audio_diffusion.generate_spectrogram_and_audio()
display(image)
display(Audio(audio, rate=sample_rate))

Please give me some suggestions for getting clean outputs

from audio-diffusion.

deepak-newzera avatar deepak-newzera commented on May 27, 2024

Also, is there a way to evaluate this model through some metrics? (Like checking how close the generated music is close to the training data)

from audio-diffusion.

teticio avatar teticio commented on May 27, 2024

It's a bit hard to say without being able to see you model. You could consider pushing it (with the tensorboad logs which should be included by default) to Hugging Face hub. Then I could look at it. One thing you can do is use the test_mel.ipynb notebook to load an example from your test dataset (make sure you set the parameters to mel to match those in the generation - i.e., hop_length 1024) and see how the recreated mp3 sounds. It is also possible that you don't have enough data, but I can't say as I didn't try with < 20,000 samples.

Regarding your second question about metrics, you can run tensorboard --logdir=. and see the loss curves and generated samples per epoch as training progresses. The losses measure how well the model is able to reconstruct an audio after noising and denoising. It doesn't measure the quality of samples generated from denoising pure noise (which is the generative process).

from audio-diffusion.

deepak-newzera avatar deepak-newzera commented on May 27, 2024

Yeah, the test_mel.ipynb is also not recreating the mp3s accurately. What might be the problem?
Also, for your dataset for epoch, the iterations are like 20000/20000 right?

from audio-diffusion.

teticio avatar teticio commented on May 27, 2024

So I would not recommend hop_length=1024: use the default (leave it blank or put 512). The higher hop_length was for low resolution cases. Can't remember the details but you can see my tensorboard here https://huggingface.co/teticio/audio-diffusion-256/tensorboard. I did 100 epochs. Before you do any training, make sure you can get a decent quality reconstruction of an audio sample from a mel image. Again, if you push your dataset to HF, I can download it and try it out, but try to solve it yourself first. Goodl luck and let me know how you get on.

from audio-diffusion.

teticio avatar teticio commented on May 27, 2024

PS: note that the first epochs have very quiet audio samples in the tensorboard because I was not normalizing them at first

from audio-diffusion.

deepak-newzera avatar deepak-newzera commented on May 27, 2024

That's a really supporting reply. I will keep trying.
If possible, you also please try it out. This is the link to my data directory containing the mp3 files.
https://drive.google.com/file/d/1lRYkvEzfpsiCc5byTBBl9nFbmeNnAnJg/view?usp=share_link

from audio-diffusion.

deepak-newzera avatar deepak-newzera commented on May 27, 2024

@teticio Also, please let me know how to generate longer samples from the pre-trained model.

from audio-diffusion.

deepak-newzera avatar deepak-newzera commented on May 27, 2024

@teticio I would like to simulate your model by training with your dataset. Could you please provide your dataset?
I could see it at https://huggingface.co/datasets/teticio/audio-diffusion-256/tree/main/data. But it is in parquet format. How can I get mp3 files from it?

from audio-diffusion.

deepak-newzera avatar deepak-newzera commented on May 27, 2024

@teticio I pushed my dataset to the HF and it can be found at https://huggingface.co/datasets/deepak-newzera/spectrogram_data_max_music_dataset-1

from audio-diffusion.

teticio avatar teticio commented on May 27, 2024

from audio-diffusion.

deepak-newzera avatar deepak-newzera commented on May 27, 2024

The dataset at deepak-newzera/spectrogram_data_max_music_dataset-1 is a newly created dataset. I have 180 music recordings. From each recording, I took 8-second clips as 0s to 8s, 1s to 9s and so on. This way I expanded the dataset and trained the model with this dataset which now contains around 15000 8-second clips. Now I could hear some better music outputs with this trained model.

But I have a doubt. While producing output music (running model inference), the progress bar iterates from 0 to 1000 if your model is used. But in the case of my model, it is iterating from 0 to 50 only. What does this signify? Does it affect the quality of the output?

from audio-diffusion.

teticio avatar teticio commented on May 27, 2024

from audio-diffusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.