Comments (16)
Glad you liked it. Around 400 files were used. (You can load the dataset into a pandas dataframe and do a "unique" on the filename). If you count the number of rows (I think there were around 20,000), this tells you the total length = 5s * 20,000 = 27h or about 4 minutes per track on average.
from audio-diffusion.
I did the following way, please check it and comment on it:
I have some mp3 music recordings. I made a total of around 5000 clips out of those recordings, by splitting each of the recordings to make them be of 5 seconds each.
Then I used the command python scripts/audio_to_images.py --resolution 256,256 --hop_length 1024 --input_dir Splitted_mp3s --output_dir spectrogram_data-splitted-mp3-256
to get the spectogram data.
Then I executed the command accelerate launch scripts/train_unet.py --dataset_name spectrogram_data-splitted-mp3-256 \ --hop_length 1024 --output_dir models/audio-diffusion-splitted-mp3-256/ --train_batch_size 2 --num_epochs 100 --gradient_accumulation_steps 8 --save_images_epochs 100 --save_model_epochs 1 --scheduler ddim --learning_rate 1e-4 --lr_warmup_steps 500 --mixed_precision no
to train the model with my dataset. The training is in progress.
Is this the correct way to train the model? Please let me know
from audio-diffusion.
Best not to split the mp3s yourself, as the split is not exactly 5 seconds. The audio_to_images
script will do this for you - just provide a folder of regular mp3s. It should still work OK. What you have done looks correct otherwise.
from audio-diffusion.
I initially did the training without splitting only. But it gave clumsy and noisy outputs. Now I completed training with splitting as well. Yet the outputs are bad!
I am doing the following to test the trained model:
audio_diffusion = AudioDiffusion('/home/deepak/mansion/AD/audio-diffusion/models/audio-diffusion-splitted-mp3-256')
image, (sample_rate, audio) = audio_diffusion.generate_spectrogram_and_audio()
display(image)
display(Audio(audio, rate=sample_rate))
Please give me some suggestions for getting clean outputs
from audio-diffusion.
Also, is there a way to evaluate this model through some metrics? (Like checking how close the generated music is close to the training data)
from audio-diffusion.
It's a bit hard to say without being able to see you model. You could consider pushing it (with the tensorboad logs which should be included by default) to Hugging Face hub. Then I could look at it. One thing you can do is use the test_mel.ipynb
notebook to load an example from your test dataset (make sure you set the parameters to mel to match those in the generation - i.e., hop_length 1024) and see how the recreated mp3 sounds. It is also possible that you don't have enough data, but I can't say as I didn't try with < 20,000 samples.
Regarding your second question about metrics, you can run tensorboard --logdir=.
and see the loss curves and generated samples per epoch as training progresses. The losses measure how well the model is able to reconstruct an audio after noising and denoising. It doesn't measure the quality of samples generated from denoising pure noise (which is the generative process).
from audio-diffusion.
Yeah, the test_mel.ipynb
is also not recreating the mp3s accurately. What might be the problem?
Also, for your dataset for epoch, the iterations are like 20000/20000 right?
from audio-diffusion.
So I would not recommend hop_length=1024: use the default (leave it blank or put 512). The higher hop_length was for low resolution cases. Can't remember the details but you can see my tensorboard here https://huggingface.co/teticio/audio-diffusion-256/tensorboard. I did 100 epochs. Before you do any training, make sure you can get a decent quality reconstruction of an audio sample from a mel image. Again, if you push your dataset to HF, I can download it and try it out, but try to solve it yourself first. Goodl luck and let me know how you get on.
from audio-diffusion.
PS: note that the first epochs have very quiet audio samples in the tensorboard because I was not normalizing them at first
from audio-diffusion.
That's a really supporting reply. I will keep trying.
If possible, you also please try it out. This is the link to my data directory containing the mp3 files.
https://drive.google.com/file/d/1lRYkvEzfpsiCc5byTBBl9nFbmeNnAnJg/view?usp=share_link
from audio-diffusion.
@teticio Also, please let me know how to generate longer samples from the pre-trained model.
from audio-diffusion.
@teticio I would like to simulate your model by training with your dataset. Could you please provide your dataset?
I could see it at https://huggingface.co/datasets/teticio/audio-diffusion-256/tree/main/data. But it is in parquet format. How can I get mp3 files from it?
from audio-diffusion.
@teticio I pushed my dataset to the HF and it can be found at https://huggingface.co/datasets/deepak-newzera/spectrogram_data_max_music_dataset-1
from audio-diffusion.
from audio-diffusion.
The dataset at deepak-newzera/spectrogram_data_max_music_dataset-1
is a newly created dataset. I have 180 music recordings. From each recording, I took 8-second clips as 0s to 8s, 1s to 9s and so on. This way I expanded the dataset and trained the model with this dataset which now contains around 15000 8-second clips. Now I could hear some better music outputs with this trained model.
But I have a doubt. While producing output music (running model inference), the progress bar iterates from 0 to 1000 if your model is used. But in the case of my model, it is iterating from 0 to 50 only. What does this signify? Does it affect the quality of the output?
from audio-diffusion.
from audio-diffusion.
Related Issues (20)
- Increasing input size HOT 4
- how does the audio_to_images.py file work? HOT 3
- Whether the longer music sample is the repetition of a shorted sample? HOT 1
- NameError: name 'transformers' is not defined upon running model via Gradio HOT 2
- High fidelity training? HOT 3
- Training own music samples? HOT 1
- Can I input audio file then generate image HOT 2
- Numpy Error HOT 1
- AttributeError: 'AutoencoderKL' object has no attribute 'sample_size' HOT 3
- teticio/audio-diffusion-256 is really good HOT 1
- multi-gpu training HOT 1
- [Little Feedback] Thank you! :) HOT 2
- is it possible to use the train_unet.py script as a regular ldm? HOT 2
- whats the difference between 256 and 512 dataset HOT 1
- Duration of generated audio HOT 4
- WARNING: audio_to_images: No valid audio files were found error! HOT 3
- Questions on conditional generations HOT 6
- Music generation conditioned on text and music HOT 2
- Request ... No GUI ??? HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from audio-diffusion.