kamalesh0406 / audio-classification Goto Github PK

Pytorch code for "Rethinking CNN Models for Audio Classification"

Python 100.00%

audio-classification's Introduction

Rethinking CNN Models for Audio Classification

This repository contains the PyTorch code for our paper Rethinking CNN Models for Audio Classification. The experiments are conducted on the following three datasets which can be downloaded from the links provided:

Preprocessing

The preprocessing is done separately to save time during the training of the models.

For ESC-50:

python preprocessing/preprocessingESC.py --csv_file /path/to/file.csv --data_dir /path/to/audio_data/ --store_dir /path/to/store_spectrograms/ --sampling_rate 44100

For UrbanSound8K:

python preprocessing/preprocessingUSC.py --csv_file /path/to/csv_file/ --data_dir /path/to/audio_data/ --store_dir /path/to/store_spectrograms/

For GTZAN:

python preprocessing/preprocessingGTZAN.py --data_dir /path/to/audio_data/ --store_dir /path/to/store_spectrograms/ --sampling_rate 22050

Training the Models

The configurations for training the models are provided in the config folder. The sample_config.json explains the details of all the variables in the configurations. The command for training is:

python train.py --config_path /config/your_config.json

audio-classification's People

Contributors

Stargazers

Watchers

audio-classification's Issues

About the article current state

hello! Thanks for your work and I am curious about that if this article was published in any journal currently?

About datasetaug

In dataloaders/datasetaug.py(line23-line29)

sample = value
limits = ((-2, 2), (0.9, 1.2))

if self.mode=="train":
	pitch_shift = np.random.randint(limits[0][1], limits[0][1] + 1)
	time_stretch = np.random.random() * (limits[1][1] - limits[1][0]) + limits[1][0]
	new_audio = librosa.effects.time_stretch(librosa.effects.pitch_shift(sample, self.sr, pitch_shift), time_stretch)

I want to know if there is something wrong with pitch_shift?
If use pitch_shift = np.random.randint(limits[0][1], limits[0][1] + 1), so pitch shift is fixed as 2.
I think we should use pitch_shift = np.random.randint(limits[0][0], limits[0][1] + 1), so pitch shift will range from -2 to 2 as we expect.

matrix normalization

How did you normalize (3,128,250) inputs? In preprocessing audios no normalization happened.
Does Densenet normalize inputs? if yes, where?

Thankss

train.py is running but no outputs

Salam Kamalesh,

I added a print command before "with tqdm(total=len(data_loader)) as t:" and it was the last output in the console. I stopped running the file after more than 5 hours and nothing changed, unfortunately.

do you have any idea why this may happen?

Thanks a lot!!

Error occur when running the classification for 'UrbanSound8k' with normalization.

Thank you for the great works.

When we tried to run the urbansound 8k classification tasks with augmentation, some error occured (in the fold 2).

"Padding size should be less than the dimension 2 of the samples."

It occured from the following code in "datasetaug.py".

spec = torchaudio.transforms.MelSpectrogram(sample_rate=self.sr, n_fft=self.fft, win_length=window_length, hop_length=hop_length, n_mels=self.melbins)(clip)

Is it normal? or i run in the different version of librosa?

Thank you.

Accuracy of each fold

Sorry to disturb you. I fork your code and try to use 'resnet' to predict samples in dataset 'UrbanSound8k'. However, Accuracy of the first fold is 76.9, which is far below 84.76%[1], so I want to know if it is normal and hope if you could make your result of each fold public. Look forward to your reply.

[1] Palanisamy, Kamalesh, et al. “Rethinking CNN Models for Audio Classification.” ArXiv Preprint ArXiv:2007.11154, 2020.

About the Integrated Gradients

Hi,
Thanks for your contibution. I am interested in your paper and trying to run the scripts. I found you mentioned the integrated gradients results in your paper. It is amazing. Could you provide the related code?

Thanks

About paper

Hello, thanks for your contibution. I would like to know the current status of your paper. Whether it has been accepted or not?

Prediction/Inference for novel data

Do you provide a utility/function for inference on novel data, that is, a way to apply a trained model to a previously unseen audio file?

Question regarding json file

Hello Kamalesh,

I am interested in your paper and am trying to run your solution. I have a question regarding Urbansound8k config file. You mentioned the number of fold =1. Why you did this?

About GPU Utilization

Thanks for your great work!
I tried to run your project, but the speed of training is very slow. I find GPU Utilization is very low, only 1%. However, GPU Usage is normal about 4GB.
I don't know the reason it happened. Looking forward to your answer!!