CLAP (Contrastive Language-Audio Pretraining)

Experiments on the CLAP model for the DL course at IMTA (2023-2024)

Installation

To download all the datasets, run the dataset.sh script.

Usage

cf. this page

Datasets

We use the following datasets :

ESC-50 : 50 classes of environmental sounds, 2000 samples, 5 seconds each.
UrbanSound8K : 10 classes of urban sounds, 8732 samples, 4 seconds each.
FMA-Small : 8 genres of music, 8000 samples, 30 seconds each.
AudioSet : 527 classes of sounds, 2 084 320 samples, 10 seconds each. However, we only use a subset of a few classes (see figure below).

Experiments

Last audio processed

Image of the last audio processed by the model (from the ESC-50 dataset).

A few experiments results on the ESC-50 dataset

Running the main.py script over the whole ESC-50 dataset on a GTX1060, consumes : 1321MiB / 6144MiB of GPU RAM and takes less than 20 minutes to complete.

Confusion matrix of the model over the ESC-50 dataset (raw labels)

Confusion matrix of the model over the ESC-50 dataset (augmented labels)

We also tried to augment the labels of the ESC-50 dataset, by turning words into full sentences. For example, the label dog becomes A dog is barking. The idea is to give more context to the model, and to make it learn more about the meaning of the sounds.

We gained more than 10% of accuracy, and the confusion matrix looks better.

t-SNE visualization of the ESC-50 dataset + labels

A few experiments results on the UrbanSound8K dataset

On 2000 samples of the UrbanSound8K dataset, the model takes about 35 minutes to run on a GTX1060.

Confusion matrix of the model over the UrbanSound8K dataset (2000 samples, augmented labels, top 1 accuracy)

Confusion matrix of the model over the UrbanSound8K dataset (2000 samples, augmented labels, top 3 accuracy)

t-SNE visualization of the UrbanSound8K dataset + labels

A few experiments results on the FMA-Small dataset

The accuracy on the FMA-Small dataset is very low, we think this might be related to poor labels. We tried to augment the labels, but it didn't improve the accuracy by much.

t-SNE visualization of the FMA-Small dataset + labels

There are some clusters, but the labels are not very accurate. It is however suitable for sound retrieval.

jonathanlys01 / dl_2023_clap Goto Github PK

dl_2023_clap's Introduction

CLAP (Contrastive Language-Audio Pretraining)

Installation

Usage

Datasets

Experiments

Last audio processed

A few experiments results on the ESC-50 dataset

Confusion matrix of the model over the ESC-50 dataset (raw labels)

Confusion matrix of the model over the ESC-50 dataset (augmented labels)

t-SNE visualization of the ESC-50 dataset + labels

A few experiments results on the UrbanSound8K dataset

Confusion matrix of the model over the UrbanSound8K dataset (2000 samples, augmented labels, top 1 accuracy)

Confusion matrix of the model over the UrbanSound8K dataset (2000 samples, augmented labels, top 3 accuracy)

t-SNE visualization of the UrbanSound8K dataset + labels

A few experiments results on the FMA-Small dataset

t-SNE visualization of the FMA-Small dataset + labels

A few experiments results on the AudioSet dataset

Confusion matrix of the model over the AudioSet dataset (~600 samples, augmented labels, top 1 accuracy)

t-SNE visualization of the AudioSet dataset + labels

dl_2023_clap's People

Contributors

Watchers

Forkers

dl_2023_clap's Issues

Recommend Projects

Recommend Topics

Recommend Org