Giter Site home page Giter Site logo

dl_2023_clap's Introduction

ArXiv CLAP (Contrastive Language-Audio Pretraining)

Experiments on the CLAP model for the DL course at IMTA (2023-2024)

Installation

To download all the datasets, run the dataset.sh script.

Usage

cf. this page

Datasets

We use the following datasets :

  • ESC-50 : 50 classes of environmental sounds, 2000 samples, 5 seconds each.
  • UrbanSound8K : 10 classes of urban sounds, 8732 samples, 4 seconds each.
  • FMA-Small : 8 genres of music, 8000 samples, 30 seconds each.
  • AudioSet : 527 classes of sounds, 2 084 320 samples, 10 seconds each. However, we only use a subset of a few classes (see figure below).

Experiments

Last audio processed

Image of the last audio processed by the model (from the ESC-50 dataset).

last audio

A few experiments results on the ESC-50 dataset

Running the main.py script over the whole ESC-50 dataset on a GTX1060, consumes : 1321MiB / 6144MiB of GPU RAM and takes less than 20 minutes to complete.

Confusion matrix of the model over the ESC-50 dataset (raw labels)

Confusion matrix

Confusion matrix of the model over the ESC-50 dataset (augmented labels)

We also tried to augment the labels of the ESC-50 dataset, by turning words into full sentences. For example, the label dog becomes A dog is barking. The idea is to give more context to the model, and to make it learn more about the meaning of the sounds.

Confusion matrix

We gained more than 10% of accuracy, and the confusion matrix looks better.

t-SNE visualization of the ESC-50 dataset + labels

t-SNE visualization

A few experiments results on the UrbanSound8K dataset

On 2000 samples of the UrbanSound8K dataset, the model takes about 35 minutes to run on a GTX1060.

Confusion matrix of the model over the UrbanSound8K dataset (2000 samples, augmented labels, top 1 accuracy)

Confusion matrix

Confusion matrix of the model over the UrbanSound8K dataset (2000 samples, augmented labels, top 3 accuracy)

Confusion matrix

t-SNE visualization of the UrbanSound8K dataset + labels

t-SNE visualization

A few experiments results on the FMA-Small dataset

The accuracy on the FMA-Small dataset is very low, we think this might be related to poor labels. We tried to augment the labels, but it didn't improve the accuracy by much.

t-SNE visualization of the FMA-Small dataset + labels

t-SNE visualization

There are some clusters, but the labels are not very accurate. It is however suitable for sound retrieval.

A few experiments results on the AudioSet dataset

Confusion matrix of the model over the AudioSet dataset (~600 samples, augmented labels, top 1 accuracy)

Confusion matrix

t-SNE visualization of the AudioSet dataset + labels

t-SNE visualization

dl_2023_clap's People

Contributors

jonathanlys01 avatar jovillios avatar

Watchers

 avatar

Forkers

brain-bzh

dl_2023_clap's Issues

Add Live music retrieval

add a "live" script that instantiates the model, loads it, and wait for a user input (text) to retrieve sounds or music from cached features for fast inference, it should output the paths to the sounds or directly play it (check for audio output on remote machines)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.