This repository collects code and data to disentangle the latent space of sound-event recognition model Yamnet into materials and actions. The data consists of 60K 1-second long sound snippets for which it is known which material and action was involved in making the sound. Yamnet is a 14 layer convolutional neural network that maps a sound's spectrogram to 521 common-sense auditory event classes.
This code has been tested on Windows, Linux and Mac. While x86 architectures were successfully set up, ARM architectures were found to have too many version conflicts. If you are using an Apple-silicon Mac, you are thus referred to switch your machine, e.g. to Google Colab. You will then need to ensure you are using a python 3.9x or 3.10x version. It is also recommended to have no more than the basic python packages installed in order to prevent version conflicts of the hereby installed packages. In your terminal, enter the root directory of the downloaded repository and execute the below line. In Colab, start a new code cell with a percentage sign (%) and then paste the below line thereafter. Then restart your code editor if you are using your local machine or the notebook's kernel if you are in Colab.
pip install .
The pre-processing code first passes the sounds through yamnet to obtain the latent representation at each layer. It then uses principal component analysis to reduce the dimensionality of these representations. Researchers who are interested in verifying or adjusting this part of code can use the Preprocess.ipynb file as a starting point. This file walks the user through downloading the github repository, installing dependencies and running the pre-processing scripts and saving the results. It can for instance be opened in Google Colab.
Researchers who are interested in verifying or adjusting the actual analysis can skip the pre-processing and use the existing pre-processed data that is stored in this github repository. The main analysis is demonstrated in Process.ipynb which involves classification of the sounds into materials and actions at each of Yamnet's latent layers as well as the disentanglement of the latent space using an invertible flow model. The latent representations of sounds are then perturbed and systematic changes in Yamnets output are demonstrated as a result. The notebook provides numerous statistical tests and figures for these analyses.