Giter Site home page Giter Site logo

yuangongnd / ssast Goto Github PK

View Code? Open in Web Editor NEW
359.0 7.0 57.0 11.97 MB

Code for the AAAI 2022 paper "SSAST: Self-Supervised Audio Spectrogram Transformer".

License: BSD 3-Clause "New" or "Revised" License

Python 87.45% Shell 12.55%
audio audio-processing audio-classification speech-classification self-supervised-learning

ssast's Introduction

SSAST: Self-Supervised Audio Spectrogram Transformer

News

March, 2022: We released a new preprint CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification, where we proposed a knowledge distillation based method to further improve the AST model performance without changing its architecture. This method can be applied in the fine-tuning stage of SSAST.

Feb 2022: I will present SSAST at AAAI 2022 at 12:00 PM - 1:45 PM (EST) on February 25th and then 7:45 PM - 9:30 PM (EST) on February 27th.

Introduction

Illustration of AST.

This repository contains the official implementation (in PyTorch) of the Self-Supervised Audio Spectrogram Transformer (SSAST) proposed in the AAAI 2022 paper SSAST: Self-Supervised Audio Spectrogram Transformer (Yuan Gong, Cheng-I Jeff Lai, Yu-An Chung, James Glass; MIT CSAIL). [Slides]

SSAST is the first patch-based joint discriminative and generative self-supervised learning framework, and also the first self-supervised learning framework for AST. SSAST significantly boosts AST performance on all downstream tasks we evaluated with an average improvement of 60.9%, leading to similar or even better results than a supervised pretrained AST. SSAST can be used as a drop-in replacement of previous ImageNet (supervised) pretrained AST, and has the advantage of 1) no labeled data is used; 2) flexible patch size and shape, ImagenNet pretraining only supports square patches; and 3) better performance on many tasks, in particular speech tasks.

Citing

Please cite our paper if you find this repository useful.

@inproceedings{gong2022ssast,
  title={SSAST: Self-Supervised Audio Spectrogram Transformer},
  author={Gong, Yuan and Lai, Cheng-I and Chung, Yu-An and Glass, James},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={36},
  number={10},
  pages={10699--10709},
  year={2022}
}
@inproceedings{gong21b_interspeech,
  author={Yuan Gong and Yu-An Chung and James Glass},
  title={{AST: Audio Spectrogram Transformer}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={571--575},
  doi={10.21437/Interspeech.2021-698}
}

Getting Started

Prepare the Environment

Clone or download this repository and set it as the working directory, create a virtual environment and install the dependencies.

cd ast/ 
python3 -m venv venvssast
source venvssast/bin/activate
pip install -r requirements.txt 

Where is the code?

The SSAST model and pretraining function code is in src/models/ast_model.py.
The self-supervised pretraining script is src/pretrain/{run_mask_{frame,patch}, run_mask_{frame,patch}_tiny}, which calls src/run.py, which then calls src/traintest_mask.py, which then calls src/models/ast_model.py.
The fine-tuning scripts are in src/finetune/, for PSLA experiments, these scripts call src/run.py, which then calls src/traintest.py, which then calls src/traintest_mask.py, which then calls src/models/ast_model.py.
The data preparation samples are in src/prep_data.

SSAST Model

The SSAST model script is in src/models/ast_models.py.

ASTModel(label_dim=527,
         fshape=16, tshape=16 fstride=10, tstride=10,
         input_fdim=128, input_tdim=1024, model_size='base',
         pretrain_stage=True, load_pretrained_mdl_path=None)

Parameters:

label_dim : The number of classes, only need to specify in the fine-tuning stage.
fshape: The side length of the patch on the frequency dimension.
tshape: The side length of the patch on the time dimension.
fstride: The stride of patch spliting on the frequency dimension, for 16*16 patchs, fstride=16 means no overlap, fstride=10 means overlap of 6.
tstride: The stride of patch spliting on the time dimension, for 16*16 patchs, tstride=16 means no overlap, tstride=10 means overlap of 6.
input_fdim: The number of frequency bins of the input spectrogram.
input_tdim: The number of time frames of the input spectrogram.
model_size: The model size of AST, should be in [tiny, small, base] (default: base).
pretrain_stage: Set as True in the self-supervised pretraining stage and False in the fine-tuning stage.
load_pretrained_mdl_path: The pretrained model used for fine-tuning. Only needed when pretrain_stage=False as it is for fine-tuning.

Methods:

forward(x, task, cluster=True, mask_patch=400)
The entry method of the class that calls fine-tuning and pretraining methods. Parameters:

  • x: the input spectrogram in shape [batch_size, time_frame_num, frequency_bin_num]. Note: the input spectrogram should be normalized with dataset mean and std, see here.
  • task: the pretraining or fine-tuning task, should in [ft_avgtok, ft_cls, pretrain_mpc, pretrain_mpg], see below for details.
  • cluster: set True if using cluster patch masking strategy.
  • mask_patch: the number of patch masked, only needed in the pretraining stage.

finetuningavgtok(x): fine-tune the model by using the average of the outputs of all tokens as the clip represention. Return in shape [batch_size, label_dim].

finetuningcls(x): fine-tune the model by using the output of the cls token as clip represention. Return in shape [batch_size, label_dim].

mpc(x, mask_patch=mask_patch, cluster=cluster): pretrain the model with mask_patch number of masked patches with the discriminative objective. Return the accuracy and NCE loss of the pretext task.

mpg(x, mask_patch=mask_patch, cluster=cluster): pretrain the model with mask_patch number of masked patches with the generative objective. Return the mean square error of the pretext task.

Example:

# pretraining stage
# suppose you have an unlabled dataset with avg length of 1024 frames (i.e., 10.24s)
input_tdim = 1024
# create a 16*16 patch based AST model for pretraining.
# note, we don't use patch split overlap in pretraining, so fstride=fshape and tstride=tshape
ast_mdl = ASTModel(
             fshape=16, tshape=16, fstride=16, tstride=16,
             input_fdim=128, input_tdim=input_tdim, model_size='base',
             pretrain_stage=True)
# # alternatively, create a frame based AST model
# ast_mdl = ASTModel(
#              fshape=128, tshape=2, fstride=128, tstride=2,
#              input_fdim=128, input_tdim=input_tdim, model_size='base',
#              pretrain=True)

# do pretraining, see src/traintest_mask.py for our full pretraining code
# input in shape [batch_size, input_tdim, input_fdim]
test_input = torch.zeros([10, input_tdim, 128])
# mask 100 patches for both discriminative and generative loss
acc, nce_loss = ast_mdl(test_input, task='pretrain_mpc', mask_patch=100)
mse_loss = ast_mdl(test_input, task='pretrain_mpg', mask_patch=100)
loss = nce_loss + 10 * mse_loss
# do back propagate and update the model, etc

# after pretraining, save the pretrained model.
# the code is designed for Dataparallel model
ast_mdl = torch.nn.DataParallel(ast_mdl)
torch.save(ast_mdl.state_dict(), './test_mdl.pth')

# fine-tuning stage
# now you have a labeled dataset you want to finetune AST on
# suppose the avg length is 100 frames (1s) and there are 35 classes
# the fshape and tshape must be same in pretraining and finetuning
# but fstride and tstride can be different in pretraining and finetuning
# using smaller strides improves the performance but also increase the computational overhead
# set pretrain_stage as False since now is in the finetuning stage
# provide the path of the pretrained model you want to load
input_tdim = 100  # fine-tuning data length can be different with pretraining data length
ast_mdl = ASTModel(label_dim=35,
             fshape=16, tshape=16, fstride=10, tstride=10,
             input_fdim=128, input_tdim=input_tdim, model_size='base',
             pretrain_stage=False, load_pretrained_mdl_path='./test_mdl.pth')
# # alternatively, use a frame based AST model
# ast_mdl = ASTModel(label_dim=35,
#              fshape=128, tshape=2, fstride=128, tstride=1,
#              input_fdim=128, input_tdim=input_tdim, model_size='base',
#              pretrain_stage=False, load_pretrained_mdl_path='./test_mdl.pth')

# do finetuning, see src/traintest.py for our finetuning code
test_input = torch.zeros([10, input_tdim, 128])
prediction = ast_mdl(test_input, task='ft_avgtok')
# output should in shape [batch_size, label_dim]
print(prediction.shape)
# calculate the loss, do back propagate, etc

Data Preparation

For both pretraining and fine-tuning, our dataloader requires two files:

  • A json file containing path of the audio and corresponding label.
    • Self-supervised pretraining does not need any label, but our current version of dataloader.py needs label information to run, you need to use a dummy label for pretraining data. Below is an example json file.
 {
    "data": [
        {
            "wav": "/data/sls/audioset/data/audio/eval/_/_/--4gqARaEJE_0.000.flac",
            "labels": "/m/068hy,/m/07q6cd_,/m/0bt9lr,/m/0jbk"
        },
        {
            "wav": "/data/sls/audioset/data/audio/eval/_/_/--BfvyPmVMo_20.000.flac",
            "labels": "/m/03l9g"
        },
      // ... many audio files
        {
            "wav": "/data/sls/audioset/data/audio/eval/_/0/-0BIyqJj9ZU_30.000.flac",
            "labels": "/m/07rgt08,/m/07sq110,/t/dd00001"
        }
    ]
}
  • A csv file containing label information. The labels should be consistent with those in the json file.
    • Again, even for self-supervised pretraining, a dummy csv file is needed.
index,mid,display_name
0,/m/07rwj00,"dog"
1,/m/07rwj01,"rooster"
2,/m/07rwj02,"pig"
...

Examples: we provide our script to prepare data for a set of datasets.

  • Librispeech: We have librispeech preparation code in src/prep_data/librispeech/prep_librispeech.py.
  • AudioSet: You will need to download and process AudioSet data by yourself as AudioSet are YouTube videos, please see here.
  • FSD50K: FSD50K is not used in the paper, but FSD50K is AudioSet-like,
  • ESC-50: in src/prep_data/esc50/prep_esc.py
  • Speechcommands V2-35: in src/prep_data/speechcommands/prep_sc.py
  • Combining multiple datasets: see src/prep_data/mix_pretraining_data for our code to combine librispeech and AudioSet (used in the paper).

Self-Supervised Pretraining

Reproduce our experiments

The pretraining scripts are in src/pretrain/, we provide scripts to pretrain tiny/base and patch-based/frame-based AST model. The one we use for our main model in the paper is src/pretrain/run_mask_patch.sh. The scripts were tested on 4 GTX TITAN GPUs with 12GB memory. Please prepare the data as mentioned in Data Preparation.

Pretrain on custom dataset

First, prepare the data files (the json and csv file) as described in Data Preparation.
Second, modify our pretraining scripts are in src/pretrain/. Unless you have a resource constraint, it is always better to pretrain a base model than a tiny/small model. If your downstream task is speech, we suggest to train a frame-based SSAST (i.e., follow run_mask_frame.sh), if the downstream task is audio, we suggest to train a patch-based SSAST (i.e., follow run_mask_patch.sh). It is good to train and compare both.

From src/pretrain/run_mask_{frame,patch}.sh, basically, the only things need to be changed are the data part.

# your data json files
tr_data=/data/sls/scratch/yuangong/sslast2/src/prep_data/audioset_librispeech.json
te_data=/data/sls/scratch/yuangong/audioset/datafiles/eval_data.json
# normalization stats, the mean and std of the entire dataset.
# if the custom dataset is also speech/audio, it is fine to use the same norm stats with us.
# check https://github.com/YuanGongND/ast/blob/master/src/get_norm_stats.py
dataset_mean=-4.2677393
dataset_std=4.5689974
# audio length in frames, dataloader cut/pad all audios to this length
target_length=1024
# the number of frequency bins of your spectrogram. 
# if you want to train a frame-based SSAST, you need to change fshape with num_mel_bins
num_mel_bins=128

Fine-tuning on Downstream Tasks

PSLA training pipeline experiments

  • ESC-50: We suggest to start from ESC-50 experiments as our recipe is almost one click (i.e., the script handles data downloading, data processing, pre-trained model downloading, training and evaluation). Check src/finetune/esc50/{run_esc_patch, run_esc_frame}.sh for fine-tune patch-based and frame-based SSAST, respectively. To run, just cd src/finetune/esc50 and then sbatch run_esc_{patch,frame}.sh (slurm user) or ./run_esc_{patch,frame}.sh (local user).
  • Speech Commands V2-35: Check src/finetune/speechcommands_v2_35/run_sc.sh. It is also one-click and handles everything. Just cd src/finetune/speechcommands_v2_35, and then sbatch run_sc.sh (slurm user) or ./run_sc.sh (local user).
  • AudioSet: Check src/finetune/audioset/run_as.sh. Since AudioSet are YouTube videos, you will need to prepare the data by yourself. Note our experiment uses label enhanced balanced AudioSet, see psla training pipeline for how we enhance the label.

SUPERB training pipeline experiments

Note: The SUPERB package function has changed after our experiments. In the lastest version, the new get_downsample_rate is not friendly to patch-based AST as patch-based AST does not process spectrogram in frame-by-frame manner. If you want to reproduce our experiments on patch-based AST, please download the old version SUPERB, or, if you are only interested in frame-based AST (which performs better on speech tasks), you can use the latest version of SUPERB without problem.

We provide everything needed to reproduce our SUPERB experiments. The scripts are in src/finetune/superb/.

First, download and install SUPERB package [old, work for both patch and frame AST] [latest, only work for frame SSAST].

cd yoursuperbpath/s3prl-master/
pip install -e ./

Second, modify the paths in src/finetune/unstream/ast/hubconf.py to your pretrained SSAST model absolute path, you can use our pretrained model.

Then, copy our src/finetune/unstream/ast to the SUPERB upstream directory:

cp -r src/finetune/unstream/ast yoursuperbpath/s3prl-master/s3prl/upstream/

Third, change the dataset path in src/finetune/superb/{speechcommands_v1, iemocap, voxceleb}/{config_ks.yaml, config_er.yaml, config_sid.yaml}.

Then, copy the downstream code and configuration to the SUPERB directory, e.g., for the speech commands task:

cp src/finetune/superb/speechcommands_v1/{run_ks.sh,config.yaml} yoursuperbpath/s3prl-master/s3prl/

Finally, run the training script:

cd yoursuperbpath/s3prl-master/s3prl/
# for local user
./run_ks.sh
# or, for slurm user
sbatch run_ks.sh

You can find the result logs in yoursuperbpath/s3prl-master/s3prl/exp/expname/log.log

Fine-tune on custom dataset

It is easy to fine-tune on a new dataset. In general, PSLA training pipeline is stronger. You can start from any of the AudioSet, ESC-50, and SpeechCommands recipes (run_sc.sh, run_esc_{frame,patch}.sh, run_as.sh) and search the hyper-parameters. The only thing you need to modify is the shell script. For speech task, we suggest to fine-tune a frame-based SSAST, for audio task, we suggest to fine-tune a patch-based SSAST.

Pretrained-Models

We provide the following self-supervised pretrained models. All models are trained with full AudioSet + Librispeech unless otherwise indicated. Click the model name to download. Tiny model should be able to pretrain and fine-tune on an 8GB GPU with a reasonable batch size.

For best performance, you should use either of the following models, patch-based AST is better for audio tasks, frame-based AST is better for speech tasks.

Model Name Data Pretrain fshape Pretrain tshape #Masked Patches Model Size Avg Audio Performance Avg Speech Performance
SSAST-Base-Patch-400 AudioSet + Librispeech 16 16 400 Base (89M) 59.9 79.5
SSAST-Base-Frame-400 AudioSet + Librispeech 128 2 400 Base (89M) 57.6 84.0

Following models does not have best performance, we release them for analysis purpose and low-resource devices.

Model Name Data Pretrain fshape Pretrain tshape #Masked Patches Model Size Avg Audio Performance Avg Speech Performance
SSAST-Base-Patch-250 AudioSet + Librispeech 16 16 250 Base (89M) 58.6 79.5
SSAST-Base-Frame-250 AudioSet + Librispeech 128 2 250 Base (89M) 55.6 81.6
SSAST-Small-Patch-400 AudioSet + Librispeech 16 16 400 Small (23M) 58.1 78.2
SSAST-Tiny-Patch-400 AudioSet + Librispeech 16 16 400 Tiny (6M) 53.3 75.7
SSAST-Tiny-Frame-400 AudioSet + Librispeech 128 2 400 Tiny (6M) 47.8 untested

Following models are used in our ablation study in Table 2 of the paper, they does not have best performance, we release them for analysis purpose only.

Ablation Study Table

We set the 16x16 patch based AST pretrained with 400 masked patches, joint discriminative and generative objectives, on both AudioSet-2M and Librispeech as the base model. We then change one factor at a time.

ID in Ablation Study Model Download
1 100 Masked Patches Link
2 Only Discriminative Objective Link
3 Only Generative Objective Link
4 Pretrained w/ AudioSet-20K Link
5 Pretrained w/ AudioSet-2M Link
6 Pretrained Librispeech960 Link

Above links are dropbox direct download links (i.e., wget works), no dropbox registration or sign in needed. For those don't have access to Dropbox, use a VPN or use the OneDrive Links or 腾讯微云链接们.

Contact

If you have a question, please bring up an issue (preferred) or send me an email [email protected].

ssast's People

Contributors

yuangongnd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

ssast's Issues

Which python version are you using?

Which python version are you using?
Because with python 3.7 I have:

#0 147.0 Collecting torch==1.9.0
#0 147.0   Downloading torch-1.9.0-cp37-cp37m-manylinux2014_aarch64.whl (49.1 MB)
#0 195.3      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49.1/49.1 MB 1.6 MB/s eta 0:00:00
#0 195.6 ERROR: Ignored the following versions that require a different python version: 0.40.0 Requires-Python >=3.8; 0.40.0rc1 Requires-Python >=3.8; 0.40.1 Requires-Python >=3.8; 0.40.1rc1 Requires-Python >=3.8; 0.41.0 Requires-Python >=3.8; 0.41.0rc1 Requires-Python >=3.8; 0.57.0 Requires-Python >=3.8; 0.57.0rc1 Requires-Python >=3.8; 0.57.1 Requires-Python >=3.8; 0.57.1rc1 Requires-Python >=3.8; 0.58.0 Requires-Python >=3.8; 0.58.0rc1 Requires-Python >=3.8; 0.58.0rc2 Requires-Python >=3.8; 1.1.0 Requires-Python >=3.8; 1.1.1 Requires-Python >=3.8; 1.1.2 Requires-Python >=3.8; 1.1.3 Requires-Python >=3.8; 1.10.0 Requires-Python <3.12,>=3.8; 1.10.0rc1 Requires-Python <3.12,>=3.8; 1.10.0rc2 Requires-Python <3.12,>=3.8; 1.10.1 Requires-Python <3.12,>=3.8; 1.11.0 Requires-Python <3.13,>=3.9; 1.11.0rc1 Requires-Python <3.13,>=3.9; 1.11.0rc2 Requires-Python <3.13,>=3.9; 1.11.1 Requires-Python <3.13,>=3.9; 1.11.2 Requires-Python <3.13,>=3.9; 1.11.3 Requires-Python <3.13,>=3.9; 1.2.0 Requires-Python >=3.8; 1.2.0rc1 Requires-Python >=3.8; 1.2.1 Requires-Python >=3.8; 1.2.2 Requires-Python >=3.8; 1.22.0 Requires-Python >=3.8; 1.22.1 Requires-Python >=3.8; 1.22.2 Requires-Python >=3.8; 1.22.3 Requires-Python >=3.8; 1.22.4 Requires-Python >=3.8; 1.23.0 Requires-Python >=3.8; 1.23.0rc1 Requires-Python >=3.8; 1.23.0rc2 Requires-Python >=3.8; 1.23.0rc3 Requires-Python >=3.8; 1.23.1 Requires-Python >=3.8; 1.23.2 Requires-Python >=3.8; 1.23.3 Requires-Python >=3.8; 1.23.4 Requires-Python >=3.8; 1.23.5 Requires-Python >=3.8; 1.24.0 Requires-Python >=3.8; 1.24.0rc1 Requires-Python >=3.8; 1.24.0rc2 Requires-Python >=3.8; 1.24.1 Requires-Python >=3.8; 1.24.2 Requires-Python >=3.8; 1.24.3 Requires-Python >=3.8; 1.24.4 Requires-Python >=3.8; 1.25.0 Requires-Python >=3.9; 1.25.0rc1 Requires-Python >=3.9; 1.25.1 Requires-Python >=3.9; 1.25.2 Requires-Python >=3.9; 1.26.0 Requires-Python <3.13,>=3.9; 1.26.0b1 Requires-Python <3.13,>=3.9; 1.26.0rc1 Requires-Python <3.13,>=3.9; 1.3.0 Requires-Python >=3.8; 1.3.0rc1 Requires-Python >=3.8; 1.3.1 Requires-Python >=3.8; 1.8.0 Requires-Python >=3.8,<3.11; 1.8.0rc1 Requires-Python >=3.8,<3.11; 1.8.0rc2 Requires-Python >=3.8,<3.11; 1.8.0rc3 Requires-Python >=3.8,<3.11; 1.8.0rc4 Requires-Python >=3.8,<3.11; 1.8.1 Requires-Python >=3.8,<3.11; 1.9.0 Requires-Python >=3.8,<3.12; 1.9.0rc1 Requires-Python >=3.8,<3.12; 1.9.0rc2 Requires-Python >=3.8,<3.12; 1.9.0rc3 Requires-Python >=3.8,<3.12; 1.9.1 Requires-Python >=3.8,<3.12; 1.9.2 Requires-Python >=3.8; 1.9.3 Requires-Python >=3.8; 3.6.0 Requires-Python >=3.8; 3.6.0rc1 Requires-Python >=3.8; 3.6.0rc2 Requires-Python >=3.8; 3.6.1 Requires-Python >=3.8; 3.6.2 Requires-Python >=3.8; 3.6.3 Requires-Python >=3.8; 3.7.0 Requires-Python >=3.8; 3.7.0rc1 Requires-Python >=3.8; 3.7.1 Requires-Python >=3.8; 3.7.2 Requires-Python >=3.8; 3.7.3 Requires-Python >=3.8; 3.8.0 Requires-Python >=3.9; 3.8.0rc1 Requires-Python >=3.9
#0 195.6 ERROR: Could not find a version that satisfies the requirement torchaudio==0.9.0 (from versions: 0.10.0, 0.10.1, 0.10.2, 0.11.0, 0.12.0, 0.12.1, 0.13.0, 0.13.1)
#0 195.6 ERROR: No matching distribution found for torchaudio==0.9.0
#0 195.9 
#0 195.9 [notice] A new release of pip is available: 23.0.1 -> 23.2.1
#0 195.9 [notice] To update, run: pip install --upgrade pip
------
Dockerfile:11
--------------------
   9 |     
  10 |     # Installez les packages requis spécifiés dans requirements.txt
  11 | >>> RUN pip install -r requirements.txt
  12 |     
  13 |     # Exposez le port 80 pour permettre la communication avec le conteneur
--------------------
ERROR: failed to solve: process "/bin/sh -c pip install -r requirements.txt" did not complete successfully: exit code: 1

With python 3.8 I have:

#0 155.6   Downloading torch-1.9.0-cp38-cp38-manylinux2014_aarch64.whl (49.0 MB)
#0 194.1      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49.0/49.0 MB 1.2 MB/s eta 0:00:00
#0 194.4 ERROR: Ignored the following versions that require a different python version: 1.11.0 Requires-Python <3.13,>=3.9; 1.11.0rc1 Requires-Python <3.13,>=3.9; 1.11.0rc2 Requires-Python <3.13,>=3.9; 1.11.1 Requires-Python <3.13,>=3.9; 1.11.2 Requires-Python <3.13,>=3.9; 1.11.3 Requires-Python <3.13,>=3.9; 1.25.0 Requires-Python >=3.9; 1.25.0rc1 Requires-Python >=3.9; 1.25.1 Requires-Python >=3.9; 1.25.2 Requires-Python >=3.9; 1.26.0 Requires-Python <3.13,>=3.9; 1.26.0b1 Requires-Python <3.13,>=3.9; 1.26.0rc1 Requires-Python <3.13,>=3.9; 3.8.0 Requires-Python >=3.9; 3.8.0rc1 Requires-Python >=3.9
#0 194.4 ERROR: Could not find a version that satisfies the requirement torchaudio==0.9.0 (from versions: 0.10.0, 0.10.1, 0.10.2, 0.11.0, 0.12.0, 0.12.1, 0.13.0, 0.13.1, 2.0.0, 2.0.1, 2.0.2, 2.1.0)
#0 194.4 ERROR: No matching distribution found for torchaudio==0.9.0
#0 194.7 
#0 194.7 [notice] A new release of pip is available: 23.0.1 -> 23.2.1
#0 194.7 [notice] To update, run: pip install --upgrade pip
------
Dockerfile:11
--------------------
   9 |     
  10 |     # Installez les packages requis spécifiés dans requirements.txt
  11 | >>> RUN pip install -r requirements.txt
  12 |     
  13 |     # Exposez le port 80 pour permettre la communication avec le conteneur
--------------------
ERROR: failed to solve: process "/bin/sh -c pip install -r requirements.txt" did not complete successfully: exit code: 1

And 3.9:

#0 99.26 ERROR: Ignored the following versions that require a different python version: 0.52.0 Requires-Python >=3.6,<3.9; 0.52.0rc3 Requires-Python >=3.6,<3.9
#0 99.26 ERROR: Could not find a version that satisfies the requirement torchaudio==0.9.0 (from versions: 0.10.0, 0.10.1, 0.10.2, 0.11.0, 0.12.0, 0.12.1, 0.13.0, 0.13.1, 2.0.0, 2.0.1, 2.0.2, 2.1.0)
#0 99.26 ERROR: No matching distribution found for torchaudio==0.9.0
#0 99.53 
#0 99.53 [notice] A new release of pip is available: 23.0.1 -> 23.2.1
#0 99.53 [notice] To update, run: pip install --upgrade pip
------
Dockerfile:11
--------------------
   9 |     
  10 |     # Installez les packages requis spécifiés dans requirements.txt
  11 | >>> RUN pip install -r requirements.txt
  12 |     
  13 |     # Exposez le port 80 pour permettre la communication avec le conteneur
--------------------
ERROR: failed to solve: process "/bin/sh -c pip install -r requirements.txt" did not complete successfully: exit code: 1

Are the model weights loaded into the ASTModel with stochasticity?

I have a best_audio_model.pth folder that is the output of a pretrained SSAST model. My specific .pth file can be found here for reproducibility. Note that this is the result of a tiny model fit to 5 of the classes found in the ESC-50 data set; I chose 5 classes to keep the computation and plotting simple.

The following code (MWE) reveals that the weights are loaded differently in two model initializations, despite the parameters and data being the same. Specifically, the mlp_head.1.weight (and potentially more) appear to be different.

import torch
import os
import ast
import pickle
import sys
import time
sys.path.insert(0, 'ssast/src/models')
from ast_models import ASTModel
os.chdir('ssast/src')
import dataloader
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import json

## to minimize stochasticity in model evaluation
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

## initialize the model (ensure parameters match those used during training)
model_1 = ASTModel(label_dim = 5,fshape=16, tshape=16, fstride=16, tstride=16,
                       input_fdim=128, input_tdim=512, model_size='tiny', pretrain_stage=False, load_pretrained_mdl_path='./best_audio_model.pth')

## set the model to evaluation mode
model_1.eval()

model_2 = ASTModel(label_dim = 5,fshape=16, tshape=16, fstride=16, tstride=16,
                       input_fdim=128, input_tdim=512, model_size='tiny', pretrain_stage=False, load_pretrained_mdl_path='./best_audio_model.pth')

## set the model to evaluation mode
model_2.eval()

###
### Model Comparison
###

def are_models_identical(model_1, model_2):
    model_1_state_dict = model_1.state_dict()
    model_2_state_dict = model_2.state_dict()

    for p in model_1_state_dict:
        if p not in model_2_state_dict:
            print(f"Missing parameter {p} in model 2")
            return False
        if not torch.equal(model_1_state_dict[p], model_2_state_dict[p]):
            print(f"Difference found in parameter: {p}")
            return False
    
    # Check if model_2 has extra parameters not in model_1
    for p in model_2_state_dict:
        if p not in model_1_state_dict:
            print(f"Extra parameter {p} in model 2 not found in model 1")
            return False

    return True

# Example usage
identical = are_models_identical(model_1, model_2)
print(f"Models are {'identical' if identical else 'different'}")

print("Model 1 'mlp_head.1.weight' mean:", model_1.state_dict()['mlp_head.1.weight'].mean())
print("Model 1 'mlp_head.1.weight' std:", model_1.state_dict()['mlp_head.1.weight'].std())

print("Model 2 'mlp_head.1.weight' mean:", model_2.state_dict()['mlp_head.1.weight'].mean())
print("Model 2 'mlp_head.1.weight' std:", model_2.state_dict()['mlp_head.1.weight'].std())

Forgive me if my assumption is incorrect, but I would assume that the weights should be identical for model_1 and model_2. Additionally, I would assume that model_1(inputs, 'ft_avgtok') and model_2(inputs, 'ft_avgtok') would give the same output for identical inputs, which it does not. Is there something simple that I'm missing here?

Once again, thank you for all of your help, Yuan. I greatly appreciate it and am sorry to keeping bothering you.

sed

Hello, can this project be used for sound event detection? Input variable-length audio, and output chunk detection results at the frame level

Potential bug/sub-optimal implementation of final output activation and losses

Dear Yuan and authors,

Love the paper + repo. I am currently implementing SSAST as an ablation study in an audio classification paper which I am working on for the Alan Turing Institute and UK Gov. I have come across a potential small issue with the choice of combination of final activation function at inference time and the loss used at training time - please correct me if I am wrong.

Here we can see compatibility for crossentropy loss (softmax + negative log likelihood) and binary cross entropy loss (sigmoid + Binary cross entropy):

ssast/src/traintest.py

Lines 102 to 105 in 888bb6a

if args.loss == 'BCE':
loss_fn = nn.BCEWithLogitsLoss()
elif args.loss == 'CE':
loss_fn = nn.CrossEntropyLoss()

This is all fine and the training scheme is not an issue. However at inference time a sigmoid activation is applied over the output dims (#class) irrespective if crossentropy or binary cross entropy loss is the chosen loss at training time:

ssast/src/traintest.py

Lines 310 to 323 in 888bb6a

# compute output
audio_output = audio_model(audio_input, args.task)
audio_output = torch.sigmoid(audio_output)
predictions = audio_output.to('cpu').detach()
A_predictions.append(predictions)
A_targets.append(labels)
# compute the loss
labels = labels.to(device)
if isinstance(args.loss_fn, torch.nn.CrossEntropyLoss):
loss = args.loss_fn(audio_output, torch.argmax(labels.long(), axis=1))
else:
loss = args.loss_fn(audio_output, labels)

I have 2 (minor) issues with this.

  1. Cross entropy loss is for multiclass classification and therefore normalises the output (with softmax) to have a probability mass of 1 over all the classes, of which one is chosen as the prediction. However, at inference time, the outputs of the network are fed through a sigmoid activation, assuming a Bernoulli distribution over all classes. This means that probabilities over all classes at inference time do not sum to 1. In my eyes this is training as if the problem in multiclass but evaluating if the scheme is multilabel. I believe this will limit performance. This scheme is ofcourse fine when the loss is Binary cross entropy.
  2. At inference time when the loss is calculated a sigmoid activation is applied to the outputs before the outputs are passed to the loss function. This is effectively applying sigmoid then softmax for the cross entropy loss and sigmoid x2 for the binary cross entropy. Again as this is at inference time this will not affect the training of the model but will affect the reported interence time loss.

Do you agree with these points? If so a simple change e.g. should suffice:

# compute output
audio_output = audio_model(audio_input, args.task)
audio_output_for_loss = audio_output
if isinstance(args.loss_fn, torch.nn.CrossEntropyLoss):
    audio_output = torch.nn.functional.softmax(audio_output)
else:
    audio_output = torch.sigmoid(audio_output)
predictions = audio_output.to('cpu').detach()
A_predictions.append(predictions)
A_targets.append(labels)
# compute the loss
labels = labels.to(device)
 if isinstance(args.loss_fn, torch.nn.CrossEntropyLoss):
          loss = args.loss_fn(audio_output_for_loss, torch.argmax(labels.long(), axis=1))
else:
           loss = args.loss_fn(audio_output_for_loss, labels)

Many thanks in advance for your time!

fine-tuning sample rate

When using pre-trained models for fine-tuning, shall the fine-tuning training set have a specific sample rate, like 16khz?

Model fails to converge on transfer to audio backtesting problem

Dear Yuan and authors,
First of all, thank you for your paper. Recently, I migrated your pre-trained model to the regression prediction task of personality computing. After splicing several fully connected layers after your original model, the result is that the predicted value will only be maintained at a very low level during training. In a small interval, there will be no effective changes. Have you done relevant regression experiments? What are the possible reasons for this problem?
Sorry to bother you with my question and thank you very much for reading my question

yang

Librispeech960 Pretrained Model

Hey I'm curious on why you don't have Librispeech960 Pre-trained on Frame base. I saw you were recommending Frame based models. Do you have Pre-trained Librispeech on Frame?

Dataset mean / stdev normalization

Hi Yuan!

Thanks again for this next iteration of the model - it was the improvement that I was hoping for in our task!

I have a quick question regarding normalization. You mention in your AST paper:

We also normalize the input audio spectrogram so that the dataset mean and standard deviation are 0 and 0.5, respectively.

The same scheme is also used here and I wonder, since this is done on a dataloader level and the whole finetuning follows - how much of an impact on the final performance does it have? Is it mostly for the sake of speeding up convergence in training?

I am creating some versions of the model that may be used in systems with periodic retrain and I am trying to decide whether I should care about recalculating the statistics every time we train (or even at all). My finetuning that used the values from ESC-50 performed really well nevertheless.

Cheers,
Michał

Trouble with pure inference

Hello!

Firstly, thanks for this great work!

I managed to modify the AudioSet fine tuning script, and fine tuned a model on a new audio binary classification task. I started with the "Tiny" Patch model and used a batch size of 2. The resulting predictions on the evaluation set looked very promising!.

I'm now trying to write an inference script, to take that saved model to perform inferences, and running into some trouble. Which method do I actually need to call for pure inference? From the documentation it seems to describe only pre-training or fine-tuning, not inference.

More pressing, I can't actually get the model to load. I am trying to load the best_audio_model.pth as follows:

input_tdim = 1024
ast_mdl = ASTModel(label_dim=2,
                   fshape=16,
                   tshape=16,
                   fstride=10,
                   tstride=10,
                   input_fdim=128,
                   input_tdim=input_tdim,
                   model_size='tiny',
                   pretrain_stage=False,
                   load_pretrained_mdl_path=MODEL)

however this results in the errors:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[/content/ssast/src/models/ast_models.py](https://localhost:8080/#) in __init__(self, label_dim, fshape, tshape, fstride, tstride, input_fdim, input_tdim, model_size, pretrain_stage, load_pretrained_mdl_path)
    146                 p_fshape, p_tshape = sd['module.v.patch_embed.proj.weight'].shape[2], sd['module.v.patch_embed.proj.weight'].shape[3]
--> 147                 p_input_fdim, p_input_tdim = sd['module.p_input_fdim'].item(), sd['module.p_input_tdim'].item()
    148             except:

KeyError: 'module.p_input_fdim'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
1 frames
[/content/ssast/src/models/ast_models.py](https://localhost:8080/#) in __init__(self, label_dim, fshape, tshape, fstride, tstride, input_fdim, input_tdim, model_size, pretrain_stage, load_pretrained_mdl_path)
    147                 p_input_fdim, p_input_tdim = sd['module.p_input_fdim'].item(), sd['module.p_input_tdim'].item()
    148             except:
--> 149                 raise  ValueError('The model loaded is not from a torch.nn.Dataparallel object. Wrap it with torch.nn.Dataparallel and try again.')
    150 
    151             print('now load a SSL pretrained models from ' + load_pretrained_mdl_path)

ValueError: The model loaded is not from a torch.nn.Dataparallel object. Wrap it with torch.nn.Dataparallel and try again.

Is there anything obvious I'm missing or doing wrong? Would appreciate any guidance on how to load this model, and also perform an inference on a new .wav file. Thanks!

I've been stuck on the first epoch in a tiny model for 8+ hours using 300 5s spectrograms for training. Is this normal?

I appreciate your patience with my questions/bugs in my previous post. I assume the answer to this is much quicker; perhaps I'm specifying an argument incorrectly.

I am trying to run a tiny patch-based model on my laptop (cpu: Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz; gpu: Intel UHD Graphics 630). I have been in the first epoch for 8+ hours now, which does not seem right. My run.sh file is below for reference. I have switched from a base to a tiny model, decreased the epoch iterations and batch sizes, and modified the mean, std, and target_length appropriately. Additionally, I have significantly reduced the amount of spectrograms in the training data set. The original ESC-50 has 1600 spectrograms; I am only using the first 180 to train.

run.sh:

#!/bin/bash
#SBATCH -p sm
#SBATCH -x sls-sm-1,sls-2080-[3],sls-1080-3,sls-sm-5
##SBATCH -p gpu
##SBATCH -x sls-titan-[0-2]
#SBATCH --gres=gpu:1
#SBATCH -c 4
#SBATCH -n 1
#SBATCH --mem=48000
#SBATCH --job-name="ssast_pretrain"
#SBATCH --output=./slurm_log/log_%j.txt

set -x
# comment this line if not running on sls cluster
#. /data/sls/scratch/share-201907/slstoolchainrc
source /data/sls/scratch/yuangong/sslast2/sslast2/bin/activate
export TORCH_HOME=../../pretrained_models
mkdir exp
mkdir slurm_log

task=pretrain_joint
mask_patch=400 # maybe reduce ??

# ESC-50
dataset=esc-50
tr_data=./src/prep_data/esc50/data/datafiles/esc_train_data_1_reduced.json
te_data=./src/prep_data/esc50/data/datafiles/esc_eval_data_1.json
dataset_mean=3.693319320678711
dataset_std=64.5123519897461
target_length=50 # for 5 seconds
num_mel_bins=128

model_size=tiny
# no patch split overlap
fshape=16
tshape=16
fstride=${fshape}
tstride=${tshape}
# no class balancing as it implicitly uses label information
bal=none
batch_size=10 # was 24
lr=1e-4
# learning rate decreases if the pretext task performance does not improve on the validation set
lr_patience=22
epoch=10
# no spectrogram masking
freqm=0
timem=0
# no mixup training
mixup=0

exp_dir=./exp/mask01-${model_size}-f${fshape}-t${tshape}-b$batch_size-lr${lr}-m${mask_patch}-${task}-${dataset}

## be sure to modify the label-csv file directory!!
CUDA_CACHE_DISABLE=1 python -W ignore ../run.py --dataset ${dataset} \
--data-train ${tr_data} --data-val ${te_data} --exp-dir $exp_dir \
--label-csv ./src/prep_data/esc50/esc_class_labels_indices.csv \
--lr $lr --n-epochs ${epoch} --batch-size $batch_size --save_model False \
--freqm $freqm --timem $timem --mixup ${mixup} --bal ${bal} \
--tstride $tstride --fstride $fstride --fshape ${fshape} --tshape ${tshape} \
--dataset_mean ${dataset_mean} --dataset_std ${dataset_std} --target_length ${target_length} --num_mel_bins ${num_mel_bins} \
--model_size ${model_size} --mask_patch ${mask_patch} --n-print-steps 100 \
--task ${task} --lr_patience ${lr_patience} --epoch_iter 50 # was 4000

In my terminal, I've been stuck on the following output:

Creating experiment directory: ./exp/mask01-tiny-f16-t16-b10-lr1e-4-m400-pretrain_joint-esc-50
Now starting self-supervised pretraining for 10 epochs
Now running on : cpu
Total parameter number is : 5.952594000 million
Total trainable parameter number is : 5.952592000 million
current #steps=0, #epochs=1
start training...
2024-02-11 09:09:01.792826
Another data point.
warm-up learning rate is 0.000000

I apologize for asking several questions in the span of a week and appreciate your commitment to open science. I'm hoping that this is my last post for you!

Loss curves

Hey, great work!
I just wanted to ask whether you might have the loss curves of your runs so that I can compare with my experiments a little bit?

about nce loss cal

Hi, I am wondering if dim should be 0 at

self.softmax = nn.Softmax(dim=-1)
self.lsoftmax = nn.LogSoftmax(dim=-1)

total = torch.mm(encode_samples[i], torch.transpose(pred[i], 0, 1)) # e.g. size 100*100
correct += torch.sum(torch.eq(torch.argmax(self.softmax(total), dim=0), torch.arange(0, mask_patch, device=x.device))) # correct is a tensor
nce += torch.sum(torch.diag(self.lsoftmax(total))) # nce is a tensor

total[j, i] is the inner product of xj and ci^T, so softmax the -1 dim on total is summing w.r.t c_i, not x_i, which is not consistent with equation 1 of the paper.

AttributeError: 'ASTModel' object has no attribute 'unfold'

I'm relatively new to Python, so this may be a simple error to debug. However, I'm working through the example in the README to understand what the output of these models is like. My python version is 3.9.18 on macOS Sonoma 14.2.1 with M1 Max chip; I tried to get python 3.7.4, but that is not feasible with the M1 chip -- hopefully, 3.9.18 will suffice and is not the cause of this error.

Below is my MWE:

###
### Module Management
###

## needed to use ASTModel class
import sys
sys.path.insert(0, './')
from ast_models import *

###
### Example from Repo
###

# pretraining stage
# suppose you have an unlabled dataset with avg length of 1024 frames (i.e., 10.24s)
input_tdim = 1024
# create a 16*16 patch based AST model for pretraining.
# note, we don't use patch split overlap in pretraining, so fstride=fshape and tstride=tshape
ast_mdl = ASTModel(
             fshape=16, tshape=16, fstride=16, tstride=16,
             input_fdim=128, input_tdim=input_tdim, model_size='base',
             pretrain_stage=True)
# # alternatively, create a frame based AST model
# ast_mdl = ASTModel(
#              fshape=128, tshape=2, fstride=128, tstride=2,
#              input_fdim=128, input_tdim=input_tdim, model_size='base',
#              pretrain=True)

# do pretraining, see src/traintest_mask.py for our full pretraining code
# input in shape [batch_size, input_tdim, input_fdim]
test_input = torch.zeros([10, input_tdim, 128])
# mask 100 patches for both discriminative and generative loss
acc, nce_loss = ast_mdl(test_input, task='pretrain_mpc', mask_patch=100)
mse_loss = ast_mdl(test_input, task='pretrain_mpg', mask_patch=100)
loss = nce_loss + 10 * mse_loss
# do back propagate and update the model, etc

# after pretraining, save the pretrained model.
# the code is designed for Dataparallel model
ast_mdl = torch.nn.DataParallel(ast_mdl)
torch.save(ast_mdl.state_dict(), './test_mdl.pth')

# fine-tuning stage
# now you have a labeled dataset you want to finetune AST on
# suppose the avg length is 100 frames (1s) and there are 35 classes
# the fshape and tshape must be same in pretraining and finetuning
# but fstride and tstride can be different in pretraining and finetuning
# using smaller strides improves the performance but also increase the computational overhead
# set pretrain_stage as False since now is in the finetuning stage
# provide the path of the pretrained model you want to load
input_tdim = 100  # fine-tuning data length can be different with pretraining data length
ast_mdl = ASTModel(label_dim=35,
             fshape=16, tshape=16, fstride=10, tstride=10,
             input_fdim=128, input_tdim=input_tdim, model_size='base',
             pretrain_stage=False, load_pretrained_mdl_path='./test_mdl.pth')
# # alternatively, use a frame based AST model
# ast_mdl = ASTModel(label_dim=35,
#              fshape=128, tshape=2, fstride=128, tstride=1,
#              input_fdim=128, input_tdim=input_tdim, model_size='base',
#              pretrain_stage=False, load_pretrained_mdl_path='./test_mdl.pth')

# do finetuning, see src/traintest.py for our finetuning code
test_input = torch.zeros([10, input_tdim, 128])
prediction = ast_mdl(test_input, task='ft_avgtok')
# output should in shape [batch_size, label_dim]
print(prediction.shape)
# calculate the loss, do back propagate, etc

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
test_input = torch.zeros([1, input_tdim, 128]).to(device)
acc, nce = ast_mdl(test_input, task='pretrain_mpc', mask_patch=100)
# you can visualize the mask
pred, masked = ast_mdl(test_input, task='visualize_mask', mask_patch=100)
plt.imshow(masked[0,0])
plt.show()

Here, both calls to ast_mdl() result in the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/schwob/.pyenv/versions/3.9.18/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/schwob/.pyenv/versions/3.9.18/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/schwob/Desktop/ssast/src/models/ast_models.py", line 448, in forward
    return self.mpc(x, mask_patch=mask_patch, cluster=cluster)
  File "/Users/schwob/Desktop/ssast/src/models/ast_models.py", line 293, in mpc
    input = self.unfold(x).transpose(1, 2)
  File "/Users/schwob/.pyenv/versions/3.9.18/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1688, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'ASTModel' object has no attribute 'unfold'

The unfolding error only occurs once we define ast_mdl as a fine-tuning model.

Ultimately, my goal is to plot the probabilities of class assignment for the test input, so if there's a simpler way to do this, I am all ears.

Thank you for your time.

A question about ESC's AC

学长您好,在论文中的ESC的指标为什么是80多,在AST那篇文章不是都作到95了嘛

Why the ESC accuracy score is ~80% in the SSAST paper while ~95% was reported in the AST paper?

The model loaded is not from a torch.nn.Dataparallel object

Hi Yuan, I fine tuned the SSAST model on my own data, and am trying to load the model to be used for demonstrative purposes. For some reason, when I load in the model path of the "best_audio_model.pth" that is generated after fine tuning, I get the following error from ast_models.py:
ValueError: The model loaded is not from a torch.nn.Dataparallel object. Wrap it with torch.nn.Dataparallel and try again.
I'm not really sure why I am getting this error as I am following the instructions shown in the example at the if name == "main" statement. What am I missing? Shouldn't the created models be already wrapped with dataparallel as seen in traintest.py?

Target length

Hi Yuan,

Thanks again for this great work, I have been using both this and the original AST model for some downstream tasks. I am currently looking into some other time series data, and was wondering if there was a particular reason you chose 10 seconds for the audio length during audioset pretraining. Why not 5 seconds, 15? Did you consult any specific resources to conclude this or is it more arbitrary?

Thanks,
Karim

Embeddings without fine tuning

Dear @YuanGongND ,

Thanks for sharing the code for your work - which is awesome btw - and for the excellent documentation.

I modified your wrapper scripts as you suggested and ran pretraining on my own dataset of ~900 or so speakers and ~1500 or so samples.

However, I'm not really interested in proceeding with a fine-tuning stage and instead would like to extract the (?mean pooled) final embedding for each sample. I.e. instead of predicting a label I just want the embedding.

Is this a functionality that is relatively simple to hack onto your existing code?

Hugo

ssast pretrained solely on Audioset-2M

In the paper there is a section where you compare the model trained self-supervised only on Audioset-2M to the default model where you combine it with librispeech. Could you please kindly share the self-supervised pretrained model on Audioset-2M alone?

Thanks in advance.

performance and loss of the frame-based model

I also wanted to ask if you still have the result.csv from the frame-based base model trained on audioset2m/librispeech, i.e. mask01-base-f128-t2-b24-lr1e-4-m400-pretrain_joint-asli. If you still have it, we would be very thankful if you could upload it here. We are currently trying to reproduce the pretraining results, such that we can then build on that and finetune it for speaker verification.

Best Regards,
Fabian

audio file length

Hi Yuan

We have another question: What was the length of the audio files you used? In the paper it is written thatthey are 10 seconds but with 10 seconds the resulting spectrograms (from torchaudio.compliance.kaldi.fbank) are 998 frames (with the frame_shift set to 10ms and the frame_length set to 25ms) and thus the remaining 26 frames are being zero padded by the dataloader (if the target_length is set to 1024).

image

Best Regards,
Fabian

Extract audio representations for future use

Hi,

First of all, thank you all for the impressive work and for making the code and models available to the community. I would like to use the SSAST models to extract audio embeddings. Specifically, I'm interested in writing a script that accomplishes the following:

  1. Convert the waveform to a spectrogram using specific parameters.
  2. Perform a forward pass to the SSAST model to obtain embeddings for each "frame" or "patch". I would like to have N embeddings instead of just the average.

In previous issues, you pointed out some mean/variance normalization and a way to extract average pooled tokens (e.g., commenting out the mlp head from ASTModel forward). Do you have any suggestion about how to do point (1) and (2)?

Thanks

I want to use frame-level ssast just for frame-level audio token extraction

image

in your ast_models.py, you put cluster True as Default

image
But if to use frame-level ssast, cluster should be False. Do I have to turn it off?

image
If I want to use your pretrained frame level ssast for audio token extraction, is the output of self.v.norm(x) except the first one what I have to use in finetuningcls function? because the first one is cls token....^^

One more thing I wonder....Could I get some part of fbank that is corresponding to the video frames? melspectrogram does but I don't know fbank could be....

Use SSAST pretrained model to inference

@YuanGongND I used SSAST pretrained model to inference, but got the different results every time. And every score in the results is close. What is the reason for the result?
[{"label": "Electric toothbrush", "score": 0.849498987197876}, {"label": "Blender", "score": 0.8397527933120728}, {"label": "Tambourine", "score": 0.8310427665710449}, {"label": "Race car, auto racing", "score": 0.8218237161636353}, {"label": "Pink noise", "score": 0.8042027354240417}, {"label": "Writing", "score": 0.7958802580833435}, {"label": "Singing", "score": 0.7875975966453552}, {"label": "Telephone dialing, DTMF", "score": 0.7849113941192627}, {"label": "Ambulance (siren)", "score": 0.7678646445274353}, {"label": "Country", "score": 0.7541956901550293}]

My code is as follows:

def model_fn(model_dir):
"""
Load the model and set weights
"""

# Load the model
input_tdim = 200
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint_path = f'{model_dir}/SSAST-Tiny-Patch-400.pth'

# fstride, tstride = int(checkpoint_path.split('/')[-1].split('_')[1]), int(
#     checkpoint_path.split('/')[-1].split('_')[2].split('.')[0])

ast_mdl = ASTModel(label_dim=527, fshape=16, tshape=16, fstride=10, tstride=10, input_tdim=input_tdim,
                        model_size='tiny', pretrain_stage=False, load_pretrained_mdl_path=checkpoint_path)

audio_model = torch.nn.DataParallel(ast_mdl)
checkpoint = load_modified_checkpoint(checkpoint_path, audio_model, device)
audio_model.load_state_dict(checkpoint)
audio_model = audio_model.to(device)
audio_model.eval()

labels = load_label(f'{model_dir}/class_labels_indices.csv')

return audio_model, labels

def predict_fn(input_data, model):
"""
The predict_fn is invoked with the return value of input_fn.
"""
audio_model, labels = model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

input_tdim = 200
feats = make_features(input_data, mel_bins=128, target_length=input_tdim)
feats_data = feats.expand(1, input_tdim, 128).to(device)

with torch.no_grad():
    output = audio_model(feats_data, task='ft_cls')
    output = torch.sigmoid(output)

result_output = output.data.cpu().numpy()[0]
sorted_indexes = np.argsort(result_output)[::-1]

top_k = 10
top_k_labels = [(labels[idx], result_output[idx]) for idx in sorted_indexes[:top_k]]

return top_k_labels

Host on Huggingface

Have you ever considered hosting the pre-trained SSAST model checkpoints on the Huggingface Hub?

SSAST for embedding model

Hi Yuan,
We're reaching out again regarding our Bachelor Thesis on Speaker Recognition. We're facing a challenge in implementing the SSAST as an embedding model trained on Contrastive Loss, such as Triplet Loss. Since speaker recognition poses an open set problem, where the number of speaker classes isn't predetermined, we need to determine a suitable dimension for the embedding. Additionally, we consider to make adjustments to the multilayer perceptron (MLP) head to accommodate this. During your studies on SSAST do u came up with any insights that could maybe lead to any recommendations for us?
Thanks, Andrin

experiment setup

Hi,
Thank you for your nice work. May I know if the experiment setting is same as in AST paper, i.e., each experiment is run three times and mean results are reported?

Data Loading Logs

Hello Yuan

Thanks for the great work and for open-sourcing everything!

I have a quesiton about traintest_mask.py: Shouldn't end_time be updated after the evaluation? Because I think the way it is currently implemented leads to a very large number for per_sample_data_time direclty after the evaluation:

image

At first I thought it was because of cache or something similar, but I think it is because the evaluation time is also included in per_sample_data_time in the first iteration after the evaluation.

Best Regards,
Fabian

Stereo audio

Would this also work for stereo (i.e. 2 channel) audio?

I wonder how to best adapt the code to this. (Especially that the timm parts have been trimmed down from 3 channels to 1 channel anyway)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.