minguinho26 / prefix_aac_icassp2023 Goto Github PK

Official Implementation of "Prefix tuning for Automated Audio Captioning(ICASSP 2023)"

Python 18.04% Jupyter Notebook 81.88% Shell 0.08%

audio-captioning deep-learning icassp2023 pytorch-implementation

prefix_aac_icassp2023's Introduction

About the source code

This repository contains a pytorch implementation for the ICASSP 2023 paper, "Prefix tuning for automated audio captioning"
[project page] [paper]

Model preparation

Downloading the audio encoder pre-trained on AudioSet

Move to AAC_Prefix/PANNs
Type in the command below

gdown https://drive.google.com/file/d/1O-rPXe_anLArvRG4Z3-nLdBYNO_JaFYL/view?usp=sharing --fuzzy

Downloading the pre-trained GPT2 from Huggingface

Move to ClipCap_forAAC
Type in the command below

gdown https://drive.google.com/file/d/15ASmIoWg0ac6qm0ixdiVwh88e8EA2MZ7/view?usp=share_link --fuzzy

Unzip the zip file

Downloading the pre-trained model for this work

Move to ClipCap_forAAC
Type in the command below

gdown https://drive.google.com/file/d/1y2yeK7eO5DFY8n9l9QfiVRwv6GZLEnFA/view?usp=share_link --fuzzy

Unzip the zip file

Dataset download

download the Clotho dataset

Move to Clotho/clotho_audio_files
Type in the command below

gdown https://drive.google.com/file/d/1kOuZrOs1yuOwlOky7ZohVVeiVwYQg1V0/view?usp=sharing --fuzzy

Unzip the zip file

download the AudioCaps Dataset

Move to AudioCaps
Type in the command below

gdown https://drive.google.com/file/d/15ODyZmXDu_gwl-GcgQ6i_dBIeLKPG5-S/view?usp=sharing --fuzzy

Unzip the zip file

Download the audio caption evaluation tools

Move to coco_caption
Type in the command below

sh get_stanford_models.sh

Train the model

# If you are using GPT2 Tokenizer
python3 Experiment_AudioCaps.py <Experiment_name> # AudioCaps Dataset
python3 Experiment_Clotho.py <Experiment_name> # Clotho Dataset
python3 Experiment_FusionDataset.py <Experiment_name> # AudioCaps&Clotho Dataset

# If you are using custom Tokenizer
python3 Experiment_AudioCaps.py <Experiment_name> <vocab_size> # AudioCaps Dataset
python3 Experiment_Clotho.py <Experiment_name> <vocab_size> # Clotho Dataset
python3 Experiment_FusionDataset.py <Experiment_name> # AudioCaps&Clotho Dataset

Evaluate the model

Update(23.12.6): Please use Evaluation_network.ipynb for evaluation. The evaluation methods were incorporated in that .ipynb file.

# If you use gpt2 that was pre-trained by Huggingface
python3 Evaluation_AudioCaps.py <model_name> <epoch_number>
python3 Evaluation_Clotho.py <model_name> <epoch_number>

# If you use a custom tokenizer that was trained by us
python3 Evaluation_AudioCaps.py <model_name> <epoch_number> <vocab_size>
python3 Evaluation_Clotho.py <model_name> <epoch_number> <vocab_size>

Inference(Generate the caption using the model in the paper's table 1)

python3 Inference.py <table_num> <setting_num> <audio_file_path>

# table_num = 1 : Evaluation on Clotho
# table_num = 2 : Evaluation on AudioCaps

# setting_num = 1 : train dataset == test dataset
# setting_num = 2 : train dataset != test dataset
# setting_num = 3 : overall datasets(Clotho & AudioCaps) <- need to test by using compressed audio

# Example
python3 Inference.py 1 1 ./test.wav

Citation

@inproceedings{kim2023prefix,
        title={Prefix tuning for automated audio captioning},
        author={Kim, Minkyu and Sung-Bin, Kim and Oh, Tae-Hyun},
        booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
        pages={1--5},
        year={2023},
        organization={IEEE}
      }

prefix_aac_icassp2023's People

Contributors

Stargazers

Watchers

Forkers

y00njaekim postech-ami

prefix_aac_icassp2023's Issues

About Evaluation files

Hello, Nice work!!!
When I replicate your work， I found there is not any evaluation files you mentioned in README.

Evaluation_AudioCaps.py and Evaluation_Clotho.py

Are these two files just using the eval_model() in Train.py?
Or anything else? If so, could you release these tow files?

Thank you for your early reply!!!

About Prefix vector in ablation study

Hello, I have another question when I want to replicate ablation study about the mapping network.

    if self.temporal_prefix_length > 0:
        # temporal_prefix_vector = self.temporal_mappingnetwork(temporal_feature).view(-1, self.temporal_prefix_length, self.gpt_embedding_size)
        temporal_prefix_vector = self.temporal_mappingnetwork(temporal_feature).view(-1, self.temporal_prefix_length, self.gpt_embedding_size).to(self.device)
    elif self.global_prefix_length + self.temporal_prefix_length == 0:
        temporal_feature = temporal_feature.permute(0, 2, 1, 3).contiguous()
        temporal_feature = torch.reshape(temporal_feature, (temporal_feature.size()[0], temporal_feature.size()[1], -1))
        temporal_prefix_vector = self.temporal_mappingnetwork(temporal_feature)

    if self.global_prefix_length > 0:
        global_prefix_vector = self.global_mappingnetwork(global_feature).view(-1, self.global_prefix_length, self.gpt_embedding_size)
        global_prefix_vector = global_prefix_vector.to("cuda:1")
    elif self.global_prefix_length + self.temporal_prefix_length == 0:
        global_prefix_vector = self.global_mappingnetwork(global_feature)
        global_prefix_vector = global_prefix_vector.view(global_feature.size()[0], 11, 768)  # [bsize, 11, 768]
        global_prefix_vector = global_prefix_vector.to("cuda:1")

        

    if self.temporal_prefix_length > 0 and self.global_prefix_length == 0:
        prefix_vectors = temporal_prefix_vector
    elif self.temporal_prefix_length == 0 and self.global_prefix_length > 0:
        prefix_vectors = global_prefix_vector
    else:
        prefix_vectors = torch.cat(
            (temporal_prefix_vector, global_prefix_vector), dim=1)

if I set self.global_prefix_length > 0 and self.temporal_prefix_length = 0, does it means the method with a mapping network and global feature,vice versa.
And is I set self.global_prefix_length = 0 and self.temporal_prefix_length = 0, does it means the method without mapping networks, just two features which is fed into gpt2?

Thank you for your early reply!

requirements

Can you show me the requirements for running environment?

Pretrained Models for reproducibility

Hi,

First of all, your project is great and I am working on this project. However, I am not able to reproduce your results reported in the ICASSP paper. It would be great if you could share your pre-trained models for reproducibility.

Thanks!

About pretrained model

Hello，
I would like to know when using Huggingface's gpt2 model， is the pt file which you shared by your
GoogleDrive from your own training， or does it correspond to the gpt2 on the website?

Thank you for your early reply.

the validation set of audiocaps dataset

According to the download link you provided, the audiocaps dataset does not have a validation set. Can you provide audiocaps validation set data？

Can't find Inference.py

Thank you for sharing your codes.

But, I can't find the Inference.py.
Could you upload this file?

ERROR in the Ablation study

I'm sorry for bothering you.
I have confronted the problem when I set both temporal_prefix_size and global_prefix_size to 0. I know this operatioin means no mapping network in the model.
The error message is ''the size of tensor a (88) must match the size of tensor b (31) at non-singleton dimension 3''
as following,

I do not konw why it happened .
I tried to figure it out, but I failed.
I would be appreciated for your early response!!!
Thank you very much!