Giter Site home page Giter Site logo

minguinho26 / prefix_aac_icassp2023 Goto Github PK

View Code? Open in Web Editor NEW
22.0 2.0 2.0 101.29 MB

Official Implementation of "Prefix tuning for Automated Audio Captioning(ICASSP 2023)"

Home Page: https://prefixaac.github.io

Python 18.04% Jupyter Notebook 81.88% Shell 0.08%
audio-captioning deep-learning icassp2023 pytorch-implementation

prefix_aac_icassp2023's Introduction

About the source code

This repository contains a pytorch implementation for the ICASSP 2023 paper, "Prefix tuning for automated audio captioning"
[project page] [paper]

pipeline


Model preparation

Downloading the audio encoder pre-trained on AudioSet

  1. Move to AAC_Prefix/PANNs
  2. Type in the command below
gdown https://drive.google.com/file/d/1O-rPXe_anLArvRG4Z3-nLdBYNO_JaFYL/view?usp=sharing --fuzzy

Downloading the pre-trained GPT2 from Huggingface

  1. Move to ClipCap_forAAC
  2. Type in the command below
gdown https://drive.google.com/file/d/15ASmIoWg0ac6qm0ixdiVwh88e8EA2MZ7/view?usp=share_link --fuzzy

  1. Unzip the zip file

Downloading the pre-trained model for this work

  1. Move to ClipCap_forAAC
  2. Type in the command below
gdown https://drive.google.com/file/d/1y2yeK7eO5DFY8n9l9QfiVRwv6GZLEnFA/view?usp=share_link --fuzzy
  1. Unzip the zip file

Dataset download

download the Clotho dataset

  1. Move to Clotho/clotho_audio_files
  2. Type in the command below
gdown https://drive.google.com/file/d/1kOuZrOs1yuOwlOky7ZohVVeiVwYQg1V0/view?usp=sharing --fuzzy
  1. Unzip the zip file

download the AudioCaps Dataset

  1. Move to AudioCaps
  2. Type in the command below
gdown https://drive.google.com/file/d/15ODyZmXDu_gwl-GcgQ6i_dBIeLKPG5-S/view?usp=sharing --fuzzy
  1. Unzip the zip file


Download the audio caption evaluation tools

  1. Move to coco_caption
  2. Type in the command below
sh get_stanford_models.sh 

Train the model

# If you are using GPT2 Tokenizer
python3 Experiment_AudioCaps.py <Experiment_name> # AudioCaps Dataset
python3 Experiment_Clotho.py <Experiment_name> # Clotho Dataset
python3 Experiment_FusionDataset.py <Experiment_name> # AudioCaps&Clotho Dataset

# If you are using custom Tokenizer
python3 Experiment_AudioCaps.py <Experiment_name> <vocab_size> # AudioCaps Dataset
python3 Experiment_Clotho.py <Experiment_name> <vocab_size> # Clotho Dataset
python3 Experiment_FusionDataset.py <Experiment_name> # AudioCaps&Clotho Dataset

Evaluate the model

Update(23.12.6): Please use Evaluation_network.ipynb for evaluation. The evaluation methods were incorporated in that .ipynb file.

# If you use gpt2 that was pre-trained by Huggingface
python3 Evaluation_AudioCaps.py <model_name> <epoch_number>
python3 Evaluation_Clotho.py <model_name> <epoch_number>

# If you use a custom tokenizer that was trained by us
python3 Evaluation_AudioCaps.py <model_name> <epoch_number> <vocab_size>
python3 Evaluation_Clotho.py <model_name> <epoch_number> <vocab_size>

Inference(Generate the caption using the model in the paper's table 1)

python3 Inference.py <table_num> <setting_num> <audio_file_path>

# table_num = 1 : Evaluation on Clotho
# table_num = 2 : Evaluation on AudioCaps

# setting_num = 1 : train dataset == test dataset
# setting_num = 2 : train dataset != test dataset
# setting_num = 3 : overall datasets(Clotho & AudioCaps) <- need to test by using compressed audio

# Example
python3 Inference.py 1 1 ./test.wav

Citation

@inproceedings{kim2023prefix,
        title={Prefix tuning for automated audio captioning},
        author={Kim, Minkyu and Sung-Bin, Kim and Oh, Tae-Hyun},
        booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
        pages={1--5},
        year={2023},
        organization={IEEE}
      }

prefix_aac_icassp2023's People

Contributors

minguinho26 avatar sbkim052 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

prefix_aac_icassp2023's Issues

About Evaluation files

Hello, Nice work!!!
When I replicate your work, I found there is not any evaluation files you mentioned in README.

Evaluation_AudioCaps.py and Evaluation_Clotho.py

Are these two files just using the eval_model() in Train.py?
Or anything else? If so, could you release these tow files?

Thank you for your early reply!!!

About Prefix vector in ablation study

Hello, I have another question when I want to replicate ablation study about the mapping network.

    if self.temporal_prefix_length > 0:
        # temporal_prefix_vector = self.temporal_mappingnetwork(temporal_feature).view(-1, self.temporal_prefix_length, self.gpt_embedding_size)
        temporal_prefix_vector = self.temporal_mappingnetwork(temporal_feature).view(-1, self.temporal_prefix_length, self.gpt_embedding_size).to(self.device)
    elif self.global_prefix_length + self.temporal_prefix_length == 0:
        temporal_feature = temporal_feature.permute(0, 2, 1, 3).contiguous()
        temporal_feature = torch.reshape(temporal_feature, (temporal_feature.size()[0], temporal_feature.size()[1], -1))
        temporal_prefix_vector = self.temporal_mappingnetwork(temporal_feature)

    if self.global_prefix_length > 0:
        global_prefix_vector = self.global_mappingnetwork(global_feature).view(-1, self.global_prefix_length, self.gpt_embedding_size)
        global_prefix_vector = global_prefix_vector.to("cuda:1")
    elif self.global_prefix_length + self.temporal_prefix_length == 0:
        global_prefix_vector = self.global_mappingnetwork(global_feature)
        global_prefix_vector = global_prefix_vector.view(global_feature.size()[0], 11, 768)  # [bsize, 11, 768]
        global_prefix_vector = global_prefix_vector.to("cuda:1")

        

    if self.temporal_prefix_length > 0 and self.global_prefix_length == 0:
        prefix_vectors = temporal_prefix_vector
    elif self.temporal_prefix_length == 0 and self.global_prefix_length > 0:
        prefix_vectors = global_prefix_vector
    else:
        prefix_vectors = torch.cat(
            (temporal_prefix_vector, global_prefix_vector), dim=1)

if I set self.global_prefix_length > 0 and self.temporal_prefix_length = 0, does it means the method with a mapping network and global feature,vice versa.
And is I set self.global_prefix_length = 0 and self.temporal_prefix_length = 0, does it means the method without mapping networks, just two features which is fed into gpt2?

Thank you for your early reply!

requirements

Can you show me the requirements for running environment?

Pretrained Models for reproducibility

Hi,

First of all, your project is great and I am working on this project. However, I am not able to reproduce your results reported in the ICASSP paper. It would be great if you could share your pre-trained models for reproducibility.

Thanks!

About pretrained model

Hello,
I would like to know when using Huggingface's gpt2 model, is the pt file which you shared by your
GoogleDrive from your own training, or does it correspond to the gpt2 on the website?

Thank you for your early reply.

Can't find Inference.py

Thank you for sharing your codes.

But, I can't find the Inference.py.
Could you upload this file?

ERROR in the Ablation study

I'm sorry for bothering you.
I have confronted the problem when I set both temporal_prefix_size and global_prefix_size to 0. I know this operatioin means no mapping network in the model.
The error message is ''the size of tensor a (88) must match the size of tensor b (31) at non-singleton dimension 3''
as following,
296408877-4db049ae-74dd-43ce-a544-1712c9200aff

I do not konw why it happened .
I tried to figure it out, but I failed.
I would be appreciated for your early response!!!
Thank you very much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.