wenliangdai / multimodal-end2end-sparse Goto Github PK

The code repository for NAACL 2021 paper "Multimodal End-to-End Sparse Model for Emotion Recognition".

Python 72.97% Jupyter Notebook 27.03%

multimodal-end2end-sparse's Introduction

Multimodal End-to-End Sparse Model for Emotion Recognition

[Paper] accepted at the NAACL 2021:

Multimodal End-to-End Sparse Model for Emotion Recognition, by Wenliang Dai *, Samuel Cahyawijaya *, Zihan Liu, Pascale Fung.

Paper Abstract

Existing works on multimodal affective computing tasks, such as emotion recognition, generally adopt a two-phase pipeline, first extracting feature representations for each single modality with hand-crafted algorithms and then performing end-to-end learning with the extracted features. However, the extracted features are fixed and cannot be further fine-tuned on different target tasks, and manually finding feature extraction algorithms does not generalize or scale well to different tasks, which can lead to sub-optimal performance. In this paper, we develop a fully end-to-end model that connects the two phases and optimizes them jointly. In addition, we restructure the current datasets to enable the fully end-to-end training. Furthermore, to reduce the computational overhead brought by the end-to-end model, we introduce a sparse cross-modal attention mechanism for the feature extraction. Experimental results show that our fully end-to-end model significantly surpasses the current state-of-the-art models based on the two-phase pipeline. Moreover, by adding the sparse cross-modal attention, our model can maintain performance with around half the computation in the feature extraction part.

If you work is inspired by our paper or code, please cite it, thanks!

@inproceedings{dai-etal-2021-multimodal,
    title = "Multimodal End-to-End Sparse Model for Emotion Recognition",
    author = "Dai, Wenliang  and
      Cahyawijaya, Samuel  and
      Liu, Zihan  and
      Fung, Pascale",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.naacl-main.417",
    doi = "10.18653/v1/2021.naacl-main.417",
    pages = "5305--5316",
    abstract = "Existing works in multimodal affective computing tasks, such as emotion recognition and personality recognition, generally adopt a two-phase pipeline by first extracting feature representations for each single modality with hand crafted algorithms, and then performing end-to-end learning with extracted features. However, the extracted features are fixed and cannot be further fine-tuned on different target tasks, and manually finding feature extracting algorithms does not generalize or scale well to different tasks, which can lead to sub-optimal performance. In this paper, we develop a fully end-to-end model that connects the two phases and optimizes them jointly. In addition, we restructure the current datasets to enable the fully end-to-end training. Furthermore, to reduce the computational overhead brought by the end-to-end model, we introduce a sparse cross-modal attention mechanism for the feature extraction. Experimental results show that our fully end-to-end model significantly surpasses the current state-of-the-art models based on the two-phase pipeline. Moreover, by adding the sparse cross-modal attention, our model can maintain the performance with around half less computation in the feature extraction part of the model.",
}

You can also check our blog here 😊.

Dataset

As mentioned in our paper, one of the contribution is that we reorganize two datasets (IEMOCAP and CMU-MOSEI) to enable training from the raw data. To the best of our knowledge, prior to our work, papers using these two datasets are based on pre-extracted features, and we did not find a way to map those features back with raw data. Therefore, we did a heavy reorganization of these datasets (refer to Section 3 of the paper for more details).

The raw data can be downloaded from CMU-MOSEI (~120GB) and IEMOCAP (~16.5GB). However, for the IEMOCAP, you need to request for a permission from the original author, then we can give the passcode to download.

We provide two Python scripts as examples of processing the raw data in the ./preprocessing folder. Alternatively, you can also download our processed raw data for training directly, as shown in the section below.

Preparation

Dataset

To run our code directly, you can download the processed data from here (88.6G). Unzip it and the tree structure of the data direcotry looks like this:

./data
- IEMOCAP_HCF_FEATURES
- IEMOCAP_RAW_PROCESSED
- IEMOCAP_SPLIT
- MOSEI_RAW_PROCESSED
- MOSEI_HCF_FEATURES
- MOSEI_SPLIT

Environment

Python 3.7
PyTorch 1.6.0
torchaudio 0.6.0
torchvision 0.7.0
transformers 3.4.0 (huggingface)
sparseconvnet
facenet-pytorch 2.3.0

Command examples for running

Train the MME2E

python main.py -lr=5e-5 -ep=40 -mod=tav -bs=8 --img-interval=500 --early-stop=6 --loss=bce --cuda=3 --model=mme2e --num-emotions=6 --trans-dim=64 --trans-nlayers=4 --trans-nheads=4 --text-lr-factor=10 --text-model-size=base --text-max-len=100

Train the sparse MME2E

python main.py -lr=5e-5 -ep=40 -mod=tav -bs=2 --img-interval=500 --early-stop=6 --loss=bce --cuda=3 --model=mme2e_sparse --num-emotions=6 --trans-dim=64 --trans-nlayers=4 --trans-nheads=4 --text-lr-factor=10 -st=0.8 --text-model-size=base --text-max-len=100

Baselines

LF_RNN

python main.py -lr=5e-4 -ep=60 -mod=tav -bs=32 --early-stop=8 --loss=bce --cuda=1 --model=lf_rnn --num-emotions=6 --hand-crafted --clip=2

LF_TRANSFORMER

python main.py -lr=5e-4 -ep=60 -mod=tav -bs=32 --early-stop=8 --loss=bce --cuda=0 --model=lf_transformer --num-emotions=6 --hand-crafted --clip=2

CLI

usage: main.py [-h] -bs BATCH_SIZE -lr LEARNING_RATE [-wd WEIGHT_DECAY] -ep
               EPOCHS [-es EARLY_STOP] [-cu CUDA] [-cl CLIP] [-sc] [-se SEED]
               [--loss LOSS] [--optim OPTIM] [--text-lr-factor TEXT_LR_FACTOR]
               [-mo MODEL] [--text-model-size TEXT_MODEL_SIZE]
               [--fusion FUSION] [--feature-dim FEATURE_DIM]
               [-st SPARSE_THRESHOLD] [-hfcs HFC_SIZES [HFC_SIZES ...]]
               [--trans-dim TRANS_DIM] [--trans-nlayers TRANS_NLAYERS]
               [--trans-nheads TRANS_NHEADS] [-aft AUDIO_FEATURE_TYPE]
               [--num-emotions NUM_EMOTIONS] [--img-interval IMG_INTERVAL]
               [--hand-crafted] [--text-max-len TEXT_MAX_LEN]
               [--datapath DATAPATH] [--dataset DATASET] [-mod MODALITIES]
               [--valid] [--test] [--ckpt CKPT] [--ckpt-mod CKPT_MOD]
               [-dr DROPOUT] [-nl NUM_LAYERS] [-hs HIDDEN_SIZE] [-bi] [--gru]

Multimodal End-to-End Sparse Model for Emotion Recognition

optional arguments:
  -h, --help            show this help message and exit
  -bs BATCH_SIZE, --batch-size BATCH_SIZE
                        Batch size
  -lr LEARNING_RATE, --learning-rate LEARNING_RATE
                        Learning rate
  -wd WEIGHT_DECAY, --weight-decay WEIGHT_DECAY
                        Weight decay
  -ep EPOCHS, --epochs EPOCHS
                        Number of epochs
  -es EARLY_STOP, --early-stop EARLY_STOP
                        Early stop
  -cu CUDA, --cuda CUDA
                        Cude device number
  -cl CLIP, --clip CLIP
                        Use clip to gradients
  -sc, --scheduler      Use scheduler to optimizer
  -se SEED, --seed SEED
                        Random seed
  --loss LOSS           loss function
  --optim OPTIM         optimizer function: adam/sgd
  --text-lr-factor TEXT_LR_FACTOR
                        Factor the learning rate of text model
  -mo MODEL, --model MODEL
                        Which model
  --text-model-size TEXT_MODEL_SIZE
                        Size of the pre-trained text model
  --fusion FUSION       How to fuse modalities
  --feature-dim FEATURE_DIM
                        Dimension of features outputed by each modality model
  -st SPARSE_THRESHOLD, --sparse-threshold SPARSE_THRESHOLD
                        Threshold of sparse CNN layers
  -hfcs HFC_SIZES [HFC_SIZES ...], --hfc-sizes HFC_SIZES [HFC_SIZES ...]
                        Hand crafted feature sizes
  --trans-dim TRANS_DIM
                        Dimension of the transformer after CNN
  --trans-nlayers TRANS_NLAYERS
                        Number of layers of the transformer after CNN
  --trans-nheads TRANS_NHEADS
                        Number of heads of the transformer after CNN
  -aft AUDIO_FEATURE_TYPE, --audio-feature-type AUDIO_FEATURE_TYPE
                        Hand crafted audio feature types
  --num-emotions NUM_EMOTIONS
                        Number of emotions in data
  --img-interval IMG_INTERVAL
                        Interval to sample image frames
  --hand-crafted        Use hand crafted features
  --text-max-len TEXT_MAX_LEN
                        Max length of text after tokenization
  --datapath DATAPATH   Path of data
  --dataset DATASET     Use which dataset
  -mod MODALITIES, --modalities MODALITIES
                        what modalities to use
  --valid               Only run validation
  --test                Only run test
  --ckpt CKPT           Path of checkpoint
  --ckpt-mod CKPT_MOD   Load which modality of the checkpoint
  -dr DROPOUT, --dropout DROPOUT
                        dropout
  -nl NUM_LAYERS, --num-layers NUM_LAYERS
                        num of layers of LSTM
  -hs HIDDEN_SIZE, --hidden-size HIDDEN_SIZE
                        hidden vector size of LSTM
  -bi, --bidirectional  Use Bi-LSTM
  --gru                 Use GRU rather than LSTM

multimodal-end2end-sparse's People

Contributors

Stargazers

Watchers

Forkers

prakharguptaz kmundnic rongfei-chen psqqq hugddygff ayushiiamin hebaulc ruddy202 hx-ling tracymacc cjj2923 zhangjd1996 artinacode jacek-tomasz-rutkowski

multimodal-end2end-sparse's Issues

string indices must be integers

Hi，When I was running the code, I encountered an error as follows:
cls_feature = last_hidden_state[:, 0]
TypeError: string indices must be integers

I don't know the reason, I‘m looking forward to your answer!
Thank you!

question on implement

What type of GPU do you use? and How many GPUs do you use？
thank u

No positional encoding in transformer implementation

Thank you very much for the code you shared.

In the paper, you said that you use the Transformer to encode the sequential representation of sequences of visual (and acoustic) representation. But I am confused about there being no positional encoding in the transformer_encoder.py file (this line is commented "# inputs = self.pos_encoder(inputs)" ). Could you clarify this?

Hey , about your processed data.

I was trying to run your code. But i have trouble downloading the preprocessed data. It is so large that I fail to download from your link. Could you please offer other links to download , such as baidu netdisk . Thanks.

about the model implementation

Hi, thank you for the great work!
I have a question about the model name in the paper and your implementation.
Do MME2E and MME2E_Sparse in the code correspond to FE2E and MESM respectively?
Also, if that's the case, I wonder why FE2E works better than MESM (Table 3. and 4. in the paper) even with less cross-modal interaction. (as MME2E does not have cross-modal operation other than multimodal fusion in the final layer.. probably because of the FLOPs..?)
Thank you!

Pretrained weights

Thanks for the work. Could you please provide pre-trained weights?

about visual data extraction

we got problems when extracting visual data, the results we got is all 0 of IEMOCAP. We found it works when other videos but error when IEMOCAP. We tried download a new IEMOCAP dataset and got same error result. Could u help to fix this?

Can you upload the code that generates *_split.txt?

Can you upload the code that generates *_split.txt? I have trouble downloading the preprocessed data, so I want to process the raw data directly.

about MOSEI

Hello! The link to the raw CMU-MOSEI dataset [http://immortal.multicomp.cs.cmu.edu/raw_datasets/CMU_MOSEI.zip] is no longer available, could you please share it?

sparseconvnet

accuracy mismatch

I tried to run the code with the following arguments:

python main.py -lr=5e-5 -ep=40 -mod=tav -bs=8 --img-interval=500 --early-stop=6 --loss=bce --cuda=3 --model=mme2e --num-emotions=6 --trans-dim=64 --trans-nlayers=4 --trans-nheads=4 --text-lr-factor=10 --text-model-size=base --text-max-len=100

The papers says that we get an accuracy of 84.5% but the best accuracy that I am getting is 76.9% on the IEMOCAP dataset. Could you please put the exact arguments to reproduce the results of the paper?

baseline运行加载数据时总是超出内存

调小batchsize没有用，请问如何让bseline模型不超内存呢

IEMOCAP dataset

@wenliangdai @SamuelCahyawijaya Hi , I have sent a request info to sail.usc.edu ,But haven't recevied a response,can you give me a passwd?

RuntimeError: split_with_sizes expects split_sizes to sum exactly to 153 (input tensor's size at dimension 0), but got split_sizes=[45, 99]

请问这个是什么问题呢问题出于

你好，能提供处理后数据的百度网盘的下载地址吗

RuntimeError: split_with_sizes expects split_sizes to sum exactly to 185 (input tensor's size at dimension 0), but got split_sizes=[83, 84]

Hi,
When I try to run sparse MME2E with default settings according to ReadME, I got the following error.
Can you help with it?
Thanks in advance.

RuntimeError: split_with_sizes expects split_sizes to sum exactly to 185 (input tensor's size at dimension 0), but got split_sizes=[83, 84]

training error

Hello, I have some question about this code,after we tokenizer the text transcription into tensor, when we use the model 'mme2e', it occurred the error below ,thanks for you help.
Traceback (most recent call last):
File "/home5/HouSJ/code/Multimodal-End2end-Sparse-main/main.py", line 155, in
trainer.train()
File "/home5/HouSJ/code/Multimodal-End2end-Sparse-main/src/trainers/emotiontrainer.py", line 46, in train
train_stats, train_thresholds = self.train_one_epoch()
File "/home5/HouSJ/code/Multimodal-End2end-Sparse-main/src/trainers/emotiontrainer.py", line 169, in train_one_epoch
logits = self.model(imgs, imgLens, specgrams, specgramLens, text)
File "/home5/HouSJ/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home5/HouSJ/code/Multimodal-End2end-Sparse-main/src/models/e2e.py", line 90, in forward
text_cls = self.T(text, get_cls=True)
File "/home5/HouSJ/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home5/HouSJ/code/Multimodal-End2end-Sparse-main/src/models/e2e_t.py", line 16, in forward
logits, hidden_states = self.albert(**text, output_hidden_states=True)
TypeError: AlbertModel object argument after ** must be a mapping, not Tensor
Train: 0%| | 0/646 [00:00<?, ?it/s]

关于mosei评价指标多分类

作者你好，请问在mosei数据中 threshold值你设置的多少

can't reproduce the paper's performance

Hi, Thank you very much for the code you shared.

I tried to run the code with the following arguments:

python main.py -lr=5e-5 -ep=40 -mod=tav -bs=8 --img-interval=500 --early-stop=6 --loss=bce --cuda=3 --model=mme2e_sparse --num-emotions=6 --trans-dim=64 --trans-nlayers=4 --trans-nheads=4 --text-lr-factor=10 -st=0.7 --text-model-size=base --text-max-len=100 --dataset="iemocap"

The papers syas that we get an accuracy of 84.4% but the best accuracy that I am getting is 83.0% on the IEMOCAP dataset. Could you please put the exact arguments to reproduce the results of the papers? How can I get the results of the papers?

(I think "p" equals "st" argument. So I set "st=0.7")

ACC	ang	exc	fru	hap	neu	sad	average
paper(p=0.7)	88.2	88.3	74.9	89.5	77.0	88.6	84.4
me(p=0.7)	88.8	80.6	77.9	90.4	74.5	86.0	83.0
me(p=0.9)	88.3	87.5	76.7	90.2	72.1	90.6	84.2

best regards,
Juyeon Kim

OverflowError: cannot serialize a bytes object larger than 4 GiB

Hi,
When I tried to train baselines (LF_RNN and LF_Transformer) on the MOSEI dataset, I got the error "OverflowError: cannot serialize a bytes object larger than 4 GiB".
Did you face this issue? Any suggestion to solve this?
Thanks in advance.

CMU-MOSEI raw dataset

Hello! The link to the raw CMU-MOSEI dataset http://immortal.multicomp.cs.cmu.edu/raw_datasets/CMU_MOSEI.zip is no longer available, could you please share it?

你好，处理过后数据可以放个百度网盘地址吗，谢谢

oneDrive password

Deer professor, I want to download the IEMOCAP dataset that you shared in oneDrive， it seems like need a password, could you send it to me ,thank you

Questions about Evaluation

There are two functions (eval_mosei_emo, eval_iemocap) in the '/utils/evaluate.py'. My question is what is the difference between these two functions, I see that both use weighted accuracy in there, but you have calculated acc and Wacc separately for both iemocap and modei datasets in the paper. In the code you provided, for both datasets, do you use eval_iemocap by default as long as it is BCEloss?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.