Giter Site home page Giter Site logo

multimodal-end2end-sparse's Introduction

Multimodal End-to-End Sparse Model for Emotion Recognition

CC BY 4.0

[Paper] accepted at the NAACL 2021:

Multimodal End-to-End Sparse Model for Emotion Recognition, by Wenliang Dai *, Samuel Cahyawijaya *, Zihan Liu, Pascale Fung.

Paper Abstract

Existing works on multimodal affective computing tasks, such as emotion recognition, generally adopt a two-phase pipeline, first extracting feature representations for each single modality with hand-crafted algorithms and then performing end-to-end learning with the extracted features. However, the extracted features are fixed and cannot be further fine-tuned on different target tasks, and manually finding feature extraction algorithms does not generalize or scale well to different tasks, which can lead to sub-optimal performance. In this paper, we develop a fully end-to-end model that connects the two phases and optimizes them jointly. In addition, we restructure the current datasets to enable the fully end-to-end training. Furthermore, to reduce the computational overhead brought by the end-to-end model, we introduce a sparse cross-modal attention mechanism for the feature extraction. Experimental results show that our fully end-to-end model significantly surpasses the current state-of-the-art models based on the two-phase pipeline. Moreover, by adding the sparse cross-modal attention, our model can maintain performance with around half the computation in the feature extraction part.

If you work is inspired by our paper or code, please cite it, thanks!

@inproceedings{dai-etal-2021-multimodal,
    title = "Multimodal End-to-End Sparse Model for Emotion Recognition",
    author = "Dai, Wenliang  and
      Cahyawijaya, Samuel  and
      Liu, Zihan  and
      Fung, Pascale",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.naacl-main.417",
    doi = "10.18653/v1/2021.naacl-main.417",
    pages = "5305--5316",
    abstract = "Existing works in multimodal affective computing tasks, such as emotion recognition and personality recognition, generally adopt a two-phase pipeline by first extracting feature representations for each single modality with hand crafted algorithms, and then performing end-to-end learning with extracted features. However, the extracted features are fixed and cannot be further fine-tuned on different target tasks, and manually finding feature extracting algorithms does not generalize or scale well to different tasks, which can lead to sub-optimal performance. In this paper, we develop a fully end-to-end model that connects the two phases and optimizes them jointly. In addition, we restructure the current datasets to enable the fully end-to-end training. Furthermore, to reduce the computational overhead brought by the end-to-end model, we introduce a sparse cross-modal attention mechanism for the feature extraction. Experimental results show that our fully end-to-end model significantly surpasses the current state-of-the-art models based on the two-phase pipeline. Moreover, by adding the sparse cross-modal attention, our model can maintain the performance with around half less computation in the feature extraction part of the model.",
}

You can also check our blog here 😊.

Dataset

As mentioned in our paper, one of the contribution is that we reorganize two datasets (IEMOCAP and CMU-MOSEI) to enable training from the raw data. To the best of our knowledge, prior to our work, papers using these two datasets are based on pre-extracted features, and we did not find a way to map those features back with raw data. Therefore, we did a heavy reorganization of these datasets (refer to Section 3 of the paper for more details).

The raw data can be downloaded from CMU-MOSEI (~120GB) and IEMOCAP (~16.5GB). However, for the IEMOCAP, you need to request for a permission from the original author, then we can give the passcode to download.

We provide two Python scripts as examples of processing the raw data in the ./preprocessing folder. Alternatively, you can also download our processed raw data for training directly, as shown in the section below.

Preparation

Dataset

To run our code directly, you can download the processed data from here (88.6G). Unzip it and the tree structure of the data direcotry looks like this:

./data
- IEMOCAP_HCF_FEATURES
- IEMOCAP_RAW_PROCESSED
- IEMOCAP_SPLIT
- MOSEI_RAW_PROCESSED
- MOSEI_HCF_FEATURES
- MOSEI_SPLIT

Environment

Command examples for running

Train the MME2E

python main.py -lr=5e-5 -ep=40 -mod=tav -bs=8 --img-interval=500 --early-stop=6 --loss=bce --cuda=3 --model=mme2e --num-emotions=6 --trans-dim=64 --trans-nlayers=4 --trans-nheads=4 --text-lr-factor=10 --text-model-size=base --text-max-len=100

Train the sparse MME2E

python main.py -lr=5e-5 -ep=40 -mod=tav -bs=2 --img-interval=500 --early-stop=6 --loss=bce --cuda=3 --model=mme2e_sparse --num-emotions=6 --trans-dim=64 --trans-nlayers=4 --trans-nheads=4 --text-lr-factor=10 -st=0.8 --text-model-size=base --text-max-len=100

Baselines

LF_RNN

python main.py -lr=5e-4 -ep=60 -mod=tav -bs=32 --early-stop=8 --loss=bce --cuda=1 --model=lf_rnn --num-emotions=6 --hand-crafted --clip=2

LF_TRANSFORMER

python main.py -lr=5e-4 -ep=60 -mod=tav -bs=32 --early-stop=8 --loss=bce --cuda=0 --model=lf_transformer --num-emotions=6 --hand-crafted --clip=2

CLI

usage: main.py [-h] -bs BATCH_SIZE -lr LEARNING_RATE [-wd WEIGHT_DECAY] -ep
               EPOCHS [-es EARLY_STOP] [-cu CUDA] [-cl CLIP] [-sc] [-se SEED]
               [--loss LOSS] [--optim OPTIM] [--text-lr-factor TEXT_LR_FACTOR]
               [-mo MODEL] [--text-model-size TEXT_MODEL_SIZE]
               [--fusion FUSION] [--feature-dim FEATURE_DIM]
               [-st SPARSE_THRESHOLD] [-hfcs HFC_SIZES [HFC_SIZES ...]]
               [--trans-dim TRANS_DIM] [--trans-nlayers TRANS_NLAYERS]
               [--trans-nheads TRANS_NHEADS] [-aft AUDIO_FEATURE_TYPE]
               [--num-emotions NUM_EMOTIONS] [--img-interval IMG_INTERVAL]
               [--hand-crafted] [--text-max-len TEXT_MAX_LEN]
               [--datapath DATAPATH] [--dataset DATASET] [-mod MODALITIES]
               [--valid] [--test] [--ckpt CKPT] [--ckpt-mod CKPT_MOD]
               [-dr DROPOUT] [-nl NUM_LAYERS] [-hs HIDDEN_SIZE] [-bi] [--gru]

Multimodal End-to-End Sparse Model for Emotion Recognition

optional arguments:
  -h, --help            show this help message and exit
  -bs BATCH_SIZE, --batch-size BATCH_SIZE
                        Batch size
  -lr LEARNING_RATE, --learning-rate LEARNING_RATE
                        Learning rate
  -wd WEIGHT_DECAY, --weight-decay WEIGHT_DECAY
                        Weight decay
  -ep EPOCHS, --epochs EPOCHS
                        Number of epochs
  -es EARLY_STOP, --early-stop EARLY_STOP
                        Early stop
  -cu CUDA, --cuda CUDA
                        Cude device number
  -cl CLIP, --clip CLIP
                        Use clip to gradients
  -sc, --scheduler      Use scheduler to optimizer
  -se SEED, --seed SEED
                        Random seed
  --loss LOSS           loss function
  --optim OPTIM         optimizer function: adam/sgd
  --text-lr-factor TEXT_LR_FACTOR
                        Factor the learning rate of text model
  -mo MODEL, --model MODEL
                        Which model
  --text-model-size TEXT_MODEL_SIZE
                        Size of the pre-trained text model
  --fusion FUSION       How to fuse modalities
  --feature-dim FEATURE_DIM
                        Dimension of features outputed by each modality model
  -st SPARSE_THRESHOLD, --sparse-threshold SPARSE_THRESHOLD
                        Threshold of sparse CNN layers
  -hfcs HFC_SIZES [HFC_SIZES ...], --hfc-sizes HFC_SIZES [HFC_SIZES ...]
                        Hand crafted feature sizes
  --trans-dim TRANS_DIM
                        Dimension of the transformer after CNN
  --trans-nlayers TRANS_NLAYERS
                        Number of layers of the transformer after CNN
  --trans-nheads TRANS_NHEADS
                        Number of heads of the transformer after CNN
  -aft AUDIO_FEATURE_TYPE, --audio-feature-type AUDIO_FEATURE_TYPE
                        Hand crafted audio feature types
  --num-emotions NUM_EMOTIONS
                        Number of emotions in data
  --img-interval IMG_INTERVAL
                        Interval to sample image frames
  --hand-crafted        Use hand crafted features
  --text-max-len TEXT_MAX_LEN
                        Max length of text after tokenization
  --datapath DATAPATH   Path of data
  --dataset DATASET     Use which dataset
  -mod MODALITIES, --modalities MODALITIES
                        what modalities to use
  --valid               Only run validation
  --test                Only run test
  --ckpt CKPT           Path of checkpoint
  --ckpt-mod CKPT_MOD   Load which modality of the checkpoint
  -dr DROPOUT, --dropout DROPOUT
                        dropout
  -nl NUM_LAYERS, --num-layers NUM_LAYERS
                        num of layers of LSTM
  -hs HIDDEN_SIZE, --hidden-size HIDDEN_SIZE
                        hidden vector size of LSTM
  -bi, --bidirectional  Use Bi-LSTM
  --gru                 Use GRU rather than LSTM

multimodal-end2end-sparse's People

Contributors

samuelcahyawijaya avatar wenliangdai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

multimodal-end2end-sparse's Issues

string indices must be integers

Hi,When I was running the code, I encountered an error as follows:
cls_feature = last_hidden_state[:, 0]
TypeError: string indices must be integers

I don't know the reason, I‘m looking forward to your answer!
Thank you!

No positional encoding in transformer implementation

Thank you very much for the code you shared.

In the paper, you said that you use the Transformer to encode the sequential representation of sequences of visual (and acoustic) representation. But I am confused about there being no positional encoding in the transformer_encoder.py file (this line is commented "# inputs = self.pos_encoder(inputs)" ). Could you clarify this?

Hey , about your processed data.

I was trying to run your code. But i have trouble downloading the preprocessed data. It is so large that I fail to download from your link. Could you please offer other links to download , such as baidu netdisk . Thanks.

about the model implementation

Hi, thank you for the great work!
I have a question about the model name in the paper and your implementation.
Do MME2E and MME2E_Sparse in the code correspond to FE2E and MESM respectively?
Also, if that's the case, I wonder why FE2E works better than MESM (Table 3. and 4. in the paper) even with less cross-modal interaction. (as MME2E does not have cross-modal operation other than multimodal fusion in the final layer.. probably because of the FLOPs..?)
Thank you!

about visual data extraction

we got problems when extracting visual data, the results we got is all 0 of IEMOCAP. We found it works when other videos but error when IEMOCAP. We tried download a new IEMOCAP dataset and got same error result. Could u help to fix this?

about MOSEI

Hello! The link to the raw CMU-MOSEI dataset [http://immortal.multicomp.cs.cmu.edu/raw_datasets/CMU_MOSEI.zip] is no longer available, could you please share it?

accuracy mismatch

I tried to run the code with the following arguments:

python main.py -lr=5e-5 -ep=40 -mod=tav -bs=8 --img-interval=500 --early-stop=6 --loss=bce --cuda=3 --model=mme2e --num-emotions=6 --trans-dim=64 --trans-nlayers=4 --trans-nheads=4 --text-lr-factor=10 --text-model-size=base --text-max-len=100

The papers says that we get an accuracy of 84.5% but the best accuracy that I am getting is 76.9% on the IEMOCAP dataset. Could you please put the exact arguments to reproduce the results of the paper?

training error

Hello, I have some question about this code,after we tokenizer the text transcription into tensor, when we use the model 'mme2e', it occurred the error below ,thanks for you help.
Traceback (most recent call last):
File "/home5/HouSJ/code/Multimodal-End2end-Sparse-main/main.py", line 155, in
trainer.train()
File "/home5/HouSJ/code/Multimodal-End2end-Sparse-main/src/trainers/emotiontrainer.py", line 46, in train
train_stats, train_thresholds = self.train_one_epoch()
File "/home5/HouSJ/code/Multimodal-End2end-Sparse-main/src/trainers/emotiontrainer.py", line 169, in train_one_epoch
logits = self.model(imgs, imgLens, specgrams, specgramLens, text)
File "/home5/HouSJ/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home5/HouSJ/code/Multimodal-End2end-Sparse-main/src/models/e2e.py", line 90, in forward
text_cls = self.T(text, get_cls=True)
File "/home5/HouSJ/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home5/HouSJ/code/Multimodal-End2end-Sparse-main/src/models/e2e_t.py", line 16, in forward
logits, hidden_states = self.albert(**text, output_hidden_states=True)
TypeError: AlbertModel object argument after ** must be a mapping, not Tensor
Train: 0%| | 0/646 [00:00<?, ?it/s]

can't reproduce the paper's performance

Hi, Thank you very much for the code you shared.

I tried to run the code with the following arguments:

python main.py -lr=5e-5 -ep=40 -mod=tav -bs=8 --img-interval=500 --early-stop=6 --loss=bce --cuda=3 --model=mme2e_sparse --num-emotions=6 --trans-dim=64 --trans-nlayers=4 --trans-nheads=4 --text-lr-factor=10 -st=0.7 --text-model-size=base --text-max-len=100 --dataset="iemocap"

The papers syas that we get an accuracy of 84.4% but the best accuracy that I am getting is 83.0% on the IEMOCAP dataset. Could you please put the exact arguments to reproduce the results of the papers? How can I get the results of the papers?

(I think "p" equals "st" argument. So I set "st=0.7")

ACC ang exc fru hap neu sad average
paper(p=0.7) 88.2 88.3 74.9 89.5 77.0 88.6 84.4
me(p=0.7) 88.8 80.6 77.9 90.4 74.5 86.0 83.0
me(p=0.9) 88.3 87.5 76.7 90.2 72.1 90.6 84.2

best regards,
Juyeon Kim

oneDrive password

Deer professor, I want to download the IEMOCAP dataset that you shared in oneDrive, it seems like need a password, could you send it to me ,thank you

Questions about Evaluation

There are two functions (eval_mosei_emo, eval_iemocap) in the '/utils/evaluate.py'. My question is what is the difference between these two functions, I see that both use weighted accuracy in there, but you have calculated acc and Wacc separately for both iemocap and modei datasets in the paper. In the code you provided, for both datasets, do you use eval_iemocap by default as long as it is BCEloss?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.