Giter Site home page Giter Site logo

smil-spcras / emolips Goto Github PK

View Code? Open in Web Editor NEW
3.0 0.0 0.0 7.09 MB

EMOLIPS: TWO-LEVEL APPROACH FOR LIP-READING EMOTIONAL SPEECH

License: GNU General Public License v3.0

Jupyter Notebook 23.73% Python 76.27%
automatic-speech-recognition emotional-speech lip-reading visual-speech-recognition

emolips's Introduction

EMOLIPS: Two-Level Approach for Emotional Speech Lip-Reading

NEU 6-Emotions valence binary

We propose two-level approach for emotional speech recognition based on visual speech data processing (EMOLIPS). On the first level, we recognize an emotion class/valence as a base for further analysis. On the second level, we apply three different emotional lip-reading strategies: (1) 6-Emotions, (2) Valence, and (3) Binary (emotional/neutral data) one. The approach leverages recent advances in deep learning. So, we use 2DCNN-LSTM architecture for facial emotion recognition and 3DCNN-BiLSTM for phrases recognition by lip.

We conducted our experiments on the emotional CREMA-D (Cao et al., 2014) corpus containing 12 scripted phrases uttered with different emotions: anger, disgust, fear, happy, neutral, and sad.

Comparison of the results of the phrases recognition accuracy of our approach (three strategies) with Baseline (model trained only on NE phrases) are presented in Table 1. The results have been achieved using our emotonal model (EMO-2DCNN-LSTM). Our experiments have shown that when using three models trained on phrases spoken with different valences (negative, neutral and positive), a gain in accuracy of 6% can be achieved compared to using a model trained only on neutral valence. Thus, the accuracy of recognition of phrases in the CREMA-D corpus, considering the two-level strategy, was 90.2%. At the same time, when using 6 emotional models based on phrases spoken with one of the six emotions, an accuracy of 86.8% is achieved (see Table 1, Acc=90.2% versus Acc=86.8% versus Acc=83.4%).

Table 1. Comparison of the results of the phrases recognition accuracy of our approach (three strategies) with Baseline (model trained only on NE phrases). Accuracy (Acc) shows phrase recognition performance without taking into account emotions/valency.

Metric Baseline 6 Emotion Valence Binary
Acc 83.4 86.8 90.2 90.1

This result indicates that when combining 4 emotions (anger, disgust, fear, sad) into one negative valence, we increase the recognition accuracy because within this valence the facial features in the lip area differ significantly from the facial features of the other two valences, which simplifies the phrases recognition approach. This is also confirmed by the result of combining two opposite valences (negative and positive) into one class, we received a decrease in recognition accuracy by 1% (see Table 2, mAcc=92.7% versus mAcc=91.3%).

Table 2. Comparison of the results of the phrases recognition accuracy of our approach (three strategies) with Baseline (model trained only on NE phrases). The results have been achieved with an assumption that an emotional model makes predictions with a high accuracy. The mean accuracy (mAcc) shows the assessment of phrase recognition in the context of emotion/valence class.

Metric Baseline 6 Emotion Valence Binary
mAcc 83.6 90.7 92.7 91.3
Acc 83.4 90.7 91.6 90.8

In this GitHub repository we propose for common use (for scientific usage only) the EMO-2DCNN-LSTM model and 8 LIP-3DCNN-BiLSTM models obtained as a result of our experiments.

To train new lip-reading models, you should get acquainted the file train.py.

To predict emotions and phrases for all videos in your folder, you should run the command python run.py --path_video video/ --path_save report/.

To get new video file with visualization of emotions- and phrases predictions, you should run the command python visualization.py. Below are examples of test videos:

Citation

If you are using EMOLIPS in your research, please consider to cite researches. Here is the examples of BibTeX entry:

@inproceedings{RYUMIN2023MDPI,
  title         = {EMOLIPS: Two-Level Approach for Emotional Speech Lip-Reading},
  author        = {Dmitry Ryumin and Elena Ryumina and Denis Ivanko},
  inproceedings = {Mathematics},
  year          = {2023},
}
@article{RYUMINA2022SPECOM,
  title         = {Emotional speech recognition based on lip-reading},
  author        = {Elena Ryumina and Denis Ivanko},
  journal       = {24th International Conference on Speech and Computer (SPECOM), Lecture Notes in Computer Science},
  volume        = {13721},
  pages         = {616-625},
  year          = {2022},
  doi           = {10.1007/978-3-031-20980-2_52},
}
@article{RYUMINA2022,
  title         = {In Search of a Robust Facial Expressions Recognition Model: A Large-Scale Visual Cross-Corpus Study},
  author        = {Elena Ryumina and Denis Dresvyanskiy and Alexey Karpov},
  journal       = {Neurocomputing},
  year          = {2022},
  volume        = {514},
  pages         = {435-450},
  doi           = {10.1016/j.neucom.2022.10.013},
  url           = {https://www.sciencedirect.com/science/article/pii/S0925231222012656},
}

Links to papers

emolips's People

Contributors

dmitryryumin avatar elenaryumina avatar

Stargazers

 avatar  avatar  avatar

emolips's Issues

Data Sets for CREMA-D

Hi,

My name is David and, first of all, congratulations for you good work. It is really interesting.

I am also working on lipreading and i would like to do a similar case study for my PhD thesis. However, although based on your code, I am not able to obtain the same tranining, validation, and test sets you are describing in your paper. For this reason, I wanted to ask you if it would be possible to obtain three CSVs indicating just the video IDs (e.g. "1076_MTI_SAD_XX") for each subset. With theses IDs is enough, since the rest of information i can retrieve it using my own scripts. It would so helpful!

Thanks in advance. Best regards from Spain,

David

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.