Giter Site home page Giter Site logo

gesture_recognition_attbilstm's Introduction

Introduction

Hand gestures are one of the most natural ways that humans use to express their thoughts. They have potential applications in interfaces for Virtual Reality and Augmented Reality, as well as in sign language recognition. With the developments and successful applications of deep learning models in image analysis, such as image classification and object detection, it has become possible to recognize hand gestures using deep learning models. Dynamic hand gesture recognition, as a branch of the video classification problem, is challenging in many ways.

Motivation

  • In Fall 2019, we investigated how we could optimize gesture recognition methods using attention mechanisms.
  • We hypothesized that attention mechanisms optimize the training of deep neural networks for continuous gesture recognition from video.

Proposed Framework

To validate this hypothesis, we have designed a framework consisting of 2D-CNN and BiLSTM with Attention (Figs. 1-2).

Fig. 1. Overall Archtechture

Fig. 2. BiLSTM with Attention (Att.BiLSTM)

In Fig. 2, BiLSTM includes X_t that represents embedding vector extracted from CNNs and W_t that represents Bi-LSTM weights, where t is time frame. Through X_t and W_t, we compute the attention-added biLSTM vectors using below Eqs. (1) and (2).

In Eq. (1), W_t is the output vector set of BiLSTM. X_t is the embedding feature extracted by CNNs. Using Eq. (1), we obtain attention weights A_t. In Eq. (2), the Attention weight A_t is added to BiLSTM's weights. Then we obtain X'_t tha fed to the softmax layer.

Dataset and model training

  • We conducted this experiment using the 20DB-Jester Dataset V1, which consists of 27 classes of gestures. A total of 118,562 videos were used for the training set, while 14,787 videos were allocated for the validation set.

  • We trained two models: Baseline (2D-CNN + BiLSTM without attention) and our AttBiLSTM (2D-CNN + BiLSTM with attention). As a 2D-CNN model, we selected the pre-trained 2D-ResNet18 model.

  • The hyperparameters for the Bi-LSTM were selected as follows: an embedding size of 128, one layer, and a hidden layer size of 256. All models were trained using a batch size of 64. The input size of image was 112x112x3.

Results

Table 1. Accuracy scores for 28 guesture classifications (At Epoch 50)

Taining Acc. Validation Acc.
Baseline 37.15 28.10
AttBiLSTM 76.15 64.12

image

Fig. 3. Training and validation accuracies during 50 epochs

We demonstrate that attention significantly improves model accuracies. By simply adding the attention equation to the model, we show that model training was greatly optimized, which provides much better accuracies for the gesture recognition even with the simple structure!

Limitations & Insights

  • We used 2D CNNs even though the input data was 3D video; This is because we had no enough resource to train our model over the large video dataset.
  • However, we show attention-mechanism significantly imporves the performance of gesture recognition models.

Code Usage

  • Edit opts.py per your data.
  • Run main_normal_attention.py for "AttBiLSTM".
  • Run main_non_attention.py for "Baseline".

Acknowledgement

  • Jihye Moon and Dr. Chen created this code for "CSE 5095 Advances in Deep Learning" Class Project in Dec. 2019.
    • Class instructor is Dr. Ding -- Many thanks to his great teaching for the class!
  • We referred opts.py and some codes from 3D-Resnets-Pytorch to build ResNet modules for the video frames.
  • If you have any questions, please feel free to contact me at [email protected]!

gesture_recognition_attbilstm's People

Contributors

jihyemooon avatar

Stargazers

Cassey (Dong Han) avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.