Hand gestures are one of the most natural ways that humans use to express their thoughts. They have potential applications in interfaces for Virtual Reality and Augmented Reality, as well as in sign language recognition. With the developments and successful applications of deep learning models in image analysis, such as image classification and object detection, it has become possible to recognize hand gestures using deep learning models. Dynamic hand gesture recognition, as a branch of the video classification problem, is challenging in many ways.
- In Fall 2019, we investigated how we could optimize gesture recognition methods using attention mechanisms.
- We hypothesized that attention mechanisms optimize the training of deep neural networks for continuous gesture recognition from video.
To validate this hypothesis, we have designed a framework consisting of 2D-CNN and BiLSTM with Attention (Figs. 1-2).
Fig. 1. Overall Archtechture
Fig. 2. BiLSTM with Attention (Att.BiLSTM)
In Fig. 2, BiLSTM includes X_t that represents embedding vector extracted from CNNs and W_t that represents Bi-LSTM weights, where t is time frame. Through X_t and W_t, we compute the attention-added biLSTM vectors using below Eqs. (1) and (2).
In Eq. (1), W_t is the output vector set of BiLSTM. X_t is the embedding feature extracted by CNNs. Using Eq. (1), we obtain attention weights A_t. In Eq. (2), the Attention weight A_t is added to BiLSTM's weights. Then we obtain X'_t tha fed to the softmax layer.
-
We conducted this experiment using the 20DB-Jester Dataset V1, which consists of 27 classes of gestures. A total of 118,562 videos were used for the training set, while 14,787 videos were allocated for the validation set.
-
We trained two models: Baseline (2D-CNN + BiLSTM without attention) and our AttBiLSTM (2D-CNN + BiLSTM with attention). As a 2D-CNN model, we selected the pre-trained 2D-ResNet18 model.
-
The hyperparameters for the Bi-LSTM were selected as follows: an embedding size of 128, one layer, and a hidden layer size of 256. All models were trained using a batch size of 64. The input size of image was 112x112x3.
Table 1. Accuracy scores for 28 guesture classifications (At Epoch 50)
Taining Acc. | Validation Acc. | |
---|---|---|
Baseline | 37.15 | 28.10 |
AttBiLSTM | 76.15 | 64.12 |
Fig. 3. Training and validation accuracies during 50 epochs
We demonstrate that attention significantly improves model accuracies. By simply adding the attention equation to the model, we show that model training was greatly optimized, which provides much better accuracies for the gesture recognition even with the simple structure!
- We used 2D CNNs even though the input data was 3D video; This is because we had no enough resource to train our model over the large video dataset.
- However, we show attention-mechanism significantly imporves the performance of gesture recognition models.
- Edit opts.py per your data.
- Run main_normal_attention.py for "AttBiLSTM".
- Run main_non_attention.py for "Baseline".
- Jihye Moon and Dr. Chen created this code for "CSE 5095 Advances in Deep Learning" Class Project in Dec. 2019.
- Class instructor is Dr. Ding -- Many thanks to his great teaching for the class!
- We referred opts.py and some codes from 3D-Resnets-Pytorch to build ResNet modules for the video frames.
- If you have any questions, please feel free to contact me at [email protected]!