Giter Site home page Giter Site logo

lipreading-tcn's Introduction

Lipreading using Temporal Convolutional Networks

I have worked in this project as part of my final year project. This is project use LRW dataset on a single word prediction on 500 classes. This model can be trained from scratch.

I have used Windows OS with Nvidia GeForce RTX 3070 GPU for the training. There are couple of point added to run it on windows os alongside a wonderful documentation from the original authors shown below.

  1. Conda environment is recommended as some of the libraries conflicts or at least conflicted during my work with virtualenv
  2. while providing path use "\" not "" or use "/".
  3. During preprocessing i had reuse couple of files twice as their .txt file were missing and preprocessing doesn't work properly if there is any missing index sequence in the files.
  4. I have changed "queue-length" to 25 instead of 30 as i wanted to test existing test video files.
  5. There were a conflict of device during running the prediction file, hence change default device to 'cpu' which solved the problem.

Content

Deep Lipreading

Model Zoo

Citation

License

Contact

Deep Lipreading

How to install environment

  1. Clone the repository into a directory. We refer to that directory as TCN_LIPREADING_ROOT.
git clone --recursive https://github.com/mpc001/Lipreading_using_Temporal_Convolutional_Networks.git
  1. Install all required packages.
pip install -r requirements.txt

How to prepare dataset

  1. Download our pre-computed landmarks from GoogleDrive or BaiduDrive (key: kumy) and unzip them to $TCN_LIPREADING_ROOT/landmarks/ folder.

  2. Pre-process mouth ROIs using the script crop_mouth_from_video.py in the preprocessing folder and save them to $TCN_LIPREADING_ROOT/datasets/visual_data/.

  3. Pre-process audio waveforms using the script extract_audio_from_video.py in the preprocessing folder and save them to $TCN_LIPREADING_ROOT/datasets/audio_data/.

  4. Download a pre-trained model from Model Zoo and put the model into the $TCN_LIPREADING_ROOT/models/ folder.

How to train

  1. Train a visual-only model.
CUDA_VISIBLE_DEVICES=0 python main.py --config-path <MODEL-JSON-PATH> \
                                      --annonation-direc <ANNONATION-DIRECTORY> \
                                      --data-dir <MOUTH-ROIS-DIRECTORY>
  1. Train an audio-only model.
CUDA_VISIBLE_DEVICES=0 python main.py --modality raw_audio \
                                      --config-path <MODEL-JSON-PATH> \
                                      --annonation-direc <ANNONATION-DIRECTORY> \
                                      --data-dir <AUDIO-WAVEFORMS-DIRECTORY>

We call the original LRW directory that includes timestamps (.txt) as <ANNONATION-DIRECTORY>.

  1. Resume from last checkpoint.

You can pass the checkpoint path (.pth.tar) <CHECKPOINT-PATH> to the variable argument --model-path, and specify the --init-epoch to 1 to resume training.

How to test

  1. Evaluate the visual-only performance (lipreading).
CUDA_VISIBLE_DEVICES=0 python main.py --config-path <MODEL-JSON-PATH> \
                                      --model-path <MODEL-PATH> \
                                      --data-dir <MOUTH-ROIS-DIRECTORY> \
                                      --test
  1. Evaluate the audio-only performance.
CUDA_VISIBLE_DEVICES=0 python main.py --modality raw_audio \
                                      --config-path <MODEL-JSON-PATH> \
                                      --model-path <MODEL-PATH> \
                                      --data-dir <AUDIO-WAVEFORMS-DIRECTORY>
                                      --test

How to extract embeddings

We assume you have cropped the mouth patches and put them into <MOUTH-PATCH-PATH>. The mouth embeddings will be saved in the .npz format

  • To extract 512-D feature embeddings from the top of ResNet-18:
CUDA_VISIBLE_DEVICES=0 python main.py --extract-feats \
                                      --config-path <MODEL-JSON-PATH> \
                                      --model-path <MODEL-PATH> \
                                      --mouth-patch-path <MOUTH-PATCH-PATH> \
                                      --mouth-embedding-out-path <OUTPUT-PATH>

Model Zoo

We plan to include more models in the future. We use a sequence of 29-frames with a size of 88 by 88 pixels to compute the FLOPs.

Architecture Acc. FLOPs (G) url size (MB)
Audio-only
resnet18_mstcn(adamw) 98.9 3.72 GoogleDrive or BaiduDrive (key: xt66) 111
resnet18_mstcn 98.5 3.72 GoogleDrive or BaiduDrive (key: 3n25) 111
Visual-only
resnet18_mstcn(adamw_s3) 87.9 10.31 GoogleDrive or BaiduDrive (key: j5tw) 139
resnet18_mstcn 85.5 10.31 GoogleDrive or BaiduDrive (key: um1q) 139
snv1x_tcn2x 84.6 1.31 GoogleDrive or BaiduDrive (key: f79d) 35
snv1x_dsmstcn3x 85.3 1.26 GoogleDrive or BaiduDrive (key: 86s4) 36
snv1x_tcn1x 82.7 1.12 GoogleDrive or BaiduDrive (key: 3caa) 15
snv05x_tcn2x 82.5 1.02 GoogleDrive or BaiduDrive (key: ej9e) 32
snv05x_tcn1x 79.9 0.58 GoogleDrive or BaiduDrive (key: devg) 11

Citation

If you find this code useful in your research, please consider to cite the following papers:

@INPROCEEDINGS{ma2020towards,
  author={Ma, Pingchuan and Martinez, Brais and Petridis, Stavros and Pantic, Maja},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Towards Practical Lipreading with Distilled and Efficient Models},
  year={2021},
  pages={7608-7612},
  doi={10.1109/ICASSP39728.2021.9415063}
}

@INPROCEEDINGS{martinez2020lipreading,
  author={Martinez, Brais and Ma, Pingchuan and Petridis, Stavros and Pantic, Maja},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Lipreading Using Temporal Convolutional Networks},
  year={2020},
  pages={6319-6323},
  doi={10.1109/ICASSP40776.2020.9053841}
}

License

It is noted that the code can only be used for comparative or benchmarking purposes. Users can only use code supplied under a License for non-commercial purposes.

Contact

[Pingchuan Ma](pingchuan.ma16[at]imperial.ac.uk)

lipreading-tcn's People

Contributors

seenaimul avatar

Stargazers

SimYng avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.