Giter Site home page Giter Site logo

qysongxx / multimodal-audio-visual-speech-recognition Goto Github PK

View Code? Open in Web Editor NEW

This project forked from smeetrs/deep_avsr

1.0 0.0 0.0 42 KB

A PyTorch implementation of the Deep Audio-Visual Speech Recognition paper.

License: MIT License

Python 100.00%

multimodal-audio-visual-speech-recognition's Introduction

Deep Audio-Visual Speech Recognition

The repository contains a PyTorch reproduction from the Deep Audio-Visual Speech Recognition paper. We train three models - Audio-Only (AO), Video-Only (VO) and Audio-Visual (AV), on the LRS2 dataset for the speech-to-text transcription task.

Requirements

System packages:

ffmpeg==2.8.15
python==3.6.9

Python packages:

editdistance==0.5.3
matplotlib==3.1.1
numpy==1.18.1
opencv-python==4.2.0
pytorch==1.2.0
scipy==1.3.1
tqdm==4.42.1

CUDA 10.0 (if NVIDIA GPU is to be used):

cudatoolkit==10.0

Project Structure

The structure of the audio_only, video_only and audio_visual directories is as follows:

Directories

/checkpoints: Temporary directory to store intermediate model weights and plots while training. Gets automatically created.

/data: Directory containing the LRS2 Main and Pretrain dataset class definitions and other required data-related utility functions.

/final: Directory to store the final trained model weights and plots. If available, place the pre-trained model weights in the models subdirectory.

/models: Directory containing the class definitions for the models.

/utils: Directory containing function definitions for calculating CER/WER, greedy search/beam search decoders and preprocessing of data samples. Also contains functions to train and evaluate the model.

Files

checker.py: File containing checker/debug functions for testing all the modules and the functions in the project as well as any other checks to be performed.

config.py: File to set the configuration options and hyperparameter values.

demo.py: Python script for generating predictions with the specified trained model for all the data samples in the specified demo directory.

preprocess.py: Python script for preprocessing all the data samples in the dataset.

pretrain.py: Python script for pretraining the model on the pretrain set of the LRS2 dataset using curriculum learning.

test.py: Python script to test the trained model on the test set of the LRS2 dataset.

train.py: Python script to train the model on the train set of the LRS2 dataset.

Some results

We provide Word Error Rate (WER) achieved by the models on the test set of the LRS2 dataset with both Greedy Search and Beam Search (with Language Model) decoding techniques. We have tested in cases of clean audio and noisy audio (0 dB SNR). We also give WER in cases where only one of the modalities is used in the Audio-Visual model.

Operation Mode AO/VO Model AV Model
Greedy Beam (+LM)
Greedy Beam (+LM)
Clean Audio
AO 11.4% 8.3% 12.0% 8.2%
VO 61.8% 55.3% 56.3% 49.2%
AV - - 10.3% 6.8%
Noisy Audio
AO 62.5% 54.0% 59.0% 50.7%
AV - - 29.1% 22.1%

How To Use

If planning to train the models, download the complete LRS2 dataset from here or in cases of custom datasets, have the specifications and folder structure similar to LRS2 dataset.

Steps have been provided to either train the models or to use the trained models directly for inference:

Training

Set the configuration options in the config.py file before each of the following steps as required. Comments have been provided for each option. Also, check the Training Details section below as a guide for training the models from scratch.

  1. Run the preprocess.py script to preprocess and generate the required files for each sample.

  2. Run the pretrain.py script for one iteration of curriculum learning. Run it multiple times, each time changing the PRETRAIN_NUM_WORDS argument in the config.py file to perform multiple iterations of curriculum learning.

  3. Run the train.py script to finally train the model on the train set.

  4. Once the model is trained, run the test.py script to obtain the performance of the trained model on the test set.

  5. Run the demo.py script to use the model to make predictions for each sample in a demo directory. Read the specifications for the sample in the demo.py file.

multimodal-audio-visual-speech-recognition's People

Contributors

qysongxx avatar smeetrs avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.