Giter Site home page Giter Site logo

archiki / robust-e2e-asr Goto Github PK

View Code? Open in Web Editor NEW
44.0 3.0 10.0 139 KB

This repository contains the code for our upcoming paper An Investigation of End-to-End Models for Robust Speech Recognition at ICASSP 2021.

License: MIT License

Python 100.00%
automatic-speech-recognition robustness deepspeech2 end-to-end-learning noise-adaptation speech-to-text speech-enhancement

robust-e2e-asr's Introduction

End-to-End Models for Robust Speech Recognition

Requirements | Instructions | Experiments | Paper | Datasets

This repository contains the code for our upcoming paper An Investigation of End-to-End Models for Robust Speech Recognition at ICASSP 2021.

Introduction

End-to-end models for robust automatic speech recognition (ASR) have not been sufficiently well-explored in prior work. With end-to-end models, one could choose to preprocess the input speech using speech enhancement techniques (such as: SEVCAE, Deep Xi and DEMUCS) and train the model using enhanced speech. Another alternative is to pass the noisy speech as input and modify the model architecture to adapt to noisy speech (via techniques such as data-augmentation, multi-task leaning, and adversarial training). In this work, we make the first attempt to a systematic comparison of these two approaches for end-to-end robust ASR.

Requirements

  • Docker: Version 19.03.1, build 74b1e89
  • nvidia-docker
  • apex==0.1
  • numpy==1.16.3
  • torch==1.1.0
  • tqdm==4.31.1
  • librosa==0.7.0
  • scipy==1.3.1

Instructions

  1. Clone deepspeech.pytorch and checkout the commit id e73ccf6. This was the stable commit used in all our experiments.
  2. Use the docker file provided in this directory and build the docker image followed by running it via the bash entrypoint,use the commands below. This should be same as the dockerfile present in your folder deepspeech.pytorch, the instructions in the README.md of that folder have been modified.
sudo docker build -t  deepspeech2.docker .
sudo docker run -ti --gpus all -v `pwd`/data:/workspace/data --entrypoint=/bin/bash --net=host --ipc=host deepspeech2.docker
  1. Install all the requirements using pip install -r requirements.txt
  2. Clone this repository code inside the docker container in the directory /workspace/ and install the other requirements.
  3. Install the optional Librispeech Dataset which is used only for training purposes as well as our custom noise datasets.
  4. Preparing Manifests: The data used in deepspeech.pytorch is required to be in .csv called manifests with two columns: path to .wav file, path to .txt file. The .wav file is the speech clip and the .txt files contain the transcript in upper case. For Librispeech, use the data/librispeech.py in deepspeech.pytorch. Similarly, manifests for the noisy speech in the test set of our data can be prepared by retrieving the transcripts using the file IDs from the names of the files in the test noisy speech set. The files are names are in the format: [file ID]_[Noise Type]_[SNR]db.wav.

Experiments

Here is an example used to train the baseline system only on LibriSpeech. This shows the usage of the basic arguments used for training, along with the pretrained baseline model.

Front-End Speech Enhancement

We explored three methods of front-end speech enhancement: SEVCAE, Deep Xi and DEMUCS. The base models were taken from the official aforementioned repositories. These speech enhancement models were finetuned by using noise samples from our custom dataset. After this, the mix clean speech from train-clean-100 of LibriSpeech with our train-noise samples and store the outputs (.wav files). This is used to fine-tune using deepspeech 2 using the Code/trainEnhanced.py file. The dependent files include:

Code/trainEnhanced.py
 ├── model.py (change utils.py accordingly)
 ├── data/data_loader.py
 ├── test.py 

Data-Augmentation Training

We have described two variants of data-augmentation training (DAT): Vanilla DAT and Soft-Freeze DAT. The training file for this experiment is Code/trainTLNoisy.py, here Vanilla DAT corresponds to the argument --layers-scale 1 and Soft-Freeze DAT corresponds to --layer-scale 0.5 (default). To train the model, supply the path to the noise dataset using the --noise-dir argument. Other --noise-* arguments control the level of noisiness in data. To control the layers in the Soft-Freeze DAT method, modify frozen_parameters in line 217. The dependent files include:

Code/trainTLNoisy.py
 ├── model.py (change utils.py accordingly)
 ├── data/data_loader_noisy.py
 ├── test_noisy.py 

Multi-Task Learning

We use the Code/trainMTLNoisy.py file to train the models as per the multi-task learning setup. The 4 main hyperparameters of in our set up are: relative-weights of the hybrid loss (λ), scale of the cross-entropy loss (η), the annealing factor of this scale, and the position of the auxiliary noise classifier. These hyperparameters are set using the --mtl-lambda, --scale, --scale-anneal, and --rnn-split arguments respectively. Other hyperparamters of the train file are self-explainatory. The dependent files include:

Code/trainMTLNoisy.py
 ├── model_split.py (change utils.py accordingly)
 ├── data/data_loader_noisy.py
 ├── test_noisy.py 

Adversarial Training

We use the Code/trainDiffAdvNoisy.py file to train the models as per the adversarial training setup. The key hyper-parameters here are: learning rate scale factor of the feature extractor (λf),learning rate scale factor of the recongition model (λr), learning rate scale factor of the noise classifier (λn), and the position of the discriminator (noise) classifier. These hyperparameters are set using the --lr-factor, --recog-factor, --noise-factor, and --rnn-split respectively. To use only linear layers in the noise classifier, which in our experience works better, use --only-fc True.The dependent files include:

Code/trainDiffAdvNoisy.py
 ├── model_split_adversary.py (change utils.py accordingly)
 ├── data/data_loader_noisy.py
 ├── test_noisy.py 

Testing on Noisy Speech

The following command is used to evaluate the performance on the test noisy speech:

python test.py --test-manifest-path [path to noisy speech] --SNR-start 0 --SNR-stop 20 --SNR-step 5

Datasets

We use the Librispeech Dataset for clean speech as well as our custom noise datasets for noisy speech.

Paper

If you use our techniques, dataset or code in this repository, please consider citing our paper \cite{prasad2021investigation}.

@article{prasad2021investigation,
  title={An Investigation of End-to-End Models for Robust Speech Recognition},
  author={Prasad, Archiki and Jyothi, Preethi and Velmurugan, Rajbabu},
  journal={CoRR},
  year={2021}
  volume={abs/2102.06237},
  archivePrefix={arXiv}
  eprint={2102.06237}
  }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.