Giter Site home page Giter Site logo

mystlee / kospeech Goto Github PK

View Code? Open in Web Editor NEW

This project forked from sooftware/kospeech

0.0 1.0 0.0 926.92 MB

Open Source Project for Korean End-to-End (E2E) Automatic Speech Recognition (ASR) in Pytorch for Deep Learning Researchers

Home Page: https://sooftware.github.io/KoSpeech/

License: Apache License 2.0

Python 97.77% Shell 2.23%

kospeech's Introduction



KoSpeech: Open Source Project for Korean End-to-End Automatic Speech Recognition in PyTorch

KoSpeech: Open Source Project for Korean End-to-End Automatic Speech Recognition in PyTorch

Soohwan Kim1,2, Seyoung Bae1, Cheolhwang Won1, Suwon Park1*

1Elcomm, Kwangwoon Univ. 2Spoken Language Lab (of Sogang Univ.)

* author is advisor to this work.

End-to-end (E2E) automatic speech recognition (ASR) is an emerging paradigm in the field of neural network-based speech recognition that offers multiple benefits. Traditional “hybrid” ASR systems, which are comprised of an acoustic model, language model, and pronunciation model, require separate training of these components, each of which can be complex.

For example, training of an acoustic model is a multi-stage process of model training and time alignment between the speech acoustic feature sequence and output label sequence. In contrast, E2E ASR is a single integrated approach with a much simpler training pipeline with models that operate at low audio frame rates. This reduces the training time, decoding time, and allows joint optimization with downstream processing such as natural language understanding.

Korean.ver

Intro

KoSpeech is project for End-to-end (E2E) automatic speech recognition implemented in PyTorch.
KoSpeech has modularized and extensible components for las models, training and evalutaion, checkpoints, parsing etc.
We appreciate any kind of feedback or contribution.

We used KsponSpeech corpus which containing 1000h of Korean speech data.
At present our model has recorded an 87.99% CRR, and we are working for a higher recognition rate.
Also our model has recorded 92.0% CRR in Kaldi-zeroth corpus

( CRR : Character Recognition Rate )

Features

We have referred to many papers to develop the best model possible. And tried to make the code as efficient and easy to use as possible. If you have any minor inconvenience, please let us know anytime. We will response as soon as possible.

Roadmap

Our architecture based on Seq2seq with Attention.

Attention mechanism helps finding speech alignment. We apply (location-aware / multi-head) attention which you can choose. Location-aware attention proposed in Attention Based Models for Speech Recognition paper and Multi-headed attention proposed in Attention Is All You Need paper. You can choose between these two options as attn_mechanism option. Please check this page.

We mainly referred to following papers.

「Listen, Attend and Spell」

「Attention Based Models for Speech Recognition」

「State-of-the-art Speech Recognition with Sequence-to-Sequence Models」

「SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition」.

If you want to study the feature of audio, we recommend this papers.

「Voice Recognition Using MFCC Algirithm」.

Our model architeuture is as follows.

ListenAttendSpell(
  (listener): Listener(
    (extractor): VGGExtractor(
      (cnn): MaskCNN(
        (sequential): Sequential(
          (0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (1): Hardtanh(min_val=0, max_val=20, inplace=True)
          (2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (4): Hardtanh(min_val=0, max_val=20, inplace=True)
          (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
          (6): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (7): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (8): Hardtanh(min_val=0, max_val=20, inplace=True)
          (9): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (10): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (11): Hardtanh(min_val=0, max_val=20, inplace=True)
          (12): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
        )
      )
    )
    (rnn): LSTM(2560, 256, num_layers=3, batch_first=True, dropout=0.3, bidirectional=True)
  )
  (speller): Speller(
    (embedding): Embedding(2038, 512)
    (input_dropout): Dropout(p=0.3, inplace=False)
    (rnn): LSTM(512, 512, num_layers=2, batch_first=True, dropout=0.3)
    (attention): MultiHeadAttention(
      (query_proj): Linear(in_features=512, out_features=512, bias=True)
      (value_proj): Linear(in_features=512, out_features=512, bias=True)
    )
    (fc1): Linear(in_features=1024, out_features=512, bias=True)
    (fc2): Linear(in_features=512, out_features=2038, bias=True)
  )
)

KoSpeech

kospeech module has modularized and extensible components for las models, trainer, evaluator, checkpoints etc...
In addition, kospeech enables learning in a variety of environments with a simple option setting.

  • Options
usage: main.py [-h] [--mode] [--sample_rate]
               [--window_size] [--stride] [--n_mels]
               [--normalize] [--del_silence] [--input_reverse]
               [--feature_extract_by] [--time_mask_para] [--freq_mask_para]
               [--time_mask_num] [--freq_mask_num]
               [--use_bidirectional] [--hidden_dim]
               [--dropout] [--num_heads] [--label_smoothing]
               [--listener_layer_size] [--speller_layer_size] [--rnn_type]
               [--extractor] [--activation]
               [--attn_mechanism] [--teacher_forcing_ratio]
               [--dataset_path] [--data_list_path]
               [--label_path] [--init_uniform] [--spec_augment]
               [--noise_augment] [--noiseset_size]
               [--noise_level] [--use_cuda]
               [--batch_size] [--num_workers]
               [--num_epochs] [--init_lr]
               [--high_plateau_lr] [--low_plateau_lr] [--valid_ratio]
               [--max_len] [--max_grad_norm]
               [--rampup_period] [--decay_threshold] [--exp_decay_period]
               [--teacher_forcing_step] [--min_teacher_forcing_ratio]
               [--seed] [--save_result_every]
               [--checkpoint_every] [--print_every] [--resume]

We are constantly updating the progress of the project on the Wiki page. Please check this page.

Installation

This project recommends Python 3.7 or higher.
We recommend creating a new virtual environment for this project (using virtual env or conda).

Prerequisites

  • Numpy: pip install numpy (Refer here for problem installing Numpy).
  • Pytorch: Refer to PyTorch website to install the version w.r.t. your environment.
  • Pandas: pip install pandas (Refer here for problem installing Pandas)
  • Matplotlib: pip install matplotlib (Refer here for problem installing Matplotlib)
  • librosa: pip install librosa (Refer here for problem installing librosa)
  • torchaudio: pip install torchaudio (Refer here for problem installing torchaudio)
  • tqdm: pip install tqdm (Refer here for problem installing tqdm)

Install from source

Currently we only support installation from source code using setuptools. Checkout the source code and run the
following commands:

pip install -r requirements.txt
python bin/setup.py build
python bin/setup.py install

Get Started

Step 1: Data Preprocessing

you can preprocess KsponSpeech corpus refer here.
Or refer this page. This documentation contains information regarding the preprocessing of KsponSpeech.

Step 2: Run main.py

  • Default setting
$ ./run.sh
  • Custom setting
python ./bin/main.py --batch_size 32 --num_workers 4 --num_epochs 20  --use_bidirectional \
                     --input_reverse --spec_augment --noise_augment --use_cuda --hidden_dim 256 \
                     --dropout 0.3 --num_heads 8 --label_smoothing 0.1 \
                     --listener_layer_size 5 --speller_layer_size 3 --rnn_type gru \
                     --high_plateau_lr $HIGH_PLATEAU_LR --teacher_forcing_ratio 1.0 --valid_ratio 0.01 \
                     --sample_rate 16000 --window_size 20 --stride 10 --n_mels 80 --normalize --del_silence \
                     --feature_extract_by torchaudio --time_mask_para 70 --freq_mask_para 12 \
                     --time_mask_num 2 --freq_mask_num 2 --save_result_every 1000 \
                     --checkpoint_every 5000 --print_every 10 --init_lr 1e-15  --init_uniform  \
                     --mode train --dataset_path /data3/ --data_list_path ./data/data_list/xxx.csv \
                     --max_grad_norm 400 --rampup_period 1000 --max_len 80 --decay_threshold 0.02 \
                     --exp_decay_period  160000 --low_plateau_lr 1e-05 --noiseset_size 1000 \
                     --noise_level 0.7 --attn_mechanism loc --teacher_forcing_step 0.05 \
                     --min_teacher_forcing_ratio 0.7

You can train the model by above command.
If you want to train by default setting, you can train by Defaulting setting command.
Or if you want to train by custom setting, you can designate hyperparameters by Custom setting command.

Step 3: Run eval.py

  • Default setting
$ ./eval.sh
  • Custom setting
python ./bin/eval.py -dataset_path dataset_path -data_list_path data_list_path \
                     -mode eval -use_cuda -batch_size 32 -num_workers 4 \
                     -use_beam_search -k 5 -print_every 100 \
                     -sample_rate 16000 --window_size 20 --stride 10 --n_mels 80 -feature_extract_by librosa \
                     -normalize -del_silence -input_reverse 

Now you have a model which you can use to predict on new data. We do this by running beam search (or greedy search).
Like training, you can choose between Default setting or Custom setting.

Checkpoints

Checkpoints are organized by experiments and timestamps as shown in the following file structure.

save_dir
+-- checkpoints
|  +-- YYYY_mm_dd_HH_MM_SS
   |  +-- trainer_states.pt
   |  +-- model.pt

You can resume and load from checkpoints.

Incorporating External Language Model in Performance Test

We introduce incorporating external language model in performance test.
If you are interested in this content, please check here.

Troubleshoots and Contributing

If you have any questions, bug reports, and feature requests, please open an issue on Github.
For live discussions, please go to our gitter or Contacts [email protected] please.

We appreciate any kind of feedback or contribution. Feel free to proceed with small issues like bug fixes, documentation improvement. For major contributions and new features, please discuss with the collaborators in corresponding issues.

Code Style

We follow PEP-8 for code style. Especially the style of docstrings is important to generate documentation.

References

Ilya Sutskever et al. Sequence to Sequence Learning with Neural Networks arXiv: 1409.3215

Dzmitry Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate arXiv: 1409.0473

Jan Chorowski et al. Attention Based Models for Speech Recognition arXiv: 1506.07503

Wiliam Chan et al. Listen, Attend and Spell arXiv: 1508.01211

Dario Amodei et al. Deep Speech2: End-to-End Speech Recognition in English and Mandarin arXiv: 1512.02595

Takaaki Hori et al. Advances in Joint CTC-Attention based E2E ASR with a Deep CNN Encoder and RNN-LM arXiv: 1706.02737

Ashish Vaswani et al. Attention Is All You Need arXiv: 1706.03762

Chung-Cheng Chiu et al. State-of-the-art Speech Recognition with Sequence-to-Sequence Models arXiv: 1712.01769

Anjuli Kannan et al. An Analysis Of Incorporating An External LM Into A Seq2seq Model arXiv: 1712.01996

Daniel S. Park et al. SpecAugment: A Simple Data Augmentation Method for ASR arXiv: 1904.08779

Rafael Muller et al. When Does Label Smoothing Help? arXiv: 1906.02629

Jung-Woo Ha et al. ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for ASR of Contact Centers arXiv: 2004.09367

Citing

@github{
  title = {KoSpeech},
  author = {Soohwan Kim, Seyoung Bae, Cheolhwang Won, Suwon Park},
  publisher = {GitHub},
  docs = {https://sooftware.github.io/KoSpeech/},
  url = {https://github.com/sooftware/KoSpeech},
  year = {2020}
}

kospeech's People

Contributors

qute012 avatar sooftware avatar triplet02 avatar wch18735 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.