Giter Site home page Giter Site logo

zzw922cn / automatic_speech_recognition Goto Github PK

View Code? Open in Web Editor NEW
2.8K 147.0 537.0 5.66 MB

End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow

License: MIT License

Python 98.71% Shell 1.02% Dockerfile 0.27%
automatic-speech-recognition tensorflow timit-dataset feature-vector phonemes data-preprocessing rnn audio deep-learning lstm

automatic_speech_recognition's Introduction

Automatic-Speech-Recognition

End-to-end automatic speech recognition system implemented in TensorFlow.

Recent Updates

  • Support TensorFlow r1.0 (2017-02-24)
  • Support dropout for dynamic rnn (2017-03-11)
  • Support running in shell file (2017-03-11)
  • Support evaluation every several training epoches automatically (2017-03-11)
  • Fix bugs for character-level automatic speech recognition (2017-03-14)
  • Improve some function apis for reusable (2017-03-14)
  • Add scaling for data preprocessing (2017-03-15)
  • Add reusable support for LibriSpeech training (2017-03-15)
  • Add simple n-gram model for random generation or statistical use (2017-03-23)
  • Improve some code for pre-processing and training (2017-03-23)
  • Replace TABs with blanks and add nist2wav converter script (2017-04-20)
  • Add some data preparation code (2017-05-01)
  • Add WSJ corpus standard preprocessing by s5 recipe (2017-05-05)
  • Restructuring of the project. Updated train.py for usage convinience (2017-05-06)
  • Finish feature module for timit, libri, wsj, support training for LibriSpeech (2017-05-14)
  • Remove some unnecessary codes (2017-07-22)
  • Add DeepSpeech2 implementation code (2017-07-23)
  • Fix some bugs (2017-08-06)
  • Add Layer Normalization RNN for efficiency (2017-08-06)
  • Add Madarian Speech Recognition support (2017-08-06)
  • Add Capsule Network Model (2017-12-12)
  • Release 1.0.0 version (2017-12-14)
  • Add Language Modeling Module (2017-12-25)
  • Will support TF1.12 soon (2019-10-17)

Recommendation

If you want to replace feed dict operation with Tensorflow multi-thread and fifoqueue input pipeline, you can refer to my repo TensorFlow-Input-Pipeline for more example codes. My own practices prove that fifoqueue input pipeline would improve the training speed in some time.

If you want to look the history of speech recognition, I have collected the significant papers since 1981 in the ASR field. You can read awesome paper list in my repo awesome-speech-recognition-papers, all download links of papers are provided. I will update it every week to add new papers, including speech recognition, speech synthesis and language modelling. I hope that we won't miss any important papers in speech domain.

All my public repos will be updated in future, thanks for your stars!

Install and Usage

Currently only python 3.5 is supported.

This project depends on scikit.audiolab, for which you need to have libsndfile installed in your system. Clone the repository to your preferred directory and install using:

sudo pip3 install -r requirements.txt
sudo python3 setup.py install

To use, simply run the following command:

python main/timit_train.py [-h] [--mode MODE] [--keep [KEEP]] [--nokeep]
                      [--level LEVEL] [--model MODEL] [--rnncell RNNCELL]
                      [--num_layer NUM_LAYER] [--activation ACTIVATION]
                      [--optimizer OPTIMIZER] [--batch_size BATCH_SIZE]
                      [--num_hidden NUM_HIDDEN] [--num_feature NUM_FEATURE]
                      [--num_classes NUM_CLASSES] [--num_epochs NUM_EPOCHS]
                      [--lr LR] [--dropout_prob DROPOUT_PROB]
                      [--grad_clip GRAD_CLIP] [--datadir DATADIR]
                      [--logdir LOGDIR]

optional arguments:
  -h, --help            show this help message and exit
  --mode MODE           set whether to train or test
  --keep [KEEP]         set whether to restore a model, when test mode, keep
                        should be set to True
  --nokeep
  --level LEVEL         set the task level, phn, cha, or seq2seq, seq2seq will
                        be supported soon
  --model MODEL         set the model to use, DBiRNN, BiRNN, ResNet..
  --rnncell RNNCELL     set the rnncell to use, rnn, gru, lstm...
  --num_layer NUM_LAYER
                        set the layers for rnn
  --activation ACTIVATION
                        set the activation to use, sigmoid, tanh, relu, elu...
  --optimizer OPTIMIZER
                        set the optimizer to use, sgd, adam...
  --batch_size BATCH_SIZE
                        set the batch size
  --num_hidden NUM_HIDDEN
                        set the hidden size of rnn cell
  --num_feature NUM_FEATURE
                        set the size of input feature
  --num_classes NUM_CLASSES
                        set the number of output classes
  --num_epochs NUM_EPOCHS
                        set the number of epochs
  --lr LR               set the learning rate
  --dropout_prob DROPOUT_PROB
                        set probability of dropout
  --grad_clip GRAD_CLIP
                        set the threshold of gradient clipping
  --datadir DATADIR     set the data root directory
  --logdir LOGDIR       set the log directory

Instead of configuration in command line, you can also set the arguments above in timit_train.py in practice.

Besides, you can also run main/run.sh for both training and testing simultaneously! See run_timit.sh for details.

Performance

PER based dynamic BLSTM on TIMIT database, with casual tuning because time it limited

image

LibriSpeech recognition result without LM

Label:

it was about noon when captain waverley entered the straggling village or rather hamlet of tully veolan close to which was situated the mansion of the proprietor

Prediction:

it was about noon when captain wavraly entered the stragling bilagor of rather hamlent of tulevallon close to which wi situated the mantion of the propriater

Label:

the english it is evident had they not been previously assured of receiving the king would never have parted with so considerable a sum and while they weakened themselves by the same measure have strengthened a people with whom they must afterwards have so material an interest to discuss

Prediction:

the onglish it is evident had they not being previously showed of receiving the king would never have parted with so considerable a some an quile they weakene themselves by the same measure haf streigth and de people with whom they must afterwards have so material and interest to discuss

Label:

one who writes of such an era labours under a troublesome disadvantage

Prediction:

one how rights of such an er a labours onder a troubles hom disadvantage

Label:

then they started on again and two hours later came in sight of the house of doctor pipt

Prediction:

then they started on again and two hours laytor came in sight of the house of doctor pipd

Label:

what does he want

Prediction:

whit daes he want

Label:

there just in front

Prediction:

there just infront

Label:

under ordinary circumstances the abalone is tough and unpalatable but after the deft manipulation of herbert they are tender and make a fine dish either fried as chowder or a la newberg

Prediction:

under ordinary circumstancesi the abl ony is tufgh and unpelitable but after the deftominiculation of hurbourt and they are tender and make a fine dish either fride as choder or alanuburg

Label:

by degrees all his happiness all his brilliancy subsided into regret and uneasiness so that his limbs lost their power his arms hung heavily by his sides and his head drooped as though he was stupefied

Prediction:

by degrees all his happiness ill his brilliancy subsited inter regret and aneasiness so that his limbs lost their power his arms hung heavily by his sides and his head druped as though he was stupified

Label:

i am the one to go after walt if anyone has to i'll go down mister thomas

Prediction:

i have the one to go after walt if ety wod hastu i'll go down mister thommas

Label:

i had to read it over carefully as the text must be absolutely correct

Prediction:

i had to readit over carefully as the tex must be absolutely correct

Label:

with a shout the boys dashed pell mell to meet the pack train and falling in behind the slow moving burros urged them on with derisive shouts and sundry resounding slaps on the animals flanks

Prediction:

with a shok the boy stash pale mele to meek the pecktrait ane falling in behind the slow lelicg burs ersh tlan with deressive shouts and sudery resounding sleps on the animal slankes

Label:

i suppose though it's too early for them then came the explosion

Prediction:

i suppouse gho waths two early for them then came the explosion

Content

This is a powerful library for automatic speech recognition, it is implemented in TensorFlow and support training with CPU/GPU. This library contains followings models you can choose to train your own model:

  • Data Pre-processing
  • Acoustic Modeling
    • RNN
    • BRNN
    • LSTM
    • BLSTM
    • GRU
    • BGRU
    • Dynamic RNN
    • Deep Residual Network
    • Seq2Seq with attention decoder
    • etc.
  • CTC Decoding
  • Evaluation(Mapping some similar phonemes)
  • Saving or Restoring Model
  • Mini-batch Training
  • Training with GPU or CPU with TensorFlow
  • Keeping logging of epoch time and error rate in disk

Implementation Details

Data preprocessing

TIMIT corpus

The original TIMIT database contains 6300 utterances, but we find the 'SA' audio files occurs many times, it will lead bad bias for our speech recognition system. Therefore, we removed the all 'SA' files from the original dataset and attain the new TIMIT dataset, which contains only 5040 utterances including 3696 standard training set and 1344 test set.

Automatic Speech Recognition transcribes a raw audio file into character sequences; the preprocessing stage converts a raw audio file into feature vectors of several frames. We first split each audio file into 20ms Hamming windows with an overlap of 10ms, and then calculate the 12 mel frequency ceptral coefficients, appending an energy variable to each frame. This results in a vector of length 13. We then calculate the delta coefficients and delta-delta coefficients, attaining a total of 39 coefficients for each frame. In other words, each audio file is split into frames using the Hamming windows function, and each frame is extracted to a feature vector of length 39 (to attain a feature vector of different length, modify the settings in the file timit_preprocess.py.

In folder data/mfcc, each file is a feature matrix with size timeLength*39 of one audio file; in folder data/label, each file is a label vector according to the mfcc file.

If you want to set your own data preprocessing, you can edit calcmfcc.py or timit_preprocess.py.

The original TIMIT dataset contains 61 phonemes, we use 61 phonemes for training and evaluation, but when scoring, we mappd the 61 phonemes into 39 phonemes for better performance. We do this mapping according to the paper Speaker-independent phone recognition using hidden Markov models. The mapping details are as follows:

Original Phoneme(s) Mapped Phoneme
iy iy
ix, ih ix
eh eh
ae ae
ax, ah, ax-h ax
uw, ux uw
uh uh
ao, aa ao
ey ey
ay ay
oy oy
aw aw
ow ow
er, axr er
l, el l
r r
w w
y y
m, em m
n, en, nx n
ng, eng ng
v v
f f
dh dh
th th
z z
s s
zh, sh zh
jh jh
ch ch
b b
p p
d d
dx dx
t t
g g
k k
hh, hv hh
bcl, pcl, dcl, tcl, gcl, kcl, q, epi, pau, h# h#

LibriSpeech corpus

LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech. It can be downloaded from here

In order to preprocess LibriSpeech data, download the dataset from the above mentioned link, extract it and run the following:

cd feature/libri
python libri_preprocess.py -h 
usage: libri_preprocess [-h]
                        [-n {dev-clean,dev-other,test-clean,test-other,train-clean-100,train-clean-360,train-other-500}]
                        [-m {mfcc,fbank}] [--featlen FEATLEN] [-s]
                        [-wl WINLEN] [-ws WINSTEP]
                        path save

Script to preprocess libri data

positional arguments:
  path                  Directory of LibriSpeech dataset
  save                  Directory where preprocessed arrays are to be saved

optional arguments:
  -h, --help            show this help message and exit
  -n {dev-clean,dev-other,test-clean,test-other,train-clean-100,train-clean-360,train-other-500}, --name {dev-clean,dev-other,test-clean,test-other,train-clean-100,train-clean-360,train-other-500}
                        Name of the dataset
  -m {mfcc,fbank}, --mode {mfcc,fbank}
                        Mode
  --featlen FEATLEN     Features length
  -s, --seq2seq         set this flag to use seq2seq
  -wl WINLEN, --winlen WINLEN
                        specify the window length of feature
  -ws WINSTEP, --winstep WINSTEP
                        specify the window step length of feature

The processed data will be saved in the "save" path.

To train the model, run the following:

python main/libri_train.py -h 
usage: libri_train.py [-h] [--task TASK] [--train_dataset TRAIN_DATASET]
                      [--dev_dataset DEV_DATASET]
                      [--test_dataset TEST_DATASET] [--mode MODE]
                      [--keep [KEEP]] [--nokeep] [--level LEVEL]
                      [--model MODEL] [--rnncell RNNCELL]
                      [--num_layer NUM_LAYER] [--activation ACTIVATION]
                      [--optimizer OPTIMIZER] [--batch_size BATCH_SIZE]
                      [--num_hidden NUM_HIDDEN] [--num_feature NUM_FEATURE]
                      [--num_classes NUM_CLASSES] [--num_epochs NUM_EPOCHS]
                      [--lr LR] [--dropout_prob DROPOUT_PROB]
                      [--grad_clip GRAD_CLIP] [--datadir DATADIR]
                      [--logdir LOGDIR]

optional arguments:
  -h, --help            show this help message and exit
  --task TASK           set task name of this program
  --train_dataset TRAIN_DATASET
                        set the training dataset
  --dev_dataset DEV_DATASET
                        set the development dataset
  --test_dataset TEST_DATASET
                        set the test dataset
  --mode MODE           set whether to train, dev or test
  --keep [KEEP]         set whether to restore a model, when test mode, keep
                        should be set to True
  --nokeep
  --level LEVEL         set the task level, phn, cha, or seq2seq, seq2seq will
                        be supported soon
  --model MODEL         set the model to use, DBiRNN, BiRNN, ResNet..
  --rnncell RNNCELL     set the rnncell to use, rnn, gru, lstm...
  --num_layer NUM_LAYER
                        set the layers for rnn
  --activation ACTIVATION
                        set the activation to use, sigmoid, tanh, relu, elu...
  --optimizer OPTIMIZER
                        set the optimizer to use, sgd, adam...
  --batch_size BATCH_SIZE
                        set the batch size
  --num_hidden NUM_HIDDEN
                        set the hidden size of rnn cell
  --num_feature NUM_FEATURE
                        set the size of input feature
  --num_classes NUM_CLASSES
                        set the number of output classes
  --num_epochs NUM_EPOCHS
                        set the number of epochs
  --lr LR               set the learning rate
  --dropout_prob DROPOUT_PROB
                        set probability of dropout
  --grad_clip GRAD_CLIP
                        set the threshold of gradient clipping, -1 denotes no
                        clipping
  --datadir DATADIR     set the data root directory
  --logdir LOGDIR       set the log directory

where the "datadir" is the "save" path used in preprocess stage.

Wall Street Journal corpus

TODO

Core Features

  • dynamic RNN(GRU, LSTM)
  • Residual Network(Deep CNN)
  • CTC Decoding
  • TIMIT Phoneme Edit Distance(PER)

Future Work

  • Release pretrained English ASR model
  • Add Attention Mechanism
  • Add Speaker Verification
  • Add TTS

License

MIT

Contact Us

If this program is helpful to you, please give us a star or fork to encourage us to keep updating. Thank you! Besides, any issues or pulls are appreciated.

Collaborators:

zzw922cn

deepxuexi

hiteshpaul

xxxxyzt

automatic_speech_recognition's People

Contributors

azizcode92 avatar brianlan avatar cunnie avatar hanseokhyeon avatar nq555222 avatar revolter avatar terminalwitchcraft avatar xxxxyzt avatar zzw922cn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

automatic_speech_recognition's Issues

Pretrained Model

Hi,

Is it possible to share your pretrained models (Checkpoint, meta files) ? We can evaluate the performance without training the model by ourselves.

Thanks,

I have a problem with the mfcc

今天偶然发现,在您的程序中完成特征提取之后的npy文件里面的数据不是391,而是39n(根据不同的语音n取值不一样,有292,370等),我之前一直以为您的预处理程序处理完语料产生的就是一个长度为39的特征向量,之前接触的其他的语音识别特征提取都是一个39维的特征向量,为什么您特征提取完的矩阵这么大?后面是否有将他转化为长度为39的特征向量的操作?我并没有在您的程序中找到,望能指教,非常感谢

Question: TensorFlow devices are created at every step. Isn't it inefficient?

I found that training our model with a Tesla P100 GPU is not any faster than training it with a ordinary CPU.

According to these console outputs, it seems TensorFlow devices are created at every step.
(I am using 24 GPUs and global_step increase 24 for each step.)

Is it necessary to do so?
Is it costly to create a Tensorflow device?
Who is in charge of the device creation, TensorFlow or us?

2017-06-15 15:54:40.224138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating
TensorFlow device (/gpu:0)
-> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:22:00.0)
15:54:36 phn mode, global_step:34755.0,total:4620,batch:40/144,epoch:11/200,train loss=82.963,mean train PER=0.013
Model has been saved in /home/chenjiasheng/log/timit/phn/save
2017-06-15 15:54:46.530416: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:22:00.0)
15:54:42 phn mode, global_step:34793.0,total:4620,batch:41/144,epoch:11/200,train loss=104.673,mean train PER=0.016
2017-06-15 15:54:50.600103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:22:00.0)
15:54:46 phn mode, global_step:34818.0,total:4620,batch:42/144,epoch:11/200,train loss=59.382,mean train PER=0.009

Dead links in project page

a lot of your links are 404 error in the first page of this repo.
nearly all pointing inside the project are dead.
others pointing to external ressoruces are ok

Not getting the specified accuracy

I do not get the accuracy as u mentioned under "LibriSpeech recognition result without LM".. Can you state what are the parameters that you used and what are the exact training dataset names that you used from the librispeech corpus at [(http://www.openslr.org/12/)]

unable to import utils

Hi, this is an awesome project! Thx a lot!
I use the most recent version but still encounter the problem of "No module named utils.utils". I saw it is mentioned in a closed issue as well. Any idea to solve this?

Have a Problem in timit_train.py

Hi:
WHEN I TRY TO RUN THE COOMMAND :
python3 timit_train.py --mode train --level cha --batch_size 8

IT PRODUCE THE PROBLEM AS FOLLOWS:

2018-01-14 14:23:26.356409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1)
2018-01-14 14:23:26.356440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:82:00.0, compute capability: 6.1)
Epoch 1 ...
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
status, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value capsule_cnn_layer_2/conv_kernel
[[Node: capsule_cnn_layer_2/conv_kernel/read = IdentityT=DT_FLOAT, _class=["loc:@capsule_cnn_layer_2/conv_kernel"], _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
[[Node: Mean/_17 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_958_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "timit_train.py", line 255, in
runner.run()
File "timit_train.py", line 183, in run
feed_dict=feedDict)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value capsule_cnn_layer_2/conv_kernel
[[Node: capsule_cnn_layer_2/conv_kernel/read = IdentityT=DT_FLOAT, _class=["loc:@capsule_cnn_layer_2/conv_kernel"], _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
[[Node: Mean/_17 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_958_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'capsule_cnn_layer_2/conv_kernel/read', defined at:
File "timit_train.py", line 255, in
runner.run()
File "timit_train.py", line 139, in run
model = model_fn(args, maxTimeSteps)
File "/home/lab/Automatic_Speech_Recognition-master/speechvalley/models/capsuleNetwork.py", line 114, in init
self.build_graph(self.args, self.maxTimeSteps)
File "/home/lab/Automatic_Speech_Recognition-master/speechvalley/models/capsuleNetwork.py", line 153, in build_graph
output = capLayer(output, [2, 2], (1,1,1,1), args.num_iter)
File "/home/lab/Automatic_Speech_Recognition-master/speechvalley/models/capsuleNetwork.py", line 87, in call
self._num_channelsself._num_capsulesself._output_vector_len], dtype=tf.float32)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 1203, in get_variable
constraint=constraint)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 1092, in get_variable
constraint=constraint)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 425, in get_variable
constraint=constraint)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 394, in _true_getter
use_resource=use_resource, constraint=constraint)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 805, in _get_single_variable
constraint=constraint)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variables.py", line 213, in init
constraint=constraint)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variables.py", line 356, in _init_from_args
self._snapshot = array_ops.identity(self._variable, name="read")
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/array_ops.py", line 125, in identity
return gen_array_ops.identity(input, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 2071, in identity
"Identity", input=input, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

FailedPreconditionError (see above for traceback): Attempting to use uninitialized value capsule_cnn_layer_2/conv_kernel
[[Node: capsule_cnn_layer_2/conv_kernel/read = IdentityT=DT_FLOAT, _class=["loc:@capsule_cnn_layer_2/conv_kernel"], _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
[[Node: Mean/_17 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_958_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Is there any other parameter that i should set?

Thanks!

error when running libri_train

Hello, when I run the libri_train code based on processed librispeech dataset, it apper followed error:

Initializing
Epoch 1 ...
Traceback (most recent call last):
File "libri_train.py", line 281, in
runner.run()
File "libri_train.py", line 216, in run
feed_dict=feedDict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 778, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 961, in _run
% (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1720, 64, 39) for Tensor u'Placeholder:0', which has shape '(1720, 64, 60)'

It seems phoneme mapping problem, How can I solve the problem? Thanks for answering.

运行libri语料库出错

您好!
我在运行您github上的代码时,已经运行成功了libri_preprocess.py,然后再运行训练的代码出现了一个错误,如下所示:
[gpu3@localhost main]$ python libri_train.py
Initializing
Epoch 1 ...
Traceback (most recent call last):
File "libri_train.py", line 281, in
runner.run()
File "libri_train.py", line 216, in run
feed_dict=feedDict)
File "/home/gpu3/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 778, in run
run_metadata_ptr)
File "/home/gpu3/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 961, in _run
% (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1731, 64, 39) for Tensor u'Placeholder:0', which has shape '(1731, 64, 60)
麻烦问一下这个怎么解决呢?

How to evaluate?

Is there a cli command for evaluating an utterance using one of the trained models?

why is "if er / batch_size == 1.0:" necessary?

Hi! I found "if er / batch_size == 1.0:" is necessary in the *_train.py file. If I delete it, then the errorRate will stay at 1.0 forever. I am very confused. Why this happens? 3ks~

SyntaxError: Missing parentheses in call to 'print'

Hi, after i followed the instruction and install the packages, i got errors as below when running train command, any idea? thanks.

File "/usr/local/lib/python3.5/dist-packages/SpeechValley-1.0.0-py3.5.egg/speechvalley/utils/visualization.py", line 24
print 'Just mono files'
^
SyntaxError: Missing parentheses in call to 'print'

timit_preprocess.py

I can't read timit, the problem is ValueError: File format 'NIST'... not understood. can you help me?

Training on TIMIT Corpus

Hello

I am trying to use this library to train on TIMIT corpus for phoneme classification. I am facing multiple issues while running the preprocessing script and the train script. It would be great if you could provide a step by step guide on how to run it. The main problem seems to be inability to correctly import packages, also the preprocessing script throws error that input directory doesn't exist while it certainly does.

Thank you

Pretrained Model

I was wondering if a pretrained model could be made available. Specifically, I was hoping for the model used to generate the librispeech examples in the readme.

Thanks!

遇到许多问题

我是个初学者,所以可能遇到很多问题不会解决,目前遇到过的有:
1.在导入模块那里无法导入core_rnn,以及impl,我把这些话全都注释了才得以继续
2.return data_lists_to_batches([np.load(os.path.join(mfccPath, fn)) for fn in os.listdir(mfccPath)],
OSError: [Errno 2] No such file or directory: '/home/pony/github/data/timit/phn/train/mfcc'
现在卡在这里不知道怎么办

BN is suggested to be applied immediately before RELU, not after.

layer4 = tf.nn.dynamic_rnn(layer4_cell, layer3, sequence_length=seqLengths, time_major=True)

As in (Laurent et al., 2015), there are two ways of applying
BatchNorm to the recurrent operation. A natural extension
is to insert a BatchNorm transformation, B(), immediately
before every non-linearity as follows:
h[l, t] = f(B(W[l]*h[l-1, t] + U[l]*h[l, t-1]))

In this case the mean and variance statistics are accumulated
over a single time-step of the minibatch. We did not
find this to be effective.
An alternative (sequence-wise normalization) is to batch
normalize only the vertical connections. The recurrent
computation is given by
h[l, t] = f(B(W[l]*h[l-1, t]) + U[l]*h[l, t-1])

So should we set activation of rnn_cell to None and move RELU activation immediately after BN?

incorrect __init__.py

File Automatic_Speech_Recognition/speechvalley/feature/libri/init.py contain invalid file name

from speechvalley.feature.libri.libri_proprecess import preprocess, wav2feature

should be

from speechvalley.feature.libri.libri_preprocess import preprocess, wav2feature

which language is it for?

Sorry I am a freshman in speech recognition
I saw that "timit phonemes, it is 62; if timit characters, it is 29"
I want to know is it for English?
If for chinese, the number of phonemes and characters should be how many?
really thanks!

中文语音识别

@zzw922cn我看了代码,发现timit文件是把英文分解成音素或者字符级别,来结合语音特征向量进行训练。libri文件是根据英文对应的数字编码自然转化为向量来训练的。那么,中文呢?我理解的是好像是直接把语音特征向量和中文的单词进行对应训练。也许是我代码理解的还不全面,但是,作者可以回答下我的这个疑惑吗?

package/lib requirements

I checked the requirements.txt, and it states that this project requires the following libraries or frameworks:

tabulate==0.7.7
theano==0.9.0
xlwt==1.2.0

Is theano really required, what is that for? It will not be supported in the future. Therefore, I would like to confirm this.

Also, the other two packages look weird to me, what are they used for: creating tabular data, and excel file? Are they essential to this project?

The size of a model

Hi,I am wondering the size of the model。
hundreds MB or GB?
Can you tell me?
Thanks very much!

About the audiolab in this project

Sorry to bother you again. But I really would like to know how important is this external lib: scikits.audiolab to this project.

I want to use python 3, and I see that your code is not that hard to be modified to adapt python 3. However, the scikits.audiolab only supports python 2. Therefore, I have this question. Can I find any replacement for that? Or where (what part of functions) is the lib used in your project?

Thank you very much!

deeepSpeech2 model ValueError: Shape must be rank 4 but is rank 3 for 'Conv2D' (op: 'Conv2D') with input shapes:

Traceback (most recent call last):
File "main/timit_train.py", line 255, in
runner.run()
File "main/timit_train.py", line 144, in run
model = model_fn(args, maxTimeSteps)
File "/mnt/Automatic_Speech_Recognition-master/models/deepSpeech2.py", line 120, in init
self.build_graph(args, maxTimeSteps)
File "/mnt/Automatic_Speech_Recognition-master/utils/utils.py", line 32, in wrapper
result = func(*args, **kwargs)
File "/mnt/Automatic_Speech_Recognition-master/models/deepSpeech2.py", line 149, in build_graph
output_fc = build_deepSpeech2(self.args, maxTimeSteps, self.inputX, self.cell_fn, self.seqLengths)
File "/mnt/Automatic_Speech_Recognition-master/models/deepSpeech2.py", line 58, in build_deepSpeech2
layer1 = tf.nn.conv2d(inputX, layer1_filter, layer1_stride, padding='SAME')
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 403, in conv2d
data_format=data_format, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2338, in create_op
set_shapes_for_outputs(ret)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1719, in set_shapes_for_outputs
shapes = shape_func(op)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1669, in call_with_requiring
return call_cpp_shape_fn(op, require_shape_fn=True)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 610, in call_cpp_shape_fn
debug_python_shape_fn, require_shape_fn)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 676, in _call_cpp_shape_fn_impl
raise ValueError(err.message)
ValueError: Shape must be rank 4 but is rank 3 for 'Conv2D' (op: 'Conv2D') with input shapes: [778,32,39], [41,11,1,32].

need to preprocess data?

How many epochs is appropriate for librispeech?

I did experiment on librispeech dataset, I have tried 15 epochs when training, but got 0.46 CER and results seems not good. So could you tell me how many epochs is appropriate for librispeech?

Small data

Just wonder if this also works well in small data. Let's say I have only a few hundreds of training instances, and 20 - 30 test instances. Because my task is quite specific for one person only.

关于timit实验重现的问题

你好,在重现DBRNN的TIMIT实验过程中程序无法运行;tensorflow==1.1.0 tensorflow-gpu==1.1.0 特征提取过程已经完成 : 请问这个问题该如何解决?
(tensorflow35)jtang@nelslip-k40-server219:~/tfcode/Automatic_Speech_Recognition/speechvalley/main$ ./run_timit.sh
loop index: 2
test mode...
load_data...
load_data in 2.1332616806030273 s
build_graph...
Traceback (most recent call last):
File "timit_train.py", line 248, in
runner.run()
File "timit_train.py", line 136, in run
model = model_fn(args, maxTimeSteps)
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/SpeechValley-1.0.0-py3.5.egg/speechvalley/models/dynamic_brnn.py", line 87, in init
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/SpeechValley-1.0.0-py3.5.egg/speechvalley/utils/utils.py", line 25, in wrapper
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/SpeechValley-1.0.0-py3.5.egg/speechvalley/models/dynamic_brnn.py", line 118, in build_graph
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/tensorflow/python/ops/random_ops.py", line 167, in truncated_normal
shape_tensor = _ShapeTensor(shape)
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/tensorflow/python/ops/random_ops.py", line 42, in _ShapeTensor
return ops.convert_to_tensor(shape, dtype=dtype, name="shape")
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 639, in convert_to_tensor
as_ref=False)
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 704, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 113, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 102, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 444, in make_tensor_proto
tensor_proto.string_val.extend([compat.as_bytes(x) for x in proto_values])
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 444, in
tensor_proto.string_val.extend([compat.as_bytes(x) for x in proto_values])
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/tensorflow/python/util/compat.py", line 65, in as_bytes
(bytes_or_text,))
TypeError: Expected binary or unicode string, got 128

Deep Speech 2 build_graph not called

Hello,

In the Deep Speech model that you defined in "models/deepSpeech2.py", the function build_graph was not called, this will result an error as following:
"AttributeError: 'DeepSpeech2' object has no attribute 'var_trainable_op' "
I compared with the "dynamic_brnn.py", you should add "self.build_graph(args, maxTimeSteps)" in the class initialiser.

Thanks,

New tf.Session() for each subdir?

Looking into the code the model initializes the tf.session() for every subdirectory (4000 samples). It doesn't make sense (at least to me). Also there is no relation for the model of each of the subdirectories. I am referring to ./Automatic_Speech_Recognition/main/libri_train.py file. Please explain.

build_deepSpeech2 function bug?

the first three conv layers asked the input to be like [batch, freq_bin, time_len, in_channels]

''' Parameters:

          maxTimeSteps: maximum time steps of input spectrogram power
          inputX: spectrogram power of audios, [batch, freq_bin, time_len, in_channels]
          seqLengths: lengths of samples in a mini-batch
   '''
# 3 2-D convolution layers
    layer1_filter = tf.get_variable('layer1_filter', shape=(41, 11, 1, 32), dtype=tf.float32)
    layer1_stride = [1, 2, 2, 1]
    layer2_filter = tf.get_variable('layer2_filter', shape=(21, 11, 32, 32), dtype=tf.float32)
    layer2_stride = [1, 2, 1, 1]
    layer3_filter = tf.get_variable('layer3_filter', shape=(21, 11, 32, 96), dtype=tf.float32)
    layer3_stride = [1, 2, 1, 1]
    layer1 = tf.nn.conv2d(inputX, layer1_filter, layer1_stride, padding='SAME')
    layer1 = tf.layers.batch_normalization(layer1, training=args.is_training)
    layer1 = tf.contrib.layers.dropout(layer1, keep_prob=args.keep_prob[0], is_training=args.is_training)

    layer2 = tf.nn.conv2d(layer1, layer2_filter, layer2_stride, padding='SAME')
    layer2 = tf.layers.batch_normalization(layer2, training=args.isTraining)
    layer2 = tf.contrib.layers.dropout(layer2, keep_prob=args.keep_prob[1], is_training=args.is_training)

    layer3 = tf.nn.conv2d(layer2, layer3_filter, layer3_stride, padding='SAME')
    layer3 = tf.layers.batch_normalization(layer3, training=args.isTraining)
    layer3 = tf.contrib.layers.dropout(layer3, keep_prob=args.keep_prob[2], is_training=args.is_training)

However, the rnn layers asked the batch to be like [max_time, batch_size ,...]

    # 4 recurrent layers
    # inputs must be [max_time, batch_size ,...]
    layer4_cell = cell_fn(args.num_hidden, activation=args.activation)
    layer4 = tf.nn.dynamic_rnn(layer4_cell, layer3, sequence_length=seqLengths, time_major=True) 
    layer4 = tf.layers.batch_normalization(layer4, training=args.isTraining)
    layer4 = tf.contrib.layers.dropout(layer4, keep_prob=args.keep_prob[3], is_training=args.is_training)

And I don't see any transpose being made to make time to be the first axis, so after the conv layers the tensor is organized like [batch, freq_bin, time_len, 96] , where time is the third axis ,so is this a bug?

无法复现实验结果

你好,我现在用这个实验的代码在timit进行了实验。但是在test集上得到的错误率大概在0.35左右,与主页上的图标的实验结果相差很多。而且,主页上图标的训练集和测试集的错误率下降曲线的横坐标不是很明确。不太了解是epoch还是什么呢?我目前选用的参数是2层blstm,learning rate 0.0001,也就是脚本中默认的模型参数。所以想问一下,如何能够复现出如图中所示的实验结果?是否有其他的trick或是用了不同的模型参数呢?
多谢

TRAIN CONFIGS

Thanks a lot for your amazing ASR repository, the best implementation of DeepSpeech there is. I will really appreciate it, if you could help me with one issue.

I am trying to train Librispeech with your repo but my results are far from your results.

Could you please inform me with what configs you got those results? ( #layers, lr, activation, rnncell, etc.)

No module named utils

O running run_timit.sh I am getting following error

saurabh@saurabh-Inspiron-5559:~/saurabh/asr_new/main$ sudo pip install utilsThe directory '/home/saurabh/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/home/saurabh/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Requirement already satisfied: utils in /usr/local/lib/python2.7/dist-packages
saurabh@saurabh-Inspiron-5559:~/saurabh/asr_new/main$ ./run_timit.sh
loop index: 2
Traceback (most recent call last):
  File "timit_train.py", line 27, in <module>
    from utils.utils import load_batched_data
ImportError: No module named utils
saurabh@saurabh-Inspiron-5559:~/saurabh/asr_new/main$ 

NoneType' and 'int'

Traceback (most recent call last):
File "/administrator/PycharmProjects/Automatic_Speech_Recognition/feature/libri/libri_preprocess.py", line 176, in
mode=mode, feature_len=feature_len, seq2seq=seq2seq, save=True)
File "/administrator/PycharmProjects/Automatic_Speech_Recognition/feature/libri/libri_preprocess.py", line 90, in wav2feature
feat = calcfeat_delta_delta(sig,rate,win_length=win_len,win_step=win_step,mode=mode,feature_len=feature_len)
File "/administrator/PycharmProjects/Automatic_Speech_Recognition/feature/core/calcmfcc.py", line 68, in calcfeat_delta_delta
feat = calcMFCC(signal,samplerate,win_length,win_step,feature_len,filters_num,NFFT,low_freq,high_freq,pre_emphasis_coeff,cep_lifter,appendEnergy,mode=mode) #首先获取13个一般MFCC系数
File "/administrator/PycharmProjects/Automatic_Speech_Recognition/feature/core/calcmfcc.py", line 118, in calcMFCC
feat,energy=fbank(signal,samplerate,win_length,win_step,filters_num,NFFT,low_freq,high_freq,pre_emphasis_coeff)
File "/administrator/PycharmProjects/Automatic_Speech_Recognition/feature/core/calcmfcc.py", line 151, in fbank
high_freq=high_freq or samplerate/2
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

when I run the libri_preprocess.py, the above error occured? how to solve it

Training on Isolated words

hi,
i have been trying to train words like command, backspace, one two, etc .
the preprocessing n training went well.
But the result of testing produces a long sequence of phonemes i suppose
Any suggestion to correct the output
I am attaching the output obtained
cha_result.txt

License?

Hi I was wondering if you could specify a license for usage?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.