cmusphinx / g2p-seq2seq Goto Github PK

G2P with Tensorflow

License: Other

Python 100.00%

g2p-seq2seq g2p cmudict

g2p-seq2seq's Introduction

Sequence-to-Sequence G2P toolkit

The tool does Grapheme-to-Phoneme (G2P) conversion using transformer model from tensor2tensor toolkit [1]. A lot of approaches in sequence modeling and transduction problems use recurrent neural networks. But, transformer model architecture eschews recurrence and instead relies entirely on an attention mechanism to draw global dependencies between input and output [2].

This implementation is based on python TensorFlow, which allows an efficient training on both CPU and GPU.

Installation

The tool requires TensorFlow at least version 1.8.0 and Tensor2Tensor version 1.6.6 or higher. Please see the installation guide for TensorFlow installation details, and details about the Tensor2Tensor installation see guide

The g2p-seq2seq package itself uses setuptools, so you can follow standard installation process:

sudo python setup.py install

You can also run the tests

python setup.py test

The runnable script g2p-seq2seq is installed in /usr/local/bin folder by default (you can adjust it with setup.py options if needed) . You need to make sure you have this folder included in your PATH so you can run this script from command line.

Running G2P

A pretrained 3-layer transformer model with 256 hidden units is available for download on cmusphinx website. Unpack the model after download. The model is trained on CMU English dictionary

wget -O g2p-seq2seq-cmudict.tar.gz https://sourceforge.net/projects/cmusphinx/files/G2P%20Models/g2p-seq2seq-model-6.2-cmudict-nostress.tar.gz/download
tar xf g2p-seq2seq-cmudict.tar.gz

The easiest way to check how the tool works is to run it the interactive mode and type the words

$ g2p-seq2seq --interactive --model_dir model_folder_path
...
> hello
...
Pronunciations: [HH EH L OW]
...
>

To generate pronunciations for an English word list with a trained model, run

  g2p-seq2seq --decode your_wordlist --model_dir model_folder_path [--output decode_output_file_path]

The wordlist is a text file with one word per line

If you wish to list top N variants of decoding, set return_beams flag and specify beam_size:

  g2p-seq2seq --decode your_wordlist --model_dir model_folder_path --return_beams --beam_size number_returned_beams [--output decode_output_file_path]

To evaluate Word Error Rate of the trained model, run

  g2p-seq2seq --evaluate your_test_dictionary --model_dir model_folder_path

The test dictionary should be a dictionary in standard format:

hello HH EH L OW
bye B AY

You may also calculate Word Error Rate considering all top N best decoded results. In this case we consider word decoding as error only if none of the decoded pronunciations will match with the ground true pronunciation of the word.

Training G2P system

To train G2P you need a dictionary (word and phone sequence per line). See an example dictionary

  g2p-seq2seq --train train_dictionary.dic --model_dir model_folder_path

You can set up maximum training steps:

  "--max_epochs" - Maximum number of training epochs (Default: 0).
     If 0 train until no improvement is observed

It is a good idea to play with the following parameters:

  "--size" - Size of each model layer (Default: 256).

  "--num_layers" - Number of layers in the model (Default: 3).

  "--filter_size" - The size of the filter layer in a convolutional layer (Default: 512)

  "--num_heads" - Number of applied heads in Multi-attention mechanism (Default: 4)

You can manually point out Development and Test datasets:

  "--valid" - Development dictionary (Default: created from train_dictionary.dic)
  "--test" - Test dictionary (Default: created from train_dictionary.dic)

Otherwise, The program will split the dataset that you feed to it in the training mode itself. In the directory with the training data you will find three data files with the following extensions: ".train", ".dev" and ".test".

In the case where you have raw dictionary with stress (for example, like in CMU English dictionary), you may set the following parameter while launching the train mode:

  "--cleanup" - Set to True to cleanup dictionary from stress and comments.

If you need to continue training a saved model just point out the directory with the existing model:

  g2p-seq2seq --train train_dictionary.dic --model_dir model_folder_path

And, if you want to start training from scratch:

  "--reinit" - Rewrite model in model_folder_path

Also, in case of solving inverse problem:

  "--p2g" - Run the program in a phoneme-to-grapheme conversion mode.

The differences in pronunciations between short and long words can be significant. So, seq2seq models apply bucketing technique to take account of such problems. On the other hand, splitting initial data into too many buckets can worsen the final results. Because in this case there will not be sufficient amount of examples in each particular bucket. To get better results, you may tune the following three parameters that change the number and size of the buckets:

  "--min_length_bucket" - the size of the minimal bucket (Default: 6)
  "--max_length" - maximal possible length of words or maximal number of phonemes in pronunciations (Default: 30)
  "--length_bucket_step" - multiplier that controls the number of length buckets in the data. The buckets have maximum lengths from min_bucket_length to max_length, increasing by factors of length_bucket_step (Default: 1.5)

After training the model, you may freeze it:

  g2p-seq2seq --model_dir model_folder_path --freeze

File "frozen_model.pb" will appear in "model_folder_path" directory after launching previous command. And now, if you run one of the decoding modes, The program will load and use this frozen graph.

Word error rate on CMU dictionary data sets

System	WER (CMUdict PRONASYL 2007), %	WER (CMUdict latest*), %
Baseline WFST (Phonetisaurus)	24.4	33.89
Transformer num_layers=3, size=256	20.6	30.2
* These results pointed out for dictionary without stress.

References

[1] Lukasz Kaiser. "Accelerating Deep Learning Research with the Tensor2Tensor Library." In Google Research Blog, 2017.

[2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lucasz Kaiser, and Illia Polosukhin. "Attention Is All You Need." arXiv preprint arXiv:1706.03762, 2017.

g2p-seq2seq's People

Contributors

Stargazers

Watchers

Forkers

g10dras jordi-adell kentchun33333 chagge akr24 stevemurr weishirongzhen1 pineking beld clear-datacenter campuslifeceo parvizp piandpower lingz mrreece5 ssp154774273 nigma theolivenbaum wesleydias zhangjiulong digideskio clinzy ipaste runngezhang 1ayelet letitbevi alongwithyou byukan rohitvishal7 yfliao kfriesth pi19404 mavidser shubhamtibra mparker3 sevinjyolchuyeva benjamesbabala seifmostafa kdjyss vagrawal moitreebasu1990 fehiepsi snd96 jonmay lemaoliu namanmakhija hugo-w ml-lab harishreddy05 sxdkxgwan sayint-ai stevenlol dacson dreamseeker1 simpleemotion rarecode msamribeiro colinsongf hellooooworld vijay120 yanchaomars haoqc86 alluneediswill frankgcheung tbinias xushenkun cheneyfan ming-hai viralmehtaswe arushir dachengai gauravsacc zhizhengwu lifefeel mitr-io vikingmew parisilabs oussemaster zhanwenchen longfeiprojects alsora rikrd domitorikor aiedward shahawi afcarl logi-shi ycczhao maxisawesome ycchuang luzeru jupinter linmufeng dzheng256 dansonc fdeng1983 aniyer saikiranvalluri whyxzh entn-at

g2p-seq2seq's Issues

Any plan to replace online lmtool with g2p-seq2seq ?

Is there any plan to replace online lmtool with g2p-seq2seq or start a new service like that ?
http://www.speech.cs.cmu.edu/tools/lmtool-new.html

No need to read vocabulary file to calculate vocabulary size

If vocabulary file is already loaded, do not reread it again.

WER 0.46?

I ran with 2 layers and 512 units but got nowhere close to reported?
Is this execution correct?
python -u g2p.py --train ../../cmudict/cmudict.dict --size 512
Preparing G2P data
Creating vocabularies in /tmp
Creating vocabulary /tmp/vocab.phoneme
Creating vocabulary /tmp/vocab.grapheme
Reading development and training data.
Creating 2 layers of 512 units.
Reading model parameters from /tmp/translate.ckpt-200
global step 400 learning rate 0.5000 step-time 2.78 perplexity 7.83
eval: bucket 0 perplexity 4.88
eval: bucket 1 perplexity 6.30
eval: bucket 2 perplexity 12.34
global step 600 learning rate 0.5000 step-time 2.71 perplexity 4.34
eval: bucket 0 perplexity 2.48
eval: bucket 1 perplexity 2.96
eval: bucket 2 perplexity 4.78
global step 800 learning rate 0.5000 step-time 2.63 perplexity 2.72
eval: bucket 0 perplexity 1.75
eval: bucket 1 perplexity 2.15
eval: bucket 2 perplexity 3.45
global step 1000 learning rate 0.5000 step-time 2.56 perplexity 2.26
eval: bucket 0 perplexity 1.65
eval: bucket 1 perplexity 1.84
eval: bucket 2 perplexity 3.17
global step 1200 learning rate 0.5000 step-time 2.68 perplexity 2.00
eval: bucket 0 perplexity 1.29
eval: bucket 1 perplexity 1.69
eval: bucket 2 perplexity 2.57
global step 1400 learning rate 0.5000 step-time 2.86 perplexity 1.84
eval: bucket 0 perplexity 1.48
eval: bucket 1 perplexity 1.70
eval: bucket 2 perplexity 2.15
global step 1600 learning rate 0.5000 step-time 3.40 perplexity 1.76
eval: bucket 0 perplexity 1.65
eval: bucket 1 perplexity 1.67
eval: bucket 2 perplexity 2.18
global step 1800 learning rate 0.5000 step-time 3.65 perplexity 1.71
eval: bucket 0 perplexity 1.42
eval: bucket 1 perplexity 1.79
eval: bucket 2 perplexity 2.04
global step 2000 learning rate 0.5000 step-time 2.68 perplexity 1.56
eval: bucket 0 perplexity 1.30
eval: bucket 1 perplexity 1.53
eval: bucket 2 perplexity 1.83
global step 2200 learning rate 0.5000 step-time 3.33 perplexity 1.61
eval: bucket 0 perplexity 1.50
eval: bucket 1 perplexity 1.66
eval: bucket 2 perplexity 1.70
global step 2400 learning rate 0.5000 step-time 3.01 perplexity 1.52
eval: bucket 0 perplexity 1.29
eval: bucket 1 perplexity 1.47
eval: bucket 2 perplexity 1.79
global step 2600 learning rate 0.5000 step-time 3.09 perplexity 1.53
eval: bucket 0 perplexity 1.34
eval: bucket 1 perplexity 1.57
eval: bucket 2 perplexity 1.90
global step 2800 learning rate 0.5000 step-time 2.92 perplexity 1.49
eval: bucket 0 perplexity 1.35
eval: bucket 1 perplexity 1.67
eval: bucket 2 perplexity 1.85
global step 3000 learning rate 0.5000 step-time 2.82 perplexity 1.44
eval: bucket 0 perplexity 1.39
eval: bucket 1 perplexity 1.55
eval: bucket 2 perplexity 1.81
global step 3200 learning rate 0.5000 step-time 2.68 perplexity 1.43
eval: bucket 0 perplexity 1.49
eval: bucket 1 perplexity 1.35
eval: bucket 2 perplexity 1.87
global step 3400 learning rate 0.5000 step-time 2.90 perplexity 1.41
eval: bucket 0 perplexity 1.35
eval: bucket 1 perplexity 1.56
eval: bucket 2 perplexity 1.73
global step 3600 learning rate 0.5000 step-time 2.79 perplexity 1.40
eval: bucket 0 perplexity 1.27
eval: bucket 1 perplexity 1.32
eval: bucket 2 perplexity 1.59
global step 3800 learning rate 0.5000 step-time 2.87 perplexity 1.38
eval: bucket 0 perplexity 1.52
eval: bucket 1 perplexity 1.46
eval: bucket 2 perplexity 1.52
global step 4000 learning rate 0.5000 step-time 2.74 perplexity 1.36
eval: bucket 0 perplexity 1.49
eval: bucket 1 perplexity 1.41
eval: bucket 2 perplexity 1.83
global step 4200 learning rate 0.5000 step-time 2.80 perplexity 1.37
eval: bucket 0 perplexity 1.23
eval: bucket 1 perplexity 1.36
eval: bucket 2 perplexity 1.58
global step 4400 learning rate 0.5000 step-time 2.94 perplexity 1.36
eval: bucket 0 perplexity 1.58
eval: bucket 1 perplexity 1.53
eval: bucket 2 perplexity 1.73
global step 4600 learning rate 0.5000 step-time 3.16 perplexity 1.35
eval: bucket 0 perplexity 1.25
eval: bucket 1 perplexity 1.54
eval: bucket 2 perplexity 1.58
global step 4800 learning rate 0.5000 step-time 2.74 perplexity 1.33
eval: bucket 0 perplexity 1.44
eval: bucket 1 perplexity 1.60
eval: bucket 2 perplexity 1.72
global step 5000 learning rate 0.5000 step-time 2.77 perplexity 1.33
eval: bucket 0 perplexity 1.36
eval: bucket 1 perplexity 1.38
eval: bucket 2 perplexity 1.60
global step 5200 learning rate 0.5000 step-time 2.97 perplexity 1.32
eval: bucket 0 perplexity 1.29
eval: bucket 1 perplexity 1.41
eval: bucket 2 perplexity 1.66
global step 5400 learning rate 0.5000 step-time 2.77 perplexity 1.30
eval: bucket 0 perplexity 1.31
eval: bucket 1 perplexity 1.52
eval: bucket 2 perplexity 1.45
global step 5600 learning rate 0.5000 step-time 2.80 perplexity 1.30
eval: bucket 0 perplexity 1.31
eval: bucket 1 perplexity 1.28
eval: bucket 2 perplexity 1.75
global step 5800 learning rate 0.5000 step-time 2.64 perplexity 1.29
eval: bucket 0 perplexity 1.42
eval: bucket 1 perplexity 1.33
eval: bucket 2 perplexity 1.41
global step 6000 learning rate 0.5000 step-time 2.76 perplexity 1.28
eval: bucket 0 perplexity 1.26
eval: bucket 1 perplexity 1.39
eval: bucket 2 perplexity 1.48
global step 6200 learning rate 0.5000 step-time 2.55 perplexity 1.28
eval: bucket 0 perplexity 1.37
eval: bucket 1 perplexity 1.37
eval: bucket 2 perplexity 1.67
global step 6400 learning rate 0.5000 step-time 2.68 perplexity 1.26
eval: bucket 0 perplexity 1.23
eval: bucket 1 perplexity 1.50
eval: bucket 2 perplexity 1.44
global step 6600 learning rate 0.5000 step-time 2.98 perplexity 1.26
eval: bucket 0 perplexity 1.12
eval: bucket 1 perplexity 1.54
eval: bucket 2 perplexity 1.47
global step 6800 learning rate 0.5000 step-time 2.87 perplexity 1.26
eval: bucket 0 perplexity 1.22
eval: bucket 1 perplexity 1.29
eval: bucket 2 perplexity 1.56
global step 7000 learning rate 0.5000 step-time 2.81 perplexity 1.26
eval: bucket 0 perplexity 1.22
eval: bucket 1 perplexity 1.45
eval: bucket 2 perplexity 1.54
global step 7200 learning rate 0.5000 step-time 2.76 perplexity 1.25
eval: bucket 0 perplexity 1.35
eval: bucket 1 perplexity 1.46
eval: bucket 2 perplexity 1.40
global step 7400 learning rate 0.5000 step-time 3.06 perplexity 1.24
eval: bucket 0 perplexity 1.18
eval: bucket 1 perplexity 1.26
eval: bucket 2 perplexity 1.48
global step 7600 learning rate 0.5000 step-time 3.15 perplexity 1.25
eval: bucket 0 perplexity 1.47
eval: bucket 1 perplexity 1.31
eval: bucket 2 perplexity 1.50
global step 7800 learning rate 0.5000 step-time 3.13 perplexity 1.24
eval: bucket 0 perplexity 1.50
eval: bucket 1 perplexity 1.43
eval: bucket 2 perplexity 1.46
global step 8000 learning rate 0.5000 step-time 2.76 perplexity 1.23
eval: bucket 0 perplexity 1.39
eval: bucket 1 perplexity 1.37
eval: bucket 2 perplexity 1.47
global step 8200 learning rate 0.5000 step-time 2.64 perplexity 1.22
eval: bucket 0 perplexity 1.30
eval: bucket 1 perplexity 1.25
eval: bucket 2 perplexity 1.59
global step 8400 learning rate 0.5000 step-time 2.38 perplexity 1.23
eval: bucket 0 perplexity 1.42
eval: bucket 1 perplexity 1.43
eval: bucket 2 perplexity 1.45
global step 8600 learning rate 0.5000 step-time 2.53 perplexity 1.21
eval: bucket 0 perplexity 1.42
eval: bucket 1 perplexity 1.33
eval: bucket 2 perplexity 1.39
global step 8800 learning rate 0.5000 step-time 2.58 perplexity 1.21
eval: bucket 0 perplexity 1.21
eval: bucket 1 perplexity 1.31
eval: bucket 2 perplexity 1.50
global step 9000 learning rate 0.5000 step-time 2.88 perplexity 1.21
eval: bucket 0 perplexity 1.36
eval: bucket 1 perplexity 1.30
eval: bucket 2 perplexity 1.57
global step 9200 learning rate 0.5000 step-time 3.03 perplexity 1.21
eval: bucket 0 perplexity 1.47
eval: bucket 1 perplexity 1.45
eval: bucket 2 perplexity 1.38
global step 9400 learning rate 0.5000 step-time 2.77 perplexity 1.20
eval: bucket 0 perplexity 1.39
eval: bucket 1 perplexity 1.29
eval: bucket 2 perplexity 1.55
global step 9600 learning rate 0.5000 step-time 2.86 perplexity 1.19
eval: bucket 0 perplexity 1.53
eval: bucket 1 perplexity 1.35
eval: bucket 2 perplexity 1.46
global step 9800 learning rate 0.5000 step-time 2.87 perplexity 1.19
eval: bucket 0 perplexity 1.43
eval: bucket 1 perplexity 1.43
eval: bucket 2 perplexity 1.80
global step 10000 learning rate 0.5000 step-time 2.74 perplexity 1.18
eval: bucket 0 perplexity 1.36
eval: bucket 1 perplexity 1.50
eval: bucket 2 perplexity 1.45
Training process stopped.
Beginning calculation word error rate (WER) on test sample.
WER : 0.469490521327
Accuracy : 0.530509478673

Supported languages.

Does it support arabic language, if not what alternative method could be used for text to phoneme for arabic text?

Merge get_vocabs and load_model

To make them read vocabulary file only once.

Review training stop criteria

Many steps should have the same perplexity before training ends. The number of steps thus could be reduced significantly if we stop when we have the same perplexity 4 times or so. It should not affect the accuracy.

Create model folder automatically

Currently have to create model direrctory manually

Float divison by zero

g2p-seq2seq --evaluate NEWARABIC/test.wordlist --model NEWARABIC
Creating 2 layers of 64 units.
Reading model parameters from NEWARABIC
Beginning calculation word error rate (WER) on test sample.
Words : 0
Errors: 0
Traceback (most recent call last):
File "/usr/local/bin/g2p-seq2seq", line 9, in
load_entry_point('g2p-seq2seq==5.0.0a0', 'console_scripts', 'g2p-seq2seq')()
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/app.py", line 81, in main
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 348, in evaluate
ZeroDivisionError: float division by zero

When I decode the same wordlist, it works fine.

No need to two-pass loops

Here you can do with a single pass, and there is not need for list

  lst = []
  for line in inp_dictionary:
    lst.append(line.strip().split())

  graphemes, phonemes = [], []
  for line in lst:
    if len(line)>1:
      graphemes.append(list(line[0]))
      phonemes.append(line[1:])

Sort phoneme and grapheme vocabularies

To make them dictionary-independent

Where is the Train Model file?

I used the following command to train G2P model:
python g2p.py --train /home/cmudict.dict --model /home/MyModel --max_steps 8400

here is the log:

Preparing G2P data
Creating vocabularies in /home/MyModel
Creating vocabulary /home/MyModel/vocab.phoneme
Creating vocabulary /home/MyModel/vocab.grapheme
Reading development and training data.
Creating 2 layers of 64 units.
........
Reading model parameters from /home/MyModel/translate.ckpt-8200
global step 8400 learning rate 0.4901 step-time 3.43 perplexity 1.37
  eval: bucket 0 perplexity 1.46
  eval: bucket 1 perplexity 1.29
  eval: bucket 2 perplexity 1.47
Training process stopped.
Beginning calculation word error rate (WER) on test sample.
WER :  0.4961492891
Accuracy :  0.5038507109

In MyModel directory there are so many generated files present, but there is no "model" file.

translate.ckpt-200
translate.ckpt-200.meta
translate.ckpt-400
translate.ckpt-400.meta
translate.ckpt-600
translate.ckpt-600.meta
translate.ckpt-7200
translate.ckpt-7200.meta
translate.ckpt-7400
translate.ckpt-7400.meta
translate.ckpt-7600
translate.ckpt-7600.meta
translate.ckpt-7800
translate.ckpt-7800.meta
translate.ckpt-8000
translate.ckpt-8000.meta
translate.ckpt-8200
translate.ckpt-8200.meta
translate.ckpt-8400
translate.ckpt-8400.meta
model.params
vocab.phoneme
vocab.grapheme
translate.ckpt-8600
translate.ckpt-8600.meta
translate.ckpt-8800
checkpoint
translate.ckpt-8800.meta

Where to get that "model" file.
or do I have to rename file translate.ckpt-8800 to model ?

compare performance with CNTK seq2seq implementation

Would be interesting to compare with a similar CNTK model:
https://github.com/Microsoft/CNTK/blob/master/Examples/SequenceToSequence/Miscellaneous/G2P/G2P.cntk

Provide a link to reference dictionary

Also update error rates in readme. Phonetisaurus error rate on this set is also 24.4%. Phonetisaurus on latest cmudict 33.89%. Provide our results on latest cmudict.

Test frequency-based g2p

Like we discussed, what if we bias word pronunciation with word frequency

Include Apache license

Fix LICENSE.txt

Last word in the list is skipped

[shmyrev@alpha g2p_seq2seq]$ cat > word list
hello
world
how
are 
you
[shmyrev@alpha g2p_seq2seq]$ python g2p.py --model /home/shmyrev/cmudict-g2p-model --decode word.list
HH EH L OW
W ER L D
HH AW
AA R

Last word is missing

Moreover, each line should contain a word, not just the phonemes. It should create a ready-to-use dictionary:

[shmyrev@alpha g2p_seq2seq]$ python g2p.py --model /home/shmyrev/cmudict-g2p-model --decode word.list
hello HH EH L OW
world W ER L D
how HH AW
are AA R
you Y UW

PER?

Is it possible to get phone error rate in addition to word error rate?

Check code with pylint

https://google.github.io/styleguide/pyguide.html

Train with short dictionaries

Traceback (most recent call last):
  File "g2p.py", line 442, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/default/_app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "g2p.py", line 425, in main
    g2p_model.train(g2p_params, FLAGS.train, FLAGS.valid, FLAGS.test)
  File "g2p.py", line 243, in train
    self.__run_evals()
  File "g2p.py", line 269, in __run_evals
    self.valid_set, bucket_id)
  File "/usr/lib/python2.7/site-packages/tensorflow/models/rnn/translate/seq2seq_model.py", line 252, in get_batch
    encoder_input, decoder_input = random.choice(data[bucket_id])
  File "/usr/lib64/python2.7/random.py", line 274, in choice
    return seq[int(self.random() * len(seq))]  # raises IndexError if seq is empty
IndexError: list index out of range

Check size for embedding layer

When you convert letter and phoneme symbols to numerical ids, isn't it confusing for the model to train with integers for classes? Would it be better to have one-hot encoding or maybe even letter embeddings to make distances between letters or phonemes more meaningful?

Print percentage on evaluation

With 2 digits after dot

Move symbol check inside decode_word function

You can raise exceptions only in one place, now you check for extra symbols in several places

Zero accuracy

I run

python /home/ubuntu/g2p-seq2seq/g2p_seq2seq/g2p.py --train cmudict.dict --num_layers 4 --size 64 --model model

I get

WER :  0.964269283852
Accuracy :  0.0357307161478

Remove sys.stdout.flush()

coding problem

Dear All,

I got a coding error in test phase (training and interactive phase were all fine). My training dictionary is a mixture of cmudict (ascii) and Chinese (utf-8) lexicons. What should I do? Should I convert all cmudict entries to utf-8?

Thanks a lot in advance!

Here is the log:

global step 91200 learning rate 0.2425 step-time 0.13 perplexity 1.02
Training done.
Creating 2 layers of 512 units.
Reading model parameters from g2p-seq2seq-oc16
Beginning calculation word error rate (WER) on test sample.
Traceback (most recent call last):
File "/home/liao/anaconda3/envs/python2.7/bin/g2p-seq2seq", line 9, in
load_entry_point('g2p-seq2seq==5.0.0a0', 'console_scripts', 'g2p-seq2seq')()
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/app.py", line 67, in main
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 234, in train
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 347, in evaluate
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 323, in calc_error
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 279, in decode_word
UnicodeEncodeError: 'ascii' codec can't encode character u'\u86c8' in position 9: ordinal not in range(128)

Here is my training dictionary:

瘦西湖 sh ou4 x i1 h u2
睃 s uo1
supercuts S UW1 P ER0 K AH2 T S
电报机 d ian4 b ao4 j i1
galka G AE1 L K AH0
知 zh ix4
Unipus Y UW1 N IH0 P AH0 S

Remove unused data from tests

Data which is not important must be removed from git and added to .gitignore file. For example train/model

Fix bad code patterns

Never convert integer to string to later convert it back to integer, this is very inefficient.

Never join list items in a string to later split them and join them again.

Remove the code which is not used.

Throw error if model path does not exist for interactive / decoding mode

Avoid redundant dictionary construction

If you only need direct and reversed dictionary, it is better to change this method:

def initialize_vocabulary(vocabulary_path):
  """Initialize vocabulary from file.
  We assume the vocabulary is stored one-item-per-line, so a file:
    d
    c
  will result in a vocabulary {"d": 0, "c": 1}, and this function will
  also return the reversed-vocabulary ["d", "c"].

To this method with optional reverse parameter:

def load_vocab(vocabulary_path, reverse = False)

This method should return only one vocabulary direct or reversed based on optional flag

Remove unused lines in init.py

Provide alternate pronunciations

whether we can get alternate pronunciations

test accuracy with trained model

I want to test accuracy of the trained model on cmudict.
Are there any standard training, validation, test dict for this task.

How it is compared in papers for fair evaluation if there are no standard partitions?
Thanks a lot for this code.

Running question for this command(g2p-seq2seq --interactive --model model_folder_path)

sam@speechws13:~/g2p-seq2seq-master$ g2p-seq2seq --interactive --model g2p-seq2seq-cmudict/g2p-seq2seq-cmudict/modle
Traceback (most recent call last):
  File "/usr/local/bin/g2p-seq2seq", line 9, in <module>
    load_entry_point('g2p-seq2seq==5.0.0a0', 'console_scripts', 'g2p-seq2seq')()
  File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 542, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2569, in load_entry_point
    return ep.load()
  File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2229, in load
    return self.resolve()
  File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2235, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/__init__.py", line 23, in <module>
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 36, in <module>
ImportError: No module named data_utils

how can I fix this question??
thank you~

Logic is not clear

  create_vocabulary(ph_vocab_path, train_ph)
  create_vocabulary(gr_vocab_path, train_gr)

  # Initialize vocabularies.
  ph_vocab = initialize_vocabulary(ph_vocab_path, False)
  gr_vocab = initialize_vocabulary(gr_vocab_path, False)

Why do you need to initialize the vocabulary after you created it. Logic must be more straightforward. First initialize the vocabulary, then save it, then there is no need to reload it again.

Implement and test adaptive training alogithms

SGD seems to converge slowly. Can we have an option for RMSProp, Adadelta and Adagrad? This should be easy to implement with the respective Tensorflow optimizers

Time to restore saved model?

In the function g2p.py I added a time.time() function around the command

self.model.saver.restore(self.session, os.path.join(self.model_dir,
"model"))

to see how long it takes to load a pre-trained model to decode words. With a model trained with 512 nodes I get:

Time to load model: 2.53336596489

with only 64 nodes I don't get much savings:

Time to load model: 2.50763916969

which according to the python time module is output in seconds. That seems really slow. I am using the cpu instead of the gpu, because in the end if we are to include a similar NN model in our software, we don't have any gpu power on our servers. But still, when I compare it with a current openfst implementation of an n-gram model, that one is only 300ms or 0.3s to load in c++.

It may be faster if I can restore the saved file from c++ but I have to see about writing code to allow that.

Train/test split which takes pronunciation variants into account

It should not split pronunciation variants for same word for train and test

Store model architecture in a text file

training model did not improve accuracy!

So, I was training a new model based on CMUSphinx dictionary, with --max_steps 10000 --size 512 --num_layers 3 --learning_rate 0.5 variables, after I finished the training on model, I got this output with trained model.

a
M HH HH HH HH UH UH UH UH UH
b
M M UH UH UH UH UH UH UH UH
c
M M M UH UH UH UH UH UH UH
d
M M M M UH UH UH UH UH UH
hello
M HH HH HH HH UW UW UW UW M M M M M M
aa
HH HH HH HH HH HH HH HH UH UH

Is there anything wrong with my approach ?

this was my last output in training model.
global step 10000 learning rate 0.4000 step-time 3.48 perplexity 1.15
eval: bucket 0 perplexity 1.40
eval: bucket 1 perplexity 1.25
eval: bucket 2 perplexity 1.34

Cleanup main function

Move

    train_gr, train_ph = data_utils.split_to_grapheme_phoneme(train_dic)
    valid_gr, valid_ph = data_utils.split_to_grapheme_phoneme(valid_dic)
    test_gr, test_ph = data_utils.split_to_grapheme_phoneme(test_dic)

from the main function to train

Can't train model

Can't train model
Words pronunciation are wrong (using default CMUSphinx G2P model)

TypeError in g2p-seq2seq --interactive

Hi,
thanks so much for this great project!

I have it running in --decode mode but run into this error when --interactivewhere I receive this message:

$ sudo g2p-seq2seq --interactive --model g2p-seq2seq-cmudict
Creating 2 layers of 512 units.
Reading model parameters from g2p-seq2seq-cmudict
> hello
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.5/bin/g2p-seq2seq", line 11, in <module>
    load_entry_point('g2p-seq2seq==5.0.0a0', 'console_scripts', 'g2p-seq2seq')()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/g2p_seq2seq-5.0.0a0-py3.5.egg/g2p_seq2seq/app.py", line 78, in main
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/g2p_seq2seq-5.0.0a0-py3.5.egg/g2p_seq2seq/g2p.py", line 308, in interactive
TypeError: decoding str is not supported

Sorry if this is a newbie error. Any help much appreciated :)

Fix comment to read_data and rename the method

It does not read anything anymore.

Write a unit test with small dataset

Create a toy dictionary with 20 lines and train an g2p model with 2 elements on hidden layer.

Remove duplicated code

There is no need to use two times

print("> ", end="")

Command arguments in README

Square brackets like "[model_folder_path]" are reserved for optional arguments, required arguments are usually simply underlined

http://stackoverflow.com/questions/9725675/is-there-a-standard-format-for-command-line-shell-help-text

Training does not report test dictionary accuracy

Test dictionary is simply ignored

Grapheme-phoneme alignment output

Hi,

is it possible to output grapheme-phoneme alignment data from this model?

many thanks,

Daniel

List index out of range

I am trying to do everything right but this error still persists.

Creating 2 layers of 64 units.
Created model with fresh parameters.
global step 200 learning rate 0.5000 step-time 3.09 perplexity 1.57
Traceback (most recent call last):
File "/usr/local/bin/g2p-seq2seq", line 9, in
load_entry_point('g2p-seq2seq==5.0.0a0', 'console_scripts', 'g2p-seq2seq')()
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/app.py", line 67, in main
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 217, in train
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 253, in __run_evals
File "/usr/local/lib/python2.7/dist-packages/tensorflow/models/rnn/translate/seq2seq_model.py", line 250, in get_batch
encoder_input, decoder_input = random.choice(data[bucket_id])
File "/usr/lib/python2.7/random.py", line 275, in choice
return seq[int(self.random() * len(seq))] # raises IndexError if seq is empty
IndexError: list index out of range

Do not check if file exists if you need to open it

File might disappear between exists check and open anyway, so check is redundant. Just open the files and proceed. Throw an error if open failed.