syang1993 / gst-tacotron Goto Github PK

A tensorflow implementation of the "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis"

Python 100.00%

expressive-speech-synthesis expressive-tacotron global-style-tokens gst-tacotron tacotron

gst-tacotron's People

Contributors

Stargazers

Watchers

Forkers

shubhampachori12110095 ruclion loong1989 tmbtw gzjas kokimame edwinyam mlbk lturing jsonko entn-at zohaibahmed lmxue epinnock ranne hazekiahwon sujithpadar peter05010402 wyb330 xiaochunan cmc1023 g-wang jdc08161063 guyko81 beautifulsumday quocvuong82 klqulei hcz-110 michalliu cliuxinxin iiaiysh herenje begeekmyfriend shangqwe123 erishan6 stc-cqupt del18687058912 chl916185 fastcode3d t04glovern geneing deeprecursion zhyoung24 templeblock yunyikristy lizzyllq charlottecuc flavio58it ntzzc daaliang appalachianwine orbisai idgmatrix aaaaaalan chenyi0818 japita-se linlinsongyun silyfox ishine haebinshin sanjosh numanor jordanmicahbennett hongwen-sun giorgionetg zeroqiaoba xuehao-marker alvarorivasg zerocurve ren-y-lin cruelpaw beckgom zge sachp1 zhoulinmin niu0717 sjw-0629 rajeshdiwakar sankar-mukherjee yqli2420 agonzalezd wataru-nakata arielfmd arkhamimp warhammer0 gavin-pu cptbtptp2333 3016203099 sahi11 wchen-casia sadam1195 spiralanch sciai-ai scutcyr hariduttt sornrasakc victor-mcz kimyeondu chazo1994 eleanorzhang

gst-tacotron's Issues

How to integrate this code to r9y9's wavenet_vocoder ?

Is there any way to integrate this code with wavenet vocoder ?

How to achieve style embedding with different weights of each token without reference audio?

training stops many seconds to create new queue of data

Hi
When I start training, time of train of any step is good and it uses GPU but after 32 steps (if _batches_per_group=32 in datafeeder) GPU utilization going to 0 and after many seconds the queue of data becomes ready and again training starts.
I saw datafeeder.py and it uses threading for queue of data but why it stops training? How can I increase GPU utilization?
I increased _batches_per_group in datafeeder.py but it causes training stops more seconds. Following picture show that if I set _batches_per_group=128 training stops for 89 seconds!

Throws "data must be floating-point" exception after 1k steps

Running on LJ dataset.
Basically this is the line where it's breaking

gst-tacotron/train.py

Line 115 in b455ed2

audio.save_wav(waveform, os.path.join(log_dir, 'step-%d-audio.wav' % step))

Starting new training run at commit: None
Generated 32 batches of size 32 in 39.301 sec
Step 1 [43.557 sec/step, loss=0.84572, avg_loss=0.84572]
Step 2 [23.415 sec/step, loss=0.85437, avg_loss=0.85004]
........
........
Step 998 [2.387 sec/step, loss=0.14099, avg_loss=0.14424]
Step 999 [2.387 sec/step, loss=0.14100, avg_loss=0.14422]
Step 1000 [2.380 sec/step, loss=0.14311, avg_loss=0.14418]
Writing summary at step: 1000
Saving checkpoint to: /media/iedc-beast/Disk 1/test/gst-tacotron-master/logs-tacotron/model.ckpt-1000
Saving audio and alignment...
Exiting due to exception: data must be floating-point
Traceback (most recent call last):
File "train.py", line 115, in train
audio.save_wav(waveform, os.path.join(log_dir, 'step-%d-audio.wav' % step))
File "/media/iedc-beast/Disk 1/test/gst-tacotron-master/util/audio.py", line 16, in save_wav
librosa.output.write_wav(path, wav.astype(np.int16), hparams.sample_rate)
File "/usr/local/lib/python3.5/dist-packages/librosa/output.py", line 223, in write_wav
util.valid_audio(y, mono=False)
File "/usr/local/lib/python3.5/dist-packages/librosa/util/utils.py", line 159, in valid_audio
raise ParameterError('data must be floating-point')
librosa.util.exceptions.ParameterError: data must be floating-point
2018-11-24 16:41:57.082342: W tensorflow/core/kernels/queue_base.cc:277] _0_datafeeder/input_queue: Skipping cancelled enqueue attempt with queue not closed
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1292, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1277, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1367, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.CancelledError: Enqueue operation was cancelled
[[{{node datafeeder/input_queue_enqueue}} = QueueEnqueueV2[Tcomponents=[DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](datafeeder/input_queue, _arg_datafeeder/inputs_0_1, _arg_datafeeder/input_lengths_0_0, _arg_datafeeder/mel_targets_0_3, _arg_datafeeder/linear_targets_0_2)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/media/iedc-beast/Disk 1/test/gst-tacotron-master/datasets/datafeeder.py", line 75, in run
self._enqueue_next_group()
File "/media/iedc-beast/Disk 1/test/gst-tacotron-master/datasets/datafeeder.py", line 97, in _enqueue_next_group
self._session.run(self._enqueue_op, feed_dict=feed_dict)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 887, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1110, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1286, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1308, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Enqueue operation was cancelled
[[{{node datafeeder/input_queue_enqueue}} = QueueEnqueueV2[Tcomponents=[DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](datafeeder/input_queue, _arg_datafeeder/inputs_0_1, _arg_datafeeder/input_lengths_0_0, _arg_datafeeder/mel_targets_0_3, _arg_datafeeder/linear_targets_0_2)]]

Caused by op 'datafeeder/input_queue_enqueue', defined at:
File "train.py", line 153, in
main()
File "train.py", line 149, in main
train(log_dir, args)
File "train.py", line 58, in train
feeder = DataFeeder(coord, input_path, hparams)
File "/media/iedc-beast/Disk 1/test/gst-tacotron-master/datasets/datafeeder.py", line 46, in init
self._enqueue_op = queue.enqueue(self._placeholders)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 339, in enqueue
self._queue_ref, vals, name=scope)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 3978, in queue_enqueue_v2
timeout_ms=timeout_ms, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3259, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1747, in init
self._traceback = tf_stack.extract_stack()

CancelledError (see above for traceback): Enqueue operation was cancelled
[[{{node datafeeder/input_queue_enqueue}} = QueueEnqueueV2[Tcomponents=[DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](datafeeder/input_queue, _arg_datafeeder/inputs_0_1, _arg_datafeeder/input_lengths_0_0, _arg_datafeeder/mel_targets_0_3, _arg_datafeeder/linear_targets_0_2)]]

Path for Reference Audio

If I have to implement the second paper by setting the use_gst hyper parameter to false, where is the path for the reference audio for training to be provided?

Check failed: dnnReLUCreateBackward_F32

Hello :D

I'm trying to use gst-tacotron with blizzardchallenge2013 datasets.

When I try to training, I met check failed error. (It same to use gst true option..)

so, I ask about below...

I just try to training so, I didn't change base code.
May I get some idea for solve this problem??

Here is my log

1. when I use gst_false option ..
gst-tacotron_gst_false# python train.py
/root/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:493: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/root/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:494: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/root/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:495: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/root/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:496: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/root/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:497: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/root/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:502: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Checkpoint path: /data/workspace/blizzardchllenge/gst-tacotron_gst_false/logs-tacotron/model.ckpt
Loading training data from: /data/workspace/blizzardchllenge/gst-tacotron_gst_false/training/train.txt
Using model: tacotron
Hyperparameters:
adam_beta1: 0.9
adam_beta2: 0.999
attention_depth: 256
batch_size: 32
cleaners: english_cleaners
decay_learning_rate: True
embed_depth: 256
encoder_depth: 256
frame_length_ms: 50
frame_shift_ms: 12.5
griffin_lim_iters: 60
initial_learning_rate: 0.002
max_iters: 1000
min_level_db: -100
num_freq: 1025
num_gst: 10
num_heads: 4
num_mels: 80
outputs_per_step: 2
power: 1.5
preemphasis: 0.97
prenet_depths: [256, 128]
ref_level_db: 20
reference_depth: 128
reference_filters: [32, 32, 64, 64, 128, 128]
rnn_depth: 256
sample_rate: 16000
style_att_dim: 128
style_att_type: mlp_attention
style_embed_depth: 256
use_cmudict: False
use_gst: False
Loaded metadata for 9725 examples (20.13 hours)
Initialized Tacotron model. Dimensions:
text embedding: 256
style embedding: 128
prenet out: 128
encoder out: 384
attention out: 256
concat attn & out: 640
decoder cell out: 256
decoder out (2 frames): 160
decoder out (1 frame): 80
postnet out: 256
linear out: 1025
Starting new training run at commit: None
Generated 32 batches of size 32 in 90.126 sec
Step 1 [139.284 sec/step, loss=0.87672, avg_loss=0.87672]
Step 2 [130.008 sec/step, loss=0.97632, avg_loss=0.92652]
Step 3 [141.618 sec/step, loss=0.98165, avg_loss=0.94490]
Step 4 [194.484 sec/step, loss=0.99856, avg_loss=0.95831]
Step 5 [177.694 sec/step, loss=0.95613, avg_loss=0.95788]
2019-12-03 09:52:09.674825: F tensorflow/core/kernels/mkl_relu_op.cc:328] Check failed: dnnReLUCreateBackward_F32(&mkl_context.prim_relu_bwd, __null, mkl_context.lt_grad, mkl_context.lt_grad, negative_slope) == E_SUCCESS (-1 vs. 0)
Aborted (core dumped)

2. when I use gst_true option ..

gst-tacotron_gst_true/training/train.txt
Using model: tacotron
Hyperparameters:
adam_beta1: 0.9
adam_beta2: 0.999
attention_depth: 256
batch_size: 32
cleaners: english_cleaners
decay_learning_rate: True
embed_depth: 256
encoder_depth: 256
frame_length_ms: 50
frame_shift_ms: 12.5
griffin_lim_iters: 60
initial_learning_rate: 0.002
max_iters: 1000
min_level_db: -100
num_freq: 1025
num_gst: 10
num_heads: 4
num_mels: 80
outputs_per_step: 2
power: 1.5
preemphasis: 0.97
prenet_depths: [256, 128]
ref_level_db: 20
reference_depth: 128
reference_filters: [32, 32, 64, 64, 128, 128]
rnn_depth: 256
sample_rate: 16000
style_att_dim: 128
style_att_type: mlp_attention
style_embed_depth: 256
use_cmudict: False
use_gst: True
Loaded metadata for 9725 examples (20.13 hours)
WARNING:tensorflow:From /data/workspace/blizzardchllenge/gst-tacotron_gst_true/models/multihead_attention.py:114: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
Initialized Tacotron model. Dimensions:
text embedding: 256
style embedding: 256
prenet out: 128
encoder out: 512
attention out: 256
concat attn & out: 768
decoder cell out: 256
decoder out (2 frames): 160
decoder out (1 frame): 80
postnet out: 256
linear out: 1025
Starting new training run at commit: None
Generated 32 batches of size 32 in 2.063 sec
Step 1 [190.721 sec/step, loss=0.87613, avg_loss=0.87613]
Step 2 [105.134 sec/step, loss=0.78472, avg_loss=0.83042]
Step 3 [87.687 sec/step, loss=0.86729, avg_loss=0.84271]
Step 4 [81.866 sec/step, loss=0.88327, avg_loss=0.85285]
Step 5 [73.656 sec/step, loss=0.85281, avg_loss=0.85284]
Step 6 [76.789 sec/step, loss=0.87447, avg_loss=0.85645]

2019-12-03 10:45:01.889230: F tensorflow/core/kernels/mkl_relu_op.cc:328] Check failed: dnnReLUCreateBackward_F32(&mkl_context.prim_relu_bwd, __null, mkl_context.lt_grad, mkl_context.lt_grad, negative_slope) == E_SUCCESS (-1 vs. 0)
Aborted (core dumped)

thank you :D

Sample Alignment Graph

Hi,

Can you share the alignment graphs that you are obtaining for your audio samples? For most of my alignments, the y-axis is about half of the x-axis. Is there a reason why this is happening? In Keithito's repo, the shared alignment graphs have a 1-1 scale. In other words, the range of the x-axis and the y-axis is the same.

Style Token Layer implementation question

Thanks a lot for this nice implementation, but at section 3.2.2 in the original paper https://arxiv.org/pdf/1803.09017.pdf, the authors mentioned that:

we found that applying a tanh activation to GSTs before applying attention led to greater token diversity.

So I am a little confused that in your implementation:

gst-tacotron/models/tacotron.py

Line 82 in 74af4dd

    
           style_embeddings = tf.nn.tanh(style_attention.multi_head_attention())                   # [N, 1, 256]

Should we first apply the tanh activation to GSTs embedding then compute the multi-head attention weights instead of first compute the weighted sum of GSTs and then apply tanh?

GMM Attention

I am currently at SLT2018 and talked to Daisy Stanton about some problems I had training another Tacotron implementation on my data. She mentioned it is crucial to use GMM Attention. I started to implement it, but then I found there is also a models/gmm_attention_wrapper.py in GST-Tacotron.
Is this GMM-Attention working, and if yes, how can it be turned on?

Thanks and kind regards
Ernst

Reference Encoder Padding

How do we ensure that the padding of the reference mel spectogram is taken into account when the reference encoder is applied on a batch of mels?

What is in reference audio path?

Dear @syang1993. Thank you for your efforts and the kindness of sharing with us.

I have 1 question about this project. Hit me if i was wrong.

As I am understanding, when training, reference mel will be target melspectrogram.
And when synthesizing, we need pass the reference audio path. I could not understand which are in that path? Reference mel-spectrograms of all type of audio (angry, happy, sadness...) or just one type of them, or just 1 mel spectrogram. Are they exported numpy array (*.npy)?

Thank you so much. and thank you again.

poor alignment when conditioned on reference audios

First of all, thanks very much for taking out time to implement this. I have listened to Audio Samples here and the results are amazing. However, I am unable to replicate this behavior. Could you please help?

I have trained gst-tacotron for 200K steps on LJSpeech-1.1 with default hyperparameters
SampleAudios.zip
. Encoder-decoder align well during the training. However, during inference when conditioned on unseen reference audio (I used 2nd target reference audio from here), the alignment does not hold.

Following is from training step#200000

However, when I evaluated with 203K checkpoint, conditioned on reference audio discussed above, I get following.

Without conditioning, (i.e., random)

Even, style-transfer in voice does not make much difference.

Please find attached zipped file for voice samples.

My Questions:

Is there anything I can change to get better quality audio and alignment? Thanks for your help in advance.
Can you please share pre-trained model you used to generate Audio Samples here.

What would happen if we merged datasets?

How would the results look like if we combined LJSpeech and Blizzard? Will we get better results?

No clear speech

Hi,
I trained the Tacotron GST with the default hyperparameters (batchsize 12 for memory reasons) for 6 days (232000 iterations) on a Titan X on the blizzard 2013 dataset, and the alignments are only partly linear, and the end of the text repeats several times, see enclosed file. The end plus the word test are not understandable.
Is this caused by the batch size?

Thanks and kind regards
Ernst

core dumped error when preprocessing ljspeech dataset

I have the LJSpeech dataset in /database/LJSpeech-1.0 and am trying to preprocess it as follows:

sudo python3 preprocess.py --dataset ljspeech

This results in the following:

2018-11-15 22:40:18.182581: F tensorflow/core/platform/cpu_feature_guard.cc:37] The TensorFlow library was compiled to use AVX instructions, but these aren't available on your machine.
Aborted (core dumped)

Wondering if others have ran into this or it's just an issue with my machine.

Using pre-trained model of Keithito's tacotron implementation

As @syang1993 has mentioned in the README that this is based on Keithito's implementation of Tacotron. I tried to compare training script for both the projects (this one and Keithito's implementation) and it seems pretty similar. So I was wondering if there is any way I could use Keithito's pre-trained model here to synthesize speech. Right now I am getting the following error while attempting to do so:
-> "tensorflow.python.framework.errors_impl.NotFoundError: Key model/inference/Multihead-attention/attention_b not found in checkpoint"
-> "[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]"

If someonce could throw some light on how to do this if in case it is possible to do it in the first place.

Thanks.

Please update link to Blizzard data

Could you change the link in "This set was trained using the Blizzard 2013 dataset with and without global style tokens (GSTs)." to point directly to the data, rather than to the Challenge page, because a few people don't read the page carefully enough and think they need to sign up for the Challenge to get the data, which is not the case. Here's the direct link to use: http://www.cstr.ed.ac.uk/projects/blizzard/2013/lessac_blizzard2013/

many thanks,
Simon King (Blizzard Challenge organiser)

training time

@syang1993 It's nice work! :-) The latest demo samples are impressing. How long does it take to GST-Tacotron model using segmented Blizzard2013 datasets? And would you let me know GPU specification?

Regarding the trained model

Some problems when preprocessing ljspeech dataset

Hi. Thank you very much for your nice job.

I installed tensorflow 1.4 with GPU and the other packages mentioned in requirements.txt. However, when I tried to run python3 preprocess.py --dataset ljspeech , I got the following problem:

Then, I upgraded numpy to the newest version, the problem mentioned above was solved, but it brought up a new problem: ValueError: operands could not be broadcast together with shapes (1,1025) (0,).

Could you please tell me how to solve this?

Thank you very much!

Tone transfer

I want to know that this model is just to learn the rhythm of the statement you provide instead of the tone. Can I use this model to imitate the tone of his speech with a single sentence?

Eval on soft voices

I noticed that regardless of how soft the reference voice is, the output is always loud. Are we really able to capture the style token if we can't detect what is loud and what is soft?

how long will I wait for the Blizzard dataset confirmation?

Hi syang, thanks for ur posts very much! Now I wanna download the Blizzard 2013 dataset and have registered on the website http://www.cstr.ed.ac.uk/projects/blizzard/2013/lessac_blizzard2013/, I received an email informing my registration afterwards, which said I should wait for 1 to 2 weeks to be confirmed, but I wanna use the dataset right away in these days! I'd like to ask that how long have u wait for the confirmation of the dataset? Thank u very much! ^_^

shape of linear_outputs is not same as while training

I have trained model with english data. Training is also converged and it is generating good wav samples from training data at checkpoint time. While evaluating trained model it takes too much time due to below reason and producing noisy output.

linear_outputs = tf.layers.dense(post_outputs, hp.num_freq) # [N, T_out, F]
shape of linear_outputs is [1,200000, 1025] (while evaluating) as compared to [1, <200, 1025] while training. which causes griffin_lim to take too much time for generating wav. Can someone please help why this is the case?

the model is hard to converge with LJSpeech

Hi! Thanks for your contribution! I have trained the model on LJSpeech dataset with your codes. But I found the loss is not converge with your default hparams. Here are some results on tensorboard. Could you give me some advice?

batch_size=32 lr=0.002
batch_size=32 lr=0.001
batch_size=64 lr=0.001
batch_size =64 lr=0.0006
batch_size=32 lr=0.0001
batch_size=32 lr=0.00002

Finally, the model seems converge. But the alignment is not good. The step-51000-align.png is like this.
Should I keep on training or kill this process and try other hparams? Can you give me some advice?

tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [12,2262,80] vs. [12,2000,80]

Hi,
thank you very much for creating Tacotron GST!

I have to run on Titan X, and I had to change the batchsize to 16.

after 32 iteration I get

tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [12,2262,80] vs. [12,2000,80]

Have you seen this before?
I am running Tensorflow 1.12.
Thanks and kind regards
Ernst

Pretrained Model

Can anyone share his pretrained model? That would be of much help.

poor alignment with test out-of-collection data

Hi @syang1993 ,I tried run a 170K+ train test and the train output is perfect among the many tacotron implementation
But I notice that with out-of-collection data the output aligned is always mass，I tried with Rayhane-mamah‘s repo（good rhythm but bad voice）， the out-of-collection data always get good rhythm . Do you got any advice on that ? The reference audio is from the training data

the trained alignment map

the test alignment map with out-of-collection data

eval form checkpoints

Thanks for your nice work. I have trained the model on Blizzard 2013 dataset. The synthesized files from 185k and 385k checkpoints are available in the following link. I used the samples from LJ-Speech (LJ001-0001.wav) and Nancy (nancy.wav) as reference files for checking the performance. I also included the mocdel.checkpoint files and the audio files at each step (step-185000-audio.wav, step-385000-audio.wav).
https://www.dropbox.com/sh/jhcynw65o1tmj7r/AABJN4cBotdbs-A5-Rk89vt0a?dl=0
Any idea on how to improve the shaking voice?

multi head attention

Hi, again.
The code is set to use mlp_attention, now.
Did the uploaded audio demo samples use mlp_attention?
Have you ever experimented using multi head attention? Do you have any audio samples?

Thanks.

erro in eval.py

hello.guys,when i do python eval.py --checkpoint /media/guo/Program/log/gst-tacotron/logs-tacotron/model.ckpt-12000 --text "hello text" --reference_audio /media/guo/...
i met this erro.what is the path of reference_audio thank you

Training with custom data

Curious if others have achieved reasonable results training on custom data. I've tried training the model on data from https://github.com/aomv/voiceloop-in-the-wild-experiments/tree/master/data/donald-trump/data (which has audio files and transcriptions of a few seconds in length, for somewhere around a couple hours in total) making a metadata.csv file in the same format as the LJSpeech dataset.

While I've trained for several hours with a steadily decreasing loss, the graph would indicate the model is not learning properly. I've also failed to generate intelligible audio at least without using a reference audio (trying several times).

Training Multi-Speaker Model.

Hi,
I had success while using this model for single speaker data,
But I am not sure how to scale it for a multi-speaker setting.
Is it just changing the data or should there be some changes to the model?
Have anyone tried this before??

Why there is some blank in the sythesized wav file when we use reference audio generation?

Hello, thanks for your brilliant job!

I used LJSpeech dataset to train out a model and synthesized several sentences with some reference audio. But I found that there is some blank in the middle of the resulting evaluation wav file if we comment out the audio.find_endpoint method in synthesizer.py. It looked like as follows:
eval-73000_ref-15_am_m-3.zip

The command line is as follows:

python eval.py --checkpoint logs-tacotron/model.ckpt-73000 --reference audio ../expressive_tacotron/ref1/15_am_m.wav

In fact the reference audio is deprived from expressive_tacotron and there is no blank in the wav file.

Do you know why did it happen like this? Do we have to use reference mel during the training to improve this situation?

Mumbling in synthesis

Hey, thanks for the implementation @syang1993!

I'm using this code to implement another paper and I've bumped into some issues during synthesis. I'm getting good alignment on training and the interim synthesised results sound good, however during evaluation, the synthesis is very unpredictable and sometimes fails to synthesise understandable speech. It rather sounds like mumbling. It's not only on long utterances, but sometimes on short and mid-length texts too. I'm attaching a few alignment plots and audio examples.

I was wondering if you've come across this before and if you have any tips where I should look to fix this issue? I've trained the model using the multihead attention, do you reckon the GMM attention will improve a lot?

mumbling_samples.zip

poor alignment when synthesizing long sentences

Thank you for your work! It helps a lot.
I want to ask whether your alignment is good when synthesizing sentences more than 10 words, like about 20 words. The paper said 'the model fails when conditioned on the shorter source phrases, successfully aligns when conditioned on the longest input.' The reference audio I used are about 20 words, but only when synthesizing shorter sentences, it works well. Attached please find some samples. Btw, I use nancy and blizzard 2017 for training.
Could you give me some suggestions? Thank you.
samples.zip

where do you insert or import wav file of models voice for training?

Hi , I have been looking at many different repos of tacotron here and there and I get a bit confused as I am much of a novice. I see everywhere on all repos the same instructions but none of them are specific on where in the process do I import the audio recording of my voice so that it can be trained into a new model? where do i put my wav file of my voice? How long of a sample should it be? where in this process does the scripts call for my recorded voice file? Any help will be greatly appreciated. I manged to get the Real Time Voice cloning toolbox to work but even still the only thing i can do with that is use the toolbox to do a quick demo not fully train.

Preprocessing blizzard 2013 data

Hi,

Thank you for your contribution.
I've tried to train blizzard 2013 data set with your model.
During preprocessing the data, I encountered the error because _max_out_length is None in datasets/blizzard2013.py (line 8)

Can I set this value as 700 defined in blizzard.py for blizzard dataset ?

Thanks in advance
Jaeseung Ko

Error in datafeeder.py

Traceback (most recent call last): File "/speech/Demo/gst-tacotron-master/datasets/datafeeder.py", line 75, in run self._enqueue_next_group() File "/speech/Demo/gst-tacotron-master/datasets/datafeeder.py", line 87, in _enqueue_next_group examples = [self._get_next_example() for i in range(n * _batches_per_group)] File "/speech/Demo/gst-tacotron-master/datasets/datafeeder.py", line 87, in examples = [self._get_next_example() for i in range(n * _batches_per_group)] File "/speech/Demo/gst-tacotron-master/datasets/datafeeder.py", line 105, in _get_next_example meta = self._metadata[self._offset] IndexError: list index out of range

I got the above error in datafeeder.py when i ran the train.py file. I initially got an Incompatible shapes error so I increased the max_iters parameter from 1000 to 2000.
What changes should I do to resolve the above error?

Add style weights when there is no reference audio

Hi when I check the code of eval.py, I can see that the functionality to add style weights when there is no reference audio is still to be done. Any hints of how to implement it? Thank you!

Pretrained Weights

Is the pretrained weights of this model availabe for download?

preprocessing the training data

Thank you very much for your nice work.
I have a problem with preprocessing the training data. The transcript file for Blizzard2013 segmented data is a file named prompts.gui which can be found here:
https://www.dropbox.com/s/6ugwnbqgwlfvxvl/prompts.gui?dl=0
I was wondering how the metdata.train file should look like. It seems that I need to clean up the attached file to be used for training and match the criteria. Is it possible to upload your cleaned up 'metadata-train' file, the converter of prompt.gui to metadata-train, or the desired format of the metadata.train file?

data feeder error

after running the code for about 200 steps, I am running into the following error. I can't figure out why. I feel like it should be an easy fix.

self._session.run(self._enqueue_op, feed_dict=feed_dict)
File "/home/fakarim/anaconda3/envs/gst-tacotron/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/home/fakarim/anaconda3/envs/gst-tacotron/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/home/fakarim/anaconda3/envs/gst-tacotron/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/home/fakarim/anaconda3/envs/gst-tacotron/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Enqueue operation was cancelled
[[Node: datafeeder/input_queue_enqueue = QueueEnqueueV2[Tcomponents=[DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](datafeeder/input_queue, _arg_datafeeder/inputs_0_1, _arg_datafeeder/input_lengths_0_0, _arg_datafeeder/mel_targets_0_3, _arg_datafeeder/linear_targets_0_2)]]

Caused by op 'datafeeder/input_queue_enqueue', defined at:
File "train.py", line 153, in
main()
File "train.py", line 149, in main
train(log_dir, args)
File "train.py", line 58, in train
feeder = DataFeeder(coord, input_path, hparams)
File "/home/fakarim/projects/gst-tacotron/datasets/datafeeder.py", line 46, in init
self._enqueue_op = queue.enqueue(self._placeholders)
File "/home/fakarim/anaconda3/envs/gst-tacotron/lib/python3.6/site-packages/tensorflow/python/ops/data_flow_ops.py", line 327, in enqueue
self._queue_ref, vals, name=scope)
File "/home/fakarim/anaconda3/envs/gst-tacotron/lib/python3.6/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 2777, in _queue_enqueue_v2
timeout_ms=timeout_ms, name=name)
File "/home/fakarim/anaconda3/envs/gst-tacotron/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/fakarim/anaconda3/envs/gst-tacotron/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/home/fakarim/anaconda3/envs/gst-tacotron/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

CancelledError (see above for traceback): Enqueue operation was cancelled
[[Node: datafeeder/input_queue_enqueue = QueueEnqueueV2[Tcomponents=[DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](datafeeder/input_queue, _arg_datafeeder/inputs_0_1, _arg_datafeeder/input_lengths_0_0, _arg_datafeeder/mel_targets_0_3, _arg_datafeeder/linear_targets_0_2)]]

Train as a Tacotron1 script problem

Thanks for your great work, but I found that if I set the hyperparameter use_gst=False and run, it seemed different from my understanding of Tacotron1. The tacotron.py code is part of here.

      if reference_mel is not None:
        # Reference encoder
        refnet_outputs = reference_encoder(
          reference_mel, 
          filters=hp.reference_filters, 
          kernel_size=(3,3),
          strides=(2,2),
          encoder_cell=GRUCell(hp.reference_depth),
          is_training=is_training)                                                 # [N, 128]
        self.refnet_outputs = refnet_outputs                                       

        if hp.use_gst:
          # Style attention
          style_attention = MultiheadAttention(
            tf.expand_dims(refnet_outputs, axis=1),                                   # [N, 1, 128]
            tf.tanh(tf.tile(tf.expand_dims(gst_tokens, axis=0), [batch_size,1,1])),            # [N, hp.num_gst, 256/hp.num_heads]   
            num_heads=hp.num_heads,
            num_units=hp.style_att_dim,
            attention_type=hp.style_att_type)

          style_embeddings = style_attention.multi_head_attention()                   # [N, 1, 256]
        else:
          style_embeddings = tf.expand_dims(refnet_outputs, axis=1)                   # [N, 1, 128]
      else:
        print("Use random weight for GST.")
        random_weights = tf.random_uniform([hp.num_heads, hp.num_gst], maxval=1.0, dtype=tf.float32)
        random_weights = tf.nn.softmax(random_weights, name="random_weights")
        style_embeddings = tf.matmul(random_weights, tf.nn.tanh(gst_tokens))
        style_embeddings = tf.reshape(style_embeddings, [1, 1] + [hp.num_heads * gst_tokens.get_shape().as_list()[1]])

Original Tacotron1 code shoudn't train with the reference encoder part right?
However, your code pass the non-gst mode data into a reference_encoder, which sounds strange ?
Maybe we can exchange the two IF condition codes to make it correct.

if hp.use_gst: 
***
if reference_mel is not None:  
***

THANKS

Why use the 'tf.layer.conv1d' for query, key transformation instead of fully connected layer?

Considering that the query and key have a timestep of 1 from the perspective of 1d convolution, I think that it is totally same with using the fully connected layer.

Is there any reason for using the tf.layer.conv1d for query, key transformation instead of fully connected layer?

Unable to reproduce results

Hi, I am using this exact code with same hyperparameters but the results produces are no way close to the sample results shown in this repository.
I have tried training using both LJSpeech and Blizzard dataset. Output from both the models have some noise present in it. What could be the possible reason?

can we synthesis speaker-A's tone with speaker-B's prosody?

when i read gst paper, i found it contains not only the token but also the tone of the speaker. In other word, can we separate prosody from the ref audio as much as possible?