fatchord / wavernn Goto Github PK

View Code? Open in Web Editor NEW

2.1K 2.1K 696.0 242.13 MB

WaveRNN Vocoder + TTS

Home Page: https://fatchord.github.io/model_outputs/

License: MIT License

Python 100.00%

neural-vocoder pytorch speech-synthesis tacotron text-to-speech tts wavernn

wavernn's People

Contributors

Stargazers

Watchers

Forkers

mlbk bryanjp3 ml-lab skywolf829 g-wang jiff-zhang mortont huguanglong petrochukm mdda entn-at boussaffawalid batikim09 gcunhase cveaux hyzhan shreyasnivas erogol loretoparisi wada-s evanyiweizhao geneing scharoun mattanimation mikesun4096 yeongtae drkeek xdcesc wgwangang cmc1023 young-sun zhouming-hfut bliep freshp124 ensky0 kastnerkyle zgsxwsdxg edresson lechaney donghaiyw beckgom syxu828 linzai1992 yangmingqi gan-challenger stevenyesz ginking concenterate tarsbase icewwn kevinyang007 yangpeng08 mazzzystar guojm14 hairuo55 hlp2819 chienlinhuang1116 twiet chl916185 ooshaunoo vitaly-zdanevich renaissance25 ml2457 shuipi100 dacrol estherxue lalimili6 yejunzhou hubeibei007 zhaisitong zhyoung24 yuhan-2626 peter05010402 caopan16 youngofnuaa fastcode3d acrosson cogmeta yjingyu sunsadyaofas andabi xuhui6666 saber5433 essingen123 chavesliu zhipingzhou begeekmyfriend dachengai eoner withoutdoubt oytunturk shaun95 rajdeep97 sadam1195 tipbot-ai shartoo kevinjesse human2b afd77 pandinosaurus

wavernn's Issues

predicted speed is 1075 samples/sec, sr = 16000, upsample_factors=(5,5,8) , is slow than you demo!

hi @fatchord , I training with you implement , but slow than your demo! where is wrong! thanks!
Generating: 1/3
43101/43200 -- Speed: 638 samples/sec
Generating: 2/3
56901/57000 -- Speed: 1075 samples/sec
Generating: 3/3
110501/110600 -- Speed: 1067 samples/sec

“ a much simpler version and will upload soon.”，can this version run in realtime in cpu?

You remind me “ a much simpler version and will upload soon.”， can this version run in realtime in cpu?
I have changed your model to smaller one(only has 0.95M parameters).

Upsampling network may be simplified

I think the upsampling network can be replaced by a single torch.nn.Upsample operation with scale_factor equal to hop_length and mode='linear'.

I took a network trained on LJSpeech data and looked at the output of upsampling layer. Upsampled mels from the upsampling network match (up to an arbitrary scaling factor) very closely the values I get by linearly interpolating between the original mel values.

Auxiliary network is still needed, though.

What is the training time?

Hi, I'm thinking of building WaveRNN as part of a masters project and I was just wondering what the training time is like. I only have a single GTX 1060 3GB so I was a bit concerned that training time would make this unrealistic. Also, I have done a term of machine learning modules but we didn't really cover conditioning networks on features of the input date, for example speaker id, current phoneme, syllable, word, etc., could you please point me in the direction of any information/literature on this.

Any help is much appreciated, Thanks

female_vocal_op8_8.wav is missing

A file data/female_vocal_op8_8.wav is mentioned in the second notebook, but is not available in the repo.

Is there any new process?

conditioning on Mel Spectrogram

Hello,

I was wondering if there is any implementation for conditioning on mel spectrogram (from say Tacatron 2)?

Finetuning with gta helpful?

Have you got observable improvements after fine-tuning with gta features? In my experiments, the quality of generated speech doesn't improve compared to the ones using the model just trained with ground truth mel-spectrogram.

tts samples

Hi fatchord, great work - may I ask for the tts samples that you posted, did you use the vocoder trained on the ground truths or on the output of a seq2seq text to features vector model?

Is it possible to adopt this code to train on multiple gpus?

Is it possible to adopt this code to train on multiple gpus? Specifically the alternative?

Subscaling WaveRNN

Hello,

Thanks for the great work!

Any plans for subscaling WaveRNN implementation? Or is the current WaveRNN implementation already fast enough (compared to say WaveNet generation)?

have you try subscale generate?

Would you interested in employing the sru to speed up inference?

”SRU is a recurrent unit that can run over 10 times faster than cuDNN LSTM, without loss of accuracy tested on many tasks. “ Sounds very exciting. I think we can try it. Here is a PyTorch inplementhttps://github.com/taolei87/sru and paper detailshttps://arxiv.org/abs/1709.02755.

About WaveRNN, not Alternative.

Hi,
The Alternative one is easy to train, and got good result.
I also trained the WaveRNN like your NB1/2/3 . but quality is not good even after 7 days. loss >5. I compare it with your Alternative , I wonder if add NN upsampling or Resnet will make it better. Would you share your thoughts. thanks.

dsp library not included

Thank you very much for sharing this! I'm trying to run your NB4a and NB4b code. NB4a includes the dsp library, which is not in this repo. Would you please include it or point me to it? Thanks!

How to train a new model

hi,

I tried to train a new model using my own wav files.
I get an error at the preprocess.py step.
How is that supposed to work?

Floating point exception (core dumped)

I tried to train the tacotron model you have on top of the LJ pretrained checkpoint you have. Just ran train_tacotron.py but when I run gen_tacotron.py, I get the following:

Initialising WaveRNN Model...

Trainable Parameters: 4.481M

Loading Weights: "checkpoints/lj.wavernn/latest_weights.pyt"


Initialising Tacotron Model...

Trainable Parameters: 11.078M

Loading Weights: "checkpoints/lj.tacotron/latest_weights.pyt"

+---------+----------+---+-----------------+----------------+-----------------+
| WaveRNN | Tacotron | r | Generation Mode | Target Samples | Overlap Samples |
+---------+----------+---+-----------------+----------------+-----------------+
|  804k   |   197k   | 1 |     Batched     |     11000      |       550       |
+---------+----------+---+-----------------+----------------+-----------------+
 

| Generating 1/6
Floating point exception (core dumped)

Any ideas on how I can go on debugging this?

use wavernn to make TTS

Can we use wavernn to make TTS, what is the input, thank you

At each epoch, loss of first batch is very different from subsequent batches

Hi @fatchord , thank you so much for sharing your great work!

I'm trying to train your alternate model. What I've noticed is that, at each epoch, the loss of the first batch is very different (usually much smaller) than the loss of subsequent batches. Is this normal? Why is this the case? Does that mean the model after the first batch is significantly better since it has a small loss?

Here is an example:

dataset issue

hi！
Can you tell me what your x data type when you load the data :
m = np.load(f'{self.path}mel/{file}.npy')
x = np.load(f'{self.path}quant/{file}.npy')
In my case ，I was normlized the wave data by 32768，and return float values like 0.001 ，0.02, 0.34............., but in NB4b :
coarse = np.stack(coarse).astype(np.int64)
astype(np.int64) trans all the datas( 0.001,0.02,0.34..........) to zero, so it can't train in model, because the output all are zero.
can you tell me how to fix it ? THX

Hi, "1d resnet with a 5 wide conv input", why kernel_size is 5? Thanks.

What model take as input?

Here is some samples https://fatchord.github.io/model_outputs/ but I wonder what model takes as input? Is it ground truth Mel Spectrogram or it's predicted Mel Spectrogram by something like Tacotron2?

16 kHz Implementation

Hi,

Great work. Thanks for sharing.

I'm trying to implement repo with 16kHz sampling rate but after a few epochs on GPU training crashes every time. I feel like I sholud adapt the network but couldn't find a solution yet.

Can you share your way of doing this?

The picture of network is not same as your code?

In picture, your last 2 dense layer has skip connection, but your code does not has.

Pretrained models

Thanks a lot for this implementation. It would be great to provide pre-trained models for the generated samples.

new model paper/details

Hi, you new model sounds very good, any chance you will write it up in a paper/blog-post? What's the new vocoder, it's it more WaveRNN like or wavenet like?

Effect of Resnet CNN (aux) features.

I plotted CNN output of my trained model. It does not seem to be useful to me. Do you see something maybe I am missing? Or have you tried without Resnet features?

RuntimeError in NB4

Anyone else having problem running NB4? The code outputs a RuntimeError regarding Tensor dimensions when I run the model.generate(5000) line. The error is as follows:

Anyone please have a suggestion on how to fix this error?
Thank you in advance,
Gwena

hey, is there a mistake or am i wrong?

In your Fit a 30min Sample, i find your GRU'update state is : hidden_coarse = u_coarse * hidden_coarse + (1. - u_coarse) * e_coarse !!!
Shouldn’t it be: hidden_coarse = u_coarse * e_coarse + (1. - u_coarse) * hidden_coarse

buy the way, thank you for your dedication, your work is great!

about "Pruning - Scratchpad.ipynb"

have you try train your model with sparse method? how about the result?

For real time generation

Hi, thanks for such a good implementation of WaveRNN. Actually, I am working to integrate this WaveRNN implementation with Tacotron for TTS task and I got good results way faster then Wavenet but way slower than real-time (10 sec audio in nearly 3 mins or so).
Currently what I see that this model gives around 1500 samples/sec with my GTX 1080 ti. But in WaveRNN paper they claim to get 96,000 samples/sec by optimising WaveRNN-896 to P100 GPU even they show subscale to mobile CPU. Is it possible to optimize this WaveRNN at that level, so that we get real-time sampling at least from GPU.

Do you mind if I hook thi repo up for an open-sourced project?

I plan to merge WaveRNN with TTS - https://github.com/mozilla/TTS, if you don't mind.

How to change the length of the generated files

hi,

how can I set the length of the generated files?
In my case they are always 10s, the same as the model I trained.
I would rather have 2s generated files.

what is the loss of Alternative Model after 500k step? thanks.

Can I use the model on a machine without CUDA?

Just for getting audio from text.

Possible to share pre-trained weights.

Hi @fatchord Great work on the code and architecture.
I was hoping if it's possible for you to share the pre-trained model so I can study the model on my end before training on my own dataset. Would be of great help, thanks!

About input to gru

hi, i have a question about the input in this picture:

the x is a float number, if we concat x and mel, the weight of x will be very small, why not use the ulaw input and one_hot it

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

# tail -f log.txt -n 5000

Initialising Model...

Trainable Parameters: 4.481M

Loading Model: "checkpoints/9bit_mulaw/latest_weights.pyt"

/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [6,0,0], thread: [611,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [6,0,0], thread: [202,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [6,0,0], thread: [301,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [6,0,0], thread: [402,0,0] Assertion `t >= 0 && t < n_classes` failed.
| Epoch: 1/157 (345/3205) | Loss: 4.921 | 0.54 steps/s | Step: 0k | Traceback (most recent call last):
  File "train.py", line 89, in <module>
    train_loop(model, optimiser, train_set, test_set, lr)
  File "train.py", line 43, in train_loop
    loss.backward()
  File "/usr/local/lib/python3.5/dist-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.5/dist-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

It seems cuda10 needed, and my current environment can NOT support CUDA10, do you have any idea of not using CUDA10?

Faster implementation of WaveRNN and licensing

Hi, I managed to make the implementation of WaveRNN much faster by allowing it to use cuDNN's implementation of GRU:

https://github.com/mkotha/WaveRNN

The code is based on yours, although I have heavily modified it.

I'd like to make the above code publicly available, either dedicated to the public domain or released under an open source license. However, I realize that you didn't release your code under an open source license. Would it be possible for me to get a permission to release the code?

Assertion Error

Hi,

I get a strange error when I launch the preprocess.py file.
The folder with my wav files is found and no other files are contained there.

How can I get through this error?

/WaveRNN$ python preprocess.py

7690 wav files found in "/home/ubuntu/wav/"

Traceback (most recent call last):
File "preprocess.py", line 52, in
text_dict = ljspeech(path)
File "/home/ubuntu/WaveRNN/utils/text/recipes.py", line 9, in ljspeech
assert len(csv_file) == 1
AssertionError

Constructing new mels as Input

@fatchord, thanks for your work on this. The samples you have are fantastic, and the model converges really quickly.

How do I go about creating the mel as the input? Do I need to train another model that produces mels and pipe that as the input? Or should I be able to take any wav file, construct a mel, and pass that as the input?

Successful training with Mixture of Logistic Distribution

Example result: https://soundcloud.com/user-565970875/ljspeech-logistic-wavernn

Here is the [branch] (https://github.com/erogol/WaveRNN/tree/mold) if you like to try. The model has trained with TTS spectrograms on LJSpeech dataset. Models are soon to be released.

@fatchord would you prefer to have the trained WaveRNN model here, or better to have a new repository for this?

Question about function melspectrogram()

Hi - I tried your alternate model, and it worked good easily, so I am thankful for your work.
But I noticed the output of your melspectrogram() function clips to 1.0 often on LJSpeech data.
(Of course, it might be my bad implementation).
But also it seems the code is similar to keithito/tacotron. In Keith's version he later changed one line to
S = _amp_to_db(_linear_to_mel(np.abs(D))) - hparams.ref_level_db
in response to an "issue" sent in by Rafael Valle. I wonder whether this difference was intentional or not, (or maybe not relevant).
Thanks.

Your latest code is based on your alternative model, 9bits.

Your samples sound great.
I think your samples were generated by your alternative model, 9bits, not by the original WaveRNN, 16bits.
Is it right?

Training model with a dataset

Any hints on how to use NB2 to train a dataset (say a directory with multiple audio files) and then use the trained model to generate one of those samples?

Thank you in advance

mu-law during generation

I am companding my target waveform with a mu-law before quantization. However I am not sure if I should expand the signal during the autoregressive generation or leave it as such, and expand it once the entire signal is generated.

I see that the restoration of the quantized signal happens here:

WaveRNN/NB4b - Alternative Model (Training).ipynb

Line 502 in 16aa2ed

    
           "                sample = 2 * distrib.sample().float() / (self.n_classes - 1.) - 1.\n",

My question is whether I should expand signal from the mu-law right after this line, e.g.:
sample = torch.sign(sample) * (1 / (2 ** bits - 1)) * ((1 + (2 ** bits - 1)) ** torch.abs(sample) - 1)

Slow inference time?

Thanks for this great implementation @fatchord! On a P100 I can generate only about 1600 samples/second i.e. much slower than real time. Is this expected or have I done something wrong?

It looks like this implementation is using 10 res blocks so maybe this is expected? Is there any way to make it 4x real time like the WaveRNN paper does?

Note I am talking only about the vocoder, not the tacotron part (i.e. mel spectrogram -> wav)