fatchord / wavernn Goto Github PK
View Code? Open in Web Editor NEWWaveRNN Vocoder + TTS
Home Page: https://fatchord.github.io/model_outputs/
License: MIT License
WaveRNN Vocoder + TTS
Home Page: https://fatchord.github.io/model_outputs/
License: MIT License
hi @fatchord , I training with you implement , but slow than your demo! where is wrong! thanks!
Generating: 1/3
43101/43200 -- Speed: 638 samples/sec
Generating: 2/3
56901/57000 -- Speed: 1075 samples/sec
Generating: 3/3
110501/110600 -- Speed: 1067 samples/sec
You remind me “ a much simpler version and will upload soon.”, can this version run in realtime in cpu?
I have changed your model to smaller one(only has 0.95M parameters).
I think the upsampling network can be replaced by a single torch.nn.Upsample operation with scale_factor equal to hop_length and mode='linear'.
I took a network trained on LJSpeech data and looked at the output of upsampling layer. Upsampled mels from the upsampling network match (up to an arbitrary scaling factor) very closely the values I get by linearly interpolating between the original mel values.
Auxiliary network is still needed, though.
Hi, I'm thinking of building WaveRNN as part of a masters project and I was just wondering what the training time is like. I only have a single GTX 1060 3GB so I was a bit concerned that training time would make this unrealistic. Also, I have done a term of machine learning modules but we didn't really cover conditioning networks on features of the input date, for example speaker id, current phoneme, syllable, word, etc., could you please point me in the direction of any information/literature on this.
Any help is much appreciated, Thanks
A file data/female_vocal_op8_8.wav
is mentioned in the second notebook, but is not available in the repo.
Hello,
I was wondering if there is any implementation for conditioning on mel spectrogram (from say Tacatron 2)?
Have you got observable improvements after fine-tuning with gta features? In my experiments, the quality of generated speech doesn't improve compared to the ones using the model just trained with ground truth mel-spectrogram.
Hi fatchord, great work - may I ask for the tts samples that you posted, did you use the vocoder trained on the ground truths or on the output of a seq2seq text to features vector model?
Is it possible to adopt this code to train on multiple gpus? Specifically the alternative?
Hello,
Thanks for the great work!
Any plans for subscaling WaveRNN implementation? Or is the current WaveRNN implementation already fast enough (compared to say WaveNet generation)?
have you try subscale generate?
”SRU is a recurrent unit that can run over 10 times faster than cuDNN LSTM, without loss of accuracy tested on many tasks. “ Sounds very exciting. I think we can try it. Here is a PyTorch inplementhttps://github.com/taolei87/sru and paper detailshttps://arxiv.org/abs/1709.02755.
Hi,
The Alternative one is easy to train, and got good result.
I also trained the WaveRNN like your NB1/2/3 . but quality is not good even after 7 days. loss >5. I compare it with your Alternative , I wonder if add NN upsampling or Resnet will make it better. Would you share your thoughts. thanks.
Thank you very much for sharing this! I'm trying to run your NB4a and NB4b code. NB4a includes the dsp library, which is not in this repo. Would you please include it or point me to it? Thanks!
hi,
I tried to train a new model using my own wav files.
I get an error at the preprocess.py step.
How is that supposed to work?
I tried to train the tacotron model you have on top of the LJ pretrained checkpoint you have. Just ran train_tacotron.py
but when I run gen_tacotron.py
, I get the following:
Initialising WaveRNN Model...
Trainable Parameters: 4.481M
Loading Weights: "checkpoints/lj.wavernn/latest_weights.pyt"
Initialising Tacotron Model...
Trainable Parameters: 11.078M
Loading Weights: "checkpoints/lj.tacotron/latest_weights.pyt"
+---------+----------+---+-----------------+----------------+-----------------+
| WaveRNN | Tacotron | r | Generation Mode | Target Samples | Overlap Samples |
+---------+----------+---+-----------------+----------------+-----------------+
| 804k | 197k | 1 | Batched | 11000 | 550 |
+---------+----------+---+-----------------+----------------+-----------------+
| Generating 1/6
Floating point exception (core dumped)
Any ideas on how I can go on debugging this?
Can we use wavernn to make TTS, what is the input, thank you
Hi @fatchord , thank you so much for sharing your great work!
I'm trying to train your alternate model. What I've noticed is that, at each epoch, the loss of the first batch is very different (usually much smaller) than the loss of subsequent batches. Is this normal? Why is this the case? Does that mean the model after the first batch is significantly better since it has a small loss?
hi!
Can you tell me what your x data type when you load the data :
m = np.load(f'{self.path}mel/{file}.npy')
x = np.load(f'{self.path}quant/{file}.npy')
In my case ,I was normlized the wave data by 32768,and return float values like 0.001 ,0.02, 0.34............., but in NB4b :
coarse = np.stack(coarse).astype(np.int64)
astype(np.int64) trans all the datas( 0.001,0.02,0.34..........) to zero, so it can't train in model, because the output all are zero.
can you tell me how to fix it ? THX
Here is some samples https://fatchord.github.io/model_outputs/ but I wonder what model takes as input? Is it ground truth Mel Spectrogram or it's predicted Mel Spectrogram by something like Tacotron2?
Hi,
Great work. Thanks for sharing.
I'm trying to implement repo with 16kHz sampling rate but after a few epochs on GPU training crashes every time. I feel like I sholud adapt the network but couldn't find a solution yet.
Can you share your way of doing this?
In picture, your last 2 dense layer has skip connection, but your code does not has.
Thanks a lot for this implementation. It would be great to provide pre-trained models for the generated samples.
Hi, you new model sounds very good, any chance you will write it up in a paper/blog-post? What's the new vocoder, it's it more WaveRNN like or wavenet like?
In your Fit a 30min Sample, i find your GRU'update state is : hidden_coarse = u_coarse * hidden_coarse + (1. - u_coarse) * e_coarse !!!
Shouldn’t it be: hidden_coarse = u_coarse * e_coarse + (1. - u_coarse) * hidden_coarse
buy the way, thank you for your dedication, your work is great!
have you try train your model with sparse method? how about the result?
Hi, thanks for such a good implementation of WaveRNN. Actually, I am working to integrate this WaveRNN implementation with Tacotron for TTS task and I got good results way faster then Wavenet but way slower than real-time (10 sec audio in nearly 3 mins or so).
Currently what I see that this model gives around 1500 samples/sec with my GTX 1080 ti. But in WaveRNN paper they claim to get 96,000 samples/sec by optimising WaveRNN-896 to P100 GPU even they show subscale to mobile CPU. Is it possible to optimize this WaveRNN at that level, so that we get real-time sampling at least from GPU.
I plan to merge WaveRNN with TTS - https://github.com/mozilla/TTS, if you don't mind.
hi,
how can I set the length of the generated files?
In my case they are always 10s, the same as the model I trained.
I would rather have 2s generated files.
Just for getting audio from text.
Hi @fatchord Great work on the code and architecture.
I was hoping if it's possible for you to share the pre-trained model so I can study the model on my end before training on my own dataset. Would be of great help, thanks!
# tail -f log.txt -n 5000
Initialising Model...
Trainable Parameters: 4.481M
Loading Model: "checkpoints/9bit_mulaw/latest_weights.pyt"
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [6,0,0], thread: [611,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [6,0,0], thread: [202,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [6,0,0], thread: [301,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [6,0,0], thread: [402,0,0] Assertion `t >= 0 && t < n_classes` failed.
| Epoch: 1/157 (345/3205) | Loss: 4.921 | 0.54 steps/s | Step: 0k | Traceback (most recent call last):
File "train.py", line 89, in <module>
train_loop(model, optimiser, train_set, test_set, lr)
File "train.py", line 43, in train_loop
loss.backward()
File "/usr/local/lib/python3.5/dist-packages/torch/tensor.py", line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.5/dist-packages/torch/autograd/__init__.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
It seems cuda10
needed, and my current environment can NOT support CUDA10
, do you have any idea of not using CUDA10
?
Hi, I managed to make the implementation of WaveRNN much faster by allowing it to use cuDNN's implementation of GRU:
https://github.com/mkotha/WaveRNN
The code is based on yours, although I have heavily modified it.
I'd like to make the above code publicly available, either dedicated to the public domain or released under an open source license. However, I realize that you didn't release your code under an open source license. Would it be possible for me to get a permission to release the code?
Hi,
I get a strange error when I launch the preprocess.py file.
The folder with my wav files is found and no other files are contained there.
How can I get through this error?
/WaveRNN$ python preprocess.py
7690 wav files found in "/home/ubuntu/wav/"
Traceback (most recent call last):
File "preprocess.py", line 52, in
text_dict = ljspeech(path)
File "/home/ubuntu/WaveRNN/utils/text/recipes.py", line 9, in ljspeech
assert len(csv_file) == 1
AssertionError
@fatchord, thanks for your work on this. The samples you have are fantastic, and the model converges really quickly.
How do I go about creating the mel as the input? Do I need to train another model that produces mels and pipe that as the input? Or should I be able to take any wav file, construct a mel, and pass that as the input?
Example result: https://soundcloud.com/user-565970875/ljspeech-logistic-wavernn
Here is the [branch] (https://github.com/erogol/WaveRNN/tree/mold) if you like to try. The model has trained with TTS spectrograms on LJSpeech dataset. Models are soon to be released.
@fatchord would you prefer to have the trained WaveRNN model here, or better to have a new repository for this?
Hi - I tried your alternate model, and it worked good easily, so I am thankful for your work.
But I noticed the output of your melspectrogram() function clips to 1.0 often on LJSpeech data.
(Of course, it might be my bad implementation).
But also it seems the code is similar to keithito/tacotron. In Keith's version he later changed one line to
S = _amp_to_db(_linear_to_mel(np.abs(D))) - hparams.ref_level_db
in response to an "issue" sent in by Rafael Valle. I wonder whether this difference was intentional or not, (or maybe not relevant).
Thanks.
Your samples sound great.
I think your samples were generated by your alternative model, 9bits, not by the original WaveRNN, 16bits.
Is it right?
Any hints on how to use NB2 to train a dataset (say a directory with multiple audio files) and then use the trained model to generate one of those samples?
Thank you in advance
I am companding my target waveform with a mu-law before quantization. However I am not sure if I should expand the signal during the autoregressive generation or leave it as such, and expand it once the entire signal is generated.
I see that the restoration of the quantized signal happens here:
sample = torch.sign(sample) * (1 / (2 ** bits - 1)) * ((1 + (2 ** bits - 1)) ** torch.abs(sample) - 1)
Thanks for this great implementation @fatchord! On a P100 I can generate only about 1600 samples/second i.e. much slower than real time. Is this expected or have I done something wrong?
It looks like this implementation is using 10 res blocks so maybe this is expected? Is there any way to make it 4x real time like the WaveRNN paper does?
Note I am talking only about the vocoder, not the tacotron part (i.e. mel spectrogram -> wav)
Could you please explain the pad
variable and how it is helpful?
mel_offsets = [np.random.randint(0, offset) for offset in max_offsets]
why is randint?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.