cookieppp / cookietts Goto Github PK

View Code? Open in Web Editor NEW

44.0 7.0 9.0 223.83 MB

[Last Updated 2021] TTS from Cookie. Messy and experimental!

License: BSD 3-Clause "New" or "Revised" License

Python 1.94% HTML 0.02% Jupyter Notebook 98.04% Batchfile 0.01%

tacotron2 waveglow waveflow

cookietts's Introduction

This repo kinda works!

Check back in a week. Thanks.

Missing stuff;

Normalize transcripts before running Montreal Forced Aligner

Install/Setup:

Clone this repo: https://github.com/CookiePPP/cookietts.git

This will make a folder called cookietts where the command is run, and clone the repo into said folder.

Run cd cookietts

This will move you into the cloned repo.

Run pip install -e .

This will 'install' the package (without moving around any files).

Run pip install -r requirements.txt

This will install dependencies.

e.g: pytorch, numpy, scipy, tensorboard, librosa

Install Nvidia/Apex.

Nvidia/Apex contains the LAMB optimizer and faster running fused optimizers.

This also allows for fp16 (mixed-precision) training, which saves VRAM and runs faster on RTX cards.

(please nag me if you cannot install Apex. Pytorch added native fp16 (mixed-precision) support some time ago and I might be able to set that up as an alternative)

That should be the main stuff.

If something fails during preprocessing, check you have ffmpeg and sox installed as well.

If something else fails, you can create an 'issue' on the top of the github page.

Usage:

Head into the CookieTTS folder and read around.

If you want to train custom multispeaker models then folders 0, 1 and 2 are of interest.

If you want to use an already trained model to generate new speech then folder 5 (_5_infer) is of interest.

cookietts's People

Contributors

Stargazers

Watchers

Forkers

harishgeth raytrac3r iamgoofball marcus-arcadius astraliteheart cris140 amorjnyh midsci

cookietts's Issues

Warm start checkpoint for tacotron2_tm?

I tried warm starting in the tacotron2_tm from the libritts mellotron checkpoint, and a lot of params were unexpected or the wrong size. Do you have a better checkpoint to warm start from?

Guided attention usage while training

Hi..

I see that you have implemented guided attention loss using trying the outcome to force to be a diagonal. Isn't this is more of a lossy way of performing alignment? Instead woudl it better to generate pre-generated alignments using Forced alignment information and use that to calculate the loss wrt the ground truth alignment graph. For e.g. like in the paper https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8703406 .. Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis.

Does this approach result in better alignment learning as it is more simpler and robust than the diagonal forcing based approach. I have used the Pre-Aligned attention. I can add Pre-Aligned attention based approach to this repo instead of diagonal forcing. What are your thoughts?

Recalculate global mean (for DFR) upon warm start even if exists

Unless I'm mistaken, it would be correct to recalculate the global mean used for dropping frame rate when intending to train on a different dataset (such as when warm-starting).
Currently it's not recalculated because it detects the file as already existing and uses that. Perhaps it would just make sense to recalculate it every time.

Teacher forcing?

Hey! I'm really curious how to get Hi-Fi GAN's first step working, would love any help whatsoever but no worries if you aren't free:

Generate mel-spectrograms in numpy format using Tacotron2 with teacher-forcing.
The file name of the generated mel-spectrogram should match the audio file and the extension should be .npy.

I'm not sure how exactly to do this as specific steps are not specified. Is there some function I need to feed in audio inputs and get out .npy mel spectrograms?

In searching for 'teacher' I found this function from Tacotron2: https://github.com/NVIDIA/tacotron2/blob/185cd24e046cc1304b4f8e564734d2498c6e2e6f/model.py#L291

Which has the words 'teacher-forced' in it but it's not clear if that will generate the .npy files or not.

Hifigan sample config missing

Not sure what you have but I'm running a test with settings here https://github.com/jik876/hifi-gan/blob/master/config_v3.json for now

Postnet generating script out-of-date

Still seems to use the training format that's in master.

pyloudnorm error on short files

In extremely short files, such as Combine_soldiers/vo/ion.wav (Duration: 00:00:00.28, bitrate: 706 kb/s), the default block_size of 0.400 in meter isn't enough to pass pyloudnorm's valid_audio check.

  File "cookietts/CookieTTS/utils/dataset/data_utils.py", line 790,
    loudness = meter.integrated_loudness(audio.numpy()) # measure loudness (in d
  File "pyloudnorm/meter.py", li
    util.valid_audio(input_data, self.rate, self.block_size)
  File "pyloudnorm/util.py", lin
    raise ValueError("Audio must have length greater than the block size.")
ValueError: Audio must have length greater than the block size.

TorchiMoji Usage and Style Transfer

Thanks a lot for sharing such awesome package. I have a query regarding the TorchMoji embedding being used. Using TorchMoji and Emotionnet is an interesting combination.
Semi Supervised Emotional VAE module will learn the latent space of any of the emotions and will help in projecting any text to any latent space during inference.
For e,g, We can make an internally sad sentence spoken in a very happy mode by choosing appropriate latent variables.
But if we also use TorchMoji embedding, this is approximately literal representation of emotion in text. Hence style transfer is affected adversely by the text embedding that points a different emotion. So wont the two approach actually work against each other?

Please can you elaborate of the interaction or combined effects on training and synthesis. Thanks a lot.

Using Bottleneck features

In the hparams file you have mentioned that use_memory_bottleneck=True,# False baseline. You have indicated that for training a baseline we should set it to false there by resulting in repetition of axillary features to every spectrogram frame. Just wanted to be sure if i need to set to false for baseline training? Thanks for the help.

Use PyTorch automatic mixed precision instead of Apex.

Requires PyTorch>=1.6.

Converting a trained tacotron2 model (not warm-starting)

My understanding is that warm starting removes speaker embedding data. I want to convert an existing model to run in the cookietts inference server, and possibly to resume training, with the same data and speaker.

(Experimental Branch) error: identifier "aten_sigmoid_flat__1" is undefined

For the past week or two, I've been training in Google Colab using the experimental branch, and it's gone well. I do have to make a few changes to the code for it to function in Colab.

However, I tried to do some more training today, and I've ran into an error that I can't figure out. It was when I ran the training script with my own dataset, using the last checkpoint I had.

I'm training at a 44100 sampling rate, with hop size, window size, etc. adjusted accordingly. I had to adjust the n_speakers and decoder_rnn_dim, and turn off the second decoder, so that my old checkpoints would be compatible.

    train(args, args.rank, args.group_name, hparams)
  File "train.py", line 707, in train
    y_pred = force(model, valid_kwargs=model_args, **{**y, "teacher_force_till": teacher_force_till, "p_teacher_forcing": p_teacher_forcing, "drop_frame_rate": drop_frame_rate})
  File "/content/cookietts/CookieTTS/utils/_utils_.py", line 35, in force
    return func(*args, **{k:v for k,v in kwargs.items() if k in valid_kwargs})
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/apex/amp/_initialize.py", line 197, in new_fwd
    **applier(kwargs, input_caster))
  File "/content/cookietts/CookieTTS/_2_ttm/tacotron2_tm/model.py", line 1012, in forward
    return_hidden_state=return_hidden_state)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/content/cookietts/CookieTTS/_2_ttm/tacotron2_tm/model.py", line 835, in forward
    mel_output, gate_output, attention_weights, decoder_hidden_attention_context = self.decode(decoder_input, memory_lengths)
  File "/content/cookietts/CookieTTS/_2_ttm/tacotron2_tm/model.py", line 746, in decode
    decoderrnn_state = self.decoder_rnn(decoder_input, (decoder_hidden, decoder_cell))# lstmcell 12.789ms
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/content/cookietts/CookieTTS/utils/model/layers.py", line 386, in forward
    self.bias_ih, self.bias_hh,
RuntimeError: default_program(57): error: identifier "aten_sigmoid_flat__1" is undefined

default_program(58): error: no operator "=" matches these operands
            operand types are: half = float

default_program(64): error: identifier "aten_mul_flat__1" is undefined

default_program(65): error: no operator "=" matches these operands
            operand types are: half = float

4 errors detected in the compilation of "default_program".

nvrtc compilation failed: 

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)


template<typename T>
__device__ T maximum(T a, T b) {
  return isnan(a) ? a : (a > b ? a : b);
}

template<typename T>
__device__ T minimum(T a, T b) {
  return isnan(a) ? a : (a < b ? a : b);
}


#define __HALF_TO_US(var) *(reinterpret_cast<unsigned short *>(&(var)))
#define __HALF_TO_CUS(var) *(reinterpret_cast<const unsigned short *>(&(var)))
#if defined(__cplusplus)
  struct __align__(2) __half {
    __host__ __device__ __half() { }

  protected:
    unsigned short __x;
  };

  /* All intrinsic functions are only available to nvcc compilers */
  #if defined(__CUDACC__)
    /* Definitions of intrinsics */
    __device__ __half __float2half(const float f) {
      __half val;
      asm("{  cvt.rn.f16.f32 %0, %1;}\n" : "=h"(__HALF_TO_US(val)) : "f"(f));
      return val;
    }

    __device__ float __half2float(const __half h) {
      float val;
      asm("{  cvt.f32.f16 %0, %1;}\n" : "=f"(val) : "h"(__HALF_TO_CUS(h)));
      return val;
    }

  #endif /* defined(__CUDACC__) */
#endif /* defined(__cplusplus) */
#undef __HALF_TO_US
#undef __HALF_TO_CUS

typedef __half half;

extern "C" __global__
void func_3(half* t0, half* t1, half* aten_mul_flat, half* aten_sigmoid_flat, half* aten_mul_flat_1, half* aten_tanh_flat, half* aten_sigmoid_flat_1, half* prim_constantchunk_flat) {
{
  float v = __half2float(t1[((512 * blockIdx.x + threadIdx.x) % 1280 + 4 * (((512 * blockIdx.x + threadIdx.x) / 1280) * 1280)) + 3840]);
  prim_constantchunk_flat[512 * blockIdx.x + threadIdx.x] = __float2half(v);
  float t1_ = __half2float(t1[((512 * blockIdx.x + threadIdx.x) % 1280 + 4 * (((512 * blockIdx.x + threadIdx.x) / 1280) * 1280)) + 1280]);
  float aten_sigmoid_flat_ = __half2float(aten_sigmoid_flat[512 * blockIdx.x + threadIdx.x]);
  aten_sigmoid_flat__1 = __float2half(1.f / (1.f + (expf(0.f - t1_))));
  aten_sigmoid_flat[512 * blockIdx.x + threadIdx.x] = aten_sigmoid_flat_;
  float t1__1 = __half2float(t1[((512 * blockIdx.x + threadIdx.x) % 1280 + 4 * (((512 * blockIdx.x + threadIdx.x) / 1280) * 1280)) + 2560]);
  aten_tanh_flat[512 * blockIdx.x + threadIdx.x] = __float2half(tanhf(t1__1));
  float t1__2 = __half2float(t1[(512 * blockIdx.x + threadIdx.x) % 1280 + 4 * (((512 * blockIdx.x + threadIdx.x) / 1280) * 1280)]);
  aten_sigmoid_flat_1[512 * blockIdx.x + threadIdx.x] = __float2half(1.f / (1.f + (expf(0.f - t1__2))));
  float aten_mul_flat_ = __half2float(aten_mul_flat[512 * blockIdx.x + threadIdx.x]);
  aten_mul_flat__1 = __float2half((1.f / (1.f + (expf(0.f - t1_)))) * __half2float(t0[512 * blockIdx.x + threadIdx.x]));
  aten_mul_flat[512 * blockIdx.x + threadIdx.x] = aten_mul_flat_;
  aten_mul_flat_1[512 * blockIdx.x + threadIdx.x] = __float2half((1.f / (1.f + (expf(0.f - t1__2)))) * (tanhf(t1__1)));
}
}

Epoch::  46% 456/1000 [00:12<00:14, 37.47epoch/s]     
Iter:  :   0% 0/67 [00:11<?, ?iter/s]
/content/cookietts/CookieTTS/utils/torchmoji/model_def.py:193: UserWarning: This overload of nonzero is deprecated:
	nonzero()
Consider using one of the following signatures instead:
	nonzero(*, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:882.)
  input_lengths = torch.LongTensor([torch.max(input_seqs[i, :].data.nonzero()) + 1 for i in range(input_seqs.size()[0])])

Question about params in MTW configuration

I've read the paper and tried to understand the source code of the original repo, but can't for the life of me understand what to set these lines to. I trained a multispeaker model with 22050Hz and the rest hparams you'd expect for that sampling rate, but I did set the num_mels to 160 therefore I can't use the pretrained hifigan models.

Creating/training a new hifigan model works well enough in tensorboard logs. In tensorboard, for validation the spectrograms look good and the audio is fairly close to GT, but whenever I try to infer speech it results in indecipherable audio. What should I set those settings to to fix this?

Emotionet and Batchnormalization

Hi,

I was curious why haven't you used Batch normalization in emotion net for while traversing through he convolution layers.? Is there any specific reason. Was just curious because that might bring in more stability while training right?

Thanks