Giter Site home page Giter Site logo

styletts's People

Contributors

artyom17 avatar magicse avatar yl4579 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

styletts's Issues

monotonic error

from monotonic_align import mask_from_lens

error, cannot import name "mask_from_lens" from "monotonic_align"
my monotonic_align version is 1.0.0

Process for setting up other datasets

Hello there! new to this speech synthesis stuff, whats the process for setting up a dataset like the VCTK dataset (downsampled to 24khz) and processing it in the proper format like in the example for LibriTTS "filename.wav|transcription".
"LibriTTS/train-clean-460/7169/89735/7169_89735_000071_000003.wav|wˈʌn mˌeɪd ˌʌp wˈʌnz mˈaɪnd ðə lˈoʊn wˈʊlf mˈʌst biː ɐ sˈɜːʔn̩ sˈɔːɹt ʌv mˈæn; ðə ɹˈɛst wʌz sˈɪmpli sˈɪftɪŋ fɹˈæns fɚðə mˈæn tə fˈɪt ðə θˈiəɹi, ænd ðˈɛn wˈɑːtʃɪŋ hˌɪm ʌntˈɪl hiː ɡˈeɪv hɪmsˈɛlf ɐwˈeɪ."
"
its not clear to me how i get the transciption from the text and set all of that up. If any one has set up a dataset like this one before and can help point me in the right direction that would be awesome! Thanks.

Any-to-any and emotion examples

I'm trying to replicate your results. How did you create your any-to-any examples? Did you change the voice to text from the "source" audio and use the "reference" audio as a zero-shot reference?

Similarly, in the emotion examples, were those also zero-shot where you just used the file from ESD as a reference to create the reference embedding?

Thanks!

finetune the vocoder

Hi @yl4579 , thanks for you amazing work.
I want to finetune the HifiGAN model, but seems the pretrained weight only has generator. Could you also publish the discriminator? Thanks.

Why don't use "attention_weight" in train_first.py ?

Why don't use "attention_weight" in train_first.py ?
the code used is "alignment". in train_first.py line 153:
ppgs, s2s_pred, s2s_attn_feat = model.text_aligner(mels, mask, texts)

in layer.py:
attention_weights = F.softmax(alignment, dim=1)

attention_weight is the result after softmax, but alignment is not.

Multi language support

Great works!
I found that when there are other language characters, the synthesizer will repeatedly read the language+letter,What should I pay attention to if I want to add other languages.

Finetuning on a small dataset?

Hi!
Thanks for the great work!
I trained a model on ryanspeech, and it sounds great!
I have a small dataset of around 2h, so is it possible to finetune an existing model on a small dataset?
Thanks!
And happy Chinese New Year!

Questions about the Evaluations

StyleTTS's ref mel requires a single audio as input, which may result in the style vector only being similar to the ref wav, but somewhat different from other waves of the same speaker. May I ask if the ref mel you use for evaluations is gt or randomly selected from the dataset during your evaluation stage?

turns out your code doesn't join the read wav paths from train_list.txt file with the dataset path (the location of train_list.txt)

Correct me if I'm wrong. The other issue I opened is actually is a soundfile related error (got I know when I degraded the soundfile version)
so it can't just read "wavs/22.wav" because the wav_path in meldataset.py has to include the directory location of train_list.txt file. Got my point? Though the complete thing works if only the wavs folder is kept in the StyleTTS directory (with train_list.txt whereever as input in config file)

Why I can't use your mel to train HiFi-Gan Vocoder ?

Hello, Yinghao Aaron Li

I use mel from "torchaudio.transforms.MelSpectrogram" to test your pretained HiGi-Gan model, is ok.
but I use the mel to finetune the your pretained HiGi-Gan model, the syn wav is noise.

I use your vocoder.py , use HiFi-Gan code in https://github.com/jik876/hifi-gan, train the model, syn wav is also ok.

What's the difference, in your training process ? Are you changed something ?

About train on Vietnamese Dataset

I really get into trouble when starting a project with a Vietnamese dataset, Can you describe in detail every task that I will do before starting with styleTTS and the format of data to train it to look like the file in your folder data
Or like here?
image

I would really appreciate it if you could help me describe each step clearly

Thank you, Hope you have a good day

Using Single Gpu Pretrained model on Multi-Gpu training

When i try to use the provided pretrained model and train with multiple GPUs I get these key errors. "Missing key(s) in state_dict:", "Unexpected key(s) in state_dict:".
Is there any way to use the single GPU pretrained model, as a starting point to train with multiple GPUs?

raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for MyDataParallel: Missing key(s) in state_dict: "module.text_encoder.lstms.0.weight_ih_l0", "module.text_encoder.lstms.0.weight_hh_l0", "module.text_encoder.lstms.0.bias_ih_l0", "module.text_encoder.lstms.0.bias_hh_l0", "module.text_encoder.lstms.0.weight_ih_l0_reverse", "module.text_encoder.lstms.0.weight_hh_l0_reverse", "module.text_encoder.lstms.0.bias_ih_l0_reverse", "module.text_encoder.lstms.0.bias_hh_l0_reverse", "module.text_encoder.lstms.1.fc.weight", "module.text_encoder.lstms.1.fc.bias", "module.text_encoder.lstms.2.weight_ih_l0", "module.text_encoder.lstms.2.weight_hh_l0", "module.text_encoder.lstms.2.bias_ih_l0", "module.text_encoder.lstms.2.bias_hh_l0", "module.text_encoder.lstms.2.weight_ih_l0_reverse", "module.text_encoder.lstms.2.weight_hh_l0_reverse", "module.text_encoder.lstms.2.bias_ih_l0_reverse", "module.text_encoder.lstms.2.bias_hh_l0_reverse", "module.text_encoder.lstms.3.fc.weight", "module.text_encoder.lstms.3.fc.bias", "module.text_encoder.lstms.4.weight_ih_l0", "module.text_encoder.lstms.4.weight_hh_l0", "module.text_encoder.lstms.4.bias_ih_l0", "module.text_encoder.lstms.4.bias_hh_l0", "module.text_encoder.lstms.4.weight_ih_l0_reverse", "module.text_encoder.lstms.4.weight_hh_l0_reverse", "module.text_encoder.lstms.4.bias_ih_l0_reverse", "module.text_encoder.lstms.4.bias_hh_l0_reverse", "module.text_encoder.lstms.5.fc.weight", "module.text_encoder.lstms.5.fc.bias", "module.lstm.weight_ih_l0", "module.lstm.weight_hh_l0", "module.lstm.bias_ih_l0", "module.lstm.bias_hh_l0", "module.lstm.weight_ih_l0_reverse", "module.lstm.weight_hh_l0_reverse", "module.lstm.bias_ih_l0_reverse", "module.lstm.bias_hh_l0_reverse", "module.duration_proj.linear_layer.weight", "module.duration_proj.linear_layer.bias", "module.shared.weight_ih_l0", "module.shared.weight_hh_l0", "module.shared.bias_ih_l0", "module.shared.bias_hh_l0", "module.shared.weight_ih_l0_reverse", "module.shared.weight_hh_l0_reverse", "module.shared.bias_ih_l0_reverse", "module.shared.bias_hh_l0_reverse", "module.F0.0.conv1.bias", "module.F0.0.conv1.weight_g", "module.F0.0.conv1.weight_v", "module.F0.0.conv2.bias", "module.F0.0.conv2.weight_g", "module.F0.0.conv2.weight_v", "module.F0.0.norm1.fc.weight", "module.F0.0.norm1.fc.bias", "module.F0.0.norm2.fc.weight", "module.F0.0.norm2.fc.bias", "module.F0.1.conv1.bias", "module.F0.1.conv1.weight_g", "module.F0.1.conv1.weight_v", "module.F0.1.conv2.bias", "module.F0.1.conv2.weight_g", "module.F0.1.conv2.weight_v", "module.F0.1.norm1.fc.weight", "module.F0.1.norm1.fc.bias", "module.F0.1.norm2.fc.weight", "module.F0.1.norm2.fc.bias", "module.F0.1.conv1x1.weight_g", "module.F0.1.conv1x1.weight_v", "module.F0.1.pool.bias", "module.F0.1.pool.weight_g", "module.F0.1.pool.weight_v", "module.F0.2.conv1.bias", "module.F0.2.conv1.weight_g", "module.F0.2.conv1.weight_v", "module.F0.2.conv2.bias", "module.F0.2.conv2.weight_g", "module.F0.2.conv2.weight_v", "module.F0.2.norm1.fc.weight", "module.F0.2.norm1.fc.bias", "module.F0.2.norm2.fc.weight", "module.F0.2.norm2.fc.bias", "module.N.0.conv1.bias", "module.N.0.conv1.weight_g", "module.N.0.conv1.weight_v", "module.N.0.conv2.bias", "module.N.0.conv2.weight_g", "module.N.0.conv2.weight_v", "module.N.0.norm1.fc.weight", "module.N.0.norm1.fc.bias", "module.N.0.norm2.fc.weight", "module.N.0.norm2.fc.bias", "module.N.1.conv1.bias", "module.N.1.conv1.weight_g", "module.N.1.conv1.weight_v", "module.N.1.conv2.bias", "module.N.1.conv2.weight_g", "module.N.1.conv2.weight_v", "module.N.1.norm1.fc.weight", "module.N.1.norm1.fc.bias", "module.N.1.norm2.fc.weight", "module.N.1.norm2.fc.bias", "module.N.1.conv1x1.weight_g", "module.N.1.conv1x1.weight_v", "module.N.1.pool.bias", "module.N.1.pool.weight_g", "module.N.1.pool.weight_v", "module.N.2.conv1.bias", "module.N.2.conv1.weight_g", "module.N.2.conv1.weight_v", "module.N.2.conv2.bias", "module.N.2.conv2.weight_g", "module.N.2.conv2.weight_v", "module.N.2.norm1.fc.weight", "module.N.2.norm1.fc.bias", "module.N.2.norm2.fc.weight", "module.N.2.norm2.fc.bias", "module.F0_proj.weight", "module.F0_proj.bias", "module.N_proj.weight", "module.N_proj.bias". Unexpected key(s) in state_dict: "text_encoder.lstms.0.weight_ih_l0", "text_encoder.lstms.0.weight_hh_l0", "text_encoder.lstms.0.bias_ih_l0", "text_encoder.lstms.0.bias_hh_l0", "text_encoder.lstms.0.weight_ih_l0_reverse", "text_encoder.lstms.0.weight_hh_l0_reverse", "text_encoder.lstms.0.bias_ih_l0_reverse", "text_encoder.lstms.0.bias_hh_l0_reverse", "text_encoder.lstms.1.fc.weight", "text_encoder.lstms.1.fc.bias", "text_encoder.lstms.2.weight_ih_l0", "text_encoder.lstms.2.weight_hh_l0", "text_encoder.lstms.2.bias_ih_l0", "text_encoder.lstms.2.bias_hh_l0", "text_encoder.lstms.2.weight_ih_l0_reverse", "text_encoder.lstms.2.weight_hh_l0_reverse", "text_encoder.lstms.2.bias_ih_l0_reverse", "text_encoder.lstms.2.bias_hh_l0_reverse", "text_encoder.lstms.3.fc.weight", "text_encoder.lstms.3.fc.bias", "text_encoder.lstms.4.weight_ih_l0", "text_encoder.lstms.4.weight_hh_l0", "text_encoder.lstms.4.bias_ih_l0", "text_encoder.lstms.4.bias_hh_l0", "text_encoder.lstms.4.weight_ih_l0_reverse", "text_encoder.lstms.4.weight_hh_l0_reverse", "text_encoder.lstms.4.bias_ih_l0_reverse", "text_encoder.lstms.4.bias_hh_l0_reverse", "text_encoder.lstms.5.fc.weight", "text_encoder.lstms.5.fc.bias", "lstm.weight_ih_l0", "lstm.weight_hh_l0", "lstm.bias_ih_l0", "lstm.bias_hh_l0", "lstm.weight_ih_l0_reverse", "lstm.weight_hh_l0_reverse", "lstm.bias_ih_l0_reverse", "lstm.bias_hh_l0_reverse", "duration_proj.linear_layer.weight", "duration_proj.linear_layer.bias", "shared.weight_ih_l0", "shared.weight_hh_l0", "shared.bias_ih_l0", "shared.bias_hh_l0", "shared.weight_ih_l0_reverse", "shared.weight_hh_l0_reverse", "shared.bias_ih_l0_reverse", "shared.bias_hh_l0_reverse", "F0.0.conv1.bias", "F0.0.conv1.weight_g", "F0.0.conv1.weight_v", "F0.0.conv2.bias", "F0.0.conv2.weight_g", "F0.0.conv2.weight_v", "F0.0.norm1.fc.weight", "F0.0.norm1.fc.bias", "F0.0.norm2.fc.weight", "F0.0.norm2.fc.bias", "F0.1.conv1.bias", "F0.1.conv1.weight_g", "F0.1.conv1.weight_v", "F0.1.conv2.bias", "F0.1.conv2.weight_g", "F0.1.conv2.weight_v", "F0.1.norm1.fc.weight", "F0.1.norm1.fc.bias", "F0.1.norm2.fc.weight", "F0.1.norm2.fc.bias", "F0.1.conv1x1.weight_g", "F0.1.conv1x1.weight_v", "F0.1.pool.bias", "F0.1.pool.weight_g", "F0.1.pool.weight_v", "F0.2.conv1.bias", "F0.2.conv1.weight_g", "F0.2.conv1.weight_v", "F0.2.conv2.bias", "F0.2.conv2.weight_g", "F0.2.conv2.weight_v", "F0.2.norm1.fc.weight", "F0.2.norm1.fc.bias", "F0.2.norm2.fc.weight", "F0.2.norm2.fc.bias", "N.0.conv1.bias", "N.0.conv1.weight_g", "N.0.conv1.weight_v", "N.0.conv2.bias", "N.0.conv2.weight_g", "N.0.conv2.weight_v", "N.0.norm1.fc.weight", "N.0.norm1.fc.bias", "N.0.norm2.fc.weight", "N.0.norm2.fc.bias", "N.1.conv1.bias", "N.1.conv1.weight_g", "N.1.conv1.weight_v", "N.1.conv2.bias", "N.1.conv2.weight_g", "N.1.conv2.weight_v", "N.1.norm1.fc.weight", "N.1.norm1.fc.bias", "N.1.norm2.fc.weight", "N.1.norm2.fc.bias", "N.1.conv1x1.weight_g", "N.1.conv1x1.weight_v", "N.1.pool.bias", "N.1.pool.weight_g", "N.1.pool.weight_v", "N.2.conv1.bias", "N.2.conv1.weight_g", "N.2.conv1.weight_v", "N.2.conv2.bias", "N.2.conv2.weight_g", "N.2.conv2.weight_v", "N.2.norm1.fc.weight", "N.2.norm1.fc.bias", "N.2.norm2.fc.weight", "N.2.norm2.fc.bias", "F0_proj.weight", "F0_proj.bias", "N_proj.weight", "N_proj.bias".

crashes during training

after starting training i am getting the following error, sometimes right away, sometimes after a few steps

./aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [652,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [652,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                      
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [652,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                     
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [652,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                       
Traceback (most recent call last):                                                                                                                                                                                                                                    |
  File "/home/tts/StyleTTS/train_first.py", line 393, in <module>                       
    main()                                                                                                                                                                                                     
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1130, in __call__                                                                                                                          
    return self.main(*args, **kwargs)                                                                                                                                                                           
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1055, in main                                                                                                                              
    rv = self.invoke(ctx)                                                                                                                                                                                       
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1404, in invoke                                                                                                                            
    return ctx.invoke(self.callback, **ctx.params)                                                                                                                                                              
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 760, in invoke                                                                                                                             
    return __callback(*args, **kwargs)                                                                                                                                                                          
  File "/home/tts/StyleTTS/train_first.py", line 149, in main                                                                                                                                                   
    ppgs, s2s_pred, s2s_attn_feat = model.text_aligner(mels, mask, texts)                                                                                                                                       
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                                                                                           
    return forward_call(*args, **kwargs)                                                                                                                                                                        
  File "/home/tts/StyleTTS/Utils/ASR/models.py", line 45, in forward                                                                                                                                            
    _, s2s_logit, s2s_attn = self.asr_s2s(x, src_key_padding_mask, text_input)                                                                                                                                  
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                                                                                           
    return forward_call(*args, **kwargs)                                                                                                                                                                        
  File "/home/tts/StyleTTS/Utils/ASR/models.py", line 130, in forward                                                                                                                                           
    print(f"... {text_input} {decoder_inputs.size(1)}")                                                                                                                                                         
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 873, in __format__                                                                                                                      
    return object.__format__(self, format_spec)                                                                                                                                                                 
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 426, in __repr__                                                                                                                        
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)                                                                                                                                        
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 636, in _str                                                                                                                        
    return _str_intern(self, tensor_contents=tensor_contents)                                                                                                                                                   
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 567, in _str_intern                                                                                                                 
    tensor_str = _tensor_str(self, indent)                                                                                                                                                                      
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 327, in _tensor_str                                                                                                                 
    formatter = _Formatter(get_summarized_data(self) if summarize else self)                                                                                                                                    
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 111, in __init__                                                                                                                    
    value_str = "{}".format(value)                                                                                                                                                                  
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 872, in __format__                                                                                                                     
    return self.item().__format__(fo

Need help for training

I'm pretraining your model on the vivo dataset (Vietnamese) but the results are not what I expected.
Here is the original audio:
https://drive.google.com/file/d/12mZdg8yVhgQj35Vt3thoxIK44_jWWaCJ/view?usp=sharing
and here is the result:
https://drive.google.com/file/d/1UOuUHrxiR5DvF1MrpccMZ6bdwKyfO2AE/view?usp=sharing

This is the loss during training stage 1 and stage 2
image

image

p/s: I used the ASR of the original article to train Vietnamese. I wonder if it has any problems because during the training stage 1 there were quite a lot of keyerror errors.
Thank you very much

Adding BigVGAN as Vocoder

Hey im trying to add my BigVGAN vocoder model to the inferencing script,. But when it generates audio it always has a lot of noise, compared to the inferencing script of the original BigVGAN code base. Any Ideas on why that could be? It looks to be the same setup as HiFi-GAN. https://github.com/NVIDIA/BigVGAN. If you would like one of my trained Models let me know ill give you DL link so you can test with it... as there are currently no available models.

Thanks in advance!

%cd /content/BigVGAN

from __future__ import absolute_import, division, print_function, unicode_literals
#import sys
#sys.path.append("./content/BigVGAN")
import glob
import os
import argparse
import json

from scipy.io.wavfile import write
from env import AttrDict
from meldataset1 import mel_spectrogram, MAX_WAV_VALUE
from models1 import BigVGAN as Generator
import librosa

torch.backends.cudnn.benchmark = False

def load_checkpoint(filepath, device):
    assert os.path.isfile(filepath)
    print("Loading '{}'".format(filepath))
    checkpoint_dict = torch.load(filepath, map_location=device)
    print("Complete.")
    return checkpoint_dict


def scan_checkpoint(cp_dir, prefix):
    pattern = os.path.join(cp_dir, prefix + '*')
    cp_list = glob.glob(pattern)
    if len(cp_list) == 0:
        return ''
    return sorted(cp_list)[-1]
    
cp_g = scan_checkpoint("configs", 'g_001')

config_file = os.path.join(os.path.split(cp_g)[0], 'bigvgan_24khz_100band.json') #actually 80-band to work with the StyleTTS model
with open(config_file) as f:
    data = f.read()


json_config = json.loads(data)
h = AttrDict(json_config)
device = torch.device(device)
generator = Generator(h).to(device)
state_dict_g = load_checkpoint(cp_g, device)

generator.load_state_dict(state_dict_g['generator'])
generator.eval()
generator.remove_weight_norm()
%cd /content/StyleTTS
import time
converted_samples = {}
start_time = time.time()

input_length = torch.LongTensor([tokens.shape[-1]]).to(device)
mask = length_to_mask(input_length).to(device)
with torch.no_grad():
    input_lengths = torch.LongTensor([tokens.shape[-1]]).to(device)
    m = length_to_mask(input_lengths).to(device)
    t_en = model.text_encoder(tokens, input_lengths, m)
    for key, (ref, _) in reference_embeddings.items():
        s = ref.squeeze(1).to(device)
        style = s
        d = model.predictor.text_encoder(t_en, style, input_lengths, m)

        x, _ = model.predictor.lstm(d)
        duration = model.predictor.duration_proj(x)
        pred_dur = torch.round(duration.squeeze()).clamp(min=1)

        pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data), device=device)
        c_frame = 0
        for i in range(pred_aln_trg.size(0)):
            pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
            c_frame += int(pred_dur[i].data)

        en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0))
        style = s.expand(en.shape[0], en.shape[1], -1)

        F0_pred, N_pred = model.predictor.F0Ntrain(en, s)

        out = model.decoder((t_en @ pred_aln_trg.unsqueeze(0)), F0_pred, N_pred, ref.squeeze().unsqueeze(0))

        audio_signal = out.cpu().numpy().squeeze()

        #Apply the Mel Spectrogram transformation
        mel_spectrogram = librosa.feature.melspectrogram(y=audio_signal, sr=24000, n_fft=1024, hop_length=256, n_mels=80, win_length=1024)

        #Convert the Mel Spectrogram to decibels
        mel_spectrogram_db = librosa.power_to_db(mel_spectrogram, ref=np.max)
        out1 = torch.FloatTensor(mel_spectrogram_db).to(device)
        y_g_hat = generator(out)
        y_out = y_g_hat.squeeze()
        y_out1 = y_out * MAX_WAV_VALUE
        y_out2 = y_out1.cpu().numpy()

        converted_samples[key] = y_out2

end_time = time.time()
print("Time taken: ", end_time - start_time, "seconds")

Also tried the original with the same result grain/noisy audio.

y_g_hat = generator(out)
y_out = y_g_hat.squeeze().cpu().numpy()
   
converted_samples[key] = y_out

RuntimeError: espeak not installed on your system

I have installed Phonemizer, but I am receiving an error about missing espeak. I have also tried to download espeak from the website and added it to the environment variable, but I am still receiving this error. I am also receiving another error about unicode decode.

`---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_18492\1790917207.py in
1 # load phonemizer
2 import phonemizer
----> 3 global_phonemizer = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True, with_stress=True)

~\anaconda3.1\lib\site-packages\phonemizer\backend\espeak\espeak.py in init(self, language, punctuation_marks, preserve_punctuation, with_stress, tie, language_switch, words_mismatch, logger)
45 super().init(
46 language, punctuation_marks=punctuation_marks,
---> 47 preserve_punctuation=preserve_punctuation, logger=logger)
48
49 self._espeak.set_voice(language)

~\anaconda3.1\lib\site-packages\phonemizer\backend\espeak\base.py in init(self, language, punctuation_marks, preserve_punctuation, logger)
41 punctuation_marks=punctuation_marks,
42 preserve_punctuation=preserve_punctuation,
---> 43 logger=logger)
44
45 self._espeak = EspeakWrapper()

~\anaconda3.1\lib\site-packages\phonemizer\backend\base.py in init(self, language, punctuation_marks, preserve_punctuation, logger)
76 if not self.is_available():
77 raise RuntimeError( # pragma: nocover
---> 78 '{} not installed on your system'.format(self.name()))
79
80 self._logger = logger

RuntimeError: espeak not installed on your system`

`---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_15548\494391268.py in
3 train_path = config.get('train_data', None)
4 val_path = config.get('val_data', None)
----> 5 train_list, val_list = get_data_path_list(train_path, val_path)
6
7 ref_dicts = {}

~\Desktop\github\StyleTTS\utils.py in get_data_path_list(train_path, val_path)
33
34 with open(train_path, 'r') as f:
---> 35 train_list = f.readlines()
36 with open(val_path, 'r') as f:
37 val_list = f.readlines()

~\anaconda3.1\lib\encodings\cp1250.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 39: character maps to `

Phoneme sequence padding

Hi, I'm trying to train StyleTTS with custom data, and I got a little cofusion when padding the phoneme sequence.

When pre-training AuxiliaryASR, in meldataset.py the _load_tensor function pad the phoneme sequence by blank_index.

   def _load_tensor(self, data):
        wave_path, text, speaker_id = data
        speaker_id = int(speaker_id)
        wave, sr = sf.read(wave_path)

        # phonemize the text
        ps = self.g2p(text.replace('-', ' '))
        if "'" in ps:
            ps.remove("'")
        text = self.text_cleaner(ps)
        blank_index = self.text_cleaner.word_index_dictionary[" "]
        text.insert(0, blank_index) # add a blank at the beginning (silence)
        text.append(blank_index) # add a blank at the end (silence)
        
        text = torch.LongTensor(text)

        return wave, text, speaker_id

But in StyleTTS's meldataset.py the _load_tensor function just pad the phoneme sequence by 0.

    def _load_tensor(self, data):
        wave_path, text, speaker_id = data
        speaker_id = int(speaker_id)
        wave, sr = sf.read(wave_path)
        if wave.shape[-1] == 2:
            wave = wave[:, 0].squeeze()
        if sr != 24000:
            wave = librosa.resample(wave, sr, 24000)
            print(wave_path, sr)
            
        wave = np.concatenate([np.zeros([5000]), wave, np.zeros([5000])], axis=0)
        
        text = self.text_cleaner(text)
        
        text.insert(0, 0)
        text.append(0)
        
        text = torch.LongTensor(text)

        return wave, text, speaker_id

The questions are:

  1. If I pre-train ASR with blank_index, should I keep it that way in StyleTTS?
  2. If the input wavs are well-trimed(with no silence at begining and end), should I still pad the phoneme?

Thank you.

Finetuning

Hi,

Is it possible to finetune this model?

Thanks

Sounds weird when the input is short

Hi, I find that the model produces really great outputs when synthesizing sentences of medium or long length. However, when synthesizing short sentences or words, the audio sounds weird. For example, I used the pretrained LibriTTS model to synthesize the word “people” (IPA: pˈiːpəl). No matter what reference audio I gave, long or short, the output was weird, especially at the end of the word. This phenomenon occurred when the input was short. I attached an example audio here people.wav.zip. Do you have any methods to solve or alleviate this problem? Thank you very much!

Multilingual Training [Question]

Hello, would it generally be possible to train StyleTTS on a multilingual dataset, e.g. by additionally conditioning the text encoder with a language embedding?

running train_first.py raises error

(demo) C:\Users\Administrator\StyleTTS>python train_first.py --config_path ./Configs/config.yml
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
bert loaded
bert_encoder loaded
predictor loaded
decoder loaded
pitch_extractor loaded
text_encoder loaded
style_encoder loaded
text_aligner loaded
discriminator loaded
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ C:\Users\Administrator\StyleTTS\train_first.py:393 in │
│ │
│ 390 │ torch.save(state, save_path) │
│ 391 │
│ 392 if name=="main": │
│ ❱ 393 │ main() │
│ 394 │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\click\core.py:1157 in call
│ │
│ 1154 │ │
│ 1155 │ def call(self, *args: t.Any, **kwargs: t.Any) -> t.Any: │
│ 1156 │ │ """Alias for :meth:main.""" │
│ ❱ 1157 │ │ return self.main(*args, **kwargs) │
│ 1158 │
│ 1159 │
│ 1160 class Command(BaseCommand): │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\click\core.py:1078 in main │
│ │
│ 1075 │ │ try: │
│ 1076 │ │ │ try: │
│ 1077 │ │ │ │ with self.make_context(prog_name, args, **extra) as ctx: │
│ ❱ 1078 │ │ │ │ │ rv = self.invoke(ctx) │
│ 1079 │ │ │ │ │ if not standalone_mode: │
│ 1080 │ │ │ │ │ │ return rv │
│ 1081 │ │ │ │ │ # it's not safe to ctx.exit(rv) here! │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\click\core.py:1434 in invoke │
│ │
│ 1431 │ │ │ echo(style(message, fg="red"), err=True) │
│ 1432 │ │ │
│ 1433 │ │ if self.callback is not None: │
│ ❱ 1434 │ │ │ return ctx.invoke(self.callback, **ctx.params) │
│ 1435 │ │
│ 1436 │ def shell_complete(self, ctx: Context, incomplete: str) -> t.List["CompletionItem"]: │
│ 1437 │ │ """Return a list of completions for the incomplete value. Looks │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\click\core.py:783 in invoke │
│ │
│ 780 │ │ │
│ 781 │ │ with augment_usage_errors(__self): │
│ 782 │ │ │ with ctx: │
│ ❱ 783 │ │ │ │ return __callback(*args, **kwargs) │
│ 784 │ │
│ 785 │ def forward( │
│ 786 │ │ __self, __cmd: "Command", *args: t.Any, **kwargs: t.Any # noqa: B902 │
│ │
│ C:\Users\Administrator\StyleTTS\train_first.py:140 in main │
│ │
│ 137 │ │ │
│ 138 │ │ _ = [model[key].train() for key in model] │
│ 139 │ │ │
│ ❱ 140 │ │ for i, batch in enumerate(train_dataloader): │
│ 141 │ │ │ │
│ 142 │ │ │ batch = [b.to(device) for b in batch] │
│ 143 │ │ │ texts, input_lengths, mels, mel_input_length = batch │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\torch\utils\data\dataloader.py:633 │
│ in next
│ │
│ 630 │ │ │ if self._sampler_iter is None: │
│ 631 │ │ │ │ # TODO(pytorch/pytorch#76750) │
│ 632 │ │ │ │ self._reset() # type: ignore[call-arg] │
│ ❱ 633 │ │ │ data = self._next_data() │
│ 634 │ │ │ self._num_yielded += 1 │
│ 635 │ │ │ if self._dataset_kind == _DatasetKind.Iterable and \ │
│ 636 │ │ │ │ │ self._IterableDataset_len_called is not None and \ │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\torch\utils\data\dataloader.py:134 │
│ 5 in _next_data │
│ │
│ 1342 │ │ │ │ self._task_info[idx] += (data,) │
│ 1343 │ │ │ else: │
│ 1344 │ │ │ │ del self._task_info[idx] │
│ ❱ 1345 │ │ │ │ return self._process_data(data) │
│ 1346 │ │
│ 1347 │ def _try_put_index(self): │
│ 1348 │ │ assert self._tasks_outstanding < self._prefetch_factor * self._num_workers │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\torch\utils\data\dataloader.py:137 │
│ 1 in _process_data │
│ │
│ 1368 │ │ self._rcvd_idx += 1 │
│ 1369 │ │ self._try_put_index() │
│ 1370 │ │ if isinstance(data, ExceptionWrapper): │
│ ❱ 1371 │ │ │ data.reraise() │
│ 1372 │ │ return data │
│ 1373 │ │
│ 1374 │ def _mark_worker_as_unavailable(self, worker_id, shutdown=False): │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\torch_utils.py:644 in reraise │
│ │
│ 641 │ │ │ # If the exception takes multiple arguments, don't try to │
│ 642 │ │ │ # instantiate since we don't know how to │
│ 643 │ │ │ raise RuntimeError(msg) from None │
│ ❱ 644 │ │ raise exception │
│ 645 │
│ 646 │
│ 647 def _get_available_device_type(): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
LibsndfileError: <exception str() failed>

s2s loss become nan

Thanks for you work. I tried to train stage1 with multi-chinese dataset,about 420+ hours. The training process is normal until some step the s2s loss become nan,after that the parameters of text aligner is nan too.
What can i do to avoid this problem?
I tried to give an epsilon to _s2s_pred and set the loss to 0,but the parameters is still nan.
image

[Q] Training the all the components together.

Hey, thanks for releasing the code. I came across this after reading the paper. I just wonder if you ever tried training the whole model together and how it performed compared to the 2 stage approach. Maybe it is just me missing in the paper, but I don't see a clear comparison although everything else is quite clear, especially with the Appendix. Thanks again.

error while running train_second.py (caused by size mismatch)

Traceback (most recent call last):
File "/home/jumpcloud/libraries/StyleTTS/train_second.py", line 494, in
main()
File "/home/jumpcloud/miniconda3/envs/work/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/jumpcloud/miniconda3/envs/work/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/jumpcloud/miniconda3/envs/work/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/jumpcloud/miniconda3/envs/work/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/jumpcloud/libraries/StyleTTS/train_second.py", line 220, in main
bert_dur = model.bert(texts, attention_mask=(~text_mask).int()).last_hidden_state
File "/home/jumpcloud/miniconda3/envs/work/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jumpcloud/miniconda3/envs/work/lib/python3.10/site-packages/transformers/models/albert/modeling_albert.py", line 724, in forward
buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (584) must match the existing size (512) at non-singleton dimension 1. Target sizes: [12, 584]. Tensor sizes: [1, 512]

about ASR model

If we do not use pretrained ASR ckpt, just let the asr model update parameters according to TTS model. Can we get good TTS result?

Pause between sentences?

As the title suggests, how would I add a pause in between multiple sentences, for example after a full stop?

Probleam about data processing

I found this line of code in the meldataset.py file and I was curious about what it does. Why does wav need to be extended in the code?
wave = torch.cat([torch.zeros([5000]), wave, torch.zeros([5000])], axis=0)

Duration predictor training is really slow.

I observe very slow progress with the duration loss at the second stage of the training. Is this something accepted or you might think of any issue that might be causing it?

For each epoch, the eval loss is going 2.21 -> 2.20 -> 2.18 ... whereas the F0 loss converged very quickly.

BTW I am using VCTK + LibriTTS.

I also tried reducing the dropout to 0.1 for the duration projection layer but didn't help.

Training the model end2end

This is just a heads-up about the discussion in #7.

I tried training the model end2end in different ways but F0 and Energy predictor was always underfitting although eval loss was also going down. They never were able to predict useful values for inference.

Here is roughly my forward pass. I can also share the branch if useful. Happy to see any feedback.

@typechecked
    def forward_all(
        self,
        texts: TensorType["B", "T_text"],
        input_lengths: TensorType["B"],
        mels: TensorType["B", "C_mel", "T_mel"],
        mel_input_length: TensorType["B"],
        F0_real: TensorType["B", 1, "T_mel"],
    ):
        # TODO: use Pitch Extractor (maybe torchcrepe)
        # mask = length_to_mask(mel_input_length // (2 ** model.text_aligner.n_down)).to(self.device)
        text_mask = self.lengths_to_mask(input_lengths).to(self.device)
        mel_mask = self.lengths_to_mask(mel_input_length).to(self.device)

        ##### --> TEXT ENCODER
        t_en, t_emb = self.text_encoder(texts, input_lengths, length_to_mask(input_lengths))

        ##### --> ALIGNER
        _, aligner_soft, aligner_logprob, aligner_hard = self._forward_aligner(
            x=t_emb.detach().transpose(1, 2),
            y=mels,
            x_mask=text_mask,
            y_mask=mel_mask,
            attn_priors=None,
        )

        ##### --> EXPAND
        t_en_ex = t_en @ aligner_hard.squeeze(1)

        ##### --> PRUNE THE BATCH BY THE SHORTEST MEL LENGTH
        mel_len = int(mel_input_length.min().item())
        t_en_ex_clipped = []
        mels_clipped = []
        F0s = []
        idxs = []
        for bib in range(len(mel_input_length)):
            mel_length = int(mel_input_length[bib].item()) + 1

            random_start = np.random.randint(0, mel_length - mel_len)
            idxs.append(random_start)
            t_en_ex_clipped.append(t_en_ex[bib, :, random_start : random_start + mel_len])
            mels_clipped.append(mels[bib, :, random_start : random_start + mel_len])
            F0s.append(F0_real[bib, :, random_start : random_start + mel_len])

        t_en_ex_clipped = torch.stack(t_en_ex_clipped)
        mels_clipped = torch.stack(mels_clipped).detach()
        F0_real = torch.stack(F0s).detach()

        ##### --> CALCULATE REAL ENERGY
        N_real = log_norm(mels_clipped.unsqueeze(1)).squeeze(1).detach()
        # F0_real, _, _ = self.pitch_extractor(gt.unsqueeze(1))

        ##### --> STYLE ENCODER
        s = self.style_encoder(mels_clipped.unsqueeze(1))

        ##### --> DURATION PREDICTOR & PROSOODY ENCODER
        dur_aligner = aligner_hard.sum(axis=-1).detach()
        dur_pred, prosody_en_ex = self.predictor(
            t_en.detach(), s.detach(), input_lengths, aligner_hard.squeeze(1), length_to_mask(input_lengths)
        )  # [B, T_en]
        d_align_mask = self.lengths_to_mask(input_lengths) * self.lengths_to_mask(mel_input_length).transpose(
            1, 2
        )  # [B, 1, T_enc] * [B, T_dec, 1]
        dur_alignment = generate_path(dur_pred, d_align_mask.squeeze(1).transpose(1, 2)).detach()  # [B, T_dec, T_enc]

        # Strip prosody tensor to match the mel length
        p_en = []
        for bib, start in enumerate(idxs):
            p_en.append(prosody_en_ex[bib, :, start : (start + mel_len)])
        p_en = torch.stack(p_en)

        ##### --> Pitch and Energy Predictor
        F0_fake, N_fake = self.predictor.F0Ntrain(p_en, s.detach())

        ##### --> DECODER
        mel_rec = self.decoder(t_en_ex_clipped, F0_real.squeeze(1), N_real, s)
        return {
            "mel_rec": mel_rec,
            "gt": mels_clipped,
            "aligner_logprob": aligner_logprob,
            "aligner_hard": aligner_hard.squeeze(1),
            "aligner_soft": aligner_soft,
            "F0_real": F0_real.squeeze(1),
            "F0_fake": F0_fake,
            "N_real": N_real,
            "N_fake": N_fake,
            "d": dur_pred,
            "d_gt": dur_aligner.squeeze(1),
            "d_alignment": dur_alignment,
        }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.