The styletts from yl4579

train_second step not fix decoder?

https://github.com/yl4579/StyleTTS/blob/main/train_second.py#L297

the above line is not in with torch.no_grad(), is it a problem?

monotonic error

from monotonic_align import mask_from_lens

error, cannot import name "mask_from_lens" from "monotonic_align"
my monotonic_align version is 1.0.0

Is it possible to control the speed of the speech?

Hello,
It truly is an impressive work!
I wonder if it is possible to control the speed of the output speech.
Thank you!

Process for setting up other datasets

Hello there! new to this speech synthesis stuff, whats the process for setting up a dataset like the VCTK dataset (downsampled to 24khz) and processing it in the proper format like in the example for LibriTTS "filename.wav|transcription".
"LibriTTS/train-clean-460/7169/89735/7169_89735_000071_000003.wav|wˈʌn mˌeɪd ˌʌp wˈʌnz mˈaɪnd ðə lˈoʊn wˈʊlf mˈʌst biː ɐ sˈɜːʔn̩ sˈɔːɹt ʌv mˈæn; ðə ɹˈɛst wʌz sˈɪmpli sˈɪftɪŋ fɹˈæns fɚðə mˈæn tə fˈɪt ðə θˈiəɹi, ænd ðˈɛn wˈɑːtʃɪŋ hˌɪm ʌntˈɪl hiː ɡˈeɪv hɪmsˈɛlf ɐwˈeɪ."
"
its not clear to me how i get the transciption from the text and set all of that up. If any one has set up a dataset like this one before and can help point me in the right direction that would be awesome! Thanks.

Any-to-any and emotion examples

I'm trying to replicate your results. How did you create your any-to-any examples? Did you change the voice to text from the "source" audio and use the "reference" audio as a zero-shot reference?

Similarly, in the emotion examples, were those also zero-shot where you just used the file from ESD as a reference to create the reference embedding?

Thanks!

Training on custom data

Can I train StyleTTS on my custom data in another language?

finetune the vocoder

Hi @yl4579 , thanks for you amazing work.
I want to finetune the HifiGAN model, but seems the pretrained weight only has generator. Could you also publish the discriminator? Thanks.

f0_extractor ?

hi，如果使用v4000 单卡训练LibriTTS数据集，显卡内存只有16G， batch_size只能设置为8，训练时常大概为多长呢？谢谢

Why don't use "attention_weight" in train_first.py ?

Why don't use "attention_weight" in train_first.py ?
the code used is "alignment". in train_first.py line 153:
ppgs, s2s_pred, s2s_attn_feat = model.text_aligner(mels, mask, texts)

in layer.py:
attention_weights = F.softmax(alignment, dim=1)

attention_weight is the result after softmax, but alignment is not.

Require Imports??

Does text_utils.py (https://github.com/yl4579/StyleTTS/blob/main/text_utils.py) require:

import os
import os.path as osp
import pandas as pd

mandrain support?

Multi language support

Great works!
I found that when there are other language characters, the synthesizer will repeatedly read the language+letter,What should I pay attention to if I want to add other languages.

s2s loss 和mono loss在第一阶段的训练中一直为0

What is the reason for multiply by 2 when stripping the batch

I could not find any mention in the paper. I wonder why the expanded text representation is half of the mel length as in the line below. Would you mind explaining the reason?

StyleTTS/train_first.py

Line 327 in 451540d

gt.append(mels[bib, :, (random_start * 2):((random_start+mel_len) * 2)])

Finetuning on a small dataset?

Hi!
Thanks for the great work!
I trained a model on ryanspeech, and it sounds great!
I have a small dataset of around 2h, so is it possible to finetune an existing model on a small dataset?
Thanks!
And happy Chinese New Year!

Can you offer the loss log of style tts

Thanks a lot

Questions about the Evaluations

StyleTTS's ref mel requires a single audio as input, which may result in the style vector only being similar to the ref wav, but somewhat different from other waves of the same speaker. May I ask if the ref mel you use for evaluations is gt or randomly selected from the dataset during your evaluation stage?

turns out your code doesn't join the read wav paths from train_list.txt file with the dataset path (the location of train_list.txt)

Correct me if I'm wrong. The other issue I opened is actually is a soundfile related error (got I know when I degraded the soundfile version)
so it can't just read "wavs/22.wav" because the wav_path in meldataset.py has to include the directory location of train_list.txt file. Got my point? Though the complete thing works if only the wavs folder is kept in the StyleTTS directory (with train_list.txt whereever as input in config file)

Why I can't use your mel to train HiFi-Gan Vocoder ?

Hello, Yinghao Aaron Li

I use mel from "torchaudio.transforms.MelSpectrogram" to test your pretained HiGi-Gan model, is ok.
but I use the mel to finetune the your pretained HiGi-Gan model, the syn wav is noise.

I use your vocoder.py , use HiFi-Gan code in https://github.com/jik876/hifi-gan, train the model, syn wav is also ok.

What's the difference, in your training process ? Are you changed something ?

About train on Vietnamese Dataset

I really get into trouble when starting a project with a Vietnamese dataset, Can you describe in detail every task that I will do before starting with styleTTS and the format of data to train it to look like the file in your folder data
Or like here?

I would really appreciate it if you could help me describe each step clearly

Thank you, Hope you have a good day

Using Single Gpu Pretrained model on Multi-Gpu training

When i try to use the provided pretrained model and train with multiple GPUs I get these key errors. "Missing key(s) in state_dict:", "Unexpected key(s) in state_dict:".
Is there any way to use the single GPU pretrained model, as a starting point to train with multiple GPUs?

raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for MyDataParallel: Missing key(s) in state_dict: "module.text_encoder.lstms.0.weight_ih_l0", "module.text_encoder.lstms.0.weight_hh_l0", "module.text_encoder.lstms.0.bias_ih_l0", "module.text_encoder.lstms.0.bias_hh_l0", "module.text_encoder.lstms.0.weight_ih_l0_reverse", "module.text_encoder.lstms.0.weight_hh_l0_reverse", "module.text_encoder.lstms.0.bias_ih_l0_reverse", "module.text_encoder.lstms.0.bias_hh_l0_reverse", "module.text_encoder.lstms.1.fc.weight", "module.text_encoder.lstms.1.fc.bias", "module.text_encoder.lstms.2.weight_ih_l0", "module.text_encoder.lstms.2.weight_hh_l0", "module.text_encoder.lstms.2.bias_ih_l0", "module.text_encoder.lstms.2.bias_hh_l0", "module.text_encoder.lstms.2.weight_ih_l0_reverse", "module.text_encoder.lstms.2.weight_hh_l0_reverse", "module.text_encoder.lstms.2.bias_ih_l0_reverse", "module.text_encoder.lstms.2.bias_hh_l0_reverse", "module.text_encoder.lstms.3.fc.weight", "module.text_encoder.lstms.3.fc.bias", "module.text_encoder.lstms.4.weight_ih_l0", "module.text_encoder.lstms.4.weight_hh_l0", "module.text_encoder.lstms.4.bias_ih_l0", "module.text_encoder.lstms.4.bias_hh_l0", "module.text_encoder.lstms.4.weight_ih_l0_reverse", "module.text_encoder.lstms.4.weight_hh_l0_reverse", "module.text_encoder.lstms.4.bias_ih_l0_reverse", "module.text_encoder.lstms.4.bias_hh_l0_reverse", "module.text_encoder.lstms.5.fc.weight", "module.text_encoder.lstms.5.fc.bias", "module.lstm.weight_ih_l0", "module.lstm.weight_hh_l0", "module.lstm.bias_ih_l0", "module.lstm.bias_hh_l0", "module.lstm.weight_ih_l0_reverse", "module.lstm.weight_hh_l0_reverse", "module.lstm.bias_ih_l0_reverse", "module.lstm.bias_hh_l0_reverse", "module.duration_proj.linear_layer.weight", "module.duration_proj.linear_layer.bias", "module.shared.weight_ih_l0", "module.shared.weight_hh_l0", "module.shared.bias_ih_l0", "module.shared.bias_hh_l0", "module.shared.weight_ih_l0_reverse", "module.shared.weight_hh_l0_reverse", "module.shared.bias_ih_l0_reverse", "module.shared.bias_hh_l0_reverse", "module.F0.0.conv1.bias", "module.F0.0.conv1.weight_g", "module.F0.0.conv1.weight_v", "module.F0.0.conv2.bias", "module.F0.0.conv2.weight_g", "module.F0.0.conv2.weight_v", "module.F0.0.norm1.fc.weight", "module.F0.0.norm1.fc.bias", "module.F0.0.norm2.fc.weight", "module.F0.0.norm2.fc.bias", "module.F0.1.conv1.bias", "module.F0.1.conv1.weight_g", "module.F0.1.conv1.weight_v", "module.F0.1.conv2.bias", "module.F0.1.conv2.weight_g", "module.F0.1.conv2.weight_v", "module.F0.1.norm1.fc.weight", "module.F0.1.norm1.fc.bias", "module.F0.1.norm2.fc.weight", "module.F0.1.norm2.fc.bias", "module.F0.1.conv1x1.weight_g", "module.F0.1.conv1x1.weight_v", "module.F0.1.pool.bias", "module.F0.1.pool.weight_g", "module.F0.1.pool.weight_v", "module.F0.2.conv1.bias", "module.F0.2.conv1.weight_g", "module.F0.2.conv1.weight_v", "module.F0.2.conv2.bias", "module.F0.2.conv2.weight_g", "module.F0.2.conv2.weight_v", "module.F0.2.norm1.fc.weight", "module.F0.2.norm1.fc.bias", "module.F0.2.norm2.fc.weight", "module.F0.2.norm2.fc.bias", "module.N.0.conv1.bias", "module.N.0.conv1.weight_g", "module.N.0.conv1.weight_v", "module.N.0.conv2.bias", "module.N.0.conv2.weight_g", "module.N.0.conv2.weight_v", "module.N.0.norm1.fc.weight", "module.N.0.norm1.fc.bias", "module.N.0.norm2.fc.weight", "module.N.0.norm2.fc.bias", "module.N.1.conv1.bias", "module.N.1.conv1.weight_g", "module.N.1.conv1.weight_v", "module.N.1.conv2.bias", "module.N.1.conv2.weight_g", "module.N.1.conv2.weight_v", "module.N.1.norm1.fc.weight", "module.N.1.norm1.fc.bias", "module.N.1.norm2.fc.weight", "module.N.1.norm2.fc.bias", "module.N.1.conv1x1.weight_g", "module.N.1.conv1x1.weight_v", "module.N.1.pool.bias", "module.N.1.pool.weight_g", "module.N.1.pool.weight_v", "module.N.2.conv1.bias", "module.N.2.conv1.weight_g", "module.N.2.conv1.weight_v", "module.N.2.conv2.bias", "module.N.2.conv2.weight_g", "module.N.2.conv2.weight_v", "module.N.2.norm1.fc.weight", "module.N.2.norm1.fc.bias", "module.N.2.norm2.fc.weight", "module.N.2.norm2.fc.bias", "module.F0_proj.weight", "module.F0_proj.bias", "module.N_proj.weight", "module.N_proj.bias". Unexpected key(s) in state_dict: "text_encoder.lstms.0.weight_ih_l0", "text_encoder.lstms.0.weight_hh_l0", "text_encoder.lstms.0.bias_ih_l0", "text_encoder.lstms.0.bias_hh_l0", "text_encoder.lstms.0.weight_ih_l0_reverse", "text_encoder.lstms.0.weight_hh_l0_reverse", "text_encoder.lstms.0.bias_ih_l0_reverse", "text_encoder.lstms.0.bias_hh_l0_reverse", "text_encoder.lstms.1.fc.weight", "text_encoder.lstms.1.fc.bias", "text_encoder.lstms.2.weight_ih_l0", "text_encoder.lstms.2.weight_hh_l0", "text_encoder.lstms.2.bias_ih_l0", "text_encoder.lstms.2.bias_hh_l0", "text_encoder.lstms.2.weight_ih_l0_reverse", "text_encoder.lstms.2.weight_hh_l0_reverse", "text_encoder.lstms.2.bias_ih_l0_reverse", "text_encoder.lstms.2.bias_hh_l0_reverse", "text_encoder.lstms.3.fc.weight", "text_encoder.lstms.3.fc.bias", "text_encoder.lstms.4.weight_ih_l0", "text_encoder.lstms.4.weight_hh_l0", "text_encoder.lstms.4.bias_ih_l0", "text_encoder.lstms.4.bias_hh_l0", "text_encoder.lstms.4.weight_ih_l0_reverse", "text_encoder.lstms.4.weight_hh_l0_reverse", "text_encoder.lstms.4.bias_ih_l0_reverse", "text_encoder.lstms.4.bias_hh_l0_reverse", "text_encoder.lstms.5.fc.weight", "text_encoder.lstms.5.fc.bias", "lstm.weight_ih_l0", "lstm.weight_hh_l0", "lstm.bias_ih_l0", "lstm.bias_hh_l0", "lstm.weight_ih_l0_reverse", "lstm.weight_hh_l0_reverse", "lstm.bias_ih_l0_reverse", "lstm.bias_hh_l0_reverse", "duration_proj.linear_layer.weight", "duration_proj.linear_layer.bias", "shared.weight_ih_l0", "shared.weight_hh_l0", "shared.bias_ih_l0", "shared.bias_hh_l0", "shared.weight_ih_l0_reverse", "shared.weight_hh_l0_reverse", "shared.bias_ih_l0_reverse", "shared.bias_hh_l0_reverse", "F0.0.conv1.bias", "F0.0.conv1.weight_g", "F0.0.conv1.weight_v", "F0.0.conv2.bias", "F0.0.conv2.weight_g", "F0.0.conv2.weight_v", "F0.0.norm1.fc.weight", "F0.0.norm1.fc.bias", "F0.0.norm2.fc.weight", "F0.0.norm2.fc.bias", "F0.1.conv1.bias", "F0.1.conv1.weight_g", "F0.1.conv1.weight_v", "F0.1.conv2.bias", "F0.1.conv2.weight_g", "F0.1.conv2.weight_v", "F0.1.norm1.fc.weight", "F0.1.norm1.fc.bias", "F0.1.norm2.fc.weight", "F0.1.norm2.fc.bias", "F0.1.conv1x1.weight_g", "F0.1.conv1x1.weight_v", "F0.1.pool.bias", "F0.1.pool.weight_g", "F0.1.pool.weight_v", "F0.2.conv1.bias", "F0.2.conv1.weight_g", "F0.2.conv1.weight_v", "F0.2.conv2.bias", "F0.2.conv2.weight_g", "F0.2.conv2.weight_v", "F0.2.norm1.fc.weight", "F0.2.norm1.fc.bias", "F0.2.norm2.fc.weight", "F0.2.norm2.fc.bias", "N.0.conv1.bias", "N.0.conv1.weight_g", "N.0.conv1.weight_v", "N.0.conv2.bias", "N.0.conv2.weight_g", "N.0.conv2.weight_v", "N.0.norm1.fc.weight", "N.0.norm1.fc.bias", "N.0.norm2.fc.weight", "N.0.norm2.fc.bias", "N.1.conv1.bias", "N.1.conv1.weight_g", "N.1.conv1.weight_v", "N.1.conv2.bias", "N.1.conv2.weight_g", "N.1.conv2.weight_v", "N.1.norm1.fc.weight", "N.1.norm1.fc.bias", "N.1.norm2.fc.weight", "N.1.norm2.fc.bias", "N.1.conv1x1.weight_g", "N.1.conv1x1.weight_v", "N.1.pool.bias", "N.1.pool.weight_g", "N.1.pool.weight_v", "N.2.conv1.bias", "N.2.conv1.weight_g", "N.2.conv1.weight_v", "N.2.conv2.bias", "N.2.conv2.weight_g", "N.2.conv2.weight_v", "N.2.norm1.fc.weight", "N.2.norm1.fc.bias", "N.2.norm2.fc.weight", "N.2.norm2.fc.bias", "F0_proj.weight", "F0_proj.bias", "N_proj.weight", "N_proj.bias".

crashes during training

after starting training i am getting the following error, sometimes right away, sometimes after a few steps

./aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [652,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [652,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                      
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [652,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                     
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [652,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                       
Traceback (most recent call last):                                                                                                                                                                                                                                    |
  File "/home/tts/StyleTTS/train_first.py", line 393, in <module>                       
    main()                                                                                                                                                                                                     
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1130, in __call__                                                                                                                          
    return self.main(*args, **kwargs)                                                                                                                                                                           
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1055, in main                                                                                                                              
    rv = self.invoke(ctx)                                                                                                                                                                                       
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1404, in invoke                                                                                                                            
    return ctx.invoke(self.callback, **ctx.params)                                                                                                                                                              
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 760, in invoke                                                                                                                             
    return __callback(*args, **kwargs)                                                                                                                                                                          
  File "/home/tts/StyleTTS/train_first.py", line 149, in main                                                                                                                                                   
    ppgs, s2s_pred, s2s_attn_feat = model.text_aligner(mels, mask, texts)                                                                                                                                       
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                                                                                           
    return forward_call(*args, **kwargs)                                                                                                                                                                        
  File "/home/tts/StyleTTS/Utils/ASR/models.py", line 45, in forward                                                                                                                                            
    _, s2s_logit, s2s_attn = self.asr_s2s(x, src_key_padding_mask, text_input)                                                                                                                                  
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                                                                                           
    return forward_call(*args, **kwargs)                                                                                                                                                                        
  File "/home/tts/StyleTTS/Utils/ASR/models.py", line 130, in forward                                                                                                                                           
    print(f"... {text_input} {decoder_inputs.size(1)}")                                                                                                                                                         
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 873, in __format__                                                                                                                      
    return object.__format__(self, format_spec)                                                                                                                                                                 
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 426, in __repr__                                                                                                                        
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)                                                                                                                                        
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 636, in _str                                                                                                                        
    return _str_intern(self, tensor_contents=tensor_contents)                                                                                                                                                   
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 567, in _str_intern                                                                                                                 
    tensor_str = _tensor_str(self, indent)                                                                                                                                                                      
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 327, in _tensor_str                                                                                                                 
    formatter = _Formatter(get_summarized_data(self) if summarize else self)                                                                                                                                    
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 111, in __init__                                                                                                                    
    value_str = "{}".format(value)                                                                                                                                                                  
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 872, in __format__                                                                                                                     
    return self.item().__format__(fo

Need help for training

I'm pretraining your model on the vivo dataset (Vietnamese) but the results are not what I expected.
Here is the original audio:
https://drive.google.com/file/d/12mZdg8yVhgQj35Vt3thoxIK44_jWWaCJ/view?usp=sharing
and here is the result:
https://drive.google.com/file/d/1UOuUHrxiR5DvF1MrpccMZ6bdwKyfO2AE/view?usp=sharing

This is the loss during training stage 1 and stage 2

p/s: I used the ASR of the original article to train Vietnamese. I wonder if it has any problems because during the training stage 1 there were quite a lot of keyerror errors.
Thank you very much

asr phone dict different from this

hi，
https://github.com/yl4579/StyleTTS/blob/eac67158fd21578ea950ff4d6734bf6e6434b6b5/meldataset.py#L28C5-L28C5

the pretrained ASR model use this ""AA0",10
"AA1",11
"AA2",12
"AE0",13
"AE1",14
"AE2",15
"AH0",16
"AH1",17
"AH2",18
"AO0",19
"AO1",20
"AO2",21
"AW0",22
"AW1",23",

Why？

Adding BigVGAN as Vocoder

Hey im trying to add my BigVGAN vocoder model to the inferencing script,. But when it generates audio it always has a lot of noise, compared to the inferencing script of the original BigVGAN code base. Any Ideas on why that could be? It looks to be the same setup as HiFi-GAN. https://github.com/NVIDIA/BigVGAN. If you would like one of my trained Models let me know ill give you DL link so you can test with it... as there are currently no available models.

Thanks in advance!

%cd /content/BigVGAN

from __future__ import absolute_import, division, print_function, unicode_literals
#import sys
#sys.path.append("./content/BigVGAN")
import glob
import os
import argparse
import json

from scipy.io.wavfile import write
from env import AttrDict
from meldataset1 import mel_spectrogram, MAX_WAV_VALUE
from models1 import BigVGAN as Generator
import librosa

torch.backends.cudnn.benchmark = False

def load_checkpoint(filepath, device):
    assert os.path.isfile(filepath)
    print("Loading '{}'".format(filepath))
    checkpoint_dict = torch.load(filepath, map_location=device)
    print("Complete.")
    return checkpoint_dict


def scan_checkpoint(cp_dir, prefix):
    pattern = os.path.join(cp_dir, prefix + '*')
    cp_list = glob.glob(pattern)
    if len(cp_list) == 0:
        return ''
    return sorted(cp_list)[-1]
    
cp_g = scan_checkpoint("configs", 'g_001')

config_file = os.path.join(os.path.split(cp_g)[0], 'bigvgan_24khz_100band.json') #actually 80-band to work with the StyleTTS model
with open(config_file) as f:
    data = f.read()


json_config = json.loads(data)
h = AttrDict(json_config)
device = torch.device(device)
generator = Generator(h).to(device)
state_dict_g = load_checkpoint(cp_g, device)

generator.load_state_dict(state_dict_g['generator'])
generator.eval()
generator.remove_weight_norm()
%cd /content/StyleTTS

import time
converted_samples = {}
start_time = time.time()

input_length = torch.LongTensor([tokens.shape[-1]]).to(device)
mask = length_to_mask(input_length).to(device)
with torch.no_grad():
    input_lengths = torch.LongTensor([tokens.shape[-1]]).to(device)
    m = length_to_mask(input_lengths).to(device)
    t_en = model.text_encoder(tokens, input_lengths, m)
    for key, (ref, _) in reference_embeddings.items():
        s = ref.squeeze(1).to(device)
        style = s
        d = model.predictor.text_encoder(t_en, style, input_lengths, m)

        x, _ = model.predictor.lstm(d)
        duration = model.predictor.duration_proj(x)
        pred_dur = torch.round(duration.squeeze()).clamp(min=1)

        pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data), device=device)
        c_frame = 0
        for i in range(pred_aln_trg.size(0)):
            pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
            c_frame += int(pred_dur[i].data)

        en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0))
        style = s.expand(en.shape[0], en.shape[1], -1)

        F0_pred, N_pred = model.predictor.F0Ntrain(en, s)

        out = model.decoder((t_en @ pred_aln_trg.unsqueeze(0)), F0_pred, N_pred, ref.squeeze().unsqueeze(0))

        audio_signal = out.cpu().numpy().squeeze()

        #Apply the Mel Spectrogram transformation
        mel_spectrogram = librosa.feature.melspectrogram(y=audio_signal, sr=24000, n_fft=1024, hop_length=256, n_mels=80, win_length=1024)

        #Convert the Mel Spectrogram to decibels
        mel_spectrogram_db = librosa.power_to_db(mel_spectrogram, ref=np.max)
        out1 = torch.FloatTensor(mel_spectrogram_db).to(device)
        y_g_hat = generator(out)
        y_out = y_g_hat.squeeze()
        y_out1 = y_out * MAX_WAV_VALUE
        y_out2 = y_out1.cpu().numpy()

        converted_samples[key] = y_out2

end_time = time.time()
print("Time taken: ", end_time - start_time, "seconds")

Also tried the original with the same result grain/noisy audio.

y_g_hat = generator(out)
y_out = y_g_hat.squeeze().cpu().numpy()
   
converted_samples[key] = y_out

RuntimeError: espeak not installed on your system

I have installed Phonemizer, but I am receiving an error about missing espeak. I have also tried to download espeak from the website and added it to the environment variable, but I am still receiving this error. I am also receiving another error about unicode decode.

`---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_18492\1790917207.py in
1 # load phonemizer
2 import phonemizer
----> 3 global_phonemizer = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True, with_stress=True)

~\anaconda3.1\lib\site-packages\phonemizer\backend\espeak\espeak.py in init(self, language, punctuation_marks, preserve_punctuation, with_stress, tie, language_switch, words_mismatch, logger)
45 super().init(
46 language, punctuation_marks=punctuation_marks,
---> 47 preserve_punctuation=preserve_punctuation, logger=logger)
48
49 self._espeak.set_voice(language)

~\anaconda3.1\lib\site-packages\phonemizer\backend\espeak\base.py in init(self, language, punctuation_marks, preserve_punctuation, logger)
41 punctuation_marks=punctuation_marks,
42 preserve_punctuation=preserve_punctuation,
---> 43 logger=logger)
44
45 self._espeak = EspeakWrapper()

~\anaconda3.1\lib\site-packages\phonemizer\backend\base.py in init(self, language, punctuation_marks, preserve_punctuation, logger)
76 if not self.is_available():
77 raise RuntimeError( # pragma: nocover
---> 78 '{} not installed on your system'.format(self.name()))
79
80 self._logger = logger

RuntimeError: espeak not installed on your system`

`---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_15548\494391268.py in
3 train_path = config.get('train_data', None)
4 val_path = config.get('val_data', None)
----> 5 train_list, val_list = get_data_path_list(train_path, val_path)
6
7 ref_dicts = {}

~\Desktop\github\StyleTTS\utils.py in get_data_path_list(train_path, val_path)
33
34 with open(train_path, 'r') as f:
---> 35 train_list = f.readlines()
36 with open(val_path, 'r') as f:
37 val_list = f.readlines()

~\anaconda3.1\lib\encodings\cp1250.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 39: character maps to `

Phoneme sequence padding

Hi, I'm trying to train StyleTTS with custom data, and I got a little cofusion when padding the phoneme sequence.

When pre-training AuxiliaryASR, in meldataset.py the _load_tensor function pad the phoneme sequence by blank_index.

   def _load_tensor(self, data):
        wave_path, text, speaker_id = data
        speaker_id = int(speaker_id)
        wave, sr = sf.read(wave_path)

        # phonemize the text
        ps = self.g2p(text.replace('-', ' '))
        if "'" in ps:
            ps.remove("'")
        text = self.text_cleaner(ps)
        blank_index = self.text_cleaner.word_index_dictionary[" "]
        text.insert(0, blank_index) # add a blank at the beginning (silence)
        text.append(blank_index) # add a blank at the end (silence)
        
        text = torch.LongTensor(text)

        return wave, text, speaker_id

But in StyleTTS's meldataset.py the _load_tensor function just pad the phoneme sequence by 0.

    def _load_tensor(self, data):
        wave_path, text, speaker_id = data
        speaker_id = int(speaker_id)
        wave, sr = sf.read(wave_path)
        if wave.shape[-1] == 2:
            wave = wave[:, 0].squeeze()
        if sr != 24000:
            wave = librosa.resample(wave, sr, 24000)
            print(wave_path, sr)
            
        wave = np.concatenate([np.zeros([5000]), wave, np.zeros([5000])], axis=0)
        
        text = self.text_cleaner(text)
        
        text.insert(0, 0)
        text.append(0)
        
        text = torch.LongTensor(text)

        return wave, text, speaker_id

The questions are:

If I pre-train ASR with blank_index, should I keep it that way in StyleTTS?
If the input wavs are well-trimed(with no silence at begining and end), should I still pad the phoneme?

Thank you.

Finetuning

Hi,

Is it possible to finetune this model?

Thanks

Code for Emotional Speech Synthesis

As shown in the demo: https://styletts.github.io/
Can someone provide the code for Emotional speech synthesis for prosody transfer?

Impressive works, and i wonder to know when to open.

Sounds weird when the input is short

Hi, I find that the model produces really great outputs when synthesizing sentences of medium or long length. However, when synthesizing short sentences or words, the audio sounds weird. For example, I used the pretrained LibriTTS model to synthesize the word “people” (IPA: pˈiːpəl). No matter what reference audio I gave, long or short, the output was weird, especially at the end of the word. This phenomenon occurred when the input was short. I attached an example audio here people.wav.zip. Do you have any methods to solve or alleviate this problem? Thank you very much!

Source of Hifigan checkpoints?

Are the Hifigan checkpoints copied from somewhere? The paper is suggestive of https://github.com/kan-bayashi/ParallelWaveGAN, but https://drive.google.com/drive/folders/10jBLsjQT3LvR-3GgPZpRvWIWvpGjzDnM (the LibriTTS Hifigan checkpoint from https://github.com/kan-bayashi/ParallelWaveGAN#results) has 2500000 steps, as opposed to g_00935000 from the link under StyleTTS, so it doesn't match.

Did the authors pretrain Hifigan themselves from scratch?

Multilingual Training [Question]

Hello, would it generally be possible to train StyleTTS on a multilingual dataset, e.g. by additionally conditioning the text encoder with a language embedding?

running train_first.py raises error

(demo) C:\Users\Administrator\StyleTTS>python train_first.py --config_path ./Configs/config.yml
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
bert loaded
bert_encoder loaded
predictor loaded
decoder loaded
pitch_extractor loaded
text_encoder loaded
style_encoder loaded
text_aligner loaded
discriminator loaded
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ C:\Users\Administrator\StyleTTS\train_first.py:393 in │
│ │
│ 390 │ torch.save(state, save_path) │
│ 391 │
│ 392 if name=="main": │
│ ❱ 393 │ main() │
│ 394 │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\click\core.py:1157 in call │
│ │
│ 1154 │ │
│ 1155 │ def call(self, *args: t.Any, **kwargs: t.Any) -> t.Any: │
│ 1156 │ │ """Alias for :meth:main.""" │
│ ❱ 1157 │ │ return self.main(*args, **kwargs) │
│ 1158 │
│ 1159 │
│ 1160 class Command(BaseCommand): │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\click\core.py:1078 in main │
│ │
│ 1075 │ │ try: │
│ 1076 │ │ │ try: │
│ 1077 │ │ │ │ with self.make_context(prog_name, args, **extra) as ctx: │
│ ❱ 1078 │ │ │ │ │ rv = self.invoke(ctx) │
│ 1079 │ │ │ │ │ if not standalone_mode: │
│ 1080 │ │ │ │ │ │ return rv │
│ 1081 │ │ │ │ │ # it's not safe to ctx.exit(rv) here! │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\click\core.py:1434 in invoke │
│ │
│ 1431 │ │ │ echo(style(message, fg="red"), err=True) │
│ 1432 │ │ │
│ 1433 │ │ if self.callback is not None: │
│ ❱ 1434 │ │ │ return ctx.invoke(self.callback, **ctx.params) │
│ 1435 │ │
│ 1436 │ def shell_complete(self, ctx: Context, incomplete: str) -> t.List["CompletionItem"]: │
│ 1437 │ │ """Return a list of completions for the incomplete value. Looks │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\click\core.py:783 in invoke │
│ │
│ 780 │ │ │
│ 781 │ │ with augment_usage_errors(__self): │
│ 782 │ │ │ with ctx: │
│ ❱ 783 │ │ │ │ return __callback(*args, **kwargs) │
│ 784 │ │
│ 785 │ def forward( │
│ 786 │ │ __self, __cmd: "Command", *args: t.Any, **kwargs: t.Any # noqa: B902 │
│ │
│ C:\Users\Administrator\StyleTTS\train_first.py:140 in main │
│ │
│ 137 │ │ │
│ 138 │ │ _ = [model[key].train() for key in model] │
│ 139 │ │ │
│ ❱ 140 │ │ for i, batch in enumerate(train_dataloader): │
│ 141 │ │ │ │
│ 142 │ │ │ batch = [b.to(device) for b in batch] │
│ 143 │ │ │ texts, input_lengths, mels, mel_input_length = batch │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\torch\utils\data\dataloader.py:633 │
│ in next │
│ │
│ 630 │ │ │ if self._sampler_iter is None: │
│ 631 │ │ │ │ # TODO(pytorch/pytorch#76750) │
│ 632 │ │ │ │ self._reset() # type: ignore[call-arg] │
│ ❱ 633 │ │ │ data = self._next_data() │
│ 634 │ │ │ self._num_yielded += 1 │
│ 635 │ │ │ if self._dataset_kind == _DatasetKind.Iterable and \ │
│ 636 │ │ │ │ │ self._IterableDataset_len_called is not None and \ │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\torch\utils\data\dataloader.py:134 │
│ 5 in _next_data │
│ │
│ 1342 │ │ │ │ self._task_info[idx] += (data,) │
│ 1343 │ │ │ else: │
│ 1344 │ │ │ │ del self._task_info[idx] │
│ ❱ 1345 │ │ │ │ return self._process_data(data) │
│ 1346 │ │
│ 1347 │ def _try_put_index(self): │
│ 1348 │ │ assert self._tasks_outstanding < self._prefetch_factor * self._num_workers │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\torch\utils\data\dataloader.py:137 │
│ 1 in _process_data │
│ │
│ 1368 │ │ self._rcvd_idx += 1 │
│ 1369 │ │ self._try_put_index() │
│ 1370 │ │ if isinstance(data, ExceptionWrapper): │
│ ❱ 1371 │ │ │ data.reraise() │
│ 1372 │ │ return data │
│ 1373 │ │
│ 1374 │ def _mark_worker_as_unavailable(self, worker_id, shutdown=False): │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\torch_utils.py:644 in reraise │
│ │
│ 641 │ │ │ # If the exception takes multiple arguments, don't try to │
│ 642 │ │ │ # instantiate since we don't know how to │
│ 643 │ │ │ raise RuntimeError(msg) from None │
│ ❱ 644 │ │ raise exception │
│ 645 │
│ 646 │
│ 647 def _get_available_device_type(): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
LibsndfileError: <exception str() failed>

s2s loss become nan

Thanks for you work. I tried to train stage1 with multi-chinese dataset，about 420+ hours. The training process is normal until some step the s2s loss become nan，after that the parameters of text aligner is nan too.
What can i do to avoid this problem?
I tried to give an epsilon to _s2s_pred and set the loss to 0，but the parameters is still nan.

[Q] Training the all the components together.

Hey, thanks for releasing the code. I came across this after reading the paper. I just wonder if you ever tried training the whole model together and how it performed compared to the 2 stage approach. Maybe it is just me missing in the paper, but I don't see a clear comparison although everything else is quite clear, especially with the Appendix. Thanks again.

train from scratch of another phoneme set

if i want to train the repo in Chinese with another phoneme set not like pypinyin, what do i need to do ? thanks!

Training new AuxiliaryASR and PitchExtractor

Great work, thanks.

What's r1_reg loss?

https://github.com/yl4579/StyleTTS/blob/main/train_second.py#L307
I didn't see the details about it in your paper.

error while running train_second.py (caused by size mismatch)

Traceback (most recent call last):
File "/home/jumpcloud/libraries/StyleTTS/train_second.py", line 494, in
main()
File "/home/jumpcloud/miniconda3/envs/work/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/jumpcloud/miniconda3/envs/work/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/jumpcloud/miniconda3/envs/work/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/jumpcloud/miniconda3/envs/work/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/jumpcloud/libraries/StyleTTS/train_second.py", line 220, in main
bert_dur = model.bert(texts, attention_mask=(~text_mask).int()).last_hidden_state
File "/home/jumpcloud/miniconda3/envs/work/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jumpcloud/miniconda3/envs/work/lib/python3.10/site-packages/transformers/models/albert/modeling_albert.py", line 724, in forward
buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (584) must match the existing size (512) at non-singleton dimension 1. Target sizes: [12, 584]. Tensor sizes: [1, 512]

about ASR model

If we do not use pretrained ASR ckpt, just let the asr model update parameters according to TTS model. Can we get good TTS result?

Pause between sentences?

As the title suggests, how would I add a pause in between multiple sentences, for example after a full stop?

pretrained model of stage 1

can you provide a pretrained model of stage 1
Thank you very much

how to Inference

Can you tell me how to Inference

Probleam about data processing

I found this line of code in the meldataset.py file and I was curious about what it does. Why does wav need to be extended in the code?
wave = torch.cat([torch.zeros([5000]), wave, torch.zeros([5000])], axis=0)

Would you recommend changing the code in the inference notebook when running a PL-BERT finetuned model?

StyleTTS/Demo/Inference_LJSpeech.ipynb

Line 326 in eac6715

" d = model.predictor.text_encoder(t_en, style, input_lengths, m)\n",

Trying to have the inference demo work - but no success

I am trying to have the inference demo output something intelligible in this collab.

But for now all my efforts are vain. I made sure to upsample the wav for prosody, but still getting rubbish. Could you point me where I am wrong ?

Duration predictor training is really slow.

I observe very slow progress with the duration loss at the second stage of the training. Is this something accepted or you might think of any issue that might be causing it?

For each epoch, the eval loss is going 2.21 -> 2.20 -> 2.18 ... whereas the F0 loss converged very quickly.

BTW I am using VCTK + LibriTTS.

I also tried reducing the dropout to 0.1 for the duration projection layer but didn't help.

Training the model end2end

This is just a heads-up about the discussion in #7.

I tried training the model end2end in different ways but F0 and Energy predictor was always underfitting although eval loss was also going down. They never were able to predict useful values for inference.

Here is roughly my forward pass. I can also share the branch if useful. Happy to see any feedback.

@typechecked
    def forward_all(
        self,
        texts: TensorType["B", "T_text"],
        input_lengths: TensorType["B"],
        mels: TensorType["B", "C_mel", "T_mel"],
        mel_input_length: TensorType["B"],
        F0_real: TensorType["B", 1, "T_mel"],
    ):
        # TODO: use Pitch Extractor (maybe torchcrepe)
        # mask = length_to_mask(mel_input_length // (2 ** model.text_aligner.n_down)).to(self.device)
        text_mask = self.lengths_to_mask(input_lengths).to(self.device)
        mel_mask = self.lengths_to_mask(mel_input_length).to(self.device)

        ##### --> TEXT ENCODER
        t_en, t_emb = self.text_encoder(texts, input_lengths, length_to_mask(input_lengths))

        ##### --> ALIGNER
        _, aligner_soft, aligner_logprob, aligner_hard = self._forward_aligner(
            x=t_emb.detach().transpose(1, 2),
            y=mels,
            x_mask=text_mask,
            y_mask=mel_mask,
            attn_priors=None,
        )

        ##### --> EXPAND
        t_en_ex = t_en @ aligner_hard.squeeze(1)

        ##### --> PRUNE THE BATCH BY THE SHORTEST MEL LENGTH
        mel_len = int(mel_input_length.min().item())
        t_en_ex_clipped = []
        mels_clipped = []
        F0s = []
        idxs = []
        for bib in range(len(mel_input_length)):
            mel_length = int(mel_input_length[bib].item()) + 1

            random_start = np.random.randint(0, mel_length - mel_len)
            idxs.append(random_start)
            t_en_ex_clipped.append(t_en_ex[bib, :, random_start : random_start + mel_len])
            mels_clipped.append(mels[bib, :, random_start : random_start + mel_len])
            F0s.append(F0_real[bib, :, random_start : random_start + mel_len])

        t_en_ex_clipped = torch.stack(t_en_ex_clipped)
        mels_clipped = torch.stack(mels_clipped).detach()
        F0_real = torch.stack(F0s).detach()

        ##### --> CALCULATE REAL ENERGY
        N_real = log_norm(mels_clipped.unsqueeze(1)).squeeze(1).detach()
        # F0_real, _, _ = self.pitch_extractor(gt.unsqueeze(1))

        ##### --> STYLE ENCODER
        s = self.style_encoder(mels_clipped.unsqueeze(1))

        ##### --> DURATION PREDICTOR & PROSOODY ENCODER
        dur_aligner = aligner_hard.sum(axis=-1).detach()
        dur_pred, prosody_en_ex = self.predictor(
            t_en.detach(), s.detach(), input_lengths, aligner_hard.squeeze(1), length_to_mask(input_lengths)
        )  # [B, T_en]
        d_align_mask = self.lengths_to_mask(input_lengths) * self.lengths_to_mask(mel_input_length).transpose(
            1, 2
        )  # [B, 1, T_enc] * [B, T_dec, 1]
        dur_alignment = generate_path(dur_pred, d_align_mask.squeeze(1).transpose(1, 2)).detach()  # [B, T_dec, T_enc]

        # Strip prosody tensor to match the mel length
        p_en = []
        for bib, start in enumerate(idxs):
            p_en.append(prosody_en_ex[bib, :, start : (start + mel_len)])
        p_en = torch.stack(p_en)

        ##### --> Pitch and Energy Predictor
        F0_fake, N_fake = self.predictor.F0Ntrain(p_en, s.detach())

        ##### --> DECODER
        mel_rec = self.decoder(t_en_ex_clipped, F0_real.squeeze(1), N_real, s)
        return {
            "mel_rec": mel_rec,
            "gt": mels_clipped,
            "aligner_logprob": aligner_logprob,
            "aligner_hard": aligner_hard.squeeze(1),
            "aligner_soft": aligner_soft,
            "F0_real": F0_real.squeeze(1),
            "F0_fake": F0_fake,
            "N_real": N_real,
            "N_fake": N_fake,
            "d": dur_pred,
            "d_gt": dur_aligner.squeeze(1),
            "d_alignment": dur_alignment,
        }

yl4579 / styletts Goto Github PK

styletts's People

Contributors

Stargazers

Watchers

Forkers

styletts's Issues

Recommend Projects

Recommend Topics

Recommend Org