Giter Site home page Giter Site logo

hierspeechpp's Introduction

HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation by Hierarchical Variational Inference for Zero-shot Speech Synthesis
The official implementation of HierSpeech++

Sang-Hoon Lee, Ha-Yeong Choi, Seung-Bin Kim, Seong-Whan Lee

Department of Artificial Intelligence, Korea University, Seoul, Korea

Abstract

Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis. However, they require a large-scale data and possess the same limitations as previous autoregressive speech models, including slow inference speed and lack of robustness. This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC). We verified that hierarchical speech synthesis frameworks could significantly improve the robustness and expressiveness of the synthetic speech. Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios. For text-to-speech, we adopt the text-to-vec framework, which generates a self-supervised speech representation and an F0 representation based on text representations and prosody prompts. Then, HierSpeech++ generates speech from the generated vector, F0, and voice prompt. We further introduce a high-efficient speech super-resolution framework from 16 kHz to 48 kHz. The experimental results demonstrated that the hierarchical variational autoencoder could be a strong zero-shot speech synthesizer given that it outperforms LLM-based and diffusion-based models. Moreover, we achieved the first human-level quality zero-shot speech synthesis.

Fig1_pipeline

This repository contains:

  • ๐Ÿช A PyTorch implementation of HierSpeech++ (TTV, Hierarchical Speech Synthesizer, SpeechSR)
  • โšก๏ธ Pre-trained HierSpeech++ models trained on LibriTTS (Train-460, Train-960, and more dataset)
  • Hugging Face Spaces Gradio Demo on HuggingFace. HuggingFace provides us with a community GPU grant. Thanks ๐Ÿ˜Š

Previous Our Works

  • [NeurIPS2022] HierSpeech: Bridging the Gap between Text and Speech by Hierarchical Variational Inference using Self-supervised Representations for Speech Synthesis
  • [Interspeech2023] HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer

This paper is an extension version of above papers.

Update

24.02.20

  • We get back the reconstruction loss for ttv. Adding the loss masking for zero-padding decrease the tts performance by generating a random long pause in generated speech and repeated sound(It may affect the loss balance). Sorry for the confusion. I revised it as a paper version.

24.01.19

  • We have released the TTV_v1 training code. Regardless of the language, you can train TTV using personal dataset, and perform speech synthesis using the pre-trained Hierarchical Speech Synthesizer model.

Todo

Hierarchical Speech Synthesizer

  • HierSpeechpp-Backbone (LibriTTS-train-460)
  • HierSpeechpp-Backbone (LibriTTS-train-960)
  • HierSpeechpp-Backbone-60epoch (LibriTTS-train-960, Libri-light (Medium), Expresso, MSSS(Kor), NIKL(Kor))
  • HierSpeechpp-Backbone-200epoch (LibriTTS-train-960, Libri-light (Medium), Expresso, MSSS(Kor), NIKL(Kor))

Text-to-Vec (TTV)

  • TTV-v1 (LibriTTS-train-960)
  • TTV-v2 (Multi-lingual TTV)

Speech Super-resolution (16k --> 24k or 48k)

  • SpeechSR-24k
  • SpeechSR-48k

Cleaning Up the Source Code

  • Clean Code

Training code (Will be released after paper acceptance)

  • TTV
  • Hierarchical Speech Synthesizer
  • SpeechSR

Getting Started

Pre-requisites

  1. Pytorch >=1.13 and torchaudio >= 0.13
  2. Install requirements
pip install -r requirements.txt
  1. Install Phonemizer
pip install phonemizer
sudo apt-get install espeak-ng

Checkpoint [Download]

Hierarchical Speech Synthesizer

Model Sampling Rate Params Dataset Hour Speaker Checkpoint
HierSpeech2 16 kHz 97M LibriTTS (train-460) 245 1,151 [Download]
HierSpeech2 16 kHz 97M LibriTTS (train-960) 555 2,311 [Download]
HierSpeech2 16 kHz 97M LibriTTS (train-960), Libri-light (Small, Medium), Expresso, MSSS(Kor), NIKL(Kor) 2,796 7,299 [Download]

TTV

Model Language Params Dataset Hour Speaker Checkpoint
TTV Eng 107M LibriTTS (train-960) 555 2,311 [Download]

SpeechSR

Model Sampling Rate Params Dataset Checkpoint
SpeechSR-24k 16kHz --> 24 kHz 0.13M LibriTTS (train-960), MSSS (Kor) speechsr24k
SpeechSR-48k 16kHz --> 48 kHz 0.13M MSSS (Kor), Expresso (Eng), VCTK (Eng) speechsr48k

Text-to-Speech

sh inference.sh

# --ckpt "logs/hierspeechpp_libritts460/hierspeechpp_lt460_ckpt.pth" \ LibriTTS-460
# --ckpt "logs/hierspeechpp_libritts960/hierspeechpp_lt960_ckpt.pth" \ LibriTTS-960
# --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1_ckpt.pth" \ Large_v1 epoch 60 (paper version)
# --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth" \ Large_v1.1 epoch 200 (20. Nov. 2023)

CUDA_VISIBLE_DEVICES=0 python3 inference.py \
                --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth" \
                --ckpt_text2w2v "logs/ttv_libritts_v1/ttv_lt960_ckpt.pth" \
                --output_dir "tts_results_eng_kor_v2" \
                --noise_scale_vc "0.333" \
                --noise_scale_ttv "0.333" \
                --denoise_ratio "0"

  • For better robustness, we recommend a noise_scale of 0.333
  • For better expressiveness, we recommend a noise_scale of 0.667
  • Find your best parameters for your style prompt

Noise Control

# without denoiser
--denoise_ratio "0"
# with denoiser
--denoise_ratio "1"
# Mixup (Recommend 0.6~0.8)
--denoise_rate "0.8" 

Voice Conversion

  • This method only utilize a hierarchical speech synthesizer for voice conversion.
sh inference_vc.sh

# --ckpt "logs/hierspeechpp_libritts460/hierspeechpp_lt460_ckpt.pth" \ LibriTTS-460
# --ckpt "logs/hierspeechpp_libritts960/hierspeechpp_lt960_ckpt.pth" \ LibriTTS-960
# --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1_ckpt.pth" \ Large_v1 epoch 60 (paper version)
# --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth" \ Large_v1.1 epoch 200 (20. Nov. 2023)

CUDA_VISIBLE_DEVICES=0 python3 inference_vc.py \
                --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth" \
                --output_dir "vc_results_eng_kor_v2" \
                --noise_scale_vc "0.333" \
                --noise_scale_ttv "0.333" \
                --denoise_ratio "0"
  • For better robustness, we recommend a noise_scale of 0.333
  • For better expressiveness, we recommend a noise_scale of 0.667
  • Find your best parameters for your style prompt
  • Voice Conversion is vulnerable to noisy target prompt so we recommend to utilize a denoiser with noisy prompt
  • For noisy source speech, a wrong F0 may be extracted by YAPPT resulting in a quality degradation.

Speech Super-resolution

  • SpeechSR-24k and SpeechSR-48 are provided in TTS pipeline. If you want to use SpeechSR only, please refer inference_speechsr.py.
  • If you change the output resolution, add this
--output_sr "48000" # Default
--output_sr "24000" # 
--output_sr "16000" # without super-resolution.

Speech Denoising for Noise-free Speech Synthesis (Only used in Speaker Encoder during Inference)

  • For denoised style prompt, we utilize a denoiser (MP-SENet).
  • When using a long reference audio, there is an out-of-memory issue with this model so we have a plan to learn a memory efficient speech denoiser in the future.
  • If you have a problem, we recommend to use a clean reference audio or denoised audio before TTS pipeline or denoise the audio with cpu (but this will be slow๐Ÿ˜ฅ).
  • (21, Nov. 2023) Sliced window denoising. This may reduce a burden for denoising a speech.
          if denoise == 0:
              audio = torch.cat([audio.cuda(), audio.cuda()], dim=0)
          else:
              with torch.no_grad():
                  
                  if ori_prompt_len > 80000:
                      denoised_audio = []
                      for i in range((ori_prompt_len//80000)):
                          denoised_audio.append(denoise(audio.squeeze(0).cuda()[i*80000:(i+1)*80000], denoiser, hps_denoiser))
                      
                      denoised_audio.append(denoise(audio.squeeze(0).cuda()[(i+1)*80000:], denoiser, hps_denoiser))
                      denoised_audio = torch.cat(denoised_audio, dim=1)
                  else:
                      denoised_audio = denoise(audio.squeeze(0).cuda(), denoiser, hps_denoiser)
    
              audio = torch.cat([audio.cuda(), denoised_audio[:,:audio.shape[-1]]], dim=0)
    

TTV-v2 (WIP)

  • TTV-v1 is a simple model which is very slightly modified from VITS. Although this simple TTV could synthesize a speech with high-quality and high speaker similarity, we thought that there is room for improvement in terms of expressiveness such as prosody modeling.
  • For TTV-v2, we modify some components and training process (Model size: 107M --> 278M)
    1. Intermediate hidden size: 256 --> 384
    2. Loss masking for wav2vec reconstruction loss (I left out masking the loss for zero-padding sequences๐Ÿ˜ฅ)
    3. For long sentence generation, we finetune the model with full LibriTTS-train dataset without data filtering (Decrease the learning rate to 2e-5 with batch size of 8 per gpus)
    4. Multi-lingual Dataset (We are training the model with Eng, Indic, and Kor dataset now)

GAN VS Diffusion

[Read More] We think that we could not confirm which is better yet. There are many advatanges for each model so you can utilize each model for your own purposes and each study must be actively conducted simultaneously.

GAN (Specifically, GAN-based End-to-End Speech Synthesis Models)

  • (pros) Fast Inference Speed
  • (pros) High-quality Audio
  • (cons) Slow Training Speed (Over 7~20 Days)
  • (cons) Lower Voice Style Transfer Performance than Diffusion Models
  • (cons) Perceptually High-quality but Over-smoothed Audio because of Information Bottleneck by the sampling from the low-dimensional Latent Variable

Diffusion (Diffusion-based Mel-spectrogram Generation Models)

  • (pros) Fast Training Speed (within 3 Days)
  • (pros) High-quality Voice Style Transfer
  • (cons) Slow Inference Speed
  • (cons) Lower Audio quality than End-to-End Speech Synthesis Models

(In this wors) Our Approaches for GAN-based End-to-End Speech Synthesis Models

  • Improving Voice Style Transfer Performance in End-to-End Speech Synthesis Models for OOD (Zero-shot Voice Style Transfer for Novel Speaker)
  • Improving the Audio Quality beyond Perceptal Quality for Much more High-fidelity Audio Generation

(Our other works) Diffusion-based Mel-spectrogram Generation Models

  • DDDM-VC: Disentangled Denoising Diffusion Models for High-quality and High-diversity Speech Synthesis Models
  • Diff-hierVC: Hierarhical Diffusion-based Speech Synthesis Model with Diffusion-based Pitch Modeling

Our Goals

  • Integrating each model for High-quality, High-diversity and High-fidelity Speech Synthesis Models

LLM-based Models

We hope to compare LLM-based models for zero-shot TTS baselines. However, there is no public-available official implementation of LLM-based TTS models. Unfortunately, unofficial models have a poor performance in zero-shot TTS so we hope they will release their model for a fair comparison and reproducibility and for our speech community. THB I could not stand the inference speed almost 1,000 times slower than e2e models It takes 5 days to synthesize the full sentences of LibriTTS-test subsets. Even, the audio quality is so bad. I hope they will release their official source code soon.

In my very personal opinion, VITS is still the best TTS model I have ever seen. But, I acknowledge that LLM-based models have much powerful potential for their creative generative performance from the large-scale dataset but not now.

Limitation of our work

  • Slow training speed and Relatively large model size (Compared with VITS) --> Future work: Light-weight and Fast training pipeline and much larger model...
  • Could not generate realistic background sound --> Future work: adding audio generation part by disentangling speech and sound.
  • Could not generate a speech from a too long sentence becauase of our training setting. We see increasing max length could improve the model performance. I hope to use GPUs with 80 GB ๐Ÿ˜ข
     # Data Filtering for limited computation resource. 
      wav_min = 32
      wav_max = 600 # 12s 
      text_min = 1
      text_max = 200
    

TTV v2 may reduce this issue significantly...!

Results [Download]

We have attached all samples from LibriTTS test-clean and test-other.

Reference

Our repository is heavily based on VITS and BigVGAN.

[Read More]

Our Previous Works

Baseline Model

Waveform Generator for High-quality Audio Generation

Self-supervised Speech Model

Other Large Language Model based Speech Synthesis Model

  • VALL-E & VALL-E-X
  • SPEAR-TTS
  • Make-a-Voice
  • MEGA-TTS & MEGA-TTS 2
  • UniAudio

Diffusion-based Model

  • NaturalSpeech 2

AdaLN-zero

Thanks for all nice works.

hierspeechpp's People

Contributors

hayeong0 avatar sh-lee-prml avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hierspeechpp's Issues

hierspeechpp_v2_ckpt

Hi there, thanks for putting up the repo!

The default checkpoint for inference is hierspeechpp_v2_ckpt. Is that more up to date and available for download aswell?

 parser.add_argument('--ckpt', default='./logs/hierspeechpp_eng_kor/hierspeechpp_v2_ckpt.pth')

License

Hi, great project. Just curious why you decided to update the license from MIT to CC-BY-NC-SA?

Multi-lingual TTS

We have a plan to train the TTV with multi-lingual dataset in Dec. 2023.

Now, we have collected a English, Korean, Indian dataset.

We may add a Chinese and Japanese speech dataset. But, we are not familiar to these languages

so It would be appreciate if you could suggest some open-source dataset ๐Ÿ˜Š

If you want some specific languages, please tell me and recommend the open-source dataset of them.

Missing keys in checkpoint file

Thank you for your excellent work! I'm currently working on fine-tuning the TTV model with a non-English dataset. I've downloaded the 'ttv_lt960_ckpt.pth' and attempted to load it using utils.load_checkpoint function in the 'train_ttv_v1.py' to initiate the model. However, the checkpoint doesn't include the required keys like 'optimizer', 'model', 'learning_rate', and 'iteration', making it impossible to load. Could you please guide me on what might be missing or if there are any additional steps I need to take?

Multi Scale STFT Discriminator checkpoint

Hello. First of all, thank you very much for your great work !
Now I'm trying to finetune the hierarchical speech synthesizer on my own dataset. In my understanding, in your adversarial training process, you adopted 2 different discriminators (the multi-period discriminator (MPD) and the multi-scale STFT discriminator). In #4 , I saw you only shared the checkpoint of MPD. Can you kindly share the checkpoint of MS-STFT ? If I'm misunderstood, please correct me !
Once again, your work is wonderful !

Problem in english cleaners

image
As the comment says, the differences between the two cleaners are that cleaners2 has punctuation and stress. However, the default value of preserve.punctuation and with_stress in phonemizer.phonemize is false, which means cleaners 2, and it's contradict to the comment.
image

discriminator checkpoint

Hi, Could you share your discriminator checkpoint?
So that we can try fine-tune the HAG model on other language.
I tried user your HAG checkpoint model to Resynthesis chinese mandarin audio, the model can do it well at most of the time when input MMS wav2vec feature of lower speech rate audio.

Based on the design of your model, the HAG model was not very related to language it trained on, if I'm not get wrong.

Before you release your model train code, I want to try train the TTV model on mandarin and english data from scratch,
and try fine-tune the HAG model on my data to get better pronunciation.
Most importantly, Thans for your sharing code and checkpoint!

tensor mismatch size in commons.rand_slice_segments(w2v, length, 60)

hey, thank you so much for your great work. As I am trying to train the model on Libritts dataset following your tips #20 (comment).
I encountered the following issue:

w2v_slice, ids_slice = commons.rand_slice_segments(w2v, length, 60)
File "/home/meri/HierSpeechpp-train_Libritts_460/commons.py", line 70, in rand_slice_segments
ret = slice_segments(x, ids_str, segment_size)
File "/home/meri/HierSpeechpp-train_Libritts_460/commons.py", line 53, in slice_segments
ret[i] = x[i, :, idx_str:idx_end]
RuntimeError: The expanded size of the tensor (60) must match the existing size (30) at non-singleton dimension 1. Target sizes: [1024, 60]. Tensor sizes: [1024, 30]
and
RuntimeError: Expected input_lengths to have value at most 512, but got value 520 (while checking arguments for ctc_loss_gpu)

It seems the reason for this is that the tensor of lengths has bigger numbers ([408, 420, 376, 328, 332, 400, 340, 288, 520, 124, 180, 260, 256, 100,
216, 124, 600, 192, 384, 296, 544, 480, 436, 384, 440, 252, 324, 152,
372, 336, 128, 288], device='cuda:0') than the size of the w2v of ([512, 32, 178]

any tip how to solve this issue?

[bug] when vc my own audios, the error occurs

X:\tmp\HierSpeechpp\inference_vc.py:78: UserWarning: "kaiser_window" resampling method name is being deprecated and replaced by "sinc_interp_kaiser" in the next release. The default behavior remains unchanged.
source_audio = torchaudio.functional.resample(source_audio, sample_rate, 16000, resampling_method="kaiser_window")
C:\Users\sekkitshi\AppData\Local\miniconda3\envs\hierspeech2\lib\site-packages\amfm_decompy\pYAAPT.py:970: RuntimeWarning: invalid value encountered in divide
phi[lag_min:lag_max] = formula_nume/np.sqrt(formula_denom)
X:\tmp\HierSpeechpp\inference_vc.py:103: UserWarning: "kaiser_window" resampling method name is being deprecated and replaced by "sinc_interp_kaiser" in the next release. The default behavior remains unchanged.
target_audio = torchaudio.functional.resample(target_audio, sample_rate, 16000, resampling_method="kaiser_window")
Traceback (most recent call last):
File "X:\tmp\HierSpeechpp\inference_vc.py", line 254, in
main()
File "X:\tmp\HierSpeechpp\inference_vc.py", line 251, in main
inference(a)
File "X:\tmp\HierSpeechpp\inference_vc.py", line 220, in inference
VC(a, hierspeech)
File "X:\tmp\HierSpeechpp\inference_vc.py", line 168, in VC
write(output_file, 48000, converted_audio)
File "C:\Users\sekkitshi\AppData\Local\miniconda3\envs\hierspeech2\lib\site-packages\scipy\io\wavfile.py", line 797, in write
fmt_chunk_data = struct.pack('<HHIIHH', format_tag, channels, fs,
struct.error: ushort format requires 0 <= number <= 0xffff

Seemingly incorrect tts result

Hi, I'm unfamiliar with Korean, while I got difference output from your work and a tts website with input text "์ด์ „ ๊ณก".
I generated audio from 4 different ways:
1: using inference with code from this repo and checkpoint logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth
2: using your huggingface demo page: https://huggingface.co/spaces/LeeSangHoon/HierSpeech_TTS
3: using coqui ai's huggingface page: https://huggingface.co/spaces/coqui/xtts
4: using 2 free tts websites: https://www.text-to-speech.cn/ and https://ttstool.com/
Here's what I got: https://drive.google.com/drive/folders/1BfptrylJTmICm2JN49G2YQi-HjnOSCBB?usp=sharing

Timeline for Multilingual TTV model

Hi,

thanks for your great work!

Do you have a timeline in mind for when to release the multilingual ttv model?
Did you do anything different from the monolingual approach to train such a model?

Thanks

A question about pertubation and pertubated features

In Section 3.2 of the paper, it was mentioned that

we utilize a speech perturbation to remove speaker-related information from the self-supervised speech representation.

Which speech perturbation did you use to remove speaker-related information, is it formant shift?

And how do you get the pertubated features during training, preprocessed or on-the-fly?

That's great!

It's really good, looking forward to the Chinese model. When will the open source code allow us to train ourselves?

Long text inference

Hey,
Thanks for releasing HierSpeech++ - it's awesome! Do you know if it's possible to smoothly synthesize long text? (For example with StyleTTS 2 does this:

Our findings indicate that style diffusion creates significant variation in samples, a characteristic that
poses challenges for long-form synthesis. In this scenario, a long paragraph is usually divided into
smaller sentences for generation, sentence by sentence, in the same way as real-time applications.

Using an independent style for each sentence may generate speech that appears inconsistent due
to differences in speaking styles. Conversely, maintaining the same style from the first sentence
throughout the entire paragraph results in monotonic, unnatural, and robotic-sounding speech.
We empirically observe that the latent space underlying the style vectors generally forms a convex
space. Consequently, a convex combination of two style vectors yields another style vector, with
the speaking style somewhere between the original two. This allows us to condition the style of the
current sentence on the previous sentence through a simple convex combination. The pseudocode of
this algorithm, which uses interpolated style vectors, is provided in Algorithm 1.

Thanks!

cant replicate the training curve for libritts 960

hey,
thank you for sharing the training cure. I am training the model on Libritts 960 using exactly your preprocessing and training files. However my training curve doesn t show the same losses you got. while you reached 2.74 w2v loss at 100k (#4 (comment) ,) my training converges at error of 6. The model shows slurries as well
Screenshot 2024-02-04 at 2 02 02 PM

what could be the reason for that ?
thank you

Can this work on a cpu without cuda?

This is the error I'm getting:

    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

inference error

I have trained on my own datasets, language is Chinese, and I use my own cleaners and symbols.
However, when I use my trained model to infer tts( inference.py ), loading model has some errors like this:

File "/home/liuyiheng/HierSpeechpp/inference.py", line 157, in model_load
text2w2v.load_state_dict(torch.load(a.ckpt_text2w2v))
File "/home/liuyiheng/HierSpeechpp/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for SynthesizerTrn:
Missing key(s) in state_dict: "enc_q.pre.weight", "enc_q.pre.bias", "enc_q.enc.in_layers.0.bias", "enc_q.enc.in_layers.0.weight_g",....
Unexpected key(s) in state_dict: "model", "iteration", "optimizer", "learning_rate".

Can you explain why this error occured?
I only replace "ckpt_text2w2v" checkpoints to my trained checkpoints ,other checkpoints use yours' checkpoints .
ๅพฎไฟกๅ›พ็‰‡_20240126091619

Is there anything else I should be aware of when infering tts?

Thank you very much for taking time out of your busy schedule to answer questions!

training TTV

I have some question about TTV training

  1. When I use Mandarin+English ttv_training and use the pre-training checkpoint, the synthesized audio file will have a foreign accent.
    example : Mandarin_0 :
    Mandarin_0.zip
  2. When I use Mandarin+English ttv_training without using the pre-training checkpoint, there will be typos in the synthesized audio file. I would like to know how to solve this error.
    example : Mandarin_1 :
    Mandarin_1.zip
  3. I would like to ask again about how many steps it takes to train TTV.

Voice Cloning TTS

Is it impossible to do TTS work that reads text in target speed file instead of VC work with source speed file and target speed file?

Hierarchical speech synthesizer training code (or new checkpoints)

Hello, your team did a great job with this model! I'm yet to test it more thoroughly for TTS, but the VC component is the best pre-trained VC model I found so far!

I'd be interested in adding more languages/voices to the VC part, and noticed you were planning to release the speech synthesizer training code after the paper gets accepted. Any news on that? Or do you have any plans to release larger pre-trained checkpoints? I believe I saw somewhere that you were working on training with more languages.

Pitch detection

Hello! Thank you for a great model! I am trying to fine-tune your model. You have quite uncommon predicted pitch value range. Could you please share what library did you use for pitch detection? Thank you!

NaN in the text encoder value

In the output of the text encoder on a custom dataset, here

x, m_p, logs_p, x_mask = self.enc_p(x_text, text_length, g=g)

the value of x is coming as NaN. Due to it, several other parameters were coming as 0 values and NaN as well.
Do you have any clues on what could be causing the issue?

Are you planning to add support for the Ukrainian language?

Dear developers,

I am asking you to add support for the Ukrainian language to your neural network HierSpeech++, which clones voices

Ukrainian is the native language of more than 40 million people worldwide, and adding it to your neural network will have a significant impact on the Ukrainian community.

Here are some of the reasons why I believe that the addition of the Ukrainian language will be useful:

 This will empower people with visual impairments and disabilities.
 This will make your neural network more accessible to people from all over the world who have Ukrainian roots.

I understand that this can be a difficult task, but I believe that it is worth it.

I would also like to mention that there is a large amount of Ukrainian language data that you can use to train your neural network. This data is publicly available from various sources

I believe that adding the Ukrainian language to your HierSpeech++ neural network is the same for the Ukrainian community.
Thank you for your time and attention.

Possible inconsistency between SpeechSR implementation and paper description

Hi @sh-lee-prml,

Thanks for the great work that you are doing and for making this awesome repository open-source :).

I think I might found an inconsistency between the code of the SpeechSR and the paper description. In section 3.4 of the paper is said that "We further replace a transposed convolution with nearest neighbor (NN) upsampler. Previously, an NN upsampler was shown to alleviate tonal artifacts caused by transposed convolutions. We found that the NN upsampler also reduce the error in the high spectrum compared to the transposed convolution." and in the code the linear interpolation is used instead of NN: https://github.com/sh-lee-prml/HierSpeechpp/blob/318c633cdc24cf3239d62c95ea8afae200ed2d13/speechsr48k/speechsr.py#L96C41-L96C70

I'm wondering if it is a code or description issue. I'm trying to apply the contributions that you guys did in my own project but I'm not sure if I should use NN or linear interpolation.

Thanks very much :).

Voice conversion training ?

Hello, thanks for posting TTS(TTV) training code!

Are you planning to releae VC training instructions , too?

guide

Is there any installation guide because I am absolutely lost, and really want to use this locally

Training code

Hi,
Will the training code be released to test it on our own datasets?

Pavan

Gibberish audio on preliminary training

Thanks for the repo. I had around 30 hours of custom Hindi data which I wanted to train and test the model on. Training only the TTV part on 4x A6000 GPUs with 64 batch size, I tried inferencing with the provided VC checkpoint, but I was getting unintelligible results. There was no distortion audio, but just unintelligible speech.
What do you reckon could be the cause of it? Was it because I will need a VC model for Hindi as well, or my training steps were low for getting the results? And after how many steps, do you think we can expect a checkpoint on which we can get a decent preliminary result?
Thanks again!

Upload model weights to Hugging Face Hub!

Hi team,

I'm VB, I lead the advocacy efforts for Open Source Audio at Hugging Face. Congratulations on such a brilliant model and for a highly permissive license. It'd be great if you could upload the weights to the Hugging Face hub as well. It'd allow the model weights to be discovered more readily, leading to even greater adoption.

These model weights can reside under the HierSpeech++ organisation (or your research group) on Hugging Face, too.

I'd be happy to help you with this, of course. Let me know.

VB

A question about the used F0 for SA and SR source filters

Hello,
First of all, thank you for your work, I am enjoying reading your paper and finding that we share few ideas :)
I am curious for the F0s fed to the speaker agnostic and speaker related source filters during training, were do they come from? from the audio after perturbation? or the audio before perturbation or the SA F0 comes from the audio after perturbation and the SR F0 comes from the original audio?

Thank you.

DWT D?

Sorry, i cant find the DWTD.........

Do you use biT-flow in ttv?

Thank you for your great previous reply. I have suceessfully finetune your pretrained model. Do you use biT-flow in ttv?

local variable 'audiosr' referenced before assignment

When i run sh inference.sh, i met this error

Initializing Inference Process.. /home/hanct/anaconda3/envs/hier/lib/python3.9/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") Traceback (most recent call last): File "/home/hanct/snap/snapd-desktop-integration/83/Documents/HierSpeechpp/inference.py", line 224, in <module> main() File "/home/hanct/snap/snapd-desktop-integration/83/Documents/HierSpeechpp/inference.py", line 221, in main inference(a) File "/home/hanct/snap/snapd-desktop-integration/83/Documents/HierSpeechpp/inference.py", line 183, in inference hierspeech = model_load(a) File "/home/hanct/snap/snapd-desktop-integration/83/Documents/HierSpeechpp/inference.py", line 162, in model_load utils.load_checkpoint(a.ckpt_sr48, audiosr, None) UnboundLocalError: local variable 'audiosr' referenced before assignment
Look like something missing. What is this audiosr variable?

About text in TTV preprocessing

May I kindly inquire which file you utilized for extracting text tokens from the LibriTTS dataset during the TTV training preprocessing phase: was it the original.txt or the normalized.txt? Thank you for your assistance.

Which VC model shall I use for crosslingual VC?

As the tutorial wrote:

--ckpt "logs/hierspeechpp_libritts460/hierspeechpp_lt460_ckpt.pth" \ LibriTTS-460

--ckpt "logs/hierspeechpp_libritts960/hierspeechpp_lt960_ckpt.pth" \ LibriTTS-960

--ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1_ckpt.pth" \ Large_v1 epoch 60 (paper version)

--ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth" \ Large_v1.1 epoch 200 (20. Nov. 2023)

CUDA_VISIBLE_DEVICES=0 python3 inference_vc.py
--ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth"
--output_dir "vc_results_eng_kor_v2"
--noise_scale_vc "0.333"
--noise_scale_ttv "0.333"
--denoise_ratio "0"

Which VC model shall I use is the best for crosslingual VC?

Random pauses in between the sentences

Hi, thanks again for open-sourcing the models.
I have been training the model so far on Hindi dataset, but I have noticed there seems to be random abrupt pauses in the sentence on generation. Have attached the sample. (From 120k checkpoint)
So, just wanted to know if you have observed this previously with your training as well / what could be the cause of it?

audio_wav.mp4

[I understand that you may not know Hindi, but I think the pauses at 3 seconds and 8 seconds are pretty noticeable]
Thanks again!

TTV model์— ๊ด€ํ•œ ์งˆ๋ฌธ

์•ˆ๋…•ํ•˜์„ธ์š”.
๊ต‰์žฅํ•œ ํ”„๋กœ์ ํŠธ๋ฅผ ์˜คํ”ˆ์†Œ์Šค๋กœ ์˜ฌ๋ ค์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
VC๊ณผ TTS๋ฅผ ์œตํ•ฉํ•˜๊ธฐ ์œ„ํ•ด์„œ w2v๋ฅผ ๋ฝ‘์•„๋‚ด๋Š” ์ปจ์…‰์ด ์ •๋ง ์žฌ๋ฐŒ์Šต๋‹ˆ๋‹ค.

ํฅ๋ฏธ๊ฐ€ ์ƒ๊ฒจ์„œ ํ•™์Šต์„ ์‹œ์ผœ๋ณด๊ณ  ์žˆ๋˜ ์ค‘์—
TTV transformer ๊ด€๋ จ ์งˆ๋ฌธ์ด ์ƒ๊ฒจ ์ด๋ ‡๊ฒŒ issue ์˜ฌ๋ฆฝ๋‹ˆ๋‹ค.

posterior => decode ๋Š” ์ž˜ ๋˜๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค๋งŒ,
prior์™€ posterior ๋ถ„ํฌ๊ฐ„์˜ kl divergence๊ฐ€ ๊ต‰์žฅํžˆ ํฌ๊ณ 
์–ด๋Š ์‹œ์  ์ดํ›„๋กœ ์ œ๋Œ€๋กœ ํ•™์Šต์ด ์•ˆ ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

  1. ๋‘ ๋ถ„ํฌ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ์ถ”๊ฐ€ํ•˜์‹  feature๊ฐ€ ์žˆ์„๊นŒ์š”?
  2. TTV๋ฅผ ํ•™์Šต์‹œํ‚ค๋Š”๋ฐ ์–ผ๋งˆ๋‚˜ ๊ฑธ๋ฆฌ์…จ์„๊นŒ์š”?

๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

Prosody Encoder

Hi, Is the output of the prosody encoder in Hierarchical speech synthesizer only used to calculate the loss with the first 20 dimensions of the target mel? What is the weight assigned to this loss?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.