- This code is a unofficial implementation of NaturalSpeech 2.
- The algorithm is based on the following paper:
Shen, K., Ju, Z., Tan, X., Liu, Y., Leng, Y., He, L., ... & Bian, J. (2023). NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers. arXiv preprint arXiv:2304.09116.
- The structure is derived from NaturalSpeech 2, but I made several modifications.
- Linear attention is applied instead of dot product-based multihead attention.
- This change was made to reduce memory usage and improve computational speed in insufficient enviornment.
- This may be a reason of the performance degradation.
- About CE-RVQ
- The CE-RVQ implementation in the current repository is incomplete.
- I had doubts about the loss calculation formula mentioned in the paper, so the previous implementation has been commented out.
- I think the current implementation aligns with the purpose of CE-RVQ, but deviates from the paper.
- The current implementation has not been verified how positively this loss contributes to model training.
- I would greatly appreciate any advice or suggestions you may have regarding this matter.
- The CE-RVQ loss is selectively applied to a random subset of RVQ layers at each step.
- Since CE-RVQ consumes a significant amount of memory, I applied sampling to reduce memory usage.
- If you want to apply it to the entire RVQ layers, please modify the hyperparameter
hp.Diffusion.CERVQ.Num_Sample
. - Based on the suggestion from @Autonomof, I have added a functionality to increase the weight of the initial layers during the sampling of the CE-RVQ layers. If you set
hp.Diffusion.CERVQ.Use_Weighted_Sample == true
, the weights will be taken into account.
- The CE-RVQ implementation in the current repository is incomplete.
- The audio codec has been changed to Meta's
Encodec 24Khz
.- This is done to reduce the time spent training a separate audio codec.
- The model uses 16Khz audio, but no audio resampling is applied.
- The dimension of Encodec is 128, which is smaller than the hyperparameter provided in the paper, which is 256. This may be a reason of the performance degradation.
- To maintain similarity with the paper, it may be better to apply Google's
SoundStream
instead of Encodec, but I couldn't apply SoundStream to this repository because official pyTorch source code or pretrained model was not provided.- There is an unverified implementation of SoundStream in Codec.py, so please refer to it.
- Although this repository does not use, there is also a c++ or tflite version of Lyra, which may allow the application of SoundStream using it.
- Information on the segment length σ of the speech prompt during training was not found in the paper and was arbitrarily set.
- The
σ
= 3, 5, and 10 seconds used in the evaluation of paper are too long to apply to both the variance predictor and diffusion during training. - To ensure stability in pattern usage, half the length of the shortest pattern used in each training is set as
σ
for each training.
- The
- The target duration is obtained through
Alignment learning framework (ALF)
, rather than being brought in externally.- Using external modules such as Montreal Force Alignment (MFA) may have benefits in terms of training speed or stability, but I prioritized simplifying the training process.
- A weight has been applied to correct the relatively large MLE loss used in MAS.
- Padding is applied between tokens like
'A <P> B <P> C ....'
- I could not verify whether there was a difference in performance depending on its usage.
- To apply zero-shot reported in the paper, I believe that it is necessary to have as many speakers as possible in the training data, but I were unable to test Multilingual LibriSpeech due to current environment.
- Tested
- LJ Dataset
- VCTK Dataset
- This repository used the VCTK092 from Torchaudio(https://datashare.is.ed.ac.uk/bitstream/handle/10283/3443/VCTK-Corpus-0.92.zip)
- Supported but not tested
- LibriTTS Dataset
- Mulitilingual LibriSpeech Dataset
- Only English language dataset generation is tested.
Before proceeding, please set the pattern, inference, and checkpoint paths in Hyper_Parameters.yaml according to your environment.
-
Sound
- Setting basic sound parameters.
-
Tokens
- The number of token.
- After pattern generating, you can see which tokens are included in the dataset at
Token_Path
.
-
Audio_Codec
- Setting the audio codec.
- This repository is using Encodec, so only the size of the latents output from Encodec's encoder is set for reference in other modules.
-
Train
- Setting the parameters of training.
-
Inference_Batch_Size
- Setting the batch size when inference
-
Inference_Path
- Setting the inference path
-
Checkpoint_Path
- Setting the checkpoint path
-
Log_Path
- Setting the tensorboard log path
-
Use_Mixed_Precision
- Setting using mixed precision
-
Use_Multi_GPU
- Setting using multi gpu
- By the nvcc problem, Only linux supports this option.
- If this is
True
, device parameter is also multiple like0,1,2,3
. - And you have to change the training command also: please check multi_gpu.sh.
-
Device
- Setting which GPU devices are used in multi-GPU enviornment.
- Or, if using only CPU, please set '-1'. (But, I don't recommend while training.)
python Pattern_Generate.py [parameters]
- -lj
- The path of LJSpeech dataset
- -vctk
- The path of VCTK dataset
- -libri
- The path of LbiriTTS dataset
- -hp
- The path of hyperparameter.
- To phoneme string generate, this repository uses phonimizer library.
- Please refer here to install phonemizer and backend
- In Windows, you need more setting to use phonemizer.
- Please refer here
- In conda enviornment, the following commands are useful.
conda env config vars set PHONEMIZER_ESPEAK_PATH='C:\Program Files\eSpeak NG' conda env config vars set PHONEMIZER_ESPEAK_LIBRARY='C:\Program Files\eSpeak NG\libespeak-ng.dll'
python Train.py -hp <path> -s <int>
-
-hp <path>
- The hyper paramter file path
- This is required.
-
-s <int>
- The resume step parameter.
- Default is
0
. - If value is
0
, model try to search the latest checkpoint.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 OMP_NUM_THREADS=32 python -m torch.distributed.launch --nproc_per_node=8 Train.py --hyper_parameters Hyper_Parameters.yaml --port 54322
- I recommend to check the multi_gpu.sh.
- Verification