adelacvg / ns2vc Goto Github PK

Unofficial implementation of NaturalSpeech2 for Voice Conversion and Text to Speech

Python 62.49% Jupyter Notebook 37.51%

diffusion speech voice-conversion naturalspeech2 zero-shot

ns2vc's Introduction

NS2VC_v2

Unofficial implementation of NaturalSpeech2 for Voice Conversion

Different from the NS2, I use the vocos but encodec as the vocoder for better quality, and use contentvec to substitute the text embedding and duration span process. I also adopted the unet1d conditional model from the diffusers lib, thanks for their hard works.

About Zero shot generalization

I did many attempt on improve the generalization of the model. And I find that it's much like the stable diffusion. If a tag is not in your train set, you can't get a promising result. Larger dataset, more speaker, better generalization, better results. The model can ensure speakers in trainset have a good result.

Demo

refer	input	output
refer0.webm	gt0.webm	gen0.webm
refer1.webm	gt1.webm	gen1.webm
refer2.webm	gt2.webm	gen2.webm
refer3.webm	gt3.webm	gen3.webm
refer4.webm	gt4.webm	gen4.webm

Data preprocessing

First of all, you need to download the contentvec model and put it under the hubert folder. The model can be download from here.

The dataset structure can be like this:

dataset
├── spk1
│   ├── 1.wav
│   ├── 2.wav
│   ├── ...
│   └── spk11
│       ├── 11.wav
├── 3.wav
├── 4.wav

Overall, you can put the data in any way you like.

Put the data with .wav extension under the dataset folder, and then run the following command to preprocess the data.

python preprocess.py

The preprocessed data will be saved under the processed_dataset folder.

Requirements

You can install the requirements by running the following command.

pip install vocos accelerate matplotlib librosa unidecode inflect ema_pytorch tensorboard fairseq praat-parselmouth pyworld

Training

Install the accelerate first, run accelerate config to configure the environment, and then run the following command to train the model.

accelerate launch train.py

Inference

Change the device, model_path, clean_names and refer_names in the inference.py, and then run the following command to inference the model.

python infer.py

Continue training

If you want to fine tune or continue to train a model. Add

trainer.load('your_model_path')

to the train.py.

Pretrained model

Maybe comming soon, if I had enough data for a good model.

TTS

If you want to use the TTS model, please check the TTS branch.

Q&A

qq group:801645314 You can add the qq group to discuss the project.

Thanks to sovits4, naturalspeech2 and imagen diffusersfor their great works.

ns2vc's People

Contributors

Stargazers

Watchers

Forkers

entn-at whitefu chenchy justinjohn0306 ishine hongwen-sun techthiyanes zcloud2014 shaun95 pranjalya sunnnnnnnny 1044690543

ns2vc's Issues

请问训练音乐的音频时间是在什么范围

数据准备

数据应该怎么准备呢？需要wav格式和lab文件格式还是其他的什么？

four preprocessed assert

lmin = min(f0.size(-1), spec.size(-1), sum(duration))
assert abs(f0.size(-1) - spec.size(-1)) < 3, (spec.size(-1), f0.shape, filename)
assert abs(spec.size(-1) - sum(duration)) < 3, (spec.size(-1), sum(duration), filename)
assert abs(audio.shape[1]- lmin * self.hop_length) < 3 * self.hop_length , (audio.shape[1], lmin, self.hop_length, filename)
assert phone.shape[0] == duration.shape[0]
能否幫我解惑一下這四個assert 的含意是什麼

Error during installation

Hi,

Thank you for your work.

I got this error:
ERROR: Could not build wheels for fairseq, which is required to install pyproject.toml-based projects
when trying to run :
pip install audiolm_pytorch

Please let me know how to fix this.

ASK for the accelerate config setting

can you provide the config.yaml for us?
I don't know if the answer settings other than my own machine are correct for this program.
I'm not sure about the distributed_type and other setting ?
Here is my setting :
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: 1
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_transformer_layer_cls_to_wrap: ''
fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Bug in load param

I found that if multi-GPU is used for training, key-value asymmetry will occur during load.
maybe the load part need to add this code inside。
saved_state_dict = data['model']
model = self.accelerator.unwrap_model(self.model)
new_state_dict= {}
for k,v in saved_state_dict.items():
name=k[7:]
new_state_dict[name] = v
if hasattr(model, 'module'):
model.module.load_state_dict(new_state_dict)
else:
model.load_state_dict(new_state_dict)

Organizing Audio Data for Speaker Recognition: Folders or Single Folder?

first thank you very must for your work it's amazing !

I have 10 speakers in my dataset. Should I organize each speaker's audio files into separate folders like this

├── spk1
│   ├── 1.wav
│   ├── 2.wav
│   └── ...
├── spk2
│   ├── 3.wav
│   ├── 4.wav
│   └── ...
└── spk3
    ├── 5.wav
    ├── 6.wav
    └──

Or is it okay to put all the audio files in one folder like this:

> ├── 1.wav
> ├── 2.wav
> ├── 3.wav
> ├── 4.wav
> ├── 5.wav
> ├── 6.wav
> └──

Does separating each speaker into a folder help with better recognition during training? Also, I only have about 1 hour of audio data for each of the 10 speakers. Is that enough for effective training?"

訓練步驟

We first train the audio codec using 8 NVIDIA TESLA V100 16GB GPUs with a batch size of 200 audios per GPU for 440K steps. We follow the implementation and experimental setting of SoundStream [19] and adopt Adam optimizer with 2e-4 learning rate. Then we use the trained codec to extract the quantized latent vectors for each audio to train the diffusion model in NaturalSpeech 2.

The diffusion model in NaturalSpeech 2 is trained using 16 NVIDIA TESLA V100 32GB GPUs with a batch size of 6K frames of latent vectors per GPU for 300K steps (our model is still underfitting and longer training will result in better performance). We optimize the models with the AdamW optimizer with 5e-4 learning rate, 32k warmup steps following the inverse square root learning schedule.

根據原論文的敘述，似乎他將audio codec 和 diffusion的部分分開來做訓練。
想向您請教，不知道有沒有嘗試過將兩個部分分開來做訓練的嘗試，我看到在NS2-ttsv2的訓練上似乎把codec相關的使用全部給mark起來了，是codec的效果不盡人意嗎?

RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 6 but got size 46 for tensor number 1 in the list.

I can't seem to get past this error

about the pretrain when training

'HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /charactr/vocos-mel-24khz/resolve/main/config.yaml (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001B8B793E310>, 'Connection to huggingface.co timed out. (connect timeout=10)'))' thrown while requesting HEAD https://huggingface.co/charactr/vocos-mel-24khz/resolve/main/config.yaml
Traceback (most recent call last):
File "E:\pythoncode\NS2VC-vc-v2\train.py", line 3, in
trainer = Trainer()
File "E:\pythoncode\NS2VC-vc-v2\model.py", line 809, in init
self.vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")
File "D:\anaconda\envs\pytorch\lib\site-packages\vocos\pretrained.py", line 65, in from_pretrained
config_path = hf_hub_download(repo_id=repo_id, filename="config.yaml")
File "D:\anaconda\envs\pytorch\lib\site-packages\huggingface_hub\utils_validators.py", line 124, in _inner_fn
return fn(*args, **kwargs)
File "D:\anaconda\envs\pytorch\lib\site-packages\huggingface_hub\file_download.py", line 1211, in hf_hub_download
raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: Connection error, and we cannot find the requested files in the disk cache. Please try again or make sure your Internet connection is on.

sorry to bother you again, when i was traning, it seems that the download of pretrain of vocos fails, but my network is ok,
how can i fix it?

How to install "operations"?

from operations import OPERATIONS_ENCODER, MultiheadAttention, SinusoidalPositionalEmbedding

But this "operations" module can not be found on pip or conda or anywhere else. Where is that?

question about function 'encode' for rvq loss

NS2VC/model.py

Line 539 in b07e453

quantized_list.append(quantized_out)

At this line, the value when all the values of 'quantized_out' are 0 are always appended first.

Wouldn't this bring a list filled with 7 quantized_out and zeros to the quantized list instead of 8?

I'm not sure if this is correct, but I'd appreciate it if you could explain why you code like this.

thank you

question about rvq_cross_entropy_loss

hi, first of all I'm glad to see your amazing work. I am also implementing NaturalSpeech2 and refer to lucidrain's code.

In the process of implementing rvq_cross_entropy_loss, you use 'codes_padded' as it was when you implemented 'loss', whereas lucidrain uses an 8-dimensional vector called 'codes'. I'm wondering if there's any rationale for why it's implemented differently.

And when I implemented like you, the following error pops up in the process of obtaining corss entropy.

File "/ZeroShotProject/NaturalSpeech2/vector_quantize_pytorch/vector_quantize_pytorch.py", line 592, in forward
return quantize, F.cross_entropy(distances, indices, ignore_index = -1)
File "/opt/conda/envs/valle/lib/python3.10/site-packages/torch/nn/functional.py", line 3026, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: expected scalar type Long but found Float

When calculating the entropy loss of rq, it seems correct that the data should be of the long type, but if you use codes_padded like you, it is not of the long type, so does an error appear?

I'm curious how you solved it.

I will be waiting for your reply.

Thanks for reading.

展示的demo效果用了多少语料

如题，demo展示的效果已经不错了，请教下是使用了多少说话人的的语料库

branch in V4 version train it's working ?

hi ! thank you very much for your work and this amazing repo

i try train the branch v4 i have something very wrong here
when i train about 3 hours it's not change i have noise all the steps i use this

1 . python preprocess.py
2. python model1.py

29000 steps v4 branch

in v3 or main branch after some steps i have this

5000 steps v3 or main branch

like you see in v4 , i get only noise i do something wrong

can you please tell me in the v4 train working ?
or what i do wrong

thank you for your time

dataset.py ： codes.size(-1)= 48960， sum(duration）=38678 assert error:

dataset.py check：

assert abs(codes.size(-1) - sum(duration)) < 3, (codes.size(-1), sum(duration), filename)
assert abs(audio.shape[1]-lmin * self.hop_length) < 3 * self.hop_length

why to check the encode and duration?

Has anyone reproduced the great audio?

if yes, congratulation! can you show your size of dataset, GPU resource，trian time?

Ready to train? I am thinking of creating a training notebook

@adelacvg is this ready for training the TTS branch? I think I am gonna create a notebook for it if it is!

请问可以克隆音乐吗?

Duplicate song questions

I'm wondering if the code in that GitHub repository also performs well at cloning songs. Thanks.

48khz Samplerate

I have been training the v4 version and the results looked promising. Thanks for the awesome work! Now, i would like to train it on 48khz audio. Could you point me at the parameters etc i would need to change to do so?
Thank you!

VC任务采用NS2的目的是啥

如题

Can I train for Vietnamese

I have a Vietnamese dataset, can u show me how to train?

ValueError: num_samples should be a positive integer value, but got num_samples=0

I have my preprocessed dataset under /dataset

How do I resume the training from a checkpoint?

I can't figure out how to resume the training with the current state of the code. Any help would be appreciated :)

Error in accelerate launch train.py

Hi, Thanks for the amazing work.

I put the .wav files in dataset.
and I process it with the below code,
python preprocess.py

It creates dataset_processed folder with contents.

Then I ran 'accelerate config' to generate the config file.

But when I try to train the model, it throws the follwing error. 'ValueError: num_samples should be a positive integer value, but got num_samples=0'

Please help me to solve it. I am an ios developer and new to generative ai.

accelerate launch train.py

Traceback (most recent call last):
  File "E:\AI\NS2VC-vc-v2\train.py", line 3, in <module>
    trainer = Trainer()
              ^^^^^^^^^
  File "E:\AI\NS2VC-vc-v2\model.py", line 824, in __init__
    dl = DataLoader(ds, batch_size = self.cfg['train']['train_batch_size'], shuffle = True, pin_memory = True, num_workers = self.cfg['train']['num_workers'], collate_fn = collate_fn)
ValueError: num_samples should be a positive integer value, but got num_samples=0
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\user\anaconda3\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "C:\Users\user\anaconda3\Lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
    args.func(args)
  File "C:\Users\user\anaconda3\Lib\site-packages\accelerate\commands\launch.py", line 986, in launch_command
    simple_launcher(args)
  File "C:\Users\user\anaconda3\Lib\site-packages\accelerate\commands\launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\user\\anaconda3\\python.exe', 'train.py']' returned non-zero exit status 1.

about dataset problem

What is the recommended public dataset, similar to VCTK, and how should the data be put in the dataset, thanks!

f0_loss issue questions

During training, an issue was discovered where f0_loss was constantly zero. Any help would be appreciated.

question about rvq_ce_loss

Thank you for your code. I have a question.
In this line of code(https://github.com/adelacvg/NS2VC/blob/master/model.py#L506), do we need to add a softmax function to obtain the distribution?

大佬，这个支持小样本数据训练吗?

diff-vits vs NS2 tts-v2

想要請教您幾個問題

想請問diff-vits這個項目與ns2 tts-v2的差別在哪裡
目前粗略看過去以及以前有看到，似乎是將主模型改成vits但留下了naturalspeech的架構?
我在tts-v2的模型中測試了一個1500+音色 600+hr的訓練資料集，測試集外數據還是會有大部分不太相似的情況。
是否真如論文所測試，需要更大量的數據集才能有集外的泛化性效果。您認為大概需要多少小時和多少資料以上的音色才能有較好的結果。
想請問您覺得MFA所預測出來的ground-truth duration與利用MAS預測出來的duration 兩者的差別在哪，您似乎比較偏好於MAS的預測系統。

Error when running preprocess.py

Hi,
When running preprocess.py, I get this error:
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
Could you guide me on solving this?

This is the stack trace:

Loading hubert for content...
load model(s) from /home/ubuntu/PS/NaturalSpeech2/NS2VC/checkpoint_best_legacy_500.pt
Loaded hubert.
0%| | 0/43950 [00:01<?, ?it/s]
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/PS/NaturalSpeech2/NS2VC/preprocess.py", line 73, in process_batch
process_one(filename, hmodel, codec)
File "/home/ubuntu/PS/NaturalSpeech2/NS2VC/preprocess.py", line 34, in process_one
c = utils.get_hubert_content(hmodel, wav_16k_tensor=wav16k)
File "/home/ubuntu/PS/NaturalSpeech2/NS2VC/utils.py", line 235, in get_hubert_content
logits = hmodel.extract_features(**inputs)
File "/home/ubuntu/PS/NaturalSpeech2/NS2VC/venv38/lib/python3.8/site-packages/fairseq/models/hubert/hubert.py", line 535, in extract_features
res = self.forward(
File "/home/ubuntu/PS/NaturalSpeech2/NS2VC/venv38/lib/python3.8/site-packages/fairseq/models/hubert/hubert.py", line 437, in forward
features = self.forward_features(source)
File "/home/ubuntu/PS/NaturalSpeech2/NS2VC/venv38/lib/python3.8/site-packages/fairseq/models/hubert/hubert.py", line 392, in forward_features
features = self.feature_extractor(source)
File "/home/ubuntu/PS/NaturalSpeech2/NS2VC/venv38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/PS/NaturalSpeech2/NS2VC/venv38/lib/python3.8/site-packages/fairseq/models/wav2vec/wav2vec2.py", line 895, in forward
x = conv(x)
File "/home/ubuntu/PS/NaturalSpeech2/NS2VC/venv38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/PS/NaturalSpeech2/NS2VC/venv38/lib/python3.8/site-packages/torch/nn/modules/container.py", line 217, in forward
input = module(input)
File "/home/ubuntu/PS/NaturalSpeech2/NS2VC/venv38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/PS/NaturalSpeech2/NS2VC/venv38/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 313, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/ubuntu/PS/NaturalSpeech2/NS2VC/venv38/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

Error when trying to infer via the demo colab notebook

acceerate launch train.py, parallel train too slowly. it seems that accelerate model parallel not successsful.

how to use command "accelerate config" to generate the /home/duser/.cache/huggingface/accelerate/default_config.yaml.

I have try the two configuration that could run accelrate launch train.py successfully. but seems the model parallel training not successsfully. So how to configure the accelerate.

(1)

Which type of machine are you using? This machine
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]:
Do you wish to optimize your script with torch dynamo?[yes/NO]:No
Do you want to use DeepSpeed? [yes/NO]: No
Do you want to use FullyShardedDataParallel? [yes/NO]: NO^H^H
Please enter yes or no.
Do you want to use FullyShardedDataParallel? [yes/NO]:
Do you want to use Megatron-LM ? [yes/NO]: NO
How many GPU(s) should be used for distributed training? [1]:5
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:1,2,3,4,5
Do you wish to use FP16 or BF16 (mixed precision)?
bf16
accelerate configuration saved at /home/duser/.cache/huggingface/accelerate/default_config.yaml

(2)
-In which compute environment are you running?
This machine
Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]:
Do you wish to optimize your script with torch dynamo?[yes/NO]:n^HNO
Please enter yes or no.
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: NO
Do you want to use FullyShardedDataParallel? [yes/NO]: yes
What should be your sharding strategy?
FULL_SHARD
Do you want to offload parameters and gradients to CPU? [yes/NO]: yes
What should be your auto wrap policy?
NO_WRAP
-What should be your FSDP's backward prefetch policy?
BACKWARD_PRE
What should be your FSDP's state dict type?
FULL_STATE_DICT
How many GPU(s) should be used for distributed training? [1]:5
Do you wish to use FP16 or BF16 (mixed precision)?
no
accelerate configuration saved at /home/duser/.cache/huggingface/accelerate/default_config.yaml

Issues with preserving the speaker identity

Okay, so I've been testing out the demo colab notebook and tried synthesizing a few characters, but it seems like it's having a hard time preserving the speaker identity. The result audio doesn't sound like my reference audio at all.

tts_infer and the preprocess

你在預處理的時候，把TextGrid中空白區域變成sil，但infer的時候使用的處理卻是使用sp作為替代，似乎是使用未訓練的sp作為空白音的phoneme。
我測試的時候，發現似乎sp會因為訓練的問題，導致合成聲音會有滋滋聲。
然後想讓您給個建議，我目前35萬步的 200位語者200hr的聲音，但是語料內的語者在使用infer的時候，相似度似乎還是不高，雖然音質都不錯，有沒有什麼訓練上的建議，，以下附上loss的圖片，不過我也看不出這個有什麼訓練上的涵義。