playvoice / vits_chinese Goto Github PK

View Code? Open in Web Editor NEW

1.1K 24.0 164.0 3.04 MB

Best practice TTS based on BERT and VITS with some Natural Speech Features Of Microsoft; Support ONNX streaming out!

Home Page: https://huggingface.co/spaces/maxmax20160403/vits_chinese

License: MIT License

Python 99.42% Cython 0.58%

tts vits naturalspeech bert bert-vits aishell3 bert-vits2

vits_chinese's Introduction

Best practice TTS based on BERT and VITS with some Natural Speech Features Of Microsoft

这是一个用于TTS算法学习的项目，如果您在寻找直接用于生产的TTS，本项目可能不适合您！

vits_bert.mp4

天空呈现的透心的蓝，像极了当年。总在这样的时候，透过窗棂，心，在天空里无尽的游弋！柔柔的，浓浓的，痴痴的风，牵引起心底灵动的思潮；情愫悠悠，思情绵绵，风里默坐，红尘中的浅醉，诗词中的优柔，任那自在飞花轻似梦的情怀，裁一束霓衣，织就清浅淡薄的安寂。

风的影子翻阅过淡蓝色的信笺，柔和的文字浅浅地漫过我安静的眸，一如几朵悠闲的云儿，忽而氤氲成汽，忽而修饰成花，铅华洗尽后的透彻和靓丽，爽爽朗朗，轻轻盈盈

时光仿佛有穿越到了从前，在你诗情画意的眼波中，在你舒适浪漫的暇思里，我如风中的思绪徜徉广阔天际，仿佛一片沾染了快乐的羽毛，在云环影绕颤动里浸润着风的呼吸，风的诗韵，那清新的耳语，那婉约的甜蜜，那恬淡的温馨，将一腔情澜染得愈发的缠绵。

Features，特性

1, Hidden prosody embedding from BERT，get natural pauses in grammar

2, Infer loss from NaturalSpeech，get less sound error

3, Framework of VITS，get high audio quality

4, Module-wise Distillation, get speedup

💗Tip: It is recommended to use Infer Loss fine-tune model after base model trained, and freeze PosteriorEncoder during fine-tuning.

💗意思就是：初步训练时，不用loss_kl_r；训练好后，添加loss_kl_r继续训练，稍微训练一下就行了，如果音频质量差，可以给loss_kl_r乘以一个小于1的系数、降低loss_kl_r对模型的贡献；继续训练时，可以尝试冻结音频编码器Posterior Encoder；总之，玩法很多，需要多尝试！

为什么不升级为VITS2

VITS2最重要的改进是将Flow的WaveNet模块使用Transformer替换，而在TTS流式实现中，通常需要用纯CNN替换Transformer。

Online demo，在线体验

https://huggingface.co/spaces/maxmax20160403/vits_chinese

Install，安装依赖和MAS对齐

pip install -r requirements.txt

cd monotonic_align

python setup.py build_ext --inplace

Infer with Pretrained model，用示例模型推理

Get from release page vits_chinese/releases/

put prosody_model.pt To ./bert/prosody_model.pt

put vits_bert_model.pth To ./vits_bert_model.pth

python vits_infer.py --config ./configs/bert_vits.json --model vits_bert_model.pth

./vits_infer_out have the waves inferred, listen !!!

Infer with chunk wave streaming out，分块流式推理

as key parameter, hop_frame = ∑decoder.ups.padding 💗

python vits_infer_stream.py --config ./configs/bert_vits.json --model vits_bert_model.pth

Ceil duration affect naturalness

So change w_ceil = torch.ceil(w) to w_ceil = torch.ceil(w + 0.35)

All Thanks To Our Contributors:

Train，训练

download baker data https://aistudio.baidu.com/datasetdetail/36741, more info: https://www.data-baker.com/data/index/TNtts/

change sample rate of waves to 16kHz, and put waves to ./data/waves

python vits_resample.py -w [input path]:[./data/Wave/] -o ./data/waves -s 16000

put 000001-010000.txt to ./data/000001-010000.txt

python vits_prepare.py -c ./configs/bert_vits.json

python train.py -c configs/bert_vits.json -m bert_vits

额外说明

原始标注为

000001	卡尔普#2陪外孙#1玩滑梯#4。
  ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1
000002	假语村言#2别再#1拥抱我#4。
  jia2 yu3 cun1 yan2 bie2 zai4 yong1 bao4 wo3

标注规整后：

BERT需要汉字 卡尔普陪外孙玩滑梯。 (包括标点)
TTS需要声韵母 sil k a2 ^ er2 p u3 p ei2 ^ uai4 s uen1 ^ uan2 h ua2 t i1 sp sil

000001	卡尔普陪外孙玩滑梯。
  ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1
  sil k a2 ^ er2 p u3 p ei2 ^ uai4 s uen1 ^ uan2 h ua2 t i1 sp sil
000002	假语村言别再拥抱我。
  jia2 yu3 cun1 yan2 bie2 zai4 yong1 bao4 wo3
  sil j ia2 ^ v3 c uen1 ^ ian2 b ie2 z ai4 ^ iong1 b ao4 ^ uo3 sp sil

训练标注为

./data/wavs/000001.wav|./data/temps/000001.spec.pt|./data/berts/000001.npy|sil k a2 ^ er2 p u3 p ei2 ^ uai4 s uen1 ^ uan2 h ua2 t i1 sp sil
./data/wavs/000002.wav|./data/temps/000002.spec.pt|./data/berts/000002.npy|sil j ia2 ^ v3 c uen1 ^ ian2 b ie2 z ai4 ^ iong1 b ao4 ^ uo3 sp sil

遇到这句话会出错

002365	这图#2难不成#2是#1Ｐ过的#4？
  zhe4 tu2 nan2 bu4 cheng2 shi4 P IY1 guo4 de5

拼音错误修改

将正确的词语和拼音写入文件： ./text/pinyin-local.txt

渐渐 jian4 jian4
浅浅 qian3 qian3

数字播报支持

已支持，基于WeNet开源社区WeTextProcessing；当然，不可能是完美的

不使用Bert也能推理

python vits_infer_no_bert.py --config ./configs/bert_vits.json --model vits_bert_model.pth

虽然训练使用了Bert，但推理可以完全不用Bert，牺牲自然停顿来适配低计算资源设备，比如手机

低资源设备通常会分句合成，这样牺牲的自然停顿也没那么明显

ONNX非流式

导出：会有许多警告，直接忽略

python model_onnx.py --config configs/bert_vits.json --model vits_bert_model.pth

推理

python vits_infer_onnx.py --model vits-chinese.onnx

ONNX流式

具体实现，将VITS拆解为两个模型，取名为Encoder和Decoder。

Encoder包括TextEncoder与DurationPredictor等；
Decoder包括ResidualCouplingBlock与Generator等；
ResidualCouplingBlock，即Flow，可以放在Encoder或Decoder，放在Decoder需要更大的hop_frame

并且将推理逻辑也进行了切分；特别的，先验分布的采样过程放在了Encoder中：

z_p = m_p + torch.randn_like(m_p) * torch.exp(logs_p) * noise_scale

ONNX流式模型导出

python model_onnx_stream.py --config configs/bert_vits.json --model vits_bert_model.pth

ONNX流式模型推理

python vits_infer_onnx_stream.py --encoder vits-chinese-encoder.onnx --decoder vits-chinese-decoder.onnx

在流式推理中，hop_frame是一个重要参数，需要去尝试出合适的值

Model compression based on knowledge distillation，应该叫迁移学习还是知识蒸馏呢？

Student model has 53M size and 3× speed of teacher model.

To train:

python train.py -c configs/bert_vits_student.json -m bert_vits_student

To infer, get student model at the release page

python vits_infer.py --config ./configs/bert_vits_student.json --model vits_bert_student.pth

代码来源

Microsoft's NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation

https://github.com/Executedone/Chinese-FastSpeech2 bert prosody

https://github.com/wenet-e2e/WeTextProcessing

https://github.com/TensorSpeech/TensorFlowTTS Heavily depend on

https://github.com/jaywalnut310/vits

https://github.com/wenet-e2e/wetts

https://github.com/csukuangfj onnx and android

BERT应用于TTS

2019 BERT+Tacotron2: Pre-trained text embeddings for enhanced text-tospeech synthesis.

2020 BERT+Tacotron2-MultiSpeaker: Improving prosody with linguistic and bert derived features in multi-speaker based mandarin chinese neural tts.

2021 BERT+Tacotron2: Extracting and predicting word-level style variations for speech synthesis.

2022 https://github.com/Executedone/Chinese-FastSpeech2

2023 BERT+VISINGER: Towards Improving the Expressiveness of Singing Voice Synthesis with BERT Derived Semantic Information.

AISHELL3多发音人训练，训练出的模型可用于克隆

切换代码分支bert_vits_aishell3，对比分支代码可以看到针对多发音人所做出的修改

数据下载

http://www.openslr.org/93/

采样率转换

python prep_resample.py --wav aishell-3/train/wav/ --out vits_data/waves-16k

标注规范化（labels.txt，名称不能改）

python prep_format_label.py --txt aishell-3/train/content.txt --out vits_data/labels.txt

原始标注

SSB00050001.wav	广 guang3 州 zhou1 女 nv3 大 da4 学 xue2 生 sheng1 登 deng1 山 shan1 失 shi1 联 lian2 四 si4 天 tian1 警 jing3 方 fang1 找 zhao3 到 dao4 疑 yi2 似 si4 女 nv3 尸 shi1
SSB00050002.wav	尊 zhun1 重 zhong4 科 ke1 学 xue2 规 gui1 律 lv4 的 de5 要 yao1 求 qiu2
SSB00050003.wav	七 qi1 路 lu4 无 wu2 人 ren2 售 shou4 票 piao4

规范标注

SSB00050001.wav 广州女大学生登山失联四天警方找到疑似女尸
	guang3 zhou1 nv3 da4 xue2 sheng1 deng1 shan1 shi1 lian2 si4 tian1 jing3 fang1 zhao3 dao4 yi2 si4 nv3 shi1
SSB00050002.wav 尊重科学规律的要求
	zhun1 zhong4 ke1 xue2 gui1 lv4 de5 yao1 qiu2
SSB00050003.wav 七路无人售票
	qi1 lu4 wu2 ren2 shou4 piao4

数据预处理

python prep_bert.py --conf configs/bert_vits.json --data vits_data/

打印信息，在过滤本项目不支持的儿化音

生成 vits_data/speakers.txt

{'SSB0005': 0, 'SSB0009': 1, 'SSB0011': 2..., 'SSB1956': 173}

生成 filelists

0|vits_data/waves-16k/SSB0005/SSB00050001.wav|vits_data/temps/SSB0005/SSB00050001.spec.pt|vits_data/berts/SSB0005/SSB00050001.npy|sil g uang3 zh ou1 n v3 d a4 x ve2 sh eng1 d eng1 sh an1 sh iii1 l ian2 s ii4 t ian1 j ing3 f ang1 zh ao3 d ao4 ^ i2 s ii4 n v3 sh iii1 sil
0|vits_data/waves-16k/SSB0005/SSB00050002.wav|vits_data/temps/SSB0005/SSB00050002.spec.pt|vits_data/berts/SSB0005/SSB00050002.npy|sil zh uen1 zh ong4 k e1 x ve2 g uei1 l v4 d e5 ^ iao1 q iou2 sil
0|vits_data/waves-16k/SSB0005/SSB00050004.wav|vits_data/temps/SSB0005/SSB00050004.spec.pt|vits_data/berts/SSB0005/SSB00050004.npy|sil h ei1 k e4 x van1 b u4 zh iii3 ^ iao4 b o1 d a2 m ou3 ^ i2 g e4 d ian4 h ua4 sil

数据调试

python prep_debug.py

启动训练

cd monotonic_align

python setup.py build_ext --inplace

cd -

python train.py -c configs/bert_vits.json -m bert_vits

下载权重

AISHELL3_G.pth：https://github.com/PlayVoice/vits_chinese/releases/v4.0

推理测试

python vits_infer.py -c configs/bert_vits.json -m AISHELL3_G.pth -i 0

-i 为发音人序号，取值范围：0 ~ 173

AISHELL3训练数据都是短短的一句话，所以，推理语句中不能有标点

训练的AISHELL3模型，使用小米K2社区开源的AISHELL3模型来初始化训练权重，以节约训练时间

K2开源模型 https://huggingface.co/jackyqs/vits-aishell3-175-chinese/tree/main 下载模型

K2在线试用 https://huggingface.co/spaces/k2-fsa/text-to-speech

vits_chinese's People

Contributors

Stargazers

Watchers

Forkers

joan126 ishine shaun95 tuong-olli hyzhan macroustc aixingxy lexkoro yzliu90 startreker-shzy kemomi code-monad sfezn1220 hylimr entn-at blueness9527 tocmike cabdy1735 csyangwen lizezheng huaxuanw whh07141 mtxing chiaoso awakingswings meliodve awesome-archive dingguoli youminxue bobo-paopao 1nlplearner yifenglv46 santej731 syaofox blood003401 alphonz rislander young17 yulinrong mikejohnsonliu gitlfc163 runngezhang tentaclecat zdj97 pink-soda armstrong1972 liuxiong21 wendongj gachaun asdlei99 fujohnwang iamleon121 stardust-minus veryquant tinachen95 anselorville koeru mofasjang awlloy aylitat iq-scm thetargo realerikk0 charlesxrwu songfang zhangziliang04 linghuxinye percychai wishtodaya kennytat qw4654134 weicaijiang benny1900 weiyutang linyxxxxx cwj10 guanyu7778 sunguanyin principles0 minixia asoefsh yihuitang yarnerlee dnlzsy hildam tianyin05 wswcy1835 lianjiang-yulj heronli2010 junhaohuang0615 0000duck pigorz fulldb dosier wonderfulyoucn gulige tany-g ibuwei naseem56 mm667870

vits_chinese's Issues

Can I use this to train with 32K sampling audio data?

Thanks for your great job.
And this project can be used to train with 32k sampling audio data? I had changed the config file with sampling rate from 16K to 32K, and my own data is 32K, preprocess all done. but when i go to train work, it shows the following error: RuntimeError: the expanded size of the tensor (68) must match the existing size (70) at non-singleton dimension 0. Target sizes: [68,256]. Tensor sizes: [70,256]

generate silence wav

Thanks for sharing this repo, I met a problem here,
I followed your instruction without any modify and after training 120k steps, I generated a whole silence wav by vit_string.py
Is it normal or did i anything wrong?

大佬，用项目提供的数据集，训了250 epoch+时，感觉loss都不怎么下降了，这是训练到位了吗？

hi, CLONE的稳定性和效果好像比VITS好，有计划实现下这个框架么

hi @MaxMax2016 , CLONE的稳定性和效果好像比VITS好，有计划实现下这个框架么
https://arxiv.org/pdf/2207.06088.pdf

有出Java版的计划吗？

想在手机上离线使用，是否有出Java版的计划？

关于模型压缩

感谢大佬开源这么棒的项目，请问关于模型压缩这块，模型蒸馏具体是什么做的呢，能简单告知一下吗

这个项目能实现声音克隆功能吗？

这个训练出来怎么控制语速音调这些东西呀？？？？

关于add_blank和use_sdp

谢谢作者的分享，vits里默认add_blank和use_sdp都是true，这两个参数对模型具体有什么影响呢。我没改参数，合成效果有时候停顿比较奇怪。不知道改这两个参数能不能改善。下面的是我合成出来的效果『且我们对这些语言的掌握程度都达到比较高的水平时』
40414717-1863-41bd-836e-503fdfb22afd.wav.zip

请教支持流式输出吗？

请教在推断时，支持实时的流式输出吗？在一些应用中，例如聊天程序，需要支持实时的流式输出，否则等待推断的时间太长，交互体验不佳。

how to export to onnx?

Will op consider add deploy to vits like using onnxruntime or tensorrt accelerate?

大佬是不是没办法训练一次然后通过调整音色音高之类的改变人物声音？？？？

大佬是不是没办法训练一次然后通过调整音色音高之类的改变人物声音？？？？有可以这样做的开源吗

大佬这个如果我音频数据集里有多个男女声音，可以直接用来训练吗

大佬这个如果我音频数据集里有多个男女声音，可以直接用来训练吗训练出来的是多个 pth，每一个对应一个人吗

Pretrained model is here.

链接：https://pan.baidu.com/s/1CEgyC1R3FxXEI-5AL_Sj8g
提取码：cym1

windows下环境不太好配

有什么好办法吗？

多说话人效果

您有试过加了韵律特征后在多说话人上训练嘛？我这边多说话人训练效果没有单人训练的好，单人效果非常逼真

如何使用预训练模型开始训练？

我的数据量太小，不想从头开始训练。

进行multispeaker训练的一些问题

我在config file 文件中增加gin_channels=256字段，现在speaker nums = 15，目前进行到330个epoch，测试发现合成的语音速度非常快，并且一个句子中，会出现多个speaker的声音。现在我的疑问是：

1、目前大概需要多少epoch才能稳定？
2、gin_channels=256，相对于只有15个speakers是否太大？比如改成32,64等小参数比较好收敛？
3、一个句子中，会出现多个speaker的声音，从现阶段来看，是否正常？
4、每句话的语速非常快，对于已经训练了330个epoch来说，是否有点不正常？

其中，每话的音素结构如下(在tacotron2中，能够训练出韵律正常的model，所以比较明确音素设计应该没问题。)：
p iy1 #1 eh1 l #1 ah0 #1 eh1 s #1 t iy1 #1 iy1 #1 aa1 r #1 iy1 #1 aa1 r #1 eh1 s #1 t iao2 #0 l i4 #1 d uei4 #1 ^ u1 #0 r an3 #0 ^ van2 #1 p u3 #0 ch a2 #1 d e5 #1 z u3 #0 zh iii1 #1 sh iii2 #0 sh iii1 #1 z uo4 #1 l e5 #1 n ei3 #0 x ie1 #1 g uei1 #0 d ing4 #3 fin

大佬们有空聊聊吧！

Baker Loss在18左右，声音没有预训练模型的干净

感谢大佬提供的训练代码及预训练模型；
我用Baker的数据跑了一遍pipeline，batch_size设定为32，目前训练到了200k step：
基本上Loss在18左右，很难再下降了；

bert_vits INFO loss_disc=2.473, loss_gen=2.393, loss_fm=5.935
bert_vits INFO loss_mel=19.083, loss_dur=0.133, loss_kl=0.908
bert_vits INFO loss_kl_r=1.710
bert_vits INFO Train Epoch: 659 [55%]
bert_vits INFO [200200, 9.205765022545685e-05]
bert_vits INFO loss_disc=2.600, loss_gen=2.050, loss_fm=5.560
bert_vits INFO loss_mel=18.343, loss_dur=0.120, loss_kl=0.851
bert_vits INFO loss_kl_r=1.387

结果链接: https://pan.baidu.com/s/11_qTi-ubfLoGOjZu565ymQ 提取码: 1sg8
除了偶尔的发音问题，感觉音质不错。但是对比预训练的模型，声音感觉不够干净，高频内容有点多的样子。请问这个gap是什么原因呢？

A-Z字母和数字0-9发声不了

如题

training for custom dataset

Have you tried for dataset of other language. Did the model work well? How many data (in total duration) and epoch I need to have to produce good results?
Thanks

问题：推理错误，语调不对

孩子，别吃了，这里的肉脏，走跟我去太平间
你好呀，这是一个美丽的早晨，我希望你能知道我有多爱你
愿你的努力得到回报，心中充满喜悦和满足。祝你拥有美好的一天！
愿你在新的一天里充满能量和活力，享受每一个美好的瞬间，迎接新的挑战和机遇。

这几句话会出现错误：

vits_chinese/bert/ProsodyModel.py", line 60, in expand_for_phone
    assert char_embeds.size(0) == len(length)
AssertionError

第一句话语调非常不对。

要加韵律预测模型和文本前端，我还在梳理、会释放~星星好少，没动力更新~~~

          要加韵律预测模型和文本前端，我还在梳理、会释放~星星好少，没动力更新~~~

Originally posted by @MaxMax2016 in #22 (comment)
这两部分您有做吗？可否分享？

能否增加：角色1、角色2。。。，请问如何修改呢？

感谢提供标准的中文模型，网上基本上都是大佐口气！

1、请问vits_bert_model.pth是放在根目录吗？运行出错了。
put prosody_model.pt To ./bert/prosody_model.pt
put vits_bert_model.pth To ./vits_bert_model.pth
python vits_infer.py --config ./configs/bert_vits.json --model vits_bert_model.pth

D:\DATA\Downloads\vits_chinese\vits_chinese-master>python vits_infer.py --config ./configs/bert_vits.json --model vits_bert_model.pth
nothing of except: 'gbk' codec can't decode byte 0xac in position 20: illegal multibyte sequence

2、能否增加：角色1、角色2。。。，请问如何修改呢？

3、目前的模型是女声，如果我想增加男生训练，是在原有的模型上进行训练吗？

新手，搜索了好多平台没有完善学习资料，折腾了一个多星期，一直搞不懂。期盼大佬回答，谢谢！

求问微调/Finetuning？

大佬好，我的一个数据集只有约为1000条样本，1小时20分钟，请问怎么在现有模型上微调？多谢！

使用torch的高版本（>1.6.0）训练需要关闭 FP16

还要修改STFT相关API调用

'monotonic_align.monotonic_align'

大佬，执行 python setup.py build_ext --inplace

Compiling core.pyx because it changed.
copying build/lib.linux-x86_64-cpython-37/monotonic_align/core.cpython-37m-x86_64-linux-gnu.so -> monotonic_align
error: could not create 'monotonic_align/core.cpython-37m-x86_64-linux-gnu.so': No such file or directory

这是什么问题？

如何制作自己的数据集。

我有差不多2小时的人声，想用来做自己的数据集训练。

貌似要将wav文件制作对应的 000001-010000.txt文件。这个文件是怎么做的。比如：

000001 卡尔普#2陪外孙#1玩滑梯#4。
ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1

#2, #1, #4 都是什么意思呢。有软件可以直接生成txt文件吗？

谢谢

how to create baker_train.txt ????

Infer loss from NaturalSpeech

请问在代码里哪里体现参考NaturalSpeech的损失函数了？

RuntimeError: The size of tensor a (76) must match the size of tensor b (84) at non-singleton dimension 1

RuntimeError: The size of tensor a (76) must match the size of tensor b (84) at non-singleton dimension 1
请教这个如何解决

训练500个Epoch+，Loss小于17才有好的效果

stft.py line 199报错：runtimeerror: exp does not support automatic differentiation for outputs with complex dtype.

module.py line 295,in forward
out = self.stft.inverse(spec, phase).to(x.device)
stft line 199 in inverse
magnitudetorch.exp(phase1j),
runtimeerror: exp does not support automatic differentiation for outputs with complex dtype.

Is needed to insert prosody label under Chinese data?

In filelists/ dir, I see there is no prosody label(#1 #2 #3) in text, if we insert these lable, can get a better result ?

关于音素生成一些疑问

1 这两种有什么区别？
会每个中间插入#0
不会插入#0

2 准备好的文本有标点符号和没标点符号训练出来差别会大嘛？

大佬，您试一下这句话

孩子，别吃了，这里的肉脏，走，跟我去太平间。

question about the phone seqs

hello, can u explain for the filelists's phone seqs.
such as
./baker_waves/000016.wav|sil c ii3 #0 ^ uai4 #0 g uang2 #0 b en3 #0 sp ^ ie3 #0 j iang1 #0 ^ iou2 #0 sh ao4 #0 zh uang4 #0 p ai4 #0 zh ang2 #0 g uan3 #0 sil eos

^ #0 mean what? and why not you add #1 #2 #3 etc?

我训练出来的人说话都连在一起很快。吐字不清楚

大佬我训练出来的音频语速太快，读的之间好像没有间隔。这个怎么解决？

大佬我训练出来的音频语速太快，读的之间好像没有间隔。这个怎么解决？
大佬可以听一下看看
链接: https://pan.baidu.com/s/1ZNaFXFkSDNH6WKIhWHhBzw 提取码: h0oj
--来自百度网盘超级会员v1的分享

请问BERT模型是如何训练的呢？

请问bert模型用的是bert-base-chinese吗？是在什么语料上训练的呢？

bert 模型的来源是什么呢

想请问下 prosody_model.pt 是怎么得到的，有做什么微调操作吗

RuntimeError: The expanded size of the tensor (50) must match the existing size (0) at non-singleton dimension 1. Target sizes: [192, 50]. Tensor sizes: [192, 0]

大佬你好。自定义的数据集，执行完vits_prepare_4_custom_speaker.py后，再执行的train.py ，错误中显示了维度不一致怎么解决呢？

报错如下：
Traceback (most recent call last):
File "train.py", line 436, in
main()
File "train.py", line 49, in main
hps,
File "/root/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/root/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/root/vits_chinese/train.py", line 168, in run
[writer, writer_eval],
File "/root/vits_chinese/train.py", line 216, in train_and_evaluate
(z, z_p, z_r, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, bert, spec, spec_lengths)
File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.local/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/root/vits_chinese/models.py", line 544, in forward
z, y_lengths, self.segment_size
File "/root/vits_chinese/commons.py", line 65, in rand_slice_segments
ret = slice_segments(x, ids_str, segment_size)
File "/root/vits_chinese/commons.py", line 55, in slice_segments
ret[i] = x[i, :, idx_str:idx_end]
RuntimeError: The expanded size of the tensor (50) must match the existing size (0) at non-singleton dimension 1. Target sizes: [192, 50]. Tensor sizes: [192, 0]

How to get Prosody pretrain model?

大佬你好，使用bert提取韵律特征的prosody.pt的预训练模型是怎么得到的，另外有相关论文可以参考吗？

关于multispeaker改动训练问题

现在使用aishell3数据集训练了100k左右，推理出来只有“得得得得”这种的合成声音.
我看相较于https://github.com/jaywalnut310/vits版本参数多了一个bert，我试着修改了下，数据集上加上了speaker
是否还需要修改什么.

支持中英混吗

这里的bert支不支持英文输出？

NameError: name 'm_frame' is not defined

最新代码运行，出现了这个错误
File "vits_chinese\models.py", line 583, in forward
z_frame = m_frame + torch.randn_like(m_frame) * torch.exp(logs_frame)
NameError: name 'm_frame' is not defined

您好！请问大佬有没有遇到这种情况？

您好！请问大佬有没有遇到这种情况？
WARNING:female_base:/qdell3data/qwork/txw94/qdell3/TTS/vits-main is not a git repository, therefore hash value comparison will be ignored.
DEBUG:h5py._conv:Creating converter from 7 to 5
DEBUG:h5py._conv:Creating converter from 5 to 7
DEBUG:h5py._conv:Creating converter from 7 to 5
DEBUG:h5py._conv:Creating converter from 5 to 7
/qwork4/twu/miniconda/envs/tts/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 32 leaked semaphores to clean up at shutdown
len(cache))
/qwork4/twu/miniconda/envs/tts/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 32 leaked semaphores to clean up at shutdown
len(cache))
Traceback (most recent call last):
File "train.py", line 297, in
main()
File "train.py", line 57, in main
mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
File "/qwork4/twu/miniconda/envs/tts/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/qwork4/twu/miniconda/envs/tts/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/qwork4/twu/miniconda/envs/tts/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/qwork4/twu/miniconda/envs/tts/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/qdell3data/qwork/txw94/qdell3/TTS/vits-main/train.py", line 126, in run
train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, None], None, None)
File "/qdell3data/qwork/txw94/qdell3/TTS/vits-main/train.py", line 151, in train_and_evaluate
(z, z_p, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, spec, spec_lengths)
File "/qwork4/twu/miniconda/envs/tts/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/qwork4/twu/miniconda/envs/tts/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/qwork4/twu/miniconda/envs/tts/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/qdell3data/qwork/txw94/qdell3/TTS/vits-main/models.py", line 467, in forward
z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g)
File "/qwork4/twu/miniconda/envs/tts/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/qdell3data/qwork/txw94/qdell3/TTS/vits-main/models.py", line 237, in forward
x = self.enc(x, x_mask, g=g)
File "/qwork4/twu/miniconda/envs/tts/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/qdell3data/qwork/txw94/qdell3/TTS/vits-main/modules.py", line 166, in forward
n_channels_tensor)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: __nv_nvrtc_builtin_header.h(78048): error: function "operator delete(void *, size_t)" has already been defined

__nv_nvrtc_builtin_header.h(78049): error: function "operator delete[](void *, size_t)" has already been defined

2 errors detected in the compilation of "default_program".

Some questions for the KL_loss and KL_loss_r and model behaviors

Hello MaxMax2016,

Thank you for sharing your code on the improved VITS, I hope to check with you about the model behaviors when adding the bi-directional KL divergence. In this repo I found that you use 1.0 * kl_loss + 1.0 * kl_loss_r during training, and when I do training on my own dataset I saw this greatly increased the mel-spectrogram loss, comparing to the original VITS, and when I add kl_loss_r to finetune a well trained VITS model, it gives bad voice quality.
Can you share your experiences or findings when trying to add this specific term? In the loss curve you shared, it seems that the mel-spec loss is still low (below 20), which is very interesting.

Thanks

关于音频中的静音段问题

请教个问题：关于音频中的比较长静音段(200ms以上)，大家是怎么处理的？需要在对应的音素序列位置插入了符号进行占位吗？

Hoping for your result

Hoping for your result trained vits on Chinese