ranchlai / mandarin-tts Goto Github PK

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder, with biaobei and aishell3 datasets

Python 100.00%

tts tts-chinese tts-hanzi tacotron pytorch fastspeech2 aishell3 multi-speaker

mandarin-tts's Introduction

Chinese mandarin text to speech (MTTS)

This is a modularized Text-to-speech framework aiming to support fast research and product developments. Main features include

all modules are configurable via yaml,
speaker embedding / prosody embeding/ multi-stream text embedding are supported and configurable,
various vocoders (VocGAN, hifi-GAN, waveglow, melGAN) are supported by adapter so that comparison across different vocoders can be done easily,
durations/pitch/energy variance predictor are supported, and other variances can be added easily,
and more on the road-map.

Contributions are welcome.

Audio samples

Checkout the demo here

Interesting audio samples for aishell3 added here.
The github page also hosts some samples for biaobei and aishell3 datasets.

Quick start

Install

git clone https://github.com/ranchlai/mandarin-tts.git
cd mandarin-tts
git submodule update --force --recursive --init --remote
pip install -e . f

Training

Two examples are provided here: biaobei and aishell3.

To train your own models, first make a copy from existing examples, then prepare the melspectrogram features using wav2mel.py by

cd examples
python wav2mel.py -c ./aishell3/config.yaml -w <aishell3_wav_folder> -m <mel_folder> -d cpu

prepare the scp files necessary for training,

cd examples/aishell3
python prepare.py --wav_folder <aishell3_wav_folder>  --mel_folder <mel_folder> --dst_folder ./train/

This will generate scp files required by config.yaml (in the dataset/train section). You would also need to check that everything is fine in the config file. Usually you don't need to change the code.

Now you can start your training by

cd examples/aishell3
python ../../mtts/train.py -c config.yaml -d cuda

For biaobei dataset, the workflow is the same, except that there is no speaker embedding but you can add prosody embedding.

More examples will be added. Please stay.

Synthesize

Pretrained mtts checkpoints

Currently two examples are provided, and the corresponding checkpoints/configs are summarized as follows.

dataset	checkpoint	config
aishell3	link	link
biaobei	link	link

Supported vocoders

Vocoders play the role of converting melspectrograms to waveforms. They are added as submodules and will be be trained in this project. Hence you should download the checkpoints before synthesizing. In training, vocoders are not necessary, as you can monitor the training process from generated melspectrograms and also the loss curve. Current we support the following vocoders,

Vocoder	checkpoint	github
Waveglow	link	link
hifi-gan	link	link
VocGAN	link link	link
MelGAN	link	link

All vocoders will be ready after running git submodule update --force --recursive --init --remote. However, you have to download the checkpoint manually and properly set the path in the config.yaml file.

Preparing your input text

The input.txt should be consistent with your setting of emb_type1 to emb_type_n in config file, i.e., same type, same order.

To facilitate transcription of hanzi to pinyin, you can try:

cd examples/aishell3/
python ../../mtts/text/gp2py.py -t "为适应新的网络传播方式和读者阅读习惯"
>> sil wei4 shi4 ying4 xin1 de5 wang3 luo4 chuan2 bo1 fang1 shi4 he2 du2 zhe3 yue4 du2 xi2 guan4 sil|sil 为 适 应 新 的 网 络 传 播 方 式 和 读 者 阅 读 习 惯 sil

Not you can copy the text to input.txt, and remember to put down the self-defined name and speaker id, separated by '|'.

Synthesizing your waves

With the above checkpoints and text ready, finally you can run the synthesis process,

python ../../mtts/synthesize.py  -d cuda --c config.yaml --checkpoint ./checkpoints/checkpoint_1240000.pth.tar -i input.txt

Please check the config.yaml file for the vocoder settings.

If lucky, audio examples can be found in the output folder.

mandarin-tts's People

Contributors

Stargazers

Watchers

Forkers

x-ccs xzm2004260 fengyen-chang xiaoyangnihao yingfenging bstester anylzer hommmm ishine xiexukang dl-ha wuyx517 super-alex jimmy133719 yanshang1991 gaofeifei light-cao jiaxp3144 ashora yotofu privapps boragocode trendingtechnology sirilanka zhangxincheng zhangsanfeng86 lniq linhuiq 877623423 pengge forwiat scfobao hackermaster1969 aoaoaowu555 lj-cug project2you chenhuayou lujunsincerely liuhongjian0316 muxichu eleanorzhang cxystudio convect-bot ghlive workfortomorrow kevinl2015 aokihu maxmax2016 ethanyhzhang honghee99 tangulak qiyeboy abeautifulsnow baozixifan lydgithub studio501 jjandnn lx2m17 kingjci liushuchun ysohotm tuannvhust longglecc chenwaner songqian27 frankrx41 sophiefy leeeeyh fightseed chenyifen ethanhwang2015 mapbased lkxu kentwei larygwil hardydou isaiahking wolfhan99 szz1031 lukezhangmengxi summerflowers frankiegu shaune0000 royandzoe yayawawo freesky-edward zhangziliang04 amorjnyh hsyy04 spybeiman veryquant xugaoxiang kk-chat cylonspace nagq benny1900 a1947z zcaudate li195111 momosan2692

mandarin-tts's Issues

AttributeError: module 'numpy' has no attribute 'complex'.

Cannot work with librosa 0.8.0 as specified by setup.py.

    from . import datasets, models, text
  File "/home/mandarin/mtts/models/__init__.py", line 1, in <module>
    from . import decoder, encoder, postnet, vocoder
  File "/home/mandarin/mtts/models/vocoder/__init__.py", line 1, in <module>
    from .hifi_gan import HiFiGAN
  File "/home/mandarin/mtts/models/vocoder/hifi_gan/__init__.py", line 1, in <module>
    from .hifi_gan import HiFiGAN
  File "/home/mandarin/mtts/models/vocoder/hifi_gan/hifi_gan.py", line 11, in <module>
    from .meldataset import MAX_WAV_VALUE
  File "/home/mandarin/mtts/models/vocoder/hifi_gan/meldataset.py", line 7, in <module>
    from librosa.util import normalize
  File "/home/mandarin/venv/lib/python3.8/site-packages/librosa/__init__.py", line 211, in <module>
    from . import core
  File "/home/mandarin/venv/lib/python3.8/site-packages/librosa/core/__init__.py", line 9, in <module>
    from .constantq import *  # pylint: disable=wildcard-import
  File "/home/mandarin/venv/lib/python3.8/site-packages/librosa/core/constantq.py", line 1059, in <module>
    dtype=np.complex,
  File "/home/mandarin/venv/lib/python3.8/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'complex'.
`np.complex` was a deprecated alias for the builtin `complex`. To avoid this error in existing code, use `complex` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.complex128` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

项目缺少license文件，想要一个license

如标题所述，因为近期想使用部分源代码用于自己的项目，但是注意到这个项目没有license文件，所以希望能提供一下，谢谢！

How can I access to model with hanzi,or model with pinyin from baidu netdisk

百度云的预训练模型有密码呀，大佬可以提供下么

[CONTRIBUTION] Speech Dataset Generator

Hi everyone!

My name is David Martin Rius and I have just published this project on GitHub: https://github.com/davidmartinrius/speech-dataset-generator/

Now you can create datasets automatically with any audio or lists of audios.

I hope you find it useful.

Here are the key functionalities of the project:

Dataset Generation: Creation of multilingual datasets with Mean Opinion Score (MOS).
Silence Removal: It includes a feature to remove silences from audio files, enhancing the overall quality.
Sound Quality Improvement: It improves the quality of the audio when needed.
Audio Segmentation: It can segment audio files within specified second ranges.
Transcription: The project transcribes the segmented audio, providing a textual representation.
Gender Identification: It identifies the gender of each speaker in the audio.
Pyannote Embeddings: Utilizes pyannote embeddings for speaker detection across multiple audio files.
Automatic Speaker Naming: Automatically assigns names to speakers detected in multiple audios.
Multiple Speaker Detection: Capable of detecting multiple speakers within each audio file.
Store speaker embeddings: The speakers are detected and stored in a Chroma database, so you do not need to assign a speaker name.
Syllabic and words-per-minute metrics

Feel free to explore the project at https://github.com/davidmartinrius/speech-dataset-generator

David Martin Rius

关于训练出来的模型

请问，如果拿标贝数据集训练，那么tts文字转语音也是类似于标贝数据集的声音吗

示例命令行有误

$ python ../../mtts/synthesize.py -d cuda --c config.yaml --checkpoint ./checkpoints/checkpoint_1240000.pth.tar -i input.txt
usage: synthesize.py [-h] [-i INPUT] [--duration DURATION] [--output_dir OUTPUT_DIR] --checkpoint CHECKPOINT
[-c CONFIG] [-d {cuda,cpu}]
synthesize.py: error: ambiguous option: --c could match --checkpoint, --config

input.txt 文件后面的那个数字是什么意思，谁回答一下。。

ckpt = torch.load(ckpt_file) 报错

waveglow/ 里面的hubconf.py 200行 ckpt = torch.load(ckpt_file) 报错：
RuntimeError: unexpected EOF, expected 459824 more bytes. The file might be corrupted.
ckpt文件有问题了？
我的是ubuntu 2020, cuda version:11.3

可能是单词错了

Not you can copy the text to

Not you can copy the text to input.txt, and remember to put down the self-defined name and speaker id, separated by '|'.

unable to pronounce some pinyins

hello, I was generating wavs from your project with your pretrained models and was generating some audios.
However, I was faced with the following error.

2021-10-21 20:35:45,624 synthesize.py: INFO: processing 413|sil e2 cuo2 shan1 sil|sil 峨 痤 山 sil|10
Traceback (most recent call last):
  File "../../mtts/synthesize.py", line 98, in <module>
    name, tokens = text_processor(line)
  File "/mnt/data1/jungwonchang/projects/mandarin-tts/mtts/text/text_processor.py", line 44, in __call__
    return self._process(input)
  File "/mnt/data1/jungwonchang/projects/mandarin-tts/mtts/text/text_processor.py", line 38, in _process
    tokens = tokenizer.tokenize(seg)
  File "/mnt/data1/jungwonchang/projects/mandarin-tts/mtts/datasets/dataset.py", line 69, in tokenize
    tokens = [self.v2i[t] for t in text.split()]
  File "/mnt/data1/jungwonchang/projects/mandarin-tts/mtts/datasets/dataset.py", line 69, in <listcomp>
    tokens = [self.v2i[t] for t in text.split()]
KeyError: 'cuo2'

after debugging, I found out there was some pinyin sequences that you guys did not offer.

(Pdb) 'cuo2' in self.v2i.keys()
False

think you guys might update some pinyin sequences!

btw, thanks for the great project!

can you share your training logs and models?

Have to say it's a really good project, and I really want to run this training progress. But my pc is not good enough, which means I cant run this training.
So if you can share your training logs and generated models, I'll really appreciate.

Good Job！Learn a lot, THX.

AttributeError: module 'distutils' has no attribute 'version' 这个什么意思啊

Traceback (most recent call last):
File "../../mtts/train.py", line 10, in
from torch.utils.tensorboard import SummaryWriter
File "/usr/local/lib/python3.7/site-packages/torch/utils/tensorboard/init.py", line 4, in
LooseVersion = distutils.version.LooseVersion
AttributeError: module 'distutils' has no attribute 'version'

语速忽快

我听生成的样例中，有语速忽快的现象，请问后期有做改进吗

关于android上使用

可以在android上使用吗

目前用于训练的数据库无法下载了，方便发个链接吗？

readMe中用到的标贝语言合成数据库无法下载，可以方便提供提供一个网盘链接或者麻烦发我邮箱吗？
我的邮箱 [email protected]

The leak of audio folder

Hi Author,

During testing, I notice the audio folder is leaked, and I got it from original FastSpeech2 reop to make the demo working.
Could you upload that package?

Thanks,
John

菜鸟需要aishell3_wav_folder和mel_folder 里的资料

Sorry if I am asking the repeated question as I am just a beginner.

I followed the readme until the wav2mel.py requires two folders where I believe there are training data from some other websites. I would appreciate if you could share the link for place where I can download the data, or you could email me the information via [email protected]

Thank you very much.

关于vocoder模型下载的问题

你好作者，感谢你的无私分享。在复现的工作的时候初心一些小问题，在使用wav2mel.py的文件时，会用到一个一些函数，但是函数终会需要vocoder的一些gan模型，这些模型没有在给的文件中，我需要下载readme中的checkpoint还是link？
ImportError: cannot import name 'HiFiGAN' from 'mtts.models.vocoder.hifi_gan' (unknown location)

请问如何支持中文句子中的英文字母或单词呢

如题

Runtime Error: Error in loading state_dict for FastSpeech2

@ranchlai 感谢分享！按照readme，一步一步运行会报错：

loading model from ./checkpoint/checkpoint_500000.pth
Traceback (most recent call last):
File "synthesize.py", line 123, in
model = build_model().to(device)
File "synthesize.py", line 50 in build_model
model.load_state_dict(sd)
File "/usr/local/lib/....../module.py， line 1224, in load_state_dict

Runtime Error: Error in loading state_dict for FastSpeech2

Miss key(s) in state_dict: "decoder.speaker_fc.weight"
size mismatch for encoder.position_enc: copyinga param with shape torch.size([1, 10
01，256]) from checkpiont, the shape in current model is torch.size([1, 2001, 256])。
size mismatch for encoder.src_word_emb.weight: copying a param with shape torch.size
([1612，256])Erom checkpoint,the shape in current model is torch.size([1915，256])。
size mismatch for encoder.cn_word_emb.weight: copying a param with shape torch.size(
[4135，2561)From checkpoint, the shape in current model is torch.size([4502，256])。
size mismatch for decoder.position_enc: copyinga param with shape torch.size([1, 10
01，2561) from checkpoint,the shape in current model is torch.size([1, 2001, 256])/content/

Thanks!

使用git clone 无法获得源代码

error: invalid path 'docs/novel2/hz_0.9_500000_在这儿做啥呢,不能啥玩意儿都带到这里来?.wav'
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

What's the RTF of fs2+wavglow?

Hi @ranchlai ,
Thanks for sharing!
What's the RTF of fs2+wavglow on CPU?

FileNotFoundError: [Errno 2] No such file or directory: '/home/xgj/checkpoints/vctk_pretrained_model_3180.pt'

请问一下vctk_pretrained_model_3180.p这个文件哪里来呢？

是否有java版的？

我想把它部署在安卓手机上，请问这个有java版本的吗？我找了一些java版本的，但声音没有你这个好，谢谢！

输入VIP出现问题。如何识别英文呢？ python ../../mtts/text/gp2py.py -t "请谢国俊到VIP八诊室就诊"

2022-02-27 19:08:54,545 synthesize.py: INFO: processing text1|sil qing3 xie4 guo2 jun4 dao4 VIP5 ba1 zhen3 shi4 jiu4 zhen3 sil|sil 请谢国俊到 V I P 八诊室就诊 sil|0
Traceback (most recent call last):
File "../../mtts/synthesize.py", line 93, in
name, tokens = text_processor(line)
File "/home/xgj/mandarin-tts/mtts/text/text_processor.py", line 41, in call
return self._process(input)
File "/home/xgj/mandarin-tts/mtts/text/text_processor.py", line 35, in _process
tokens = tokenizer.tokenize(seg)
File "/home/xgj/mandarin-tts/mtts/datasets/dataset.py", line 61, in tokenize
tokens = [self.v2i[t] for t in text.split()]
File "/home/xgj/mandarin-tts/mtts/datasets/dataset.py", line 61, in
tokens = [self.v2i[t] for t in text.split()]
KeyError: 'VIP5'

自己的数据，训练效果不理想

拿aishell3的数据集训练，loss下降的很快，模型run2000轮就能输出较为清晰的语音。用自己收集来的语音去训练，收敛很慢且输出结果不太理想。
自己的数据频谱清晰无杂音，不是很明白为什么效果和aishell差这么多，请指教

Download Waveglow error

Traceback (most recent call last):
File "/home/project/TTS/mandarin-tts/synthesize.py", line 126, in
waveglow = download_waveglow(device)
File "/home/project/TTS/mandarin-tts/download_utils.py", line 164, in download_waveglow
waveglow = torch.hub.load('./waveglow/DeepLearningExamples-torchhub/', 'nvidia_waveglow',source='local')
File "/home/anaconda3/envs/torch1.6/lib/python3.8/site-packages/torch/hub.py", line 345, in load
repo_dir = _get_cache_or_reload(github, force_reload, verbose)
File "/home/anaconda3/envs/torch1.6/lib/python3.8/site-packages/torch/hub.py", line 121, in _get_cache_or_reload
repo_owner, repo_name, branch = _parse_repo_info(github)
File "/homeg/anaconda3/envs/torch1.6/lib/python3.8/site-packages/torch/hub.py", line 111, in _parse_repo_info
repo_owner, repo_name = repo_info.split('/')
ValueError: too many values to unpack (expected 2)

能否提供inference模型参数

您好，请问能否提供一版经过训练好的模型参数呢，自己训练出现各种效果不佳的情况

语音速度控制

谢谢您的分享，想问下如何控制生成语音的速度？

Key Error when feeding bilingual text

Hi, thanks for your great work! I met the following problem:

python synthesize.py --input="大家好，Transformer是Google提出的一种神经网络架构"

Failed to open text file 大家好，Transformer是Google提出的一种神经网络架构
Treating input as text
Namespace(channel=2, duration=1.0, input='大家好，Transformer是Google提出的一种神经网络架构', output_dir='./outputs/')
loading model from ./checkpoint/checkpoint_500000.pth
loading waveglow...
processing 大家好，Transformer是Google提出的一种神经网络架构
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.609 seconds.
Prefix dict has been built successfully.
Traceback (most recent call last):
  File "synthesize.py", line 152, in <module>
    cn_sentence_seq,cn_sentence_aug = utils.convert_cn(cn_sentence)
  File "/content/mandarin-tts/utils/hz_utils.py", line 100, in convert_cn
    cn_array = np.array([[cn2idx[c] for c in cn_sentence]])
  File "/content/mandarin-tts/utils/hz_utils.py", line 100, in <listcomp>
    cn_array = np.array([[cn2idx[c] for c in cn_sentence]])
KeyError: 'T'

关于hparams包

Traceback (most recent call last):
File "synthesize.py", line 7, in
from fastspeech2 import FastSpeech2
File "F:\project\mandarin-tts-master\fastspeech2.py", line 5, in
from transformer.Models import Encoder, Decoder
File "F:\project\mandarin-tts-master\transformer_init_.py", line 5, in
import transformer.Models
File "F:\project\mandarin-tts-master\transformer\Models.py", line 33, in
class Encoder(nn.Module):
File "F:\project\mandarin-tts-master\transformer\Models.py", line 39, in Encoder
len_max_seq=hp.max_seq_len,
AttributeError: module 'hparams' has no attribute 'max_seq_len'

请问这个问题怎么解决，版本也试过好几个了，不太行。

使用biaobei数据集，train在step9980时这样报错，请问我应该从哪里开始修改呢

RuntimeError: cannot reshape tensor of 0 elements into shape [0, 39, -1] because the unspecified dimension size -1 can be any value and is ambiguous

unable to train biaobei

mandarin-tts/mtts/models/layers.py", line 91, in forward
    output = output.permute(1, 2, 0, 3).contiguous().view(sz_b, len_q, -1)  # b x lq x (n*dv)
RuntimeError: cannot reshape tensor of 0 elements into shape [0, 8, -1] because the unspecified dimension size -1 can be any value and is ambiguous

File not found running ../../mtts/train.py -c config.yaml -d cuda

Here is my colab file, I am stuck at the

%run ../../mtts/train.py -c config.yaml -d cuda

It says 'No such file or directory: '../../mel_folder/SSB10560130.npy'

Include please find my colab file.

https://colab.research.google.com/drive/13gXl3NQSz97Fl__9wmCQJulEODD7Xe5-?usp=sharing

name_py_hz_dur.txt中的duration是怎么生成的，原数据集的标注文件中没有，拿其他数据集来训练试试

Variance中的pitch和energy

谢谢，您的想法和思路很有参考价值，而且从mel谱上看，您的mel谱分辨率可以说相当高了。
但代码中的pitch和energy似乎并没有启用？您是否也认为难以拟合他们，并且强行使用pitch和energy会导致音调问题？
这个问题上您有好的解决办法吗？

Pinyin doesn't seem to align with hanzi when coming with erhua

firstly, thinks for your great work.
just like the title says, for example:
001464|sil wo3 dou1 hui4 shuo1 er2 hua4 yin1 le5 sp1 zher4 ne5 sp1 mingr2 jian4 sp1 hai2 bu2 cuo4 ba5 sil
001464|sil 我都会说儿化音了 sp1 这儿 sp1 呢明 sp1 儿见还不 sil

zip:
[('sil', 'sil'), ('我', 'wo3'), ('都', 'dou1'), ('会', 'hui4'), ('说', 'shuo1'), ('儿', 'er2'), ('化', 'hua4'), ('音', 'yin1'), ('了', 'le5'), ('sp1', 'sp1'), ('这', 'zher4'), ('儿', 'ne5'), ('sp1', 'sp1'), ('呢', 'mingr2'), ('明', 'jian4'), ('sp1', 'sp1'), ('儿', 'hai2'), ('见', 'bu2'), ('还', 'cuo4'), ('不', 'ba5'), ('sil', 'sil')]

'zher4' should respect to '这儿' , and there misses some hanzi in the tail.
do you think it matters?

AISHELL3某些数据生成梅尔频谱失败的问题

在aishell3数据中,有些wav文件通过librosa生成振幅向量的时候,振幅大小会超过1
如: SSB08870032.wav 文件的最大振幅为1.0116
导致运行wav2mel.py的时候会中断报错.

具体问题如下:
文件 /mtts/utils/stft.py 第248 、249行
为什么要对wav的振幅向量限制在[-1,1]呢 ?