kevinwang676 / bark-voice-cloning Goto Github PK

View Code? Open in Web Editor NEW

2.5K 30.0 362.0 3.06 MB

Bark Voice Cloning and Voice Cloning for Chinese Speech

License: MIT License

Python 4.04% Dockerfile 0.03% Jupyter Notebook 95.93%

bark-voice-cloning's Introduction

Bark Voice Cloning 🐶 & Voice Cloning for Chinese Speech 🎶

简体中文

1️⃣ Bark Voice Cloning

10/19/2023: Fixed ERROR: Exception in ASGI application by specifying gradio==3.33.0 and gradio_client==0.2.7 in requirements.txt.

11/08/2023: Integrated KNN-VC into OpenAI TTS and created an easy-to-use Gradio interface. Try it here.

02/27/2024: We are thrilled to launch our most powerful AI song cover generator ever with Shanghai Artificial Intelligence Laboratory! Just need to provide the name of a song and our application running on an A100 GPU will handle everything else. Check it out in our website (please click "EN" in the first tab of our website to see the english version)! 💕

Based on bark-gui and bark. Thanks to C0untFloyd.

Quick start: Colab Notebook ⚡

HuggingFace Demo: Bark Voice Cloning 🤗 (Need a GPU)

Demo Video: YouTube Video

If you would like to run the code locally, remember to replace the original path /content/Bark-Voice-Cloning/bark/assets/prompts/file.npz with the path of file.npz in your own computer.

If you like the quick start, please star this repository. ⭐⭐⭐

Easy to use:

(1) First upload audio for voice cloning and click Create Voice.

(2) Choose the option called "file" in Voice if you'd like to use voice cloning.

(3) Click Generate. Done!

2️⃣ Voice Cloning for Chinese Speech

10/26/2023: Integrated labeling, training and inference into an easy-to-use user interface of SambertHifigan. Thanks to wujohns.

We want to point out that Bark is very good at generating English speech but relatively poor at generating Chinese speech. So we'd like to adopt another approach, which is called SambertHifigan, to realizing voice cloning for Chinese speech. Please check out our Colab Notebook for the implementation.

Quick start: Colab Notebook ⚡

HuggingFace demo: Voice Cloning for Chinese Speech 🤗

bark-voice-cloning's People

Contributors

Stargazers

Watchers

Forkers

ishine platform-kit alexanderdaly great1001 linxian0913 maxmax2016 manyxu pennyjoly zhw0213 techthiyanes skeras alexgoiadev mrduxs gdlbu assassindesign challenx zhenglei0410 erickong1985 samzhang9 xrunda cooper147 conansherry dev20221001 235623 nemophilister pli2006228 cchrewrite yaojiach fulldb yaoertech yangningbo hubin858130 40740 rhewong mayi140611 bw-huangxiaohui ghettost cleardry readytodance adambear yuanfeng-net bewitching-coder kevinwck mt526 lizhunkg yilvqing xiaoqingwang xiedongmingming junhaohuang0615 chenliqiang1106 yqty mythxw ychyjj dahuilanggit jonnyquan susiia cylonspace iamleon121 gabrielwenfu windli2018 uipark25 yunnewh yaospacetim tigerjam1980 tszxscj anselorville g-force78 1006000670 dy2009 zhangbhui ycunix xiequnqun2013 renfengyi buyersystem daddongoba camenduru xukai1991 lurenchou thetargo jjandnn w9521423 yhbbobo snamper hito0512 ai-forks g2262853652 bjzhangyong f901107 gugd123 fdg513 ghostmember sean85120 zcfrank1st bygzx shadowcz007 coolbe sanshaoyeyang fznf1010 ykimdeveloper prahs

bark-voice-cloning's Issues

中英混合模型如何训练呢

基于PTTS-basemodel微调报错 SOS

基于PTTS-basemodel微调训练的时候报错，代码跟ipy上一样

File ~/anaconda3/envs/modelscope_py38/lib/python3.8/site-packages/modelscope/trainers/audio/tts_trainer.py:242, in KanttsTrainer.train(self, *args, **kwargs)
232 dir_dict = {
233 'work_dir': self.work_dir,
234 'am_tmp_dir': self.am_tmp_dir,
235 'voc_tmp_dir': self.voc_tmp_dir,
236 'data_dir': self.data_dir
237 }
238 config_dict = {
239 'am_config': self.am_config_path,
240 'voc_config': self.voc_config_path
241 }
--> 242 self.model.train(self.speaker, dir_dict, self.train_type, config_dict,

size mismatch for mel_decoder.mel_dec.pnca.7.pnca_attn.w_h_kv.weight: copying a param with shape torch.Size([256, 640]) from checkpoint, the shape in current model is torch.Size([256, 320])

windows系统下 tts-autolabel和ttsfrd 不支持该怎么办啊

代码目前不支持windows运行吗？

大佬界面主题可以放在本地不？

win10 部署

有人win 10 部署成功了吗？
我的win10 + python10
现在卡在安装 gradio 后本地访问不到，并且线程报错
gradio 是不是有版本要求呀？

本地部署的问题

如果要在本地运行代码，请记住将原始路径替换为自己计算机中的路径。/content/Bark-Voice-Cloning/bark/assets/prompts/file.npz
file.npz
但是在这个文件夹内我没有看到有file.npz 只有一个announcer.npq，请问这个file.npz 文件在哪里找

目录中的几个ipynb文件之间是什么关系

比如 Sambert中文声音克隆v2 跟Sambert_Voice_Cloning_in_One_Click.ipynb。哪个是更新的？按上传时间好像是后者。但是看名字似乎v2更新？ SambertHifigan.ipynb跟前两者的区别是什么呢？谢谢大神

使用py3.11问题

python3.11会各种报错，其中fairseq不支持3.11，虽然github有人对fairseq进行了更新，支持了py3.11，但是我不知道如何去更新包，同时还有hydra的报错，只能对3.9-3.10版本有良好的支持

遇到个numpy版本问题，但是我的numpy==1.22.0的呀，大佬求救

在修改上传音频文件名的代码块那里出现报错

上传了自己的音频文件，并修改了filename,运行时报错，错误信息如下，请帮忙看看

Transcribing file 0: 'voicedata.WAV' to segments...

Error Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/whisper/audio.py in load_audio(file, sr)
45 out, _ = (
---> 46 ffmpeg.input(file, threads=0)
47 .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)

5 frames
Error: ffmpeg error (see stderr output for detail)

The above exception was the direct cause of the following exception:

RuntimeError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/whisper/audio.py in load_audio(file, sr)
49 )
50 except ffmpeg.Error as e:
---> 51 raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e
52
53 return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0

RuntimeError: Failed to load audio: ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
libavutil 56. 70.100 / 56. 70.100
libavcodec 58.134.100 / 58.134.100
libavformat 58. 76.100 / 58. 76.100
libavdevice 58. 13.100 / 58. 13.100
libavfilter 7.110.100 / 7.110.100
libswscale 5. 9.100 / 5. 9.100
libswresample 3. 9.100 / 3. 9.100
libpostproc 55. 9.100 / 55. 9.100
voicedata.WAV: No such file or directory

RecursionError: maximum recursion depth exceeded in comparison

Can you please tell me how to solve this problem?

jsonable_encoder(
File "/usr/local/lib/python3.10/dist-packages/fastapi/encoders.py", line 226, in jsonable_encoder
if isinstance(obj, classes_tuple):
File "/usr/lib/python3.10/abc.py", line 119, in instancecheck
return _abc_instancecheck(cls, instance)
RecursionError: maximum recursion depth exceeded in comparison

No module named 'fa'

/usr/local/lib/python3.10/dist-packages/tts_autolabel/fa_utils.py in
2 import sys
----> 3 import fa
4 import json

ModuleNotFoundError: No module named 'fa'

runtime error: HuggingFace demo: Voice Cloning for Chinese Speech

r9y9/pysptk#95

【求指教】运行python app.py报错

环境：win10，annaconda，python=3.10，cu117。已经修改app.py里面的/content/Bark-Voice-Cloning/bark/assets/prompts/file为本地目录。
在环境下运行python app.py报错信息如下：
Preloading Models

Loading text model from ./models\text_2.pt to cuda
Loading coarse model from ./models\coarse_2.pt to cuda
Loading fine model from ./models\fine_2.pt to cuda
Launching Bark Voice Cloning UI Server
Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch().
Loading Hubert ./models/hubert/hubert.pt
Traceback (most recent call last):
File "C:\AIr\anaconda3\envs\Bark-Voice-Cloning\lib\site-packages\gradio\routes.py", line 488, in run_predict
output = await app.get_blocks().process_api(
File "C:\AIr\anaconda3\envs\Bark-Voice-Cloning\lib\site-packages\gradio\blocks.py", line 1431, in process_api
result = await self.call_function(
File "C:\AIr\anaconda3\envs\Bark-Voice-Cloning\lib\site-packages\gradio\blocks.py", line 1103, in call_function
prediction = await anyio.to_thread.run_sync(
File "C:\AIr\anaconda3\envs\Bark-Voice-Cloning\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
File "C:\AIr\anaconda3\envs\Bark-Voice-Cloning\lib\site-packages\anyio_backends_asyncio.py", line 2106, in run_sync_in_worker_thread
return await future
File "C:\AIr\anaconda3\envs\Bark-Voice-Cloning\lib\site-packages\anyio_backends_asyncio.py", line 833, in run
result = context.run(func, *args)
File "C:\AIr\anaconda3\envs\Bark-Voice-Cloning\lib\site-packages\gradio\utils.py", line 707, in wrapper
response = f(*args, **kwargs)
File "C:\AIr\anaconda3\envs\Bark-Voice-Cloning\lib\site-packages\gradio\utils.py", line 707, in wrapper
response = f(*args, **kwargs)
File "E:\AI\Bark-Voice-Cloning\cloning\clonevoice.py", line 31, in clone_voice
hubert_model = CustomHubert(checkpoint_path='./models/hubert/hubert.pt').to(device)
File "E:\AI\Bark-Voice-Cloning\bark\hubert\pre_kmeans_hubert.py", line 61, in init
checkpoint = torch.load(checkpoint_path)
File "C:\AIr\anaconda3\envs\Bark-Voice-Cloning\lib\site-packages\torch\serialization.py", line 797, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "C:\AIr\anaconda3\envs\Bark-Voice-Cloning\lib\site-packages\torch\serialization.py", line 283, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

使用jupyter notebook运行，所有依赖已安装，提示这个上错误大佬

训练前的预处理这一步卡住，最后一行信息提示text_preprocess: text file lines must be equals to 142

ret, report = run_auto_label(input_wav=input_wav, work_dir=output_data, resource_revision="v1.0.5")
这一步卡住很长时间都不结束。如果强行结束开始训练会报错：
FileNotFoundError: [Errno 2] No such file or directory: '/content/output_training_data/prosody/prosody.txt'

本地部署无法跑通程序

电脑系统：Windows 10，python 3.10.10，miniconda
本地正常运行bark-gui
跑create voice 的时候抛出异常：FileNotFoundError: [Errno 2] No such file or directory: '/content/Bark-Voice-Cloning/bark/assets/prompts/file.npz'
请问这个file.npz 去哪里下载呢？

谢谢

模型大小

为神马我自己生成的模型文件大小有1G多，而自带例子的模型才几十M，怎么减小大小呢

大佬在训练的时候提示安装不上kantts

RuntimeError: Failed to import modelscope.models.audio.tts.sambert_hifi because of the following error (look up to see its traceback):
No module named 'kantts'
运行环境colab.research.google.com

colab执行run_auto_label时会崩溃

pre-break recording in paragraph by vad.
Generate phone interval by fa align.
prosody_dir=/content/output_training_data/paragraph/prosody
FA processing...
--- New folder /content/output_training_data/raw_ali... ---
--- OK ---
--- New folder /content/output_training_data/raw_interval... ---

每次执行到上面这个位置后，会话就会崩溃。崩溃前并没有OOM。崩溃前一条记事本日志如下：
what(): ReadConfigFile /disk1/jiaqi.sjq/gongbo.gb/se/asr/decoder/src/core/util/parse-options.cpp 512 Input config stream is broken

How to increase quality?

Hey, @KevinWang676, great work on this repo.

If I want to increase the quality what's the best way to go about that?

I imagine the number of steps that are used during both training and inference must be stored in some variable somewhere. Can you point me to it?

Or maybe there's another obvious solution?

am configuration not found

2023-09-04 15:36:16,560 - modelscope - INFO - mvn_path=/home/gpuserver/python/Personal-TTS-v2/pretrain_work_dir/orig_model/mvn.npy
Traceback (most recent call last):
File "/home/gpuserver/.conda/envs/sbhfgan/lib/python3.9/site-packages/gradio/routes.py", line 523, in run_predict
output = await app.get_blocks().process_api(
File "/home/gpuserver/.conda/envs/sbhfgan/lib/python3.9/site-packages/gradio/blocks.py", line 1437, in process_api
result = await self.call_function(
File "/home/gpuserver/.conda/envs/sbhfgan/lib/python3.9/site-packages/gradio/blocks.py", line 1109, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/gpuserver/.conda/envs/sbhfgan/lib/python3.9/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/gpuserver/.conda/envs/sbhfgan/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/home/gpuserver/.conda/envs/sbhfgan/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/home/gpuserver/.conda/envs/sbhfgan/lib/python3.9/site-packages/gradio/utils.py", line 865, in wrapper
response = f(*args, **kwargs)
File "/home/gpuserver/python/Personal-TTS-v2/main.py", line 145, in infer
model_id = SambertHifigan(os.path.join(model_dir, "orig_model"), **kwargs)
File "/home/gpuserver/.conda/envs/sbhfgan/lib/python3.9/site-packages/modelscope/models/audio/tts/sambert_hifi.py", line 53, in init
self.voices, self.voice_cfg, self.lang_type = self.load_voice(
File "/home/gpuserver/.conda/envs/sbhfgan/lib/python3.9/site-packages/modelscope/models/audio/tts/sambert_hifi.py", line 107, in load_voice
return self.build_voice_from_custom(model_dir, custom_ckpt)
File "/home/gpuserver/.conda/envs/sbhfgan/lib/python3.9/site-packages/modelscope/models/audio/tts/sambert_hifi.py", line 90, in build_voice_from_custom
voice = Voice(
File "/home/gpuserver/.conda/envs/sbhfgan/lib/python3.9/site-packages/modelscope/models/audio/tts/voice.py", line 158, in init
raise TtsModelConfigurationException(
modelscope.utils.audio.tts_exceptions.TtsModelConfigurationException: modelscope error: am configuration not found

克隆语音应该上传中文语音还是英文语音？

python app.py 运行不起来，提示如下截图，谢谢大佬分享

huggingface上的那个AI娜娜和AI小杰两个选项都不算纯的模型音色吧？

huggingface的界面能否增设一个纯模型音色按钮来生成音频？AI娜娜和AI小杰的生成的效果和之前在colab里出来的纯模型音频效果不太一样。
而且希望能加上语速调节/音调调节等附加选项。

auto label error

尝试在本地跑 Voice_Cloning_for_Chinese_Speech.ipynb

input_wav = "./test_wavs/"
output_data = "./output_training_data/"

ret, report = run_auto_label(input_wav=input_wav, work_dir=output_data, resource_revision="v1.0.5")

2023-07-10 10:25:48,985 - modelscope - INFO - Use user-specified model revision: v1.0.5
2023-07-10:10:25:48, INFO [api.py:486] Use user-specified model revision: v1.0.5
2023-07-10:10:25:56, INFO [auto_label.py:91] ---  New folder [/home/mai/Bark-Voice-Cloning/output_training_data/paragraph/prosody...](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/paragraph/prosody...)  ---
2023-07-10:10:25:56, INFO [auto_label.py:92] ---  OK  ---
2023-07-10:10:25:56, INFO [auto_label.py:91] ---  New folder [/home/mai/Bark-Voice-Cloning/output_training_data/sp_interval...](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/sp_interval...)  ---
2023-07-10:10:25:56, INFO [auto_label.py:92] ---  OK  ---
2023-07-10:10:25:56, INFO [auto_label.py:91] ---  New folder [/home/mai/Bark-Voice-Cloning/output_training_data/wav...](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/wav...)  ---
2023-07-10:10:25:56, INFO [auto_label.py:92] ---  OK  ---
2023-07-10:10:25:56, INFO [auto_label.py:91] ---  New folder [/home/mai/Bark-Voice-Cloning/output_training_data/log...](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/log...)  ---
2023-07-10:10:25:56, INFO [auto_label.py:92] ---  OK  ---
2023-07-10:10:25:56, INFO [auto_label.py:301] 2023-07-10 10:25:56
2023-07-10:10:25:56, INFO [auto_label.py:355] wav_preprocess start...
  0%|          | 0/1 [00:00<?, ?it/s]sox WARN rate: rate clipped 1 samples; decrease volume?
sox WARN dither: dither clipped 1 samples; decrease volume?
100%|██████████| 1/1 [00:00<00:00, 118.89it/s]
2023-07-10:10:25:56, INFO [auto_label.py:367] wav cut by vad start...
100%|██████████| 1/1 [00:00<00:00,  5.07it/s]
100%|██████████| 1/1 [00:00<00:00, 17.83it/s]
2023-07-10:10:26:00, INFO [audio2prosody.py:52] Text to label start...
festival_initialize() called more than once
100%|██████████| 1/1 [00:00<00:00,  6.40it/s]
2023-07-10:10:26:01, INFO [auto_label.py:773] pre-break recording in paragraph by vad.
2023-07-10:10:26:01, INFO [auto_label.py:786] Generate phone interval by asr align.
2023-07-10:10:26:01, INFO [auto_label.py:91] ---  New folder [/home/mai/Bark-Voice-Cloning/output_training_data/align...](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/align...)  ---
2023-07-10:10:26:01, INFO [auto_label.py:92] ---  OK  ---
2023-07-10:10:26:01, INFO [auto_label.py:460] prosody_dir=/home/mai/Bark-Voice-Cloning/output_training_data/paragraph/prosody
2023-07-10:10:26:01, INFO [asr_align.py:190] job_num=1 process_num=4 fbank_config=/home/mai/.cache/modelscope/hub/damo/speech_ptts_autolabel_16k/model/fsmn_16k_2/fbank.conf, data_dir=/home/mai/Bark-Voice-Cloning/output_training_data/align/gen/data, fbank_dir=/home/mai/Bark-Voice-Cloning/output_training_data/align/gen/fbank
2023-07-10:10:26:01, INFO [make_fbank.py:48] run make_fbank with num=1 config_path=/home/mai/.cache/modelscope/hub/damo/speech_ptts_autolabel_16k/model/fsmn_16k_2/fbank.conf
2023-07-10:10:26:01, INFO [make_fbank.py:49] data_path=/home/mai/Bark-Voice-Cloning/output_training_data/align/gen/data fbank_path=/home/mai/Bark-Voice-Cloning/output_training_data/align/gen/fbank
2023-07-10:10:26:01, INFO [make_fbank.py:62] [{'id': 'test_0_0', 'wav': '/home/mai/Bark-Voice-Cloning/output_training_data/wav_cut/test_0_0.wav'}]
run_asr_align step 2
speak_script=/home/mai/Bark-Voice-Cloning/output_training_data/align/script.txt
  0%|          | 0/1 [00:00<?, ?it/s]
2023-07-10:10:26:01, INFO [make_fbank.py:77] DONE compute fbank and copy feats

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[11], line 4
      1 input_wav = "[./test_wavs/](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/test_wavs/)"
      2 output_data = "[./output_training_data/](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/output_training_data/)"
----> 4 ret, report = run_auto_label(input_wav=input_wav, work_dir=output_data, resource_revision="v1.0.5")

File [~/.local/lib/python3.9/site-packages/modelscope/tools/speech_tts_autolabel.py:78](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/modelscope/tools/speech_tts_autolabel.py:78), in run_auto_label(input_wav, work_dir, para_ids, resource_model_id, resource_revision, gender, stage, process_num, develop_mode, has_para, enable_enh)
     64 model_resource = _download_and_unzip_resource(resource_model_id,
     65                                               resource_revision)
     66 auto_labeling = AutoLabeling(
     67     os.path.abspath(input_wav),
     68     model_resource,
   (...)
     76     process_num,
     77     enable_enh=enable_enh)
---> 78 ret_code, report = auto_labeling.run()
     79 return ret_code, report

File [~/.local/lib/python3.9/site-packages/tts_autolabel/auto_label.py:787](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/auto_label.py:787), in AutoLabeling.run(self)
    785 # generate phone interval by asr align.
    786 logging.info("Generate phone interval by asr align.")
--> 787 self.asr_align()
    789 # align interval leading and trailing silence with wav.
    790 self.trim_sil_wav_interval()

File [~/.local/lib/python3.9/site-packages/tts_autolabel/auto_label.py:482](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/auto_label.py:482), in AutoLabeling.asr_align(self)
    480     run_asr_align(self.resource_dir, align_output, script_file, self.out_wav_peh_dir, job_num=self.break_job_num, process_num=self.process_num)
    481 else:
--> 482     run_asr_align(self.resource_dir, align_output, script_file, self.cut_wav_dir, job_num = self.align_job_num, process_num=self.process_num)
    483 # fbank feats files could be used by vad.
    484 self.asr_align_gen_feats_file = os.path.join(align_output, "data/feats.scp")

File [~/.local/lib/python3.9/site-packages/tts_autolabel/asr_align.py:522](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/asr_align.py:522), in run_asr_align(resource_root, working_dir, speak_script, wave_dir, step, job_num, process_num)
    520 lm_dir = resource_root + '[/lang](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/lang)'
    521 am_dir = resource_root + '[/fsmn_16k_2](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/fsmn_16k_2)'
--> 522 process(job_num, process_num, lm_dir, am_dir, working_dir, speak_script, wave_dir, engine_test_dir, engine_data_dir, sy2ph_map, step)

File [~/.local/lib/python3.9/site-packages/tts_autolabel/asr_align.py:482](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/asr_align.py:482), in process(job_num, process_num, lm_dir, am_dir, working_dir, speak_script, wave_dir, engine_test_dir, engine_data_dir, sy2ph_map, step)
    480 if not os.path.exists(fbank_dir):
    481     os.makedirs(fbank_dir)
--> 482 generate_fbank(job_num, process_num, data_dir, fbank_config, fbank_dir)
    484 if step >= ASR_ALIGN_STEP_ALIGN:
    485     #################### step6 ####################
    486     align_dir = os.path.join(working_dir, 'align')

File [~/.local/lib/python3.9/site-packages/tts_autolabel/asr_align.py:191](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/asr_align.py:191), in generate_fbank(job_num, process_num, data_dir, fbank_config, fbank_dir)
    189 def generate_fbank(job_num, process_num, data_dir, fbank_config, fbank_dir):
    190     logging.info(f'job_num={job_num} process_num={process_num} fbank_config={fbank_config}, data_dir={data_dir}, fbank_dir={fbank_dir}')
--> 191     do_make_fbank(job_num, process_num, fbank_config, data_dir, fbank_dir)

File [~/.local/lib/python3.9/site-packages/tts_autolabel/make_fbank.py:82](https://vscode-remote+ssh-002dremote-002b192-002e168-002e50-002e106.vscode-resource.vscode-cdn.net/home/mai/Bark-Voice-Cloning/~/.local/lib/python3.9/site-packages/tts_autolabel/make_fbank.py:82), in do_make_fbank(num, process_num, config_path, data_path, fbank_path)
     80         id = fbank_list[i]['id']
     81         output_scp = os.path.join(fbank_path, f'raw_fbank_data.{id}.scp')
---> 82         with open(output_scp, 'r') as f2:
     83             feats_scp_f.write(f2.readline())
     84 logging.info('DONE!')

FileNotFoundError: [Errno 2] No such file or directory: '/home/mai/Bark-Voice-Cloning/output_training_data/align/gen/fbank/raw_fbank_data.test_0_0.scp'

阿里云笔记本，第二次训练报错KanttsTrainer: [Errno 17] File exists: './pretrain_work_dir/tmp_am'

全是BUG

在Colab上跑 autolabel,

先是找不到 temp.wav file, 撕源码发现压根就没有存储过temp.wav的code(只有起了这个名字string).

改了这个后, 终于走完cut_wav和prosody也出来了然后继续报错 'can not find the file of sox'. 真的头大.

由于autolabel的部份依赖(nls-fa)只能由linux系统下安装. 我现在还只能在colab上撕源码.

老哥你有没有啥建议呀 555

TTS 或者 VC，有时长限制吗？

Bark 本身生成的语音的时长限制是13-14s，这个 Bark-Voice-Cloning 有时长限制吗？

中文效果不理想

英文的效果很好，中文的发音感觉是外国人在说话。

Local Install

可以给一份MAC 环境下 Sambert中文声音克隆v2.ipynb 的requirements.txt吗按照colab上的按照总报错

module 'fa' has no attribute 'AlsFaPyImpl'

what is fa is?

Load pinyin_en_mix_dict failed

在跑Personal-TTS-v2代码的时候，点击开始训练就报错Load pinyin_en_mix_dict failed，但是在colab上就是正常的

IndexError: index 0 is out of bounds for axis 0 with size 0

当我根据colab中你给的代码执行到

input_wav = "./test_wavs/"
output_data = "./output_training_data/"

ret, report = run_auto_label(input_wav=input_wav, work_dir=output_data, resource_revision="v1.0.7")

我知道这是一个数组越界的错误，但是你代码里面的，我准备过去改，我一看，全是公式。我无从下手，也不知道改了会不会影响下一步。希望能给出指导 /usr/local/lib/python3.10/dist-packages/tts_autolabel/audio2phone/funasr_onnx/utils/frontend.py这个类的101行

 99         T_lfr = int(np.ceil(T / lfr_n))
100         left_padding = np.tile(inputs[0], ((lfr_m - 1) // 2, 1))

--> 101 inputs = np.vstack((left_padding, inputs))
102 T = T + (lfr_m - 1) // 2
103 for i in range(T_lfr):

IndexError: index 0 is out of bounds for axis 0 with size 0

Load pinyin_en_mix_dict failed

2023-09-04 14:55:36,078 - modelscope - INFO - mvn_path=./pretrain_work_dir/orig_model/mvn.npy
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
festival_initialize() called more than once

Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

I have the GPU and cuda 12.1. But the code has such a issue. This issue also happens in Huggingface demo.

colab运行遇到问题

按Sambert_Voice_Cloning_in_One_Click.ipynb 在colab运行。打开生成的链接后。先上传了照文本念的m4a格式的语音。然后在文本的栏里输入了对应的文本。然后点击了1. 标注数据。然后点击了2.开始训练。但是返回了错误信息。请问是我哪里操作得不对吗？

声音是否支持 [clears throat]、[laughs] 之类的？

在huggingface上试了一下，声音非常出色的，特别是一些尾音，太好听了。
在介绍中，我看到了“Based on [bark-gui]”，那么，是否也拥有bark-gui一样的功能，比如“uh —”、 [clears throat]、[laughs] 之类的，让声音可信度更高？

I'v having this issue on Ubuntu with GPU 3090 : [Errno 2] No such file or directory: '/content/Bark-Voice-Cloning/bark/assets/prompts/file.npz'

Can anyone help ?

demo 服务挂了

在TTS页面写的是中文，说的是英文，和写的中文内容毫无关系

我在CLONE VOICE页面训练了声音也生成了NPZ文件
在TTS页面的INPUT TEXT中随便输入了一段中文，但讲的话是英文，today。。。。
如果把voice类型由file改为none，是按输入的文本念的中文，但是发英象是老外说中文一样，像川普的口音
另外请问下，项目中有没有地方能上传NPZ文件，这样就不用每次重新练，直接上传NPZ然后写入新要发声的文字内容
谢谢

colab右下角没有终端啊

这个怎么办呢？

Please share your dataset, and make an entry on mylo's repo

Greetings,

Mylo, the author of the quantizer repo (https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer/) is also collecting the datasets for the multiple languages people trained Bark's Hubert quantizers, as you can see here:

https://github.com/gitmylo/Voice-cloning-quantizers/

Can you please make a Pull Request in this repo and add the chinese model and dataset you trained on?

This is very important, as there are plans that we could merge all datasets in the future and train a single multi-language voice cloning model, where, for example with chinese, we could use the multilanguage tokenizer to make an english voice speak chinese and vice-versa, or make a spanish voice speak chinese etc.

Create Vioce时提示：en_tokenizer.pth not found，从什么地方download？

IndexError: index 0 is out of bounds for axis 0 with size 0

While using my own training voice data, got the above error at below

ret, report = run_auto_label(input_wav=input_wav, work_dir=output_data, resource_revision="v1.0.7")

More details:

46%|████▌ | 139/305 [00:02<00:02, 60.32it/s]

IndexError Traceback (most recent call last)
Cell In[8], line 4
1 input_wav = "./test_wavs/"
2 output_data = "./output_training_data/"
----> 4 ret, report = run_auto_label(input_wav=input_wav, work_dir=output_data, resource_revision="v1.0.7")

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/modelscope/tools/speech_tts_autolabel.py:78, in run_auto_label(input_wav, work_dir, para_ids, resource_model_id, resource_revision, gender, stage, process_num, develop_mode, has_para, enable_enh)
64 model_resource = _download_and_unzip_resource(resource_model_id,
65 resource_revision)
66 auto_labeling = AutoLabeling(
67 os.path.abspath(input_wav),
68 model_resource,
(...)
76 process_num,
77 enable_enh=enable_enh)
---> 78 ret_code, report = auto_labeling.run()
79 return ret_code, report

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/tts_autolabel/auto_label.py:853, in AutoLabeling.run(self)
851 if self.enable_vad:
852 logging.info("[VAD] chunk recordings for training.")
--> 853 self.wav_cut_by_vad(self.resample_wav_dir, self.cut_wav_dir)
854 else:
855 self.cut_wav_dir = self.resample_wav_dir

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/tts_autolabel/auto_label.py:437, in AutoLabeling.wav_cut_by_vad(self, input_wav_dir, output_wav_dir)
435 shutil.rmtree(output_wav_dir)
436 os.makedirs(output_wav_dir, exist_ok=True)
--> 437 vad_cut(input_wav_dir, output_wav_dir, self.resource_dir)

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/tts_autolabel/audiocut/vad.py:367, in vad_cut(input_wav_dir, output_wav_dir, resource_dir, superhigh_cut_threshold, high_cut_threshold, low_cut_threshold, max_dur_threshold, min_dur_threshold)
365 audio_files = glob.glob(os.path.join(input_wav_dir, ".wav"))
366 for audio_file in tqdm(audio_files):
--> 367 vad_level_S(
368 vad_pipeline_superhigh,
369 audio_file,
370 output_wav_dir,
371 output_wav_dirs["S"],
372 max_samples_threshold,
373 min_samples_threshold,
374 )
376 audio_files = glob.glob(os.path.join(output_wav_dirs["S"], ".wav"))
377 # if len(audio_files) <= 0:
378 # return

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/tts_autolabel/audiocut/vad.py:45, in vad_level_S(vad_pipeline, audio_file, output_wav_dir, tmp_wav_dir, max_samples_threshold, min_samples_threshold)
42 scale_factor = sample_rate / 16000
44 wavid_origin = os.path.basename(audio_file)[:-4]
---> 45 segments_result_origin = vad_pipeline(audio_in=waveform_16k)
46 segments_text_origin = segments_result_origin[0]
47 if len(segments_text_origin) == 0:

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/tts_autolabel/audio2phone/funasr_onnx/vad_bin.py:102, in Fsmn_vad.call(self, audio_in, **kwargs)
100 end_idx = min(waveform_nums, beg_idx + self.batch_size)
101 waveform = waveform_list[beg_idx:end_idx]
--> 102 feats, feats_len = self.extract_feat(waveform)
103 waveform = np.array(waveform)
104 param_dict = kwargs.get("param_dict", dict())

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/tts_autolabel/audio2phone/funasr_onnx/vad_bin.py:176, in Fsmn_vad.extract_feat(self, waveform_list)
174 for waveform in waveform_list:
175 speech, _ = self.frontend.fbank(waveform)
--> 176 feat, feat_len = self.frontend.lfr_cmvn(speech)
177 feats.append(feat)
178 feats_len.append(feat_len)

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/tts_autolabel/audio2phone/funasr_onnx/utils/frontend.py:87, in WavFrontend.lfr_cmvn(self, feat)
85 def lfr_cmvn(self, feat: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
86 if self.lfr_m != 1 or self.lfr_n != 1:
---> 87 feat = self.apply_lfr(feat, self.lfr_m, self.lfr_n)
89 if self.cmvn_file:
90 feat = self.apply_cmvn(feat)

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/tts_autolabel/audio2phone/funasr_onnx/utils/frontend.py:101, in WavFrontend.apply_lfr(inputs, lfr_m, lfr_n)
99 T = inputs.shape[0]
100 T_lfr = int(np.ceil(T / lfr_n))
--> 101 left_padding = np.tile(inputs[0], ((lfr_m - 1) // 2, 1))
102 inputs = np.vstack((left_padding, inputs))
103 T = T + (lfr_m - 1) // 2

IndexError: index 0 is out of bounds for axis 0 with size 0

auto label 过程没出现问题

2023-07-20 02:11:33,456 - modelscope - INFO - Use user-specified model revision: v1.0.5
---  New folder /content/output_training_data/paragraph/prosody...  ---
---  OK  ---
---  New folder /content/output_training_data/sp_interval...  ---
---  OK  ---
---  New folder /content/output_training_data/wav...  ---
---  OK  ---
--- Remove /content/output_training_data/log folder!  ---
---  New folder /content/output_training_data/log...  ---
---  OK  ---
2023-07-20 02:12:03
wav_preprocess start...
---  new folder...  ---
---  OK  ---
100%|██████████| 1/1 [00:00<00:00, 23.94it/s]wav cut by vad start...

100%|██████████| 1/1 [00:00<00:00,  2.59it/s]
100%|██████████| 1/1 [00:00<00:00,  2.00it/s]
Text to label start...
100%|██████████| 1/1 [00:00<00:00,  1.91it/s]
pre-break recording in paragraph by vad.
Generate phone interval by asr align.
---  New folder /content/output_training_data/align...  ---
---  OK  ---
prosody_dir=/content/output_training_data/paragraph/prosody
run_asr_align step 2
speak_script=/content/output_training_data/align/script.txt
job_num=1 process_num=4 fbank_config=/root/.cache/modelscope/hub/damo/speech_ptts_autolabel_16k/model/fsmn_16k_2/fbank.conf, data_dir=/content/output_training_data/align/gen/data, fbank_dir=/content/output_training_data/align/gen/fbank
run make_fbank with num=1 config_path=/root/.cache/modelscope/hub/damo/speech_ptts_autolabel_16k/model/fsmn_16k_2/fbank.conf
data_path=/content/output_training_data/align/gen/data fbank_path=/content/output_training_data/align/gen/fbank
[{'id': 'test_0_0', 'wav': '/content/output_training_data/wav_cut/test_0_0.wav'}]
100%|██████████| 1/1 [00:01<00:00,  1.90s/it]DONE compute fbank and copy feats
DONE!
job_num=1 process_num=4 data_dir=/content/output_training_data/align/gen/data lm_dir=/root/.cache/modelscope/hub/damo/speech_ptts_autolabel_16k/model/lang am_dir=/root/.cache/modelscope/hub/damo/speech_ptts_autolabel_16k/model/fsmn_16k_2, fbank_dir=/content/output_training_data/align/gen/fbank, align_dir=/content/output_training_data/align/gen/align
[{'id': 'test_0_0', 'ark': '/content/output_training_data/align/gen/fbank/raw_fbank_data.test_0_0.ark', 'scp': '/content/output_training_data/align/gen/fbank/raw_fbank_data.test_0_0.scp'}]

Feature preprocessing start...
100%|██████████| 1/1 [00:05<00:00,  5.20s/it]Waveform aligning start...

100%|██████████| 1/1 [00:01<00:00,  1.62s/it]do_align done!
---  new folder...  ---
---  OK  ---
test_0_0.ali
Trim silence wav with align info and modify wav files....

100%|██████████| 1/1 [00:00<00:00, 80.69it/s]Convert align info to interval files....
---  There is this folder!  ---
test_0_0.ali
Modify sil to sp in interval....
modify interval er phone.
--- Remove /content/output_training_data/interval folder!  ---
---  New folder /content/output_training_data/interval...  ---
---  OK  ---
qualification review.
prosody sillence detect.
--- Remove /content/output_training_data/prosody folder!  ---
---  New folder /content/output_training_data/prosody...  ---
---  OK  ---

average silence duration: 0.3249999999999996
100%|██████████| 2/2 [00:00<00:00, 3506.94it/s]Write prosody file
0 "mismatch" sentences

Auto labeling info: stage 1 | develop mode 0 | gender:female | score 10.000000 | retcode 0
labeling report:
stage 1 | develop mode 0 | gender female | score 10.000000 | retcode 0
qulification report:
credit score: 10.000000
qualified score: 3.000000
normalized snr: 35.000000
abandon utt snr threshold: 10.000000
snr score ration: 0.500000
interval score ration: 0.500000
data qulificaion report:

Training 时出错了

2023-07-20 02:13:16,273 - modelscope - INFO - Use user-specified model revision: v1.0.6
2023-07-20 02:13:17,519 - modelscope - INFO - Use user-specified model revision: v1.0.6
2023-07-20 02:13:18,124 - modelscope - INFO - Set workdir to ./pretrain_work_dir/
2023-07-20 02:13:18,171 - modelscope - INFO - load ./output_training_data/
2023-07-20 02:13:18,561 - modelscope - INFO - Use user-specified model revision: v1.0.6
2023-07-20 02:13:37,195 - modelscope - INFO - am_config=./pretrain_work_dir/orig_model/basemodel_16k/sambert/config.yaml voc_config=./pretrain_work_dir/orig_model/basemodel_16k/hifigan/config.yaml
2023-07-20 02:13:37,197 - modelscope - INFO - audio_config=./pretrain_work_dir/orig_model/basemodel_16k/audio_config_se_16k.yaml
2023-07-20 02:13:37,198 - modelscope - INFO - am_ckpts=OrderedDict([(2400000, './pretrain_work_dir/orig_model/basemodel_16k/sambert/ckpt/checkpoint_2400000.pth')])
2023-07-20 02:13:37,200 - modelscope - INFO - voc_ckpts=OrderedDict([(2400000, './pretrain_work_dir/orig_model/basemodel_16k/hifigan/ckpt/checkpoint_2400000.pth')])
2023-07-20 02:13:37,203 - modelscope - INFO - se_path=./pretrain_work_dir/orig_model/se.npy se_model_path=./pretrain_work_dir/orig_model/basemodel_16k/speaker_embedding/se.onnx
2023-07-20 02:13:37,204 - modelscope - INFO - mvn_path=./pretrain_work_dir/orig_model/mvn.npy
100%|██████████| 2/2 [00:00<00:00, 2823.50it/s]TextScriptConvertor.process:
Save script to: ./pretrain_work_dir/data/Script.xml
TextScriptConvertor.process:
Save metafile to: ./pretrain_work_dir/data/raw_metafile.txt
[AudioProcessor] Initialize AudioProcessor.
[AudioProcessor] config params:
[AudioProcessor] wav_normalize: True
[AudioProcessor] trim_silence: True
[AudioProcessor] trim_silence_threshold_db: 60
[AudioProcessor] preemphasize: False
[AudioProcessor] sampling_rate: 16000
[AudioProcessor] hop_length: 200
[AudioProcessor] win_length: 1000
[AudioProcessor] n_fft: 2048
[AudioProcessor] n_mels: 80
[AudioProcessor] fmin: 0.0
[AudioProcessor] fmax: 8000.0
[AudioProcessor] phone_level_feature: True
[AudioProcessor] se_feature: True
[AudioProcessor] norm_type: mean_std
[AudioProcessor] max_norm: 1.0
[AudioProcessor] symmetric: False
[AudioProcessor] min_level_db: -100.0
[AudioProcessor] ref_level_db: 20
[AudioProcessor] num_workers: 16
[AudioProcessor] Amplitude normalization started
Volume statistic proceeding...

100%|██████████| 1/1 [00:00<00:00,  1.70it/s]
Average amplitude RMS : 0.126146
Volume statistic done.
Volume normalization proceeding...
100%|██████████| 1/1 [00:00<00:00, 530.12it/s]Volume normalization done.
[AudioProcessor] Amplitude normalization finished
[AudioProcessor] Duration generation started

  0%|          | 0/1 [00:00<?, ?it/s][AudioProcessor] Duration align with mel is proceeding...
100%|██████████| 1/1 [00:01<00:00,  1.14s/it]
[AudioProcessor] Duration generate finished
[AudioProcessor] Trim silence with interval started
[AudioProcessor] Start to load pcm from ./pretrain_work_dir/data/wav
100%|██████████| 1/1 [00:01<00:00,  1.08s/it]
  0%|          | 0/1 [00:01<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 815.70it/s][AudioProcessor] Trim silence finished
[AudioProcessor] Melspec extraction started

100%|██████████| 1/1 [00:01<00:00,  1.57s/it]
[AudioProcessor] Melspec extraction finished
Melspec statistic proceeding...
100%|██████████| 1/1 [00:00<00:00, 3236.35it/s]
100%|██████████| 1/1 [00:00<00:00, 363.39it/s]Melspec statistic done
[AudioProcessor] melspec mean and std saved to:
./pretrain_work_dir/data/mel/mel_mean.txt,
./pretrain_work_dir/data/mel/mel_std.txt
[AudioProcessor] Melspec mean std norm is proceeding...
[AudioProcessor] Melspec normalization finished
[AudioProcessor] Normed Melspec saved to ./pretrain_work_dir/data/mel
[AudioProcessor] Pitch extraction started

  0%|          | 0/1 [00:00<?, ?it/s][AudioProcessor] Pitch align with mel is proceeding...
100%|██████████| 1/1 [00:01<00:00,  1.69s/it]
[AudioProcessor] Pitch normalization is proceeding...
100%|██████████| 1/1 [00:00<00:00, 4128.25it/s]
100%|██████████| 1/1 [00:00<00:00, 3721.65it/s][AudioProcessor] f0 mean and std saved to:
./pretrain_work_dir/data/f0/f0_mean.txt,
./pretrain_work_dir/data/f0/f0_std.txt
[AudioProcessor] Pitch mean std norm is proceeding...
[AudioProcessor] Pitch turn to phone-level is proceeding...

100%|██████████| 1/1 [00:01<00:00,  1.55s/it]
[AudioProcessor] Pitch normalization finished
[AudioProcessor] Normed f0 saved to ./pretrain_work_dir/data/f0
[AudioProcessor] Pitch extraction finished
[AudioProcessor] Energy extraction started
100%|██████████| 1/1 [00:01<00:00,  1.12s/it]
100%|██████████| 1/1 [00:00<00:00, 252.64it/s]
100%|██████████| 1/1 [00:00<00:00, 3682.44it/s][AudioProcessor] energy mean and std saved to:
./pretrain_work_dir/data/energy/energy_mean.txt,
./pretrain_work_dir/data/energy/energy_std.txt
[AudioProcessor] Energy mean std norm is proceeding...

100%|██████████| 1/1 [00:01<00:00,  1.08s/it]
[AudioProcessor] Energy normalization finished
[AudioProcessor] Normed Energy saved to ./pretrain_work_dir/data/energy
[AudioProcessor] Energy extraction finished
[AudioProcessor] All features extracted successfully!
Processing audio done.
[SpeakerEmbeddingProcessor] Speaker embedding extractor started
[SpeakerEmbeddingProcessor] se model loading error!!!
[SpeakerEmbeddingProcessor] please update your se model to ensure that the version is greater than or equal to 1.0.5
[SpeakerEmbeddingProcessor] try load it as se.model
[SpeakerEmbeddingProcessor] Speaker embedding extracted successfully!
Processing speaker embedding done.
Processing done.
Voc metafile generated.
AM metafile generated.
2023-07-20 02:14:06,035 - modelscope - INFO - Start training....
2023-07-20 02:14:06,040 - modelscope - INFO - Start SAMBERT training...
2023-07-20 02:14:06,042 - modelscope - INFO - TRAIN SAMBERT....
2023-07-20 02:14:06,059 - modelscope - INFO - TRAINING steps: 2400202
2023-07-20 02:14:06,069 - modelscope - INFO - audio_config = {'fmax': 8000.0, 'fmin': 0.0, 'hop_length': 200, 'max_norm': 1.0, 'min_level_db': -100.0, 'n_fft': 2048, 'n_mels': 80, 'norm_type': 'mean_std', 'num_workers': 16, 'phone_level_feature': True, 'preemphasize': False, 'ref_level_db': 20, 'sampling_rate': 16000, 'symmetric': False, 'trim_silence': True, 'trim_silence_threshold_db': 60, 'wav_normalize': True, 'win_length': 1000}
2023-07-20 02:14:06,070 - modelscope - INFO - Loss = {'MelReconLoss': {'enable': True, 'params': {'loss_type': 'mae'}}, 'ProsodyReconLoss': {'enable': True, 'params': {'loss_type': 'mae'}}}
2023-07-20 02:14:06,072 - modelscope - INFO - Model = {'KanTtsSAMBERT': {'optimizer': {'params': {'betas': [0.9, 0.98], 'eps': 1e-09, 'lr': 0.001, 'weight_decay': 0.0}, 'type': 'Adam'}, 'params': {'MAS': False, 'NSF': True, 'SE': True, 'decoder_attention_dropout': 0.1, 'decoder_dropout': 0.1, 'decoder_ffn_inner_dim': 1024, 'decoder_num_heads': 8, 'decoder_num_layers': 12, 'decoder_num_units': 128, 'decoder_prenet_units': [256, 256], 'decoder_relu_dropout': 0.1, 'dur_pred_lstm_units': 128, 'dur_pred_prenet_units': [128, 128], 'embedding_dim': 512, 'emotion_units': 32, 'encoder_attention_dropout': 0.1, 'encoder_dropout': 0.1, 'encoder_ffn_inner_dim': 1024, 'encoder_num_heads': 8, 'encoder_num_layers': 8, 'encoder_num_units': 128, 'encoder_projection_units': 32, 'encoder_relu_dropout': 0.1, 'max_len': 800, 'nsf_f0_global_maximum': 730.0, 'nsf_f0_global_minimum': 30.0, 'nsf_norm_type': 'global', 'num_mels': 82, 'outputs_per_step': 3, 'postnet_dropout': 0.1, 'postnet_ffn_inner_dim': 512, 'postnet_filter_size': 41, 'postnet_fsmn_num_layers': 4, 'postnet_lstm_units': 128, 'postnet_num_memory_units': 256, 'postnet_shift': 17, 'predictor_dropout': 0.1, 'predictor_ffn_inner_dim': 256, 'predictor_filter_size': 41, 'predictor_fsmn_num_layers': 3, 'predictor_lstm_units': 128, 'predictor_num_memory_units': 128, 'predictor_shift': 0, 'speaker_units': 192}, 'scheduler': {'params': {'warmup_steps': 4000}, 'type': 'NoamLR'}}}
2023-07-20 02:14:06,074 - modelscope - INFO - allow_cache = False
2023-07-20 02:14:06,084 - modelscope - INFO - batch_size = 32
2023-07-20 02:14:06,085 - modelscope - INFO - create_time = 2023-07-20 02:14:06
2023-07-20 02:14:06,087 - modelscope - INFO - eval_interval_steps = 10000000000000000
2023-07-20 02:14:06,090 - modelscope - INFO - git_revision_hash = d16755444c9baf23348213211a5ed9035458ecf0
2023-07-20 02:14:06,093 - modelscope - INFO - grad_norm = 1.0
2023-07-20 02:14:06,096 - modelscope - INFO - linguistic_unit = {'cleaners': 'english_cleaners', 'lfeat_type_list': 'sy,tone,syllable_flag,word_segment,emo_category,speaker_category', 'speaker_list': 'F7'}
2023-07-20 02:14:06,098 - modelscope - INFO - log_interval_steps = 50
2023-07-20 02:14:06,099 - modelscope - INFO - model_type = sambert
2023-07-20 02:14:06,100 - modelscope - INFO - num_save_intermediate_results = 4
2023-07-20 02:14:06,101 - modelscope - INFO - num_workers = 4
2023-07-20 02:14:06,102 - modelscope - INFO - pin_memory = False
2023-07-20 02:14:06,105 - modelscope - INFO - remove_short_samples = False
2023-07-20 02:14:06,111 - modelscope - INFO - save_interval_steps = 200
2023-07-20 02:14:06,113 - modelscope - INFO - train_max_steps = 2400202
2023-07-20 02:14:06,115 - modelscope - INFO - train_steps = 202
2023-07-20 02:14:06,119 - modelscope - INFO - log_interval = 10
2023-07-20 02:14:06,121 - modelscope - INFO - modelscope_version = 1.7.1
Loading metafile...
0it [00:00, ?it/s]Loading metafile...

100%|██████████| 1/1 [00:00<00:00, 9198.04it/s]
2023-07-20 02:14:06,139 - modelscope - INFO - The number of training files = 0.
2023-07-20 02:14:06,141 - modelscope - INFO - The number of validation files = 1.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-15-0089498a7012>](https://localhost:8080/#) in <cell line: 33>()
     31                         default_args=kwargs)
     32 
---> 33 trainer.train()

kevinwang676 / bark-voice-cloning Goto Github PK

bark-voice-cloning's Introduction

Bark Voice Cloning 🐶 & Voice Cloning for Chinese Speech 🎶

1️⃣ Bark Voice Cloning

If you like the quick start, please star this repository. ⭐⭐⭐

Easy to use:

2️⃣ Voice Cloning for Chinese Speech

bark-voice-cloning's People

Contributors

Stargazers

Watchers

Forkers

bark-voice-cloning's Issues

Transcribing file 0: 'voicedata.WAV' to segments...

46%|████▌ | 139/305 [00:02<00:02, 60.32it/s]

Recommend Projects

Recommend Topics

Recommend Org