x-lance / unicats-ctx-txt2vec Goto Github PK

View Code? Open in Web Editor NEW

57.0 7.0 8.0 1008 KB

[AAAI 2024] CTX-txt2vec, the acoustic model in UniCATS

Home Page: https://cpdu.github.io/unicats

Python 94.44% Shell 3.02% Perl 2.53%

acoustic-model speech-synthesis text-to-speech tts unicats vq-diffusion ctx-txt2vec

unicats-ctx-txt2vec's People

Stargazers

Watchers

Forkers

ishine andyweiqiu qoboty fl0under hongwen-sun saber5433 cyysky2 pan310

unicats-ctx-txt2vec's Issues

Which vq-wav2vec checkpoint was used for data preprocessing?

Hello @cantabile-kwok ! Thanks for this amazing project and congratulations on acceptance in AAAI.

I have a question. What vq-wav2vec checkpoint was used for tokenizing the speech data?

I'm reproducing the data preprocessing and find that some of resulting labels of the same libritts files do not match.

Thank you again for that project!

Hardware specs

Hi!

First of all, congrats on the impressive work and thank you for taking the time to make it available on GitHub.

I have read the paper and I couldn't find any mention on the hardware setup for training. Any info about the hardware used for training and expected training times would be very much appreciated.
I am currently trying a 4GPU setup with 16GB of VRAM per node and training with default configuration seems to raise OOM. So I was wondering what hardware configuration would be adequate.

Thanks!

About Training

Congratulations on acceptance in AAAI.

I have some questions about your training method.

From reading other papers that use similar techniques, some of these have trained models on huge datasets such as LibriLight (VALL-E, SpeechX). Have you tested with bigger datasets than LibriTTS, or do you think this would make any significant impact on model zero-shot for editing and continuation compared to just training on LibriTTS?
Perhaps by grouping a new unseen speaker's samples together with a small number of seen speakers to create a dataset, fine tuning a pretrained model could be possible in a shorter time? Have you tried this?
In the paper you trained text2vec to 50 epochs on LibriTTS, did you see any improvements after 50 epochs still or were the changes negligible?

Thanks again for this wonderful project.

关于unicat在中文领域的一些疑问

大佬您好！最近在看一些zero-shot合成的方向，偶然看到你们的项目，研究了一下发现效果很不错，尤其vec2wav部分的结构，感觉相当有参考价值。我用你们提供的数据训练了txt2vec部分，结合预训练的vec2wav部分同样获得了非常好的效果，所以我想迁移到中文领域来尝试中文的合成效果。
我采用了你们的vec2wav预训练模型，并且用类似转换的方式（提取源音频的特征，使用目标人的部分音频作为参考，然后生成目标人音色的源内容）验证了一下对于中文发音好像没有太大的问题，只是偶尔有一点重音问题。所以我只用中文的aishell2数据（约2000人）训练了一下txt2vec部分的模型，但是预测结果合成的音频总是有一些音调的问题，中文的发音不是很准。
我想请教一下你们是否有尝试过这个模型对于中文合成的效果呢？我的猜想可能是英文的vq-wav2vec特征导致的合成音调问题，但是用预训练的vec2wav做类似转换的任务又似乎没有太大的问题。

x-lance / unicats-ctx-txt2vec Goto Github PK

unicats-ctx-txt2vec's People

Stargazers

Watchers

Forkers

unicats-ctx-txt2vec's Issues

Which vq-wav2vec checkpoint was used for data preprocessing?

Hardware specs

About Training

关于unicat在中文领域的一些疑问

Experimenting with number of diffusion steps

如何得到的Duration信息的？

请问一下，在只有音频没有对应的文本情况下，可以根据音频的音色和韵律风格实现类似speech continue的操作吗？

如何将多组token转换成单一token

feats.scp 区别

使用readme的数据迭代训练400k步，出来都是噪音，可能是什么问题？

Example usage

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent