Giter Site home page Giter Site logo

x-lance / unicats-ctx-txt2vec Goto Github PK

View Code? Open in Web Editor NEW
57.0 7.0 8.0 1008 KB

[AAAI 2024] CTX-txt2vec, the acoustic model in UniCATS

Home Page: https://cpdu.github.io/unicats

Python 94.44% Shell 3.02% Perl 2.53%
acoustic-model speech-synthesis text-to-speech tts unicats vq-diffusion ctx-txt2vec

unicats-ctx-txt2vec's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

unicats-ctx-txt2vec's Issues

Which vq-wav2vec checkpoint was used for data preprocessing?

Hello @cantabile-kwok ! Thanks for this amazing project and congratulations on acceptance in AAAI.

I have a question. What vq-wav2vec checkpoint was used for tokenizing the speech data?

I'm reproducing the data preprocessing and find that some of resulting labels of the same libritts files do not match.

Thank you again for that project!

Hardware specs

Hi!

First of all, congrats on the impressive work and thank you for taking the time to make it available on GitHub.

I have read the paper and I couldn't find any mention on the hardware setup for training. Any info about the hardware used for training and expected training times would be very much appreciated.
I am currently trying a 4GPU setup with 16GB of VRAM per node and training with default configuration seems to raise OOM. So I was wondering what hardware configuration would be adequate.

Thanks!

About Training

Congratulations on acceptance in AAAI.

I have some questions about your training method.

  1. From reading other papers that use similar techniques, some of these have trained models on huge datasets such as LibriLight (VALL-E, SpeechX). Have you tested with bigger datasets than LibriTTS, or do you think this would make any significant impact on model zero-shot for editing and continuation compared to just training on LibriTTS?

  2. Perhaps by grouping a new unseen speaker's samples together with a small number of seen speakers to create a dataset, fine tuning a pretrained model could be possible in a shorter time? Have you tried this?

  3. In the paper you trained text2vec to 50 epochs on LibriTTS, did you see any improvements after 50 epochs still or were the changes negligible?

Thanks again for this wonderful project.

关于unicat在中文领域的一些疑问

大佬您好!最近在看一些zero-shot合成的方向,偶然看到你们的项目,研究了一下发现效果很不错,尤其vec2wav部分的结构,感觉相当有参考价值。我用你们提供的数据训练了txt2vec部分,结合预训练的vec2wav部分同样获得了非常好的效果,所以我想迁移到中文领域来尝试中文的合成效果。
我采用了你们的vec2wav预训练模型,并且用类似转换的方式(提取源音频的特征,使用目标人的部分音频作为参考,然后生成目标人音色的源内容)验证了一下对于中文发音好像没有太大的问题,只是偶尔有一点重音问题。所以我只用中文的aishell2数据(约2000人)训练了一下txt2vec部分的模型,但是预测结果合成的音频总是有一些音调的问题,中文的发音不是很准。
我想请教一下你们是否有尝试过这个模型对于中文合成的效果呢?我的猜想可能是英文的vq-wav2vec特征导致的合成音调问题,但是用预训练的vec2wav做类似转换的任务又似乎没有太大的问题。

Experimenting with number of diffusion steps

Hello, have you experimented with the number of diffusion steps?

I notice it is set to 100 at the moment, so if it is possible to set it to a lower number like 20, the model could be much much faster at training and computing inference. Is this something that has been tested, or not yet? Thanks!

如何将多组token转换成单一token

看了你们其他的文章,有介绍将DAC提的token用来训unicats ,但是dac这样的token有8组每组1024个,但是unicat输入的事单一一组token,这个怎么转换呢?或者是有新的其他架构?

feats.scp 区别

feats/normed_fbank/eval_all/feats.scp

$syn_dir/feats.scp的
特征有什么区别?

如何获取normed_fbank对应的特征?

Example usage

Thank you very much for the repository - do you have any usage examples for the different tasks such as continuation & editing? :-)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.