x-lance / unicats-ctx-vec2wav Goto Github PK

View Code? Open in Web Editor NEW

112.0 10.0 16.0 1 MB

[AAAI 2024] Code for CTX-vec2wav in UniCATS

Home Page: https://cpdu.github.io/unicats/

Shell 4.29% Python 93.34% Perl 2.37%

speech-synthesis vocoder unicats vocoding semantic-token self-supervised-speech

unicats-ctx-vec2wav's People

Contributors

Stargazers

Watchers

Forkers

ishine flowspeech shaun95 abylouw entn-at pan310 techthiyanes chenchy qoboty fl0under hongwen-sun aydous chungks603 cyysky2 kevingenggavo jianzhu

unicats-ctx-vec2wav's Issues

Possible collaboration on CTXtxt2vec

Hi @cantabile-kwok, I’ve been chipping away on the unofficial implementation of the UniCATS paper here. Since the second part is out and it sounds like you’re working on the txt2vec portion of it, is there some possibility to collaborate on this? My unofficial repo contains some very basic dataset pre-processing and the different configs for establishing the contexts for each utterance.

Please do let me know. Thank you!

关于 prompt 梅尔谱的标准化

感谢大佬的开源！想请问可以分享一下 cmvn.ark 这个文件吗 🙏🏻🙏🏻🙏🏻
目前直接用没标准化的梅尔谱当 prompt，发音都很清晰，就是音色不太像，想看看标准化后的效果 🙏🏻🙏🏻🙏🏻
另外想确认下关于梅尔谱的参数：

prompt_wav, sr = librosa.load(prompt_src_wav_file, sr=16000)
prompt = logmelspectrogram(
    x=prompt_wav.T,
    fs=16000,
    n_mels=80,
    n_fft=1024,
    n_shift=160,
    win_length=465,
    window="hann",
    fmin=80,
    fmax=7600).squeeze()[None, :, :]
prompt = torch.FloatTensor(prompt)

是不是这样加载进来再标准化一下，就可以跟模型适配了

关于采样率

README 中说明采用 16000 采样率，但是在 demo 页面 https://cpdu.github.io/unicats/ 中的音频是 24000 的采样率，这是什么原因呢？

提取ppe

使用项目里的ppe提取脚本跑出来的结果和项目提供的ppe特征对不上。还有这三个特征有没有做过消融实验呀？

Inference Speed

Hi @cantabile-kwok ,
I have also implemented UniCATS's vec2wav but that model is too slow, so I am curious to know the inference speed of this model. Actually, I am interested in integrating CTX-vec2wav with GPT-based AR txt2vec to create a fast prompt-based TTS model.

Also, do you have any plan to release CTX-txt2vec model anytime soon?
Thanks

Source voice speech -> semantic token -> vec2wav (with target voice prompt) -> Target voice speech

We can easily calculate semantic token from pretrained HuBert or VQ-wav2vec etc.

x-lance / unicats-ctx-vec2wav Goto Github PK

unicats-ctx-vec2wav's People

Contributors

Stargazers

Watchers

Forkers

unicats-ctx-vec2wav's Issues

Possible collaboration on CTXtxt2vec

关于 prompt 梅尔谱的标准化

关于采样率

提取ppe

Inference Speed

有没有试过hubert 或者 wav2vec 的离散化表示来代替vq-wav2vec

Recommended text or phoneme tokenizer to use

关于 vq_codebook

Use vec2wav for Speech to Speech Voice conversion

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent