x-lance / unicats-ctx-vec2wav Goto Github PK
View Code? Open in Web Editor NEW[AAAI 2024] Code for CTX-vec2wav in UniCATS
Home Page: https://cpdu.github.io/unicats/
[AAAI 2024] Code for CTX-vec2wav in UniCATS
Home Page: https://cpdu.github.io/unicats/
Hi @cantabile-kwok, I’ve been chipping away on the unofficial implementation of the UniCATS paper here. Since the second part is out and it sounds like you’re working on the txt2vec portion of it, is there some possibility to collaborate on this? My unofficial repo contains some very basic dataset pre-processing and the different configs for establishing the contexts for each utterance.
Please do let me know. Thank you!
感谢大佬的开源!想请问可以分享一下 cmvn.ark 这个文件吗 🙏🏻🙏🏻🙏🏻
目前直接用没标准化的梅尔谱当 prompt,发音都很清晰,就是音色不太像,想看看标准化后的效果 🙏🏻🙏🏻🙏🏻
另外想确认下关于梅尔谱的参数:
prompt_wav, sr = librosa.load(prompt_src_wav_file, sr=16000)
prompt = logmelspectrogram(
x=prompt_wav.T,
fs=16000,
n_mels=80,
n_fft=1024,
n_shift=160,
win_length=465,
window="hann",
fmin=80,
fmax=7600).squeeze()[None, :, :]
prompt = torch.FloatTensor(prompt)
是不是这样加载进来再标准化一下,就可以跟模型适配了
README 中说明采用 16000 采样率,但是在 demo 页面 https://cpdu.github.io/unicats/ 中的音频是 24000 的采样率,这是什么原因呢?
使用项目里的ppe提取脚本 跑出来的结果和项目提供的ppe特征对不上。还有这三个特征有没有做过消融实验呀?
Hi @cantabile-kwok ,
I have also implemented UniCATS's vec2wav but that model is too slow, so I am curious to know the inference speed of this model. Actually, I am interested in integrating CTX-vec2wav with GPT-based AR txt2vec to create a fast prompt-based TTS model.
Also, do you have any plan to release CTX-txt2vec model anytime soon?
Thanks
Hi @cantabile-kwok, in the paper, there was not any recommended text or phoneme tokenizer to use. Do you have recommendations of what to use?
Thank you.
您好,请问 vq_codebook 也是来自 vq-wav2vec-kmeans 吗?注意到代码中加载的是 (2, 320, 256),但是 vq-wav2vec-kmeans 中量化器的 embedding 的 shape 是 (320, 1, 256),两个 codebook 是一样的吗?
Hi @cantabile-kwok ,
I am curious to know have you tried this model for zero shot voice conversion use case ?
Idea is very simple:
Source voice speech -> semantic token -> vec2wav (with target voice prompt) -> Target voice speech
We can easily calculate semantic token from pretrained HuBert or VQ-wav2vec etc.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.