Comments (92)
This looks like the best candidate now: https://github.com/suno-ai/bark
from ggml.
This looks like the best candidate now: https://github.com/suno-ai/bark
their voice creation got reverse engineered.
https://github.com/serp-ai/bark-with-voice-clone
from ggml.
I personally will look into TTS after finishing the SAM implementation. Maybe someone else is already working on TTS inference
from ggml.
For anyone interested this is the current state of Neural Text to Speech for C / C++. I'm currently in search for a decent Neural TTS C or C++ library for integration into my llama.cpp front end (MAID).
rhasspy/piper
Piper is currently the most stable C++ implementation of Neural TTS, it utilities VITS models and depends on ONNX (A similar ML library to GGML). It also requires a custom fork of espeak-ng for phonemization. I can get Piper to compile and run on linux and windows (Though i used my own fork of piper to get it to compile), but actual inference only seems to work on linux. Though it must have worked at one point in time because there's a windows front end that uses it (Piper_UI).
Sherpa Onnx
Sherpa Onnx is very similar to piper. Like its name suggests it utilities the Onnx runtime for inference and like piper it also uses Espeak-NG for phonemization. It has various implementations and API's available and is actively maintained but the monumental size of it makes it difficult to integrate into dart which is a requirement for what i need to do.
I'm preferring to avoid using Piper or Sherpa Onnx for my own project as i would prefer to not be dependent on Espeak-ng or another separate ML library other than GGMl which im already using for the Llama.cpp integration.
Vits.cpp
Vits.cpp is a GGML implementation of VITS models (The same Piper and Sherpa Onnx use). It does not require Espeak-NG and as stated uses GGML and not Onnx. VITS models are good if you have alot of data because they produce very small model files (One model file Ive tested is 60mb). For what i need to do Vits.cpp would be ideal, however, though i can get Vits.cpp to compile it immediately segfaults when launched. It has also not been updated for 3 months so the maintainer has likely abandoned it.
Bark.cpp
Bark.cpp is another GGML implementation but unlike the others it uses the bark series of models released by suno-ai. I haven't tested if it worked but from what Ive seen there's little control over what voice is used and a limited variety of voice presets. There's also no ability to clone voices and because bark is a GPT model the words spoken by the output can be different to the input. Its also worth stating that as of writing Bark.cpp hasn't exactly been receiving frequent updates so it may also be abandoned.
That's it for the C++ implementations I'm aware of if anyone else knows any more let me know.
Purely for Voice cloning models though there's a few available now but they all require python which means they cant be integrated into dart and C++ software well. Off the top of my head theres OpenVoice, WhisperSpeech, StyleTTS, VALL-E, metavoice and the already mentioned bark and vits.
from ggml.
Coqui released their cross-lingual TTS model: XTTS:
It supports 13 languages (Arabic, Brazilian Portuguese, Chinese, Czech, Dutch, English, French, German, Italian, Polish, Russian, Spanish, and Turkish).
It also offers voice cloning and cross-lingual voice cloning.
from ggml.
https://github.com/balisujohn/tortoise.cpp
Repository is public, feel free to make an issue on this repo if you want to contribute. I will release the ggml export script and modified tortoise that prints out intermediate values that I'm using for reverse engineering also if people want it.
from ggml.
What about https://github.com/snakers4/silero-models ?
from ggml.
it's in roadmap now ggerganov/llama.cpp#1729
from ggml.
there is now a tracking issue for bark #388
which links https://github.com/PABannier/bark.cpp (wip) 🥳
from ggml.
Is there another update about text to speech?
from ggml.
WhisperSpeech works really well with 10-20 minutes of audio for many voices for zero shot voice cloning. It does a good job on JFK for example if you use this file as input: https://upload.wikimedia.org/wikipedia/commons/5/50/Jfk_rice_university_we_choose_to_go_to_the_moon.ogg The overall quality is less than tortoise though.
If I have time I'm interested in making whisperspeech.cpp, but I'm busy with tortoise.cpp for now :^)
I'd also add, that WhisperSpeech already seems like it could be fast enough that ggml might not improve much over it in terms of speed, but I'm not really sure.
from ggml.
This looks like the best candidate now: https://github.com/suno-ai/bark
by far. since they provide the models. (a bit over 12gig)
from ggml.
A new paper came out called Tango looks pretty good, also using LLMs
from ggml.
Been playing around with StyleTTS 2 and it's pretty fast. IMHO it would be better to add Tortoise first since its slower and GGML could have a more significant impact in speeding it up, but StyleTTS 2 is pretty impressive too.
from ggml.
There is a new contender: WhisperSpeech:
An Open Source text-to-speech system built by inverting Whisper. Previously known as spear-tts-pytorch.
We want this model to be like Stable Diffusion but for speech – both powerful and easily customizable.
We are working only with properly licensed speech recordings and all the code is Open Source so the model will be always safe to use for commercial applications.
from ggml.
Seems like recent update of StyleTTS2 is really good TTS Arena
from ggml.
I'm interested in implementing a TTS using ggml
, but don't have capacity atm - there are other priorities.
Also, I don't think it is worth implementing a model from 3-4 years ago. It should be SOTA.
What is SOTA atm?
VALL-E looks like a good candidate - but no weights.
from ggml.
The original Bark did sound artificial to me.
The voice cloning repos (both) sound amazing already!
Combine that with llm text and the inference speed we see already .. then we have realtime generative speech output
from ggml.
While I'm not an expert by any means, VITS in CoquiTTS is almost realtime on CPU (I tested on a medium range laptop CPU). With ggml and a good quant if possible, could almost certainly be realtime, maybe even playing to speakers in realtime too. Just a thought.
from ggml.
coincidentally, this was just released claiming SOTA: https://github.com/yl4579/StyleTTS2
from ggml.
StyleTTS looks like a very clean approach, also very good english.
But .. it's not multi-lingual at this point. The readme reads like it's a small thing to add more but it doesn't appear that easy when looking closer.
If StyleTTS2 would support languages similar to the others, I'd focus on it fully
from ggml.
StyleTTS 2 is truly state-of-the-art. Just look at quality and speed https://huggingface.co/spaces/styletts2/styletts2
This is a great candidate for cpp implementation.
from ggml.
Is anyone working on StyleTTS2.cpp?
Support for other languages is important, but even in English it would be an incredibly useful thing, especially the speed at which it works.
from ggml.
Yeah some ppl on the STTS Slack channel are trying to reimplement Phonemizer
from ggml.
uh but tortoise.cpp will get its own repo, but the goal will be to keep it's ggml version consistent with normal GGML in the long run.
from ggml.
by far. since they provide the models. (a bit over 12gig)
even with the best "4bit" quantization imaginable, you still end up with ~3gigs for model files
from ggml.
It seems that the unlocked Bark with voice cloning is here now:
https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer
This is a necessary step for a complete system.
It is worth to have a look into the open and closed issues there to get an overview of the multiple needed models.
from ggml.
Also, in llama.cpp May 2023 roadmap, a recent comment suggests "drop-in replacement for EnCodec", which may be (or not) easier to implement.
from ggml.
it's in roadmap now ggerganov/llama.cpp#1729
That's great news! My only complaint with bark is its speed... your magic touch would be ✨✨✨
from ggml.
@ggerganov This is a good TTS with C++ code (ONNX Runtime). https://github.com/rhasspy/piper.
You can try some generated sample at: https://rhasspy.github.io/piper-samples/.
from ggml.
https://github.com/Plachtaa/VALL-E-X/blob/master/README.md
how about this?
from ggml.
xtts is based on tortoise-tts
https://github.com/neonbjb/tortoise-tts
which uses modified gpt-2. embedding is different and it has two heads one for mel and one for text.
i think we can convert tortoise-tts autoregrassive model to ggml.
from ggml.
it should be able to handle XTTS if ggml supports tortoise-tts, xtts is multilingual tortoise-tts model, we might need different model conversion scripts for xtts that's it.
from ggml.
coincidentally, this was just released claiming SOTA: https://github.com/yl4579/StyleTTS2
Thanks for that info, I'll test that now, the audio samples in their paper is superior to the others though the latest xttsv2 is also almost flawless (but for xtts v2 we'd need a equivalent open source model to be useful).
If we go the Tortoise/XTTS route it would be best to make sure to implement the advantages of xtts as well, namely the instant voice cloning and the language independent models.
Evaluating StyleTTS2:
- The dataset to train is completely open and the steps to train appear very simple
- The code is MIT
- They supply models which are completely open with the exception that you need to inform people about StyleTTS2 UNLESS you do have permission by the voice originator which is awesome .. AND it only applies if you don't just train your own.
- Now testing it
from ggml.
Time for me learn to C++ i think!
from ggml.
Yep it's the best out there, as soon as multilanguage is supported nothing can stop it.
The quality is better and the computation requirements are a fraction of before.
from ggml.
Does anyone know of any other high-quality models suitable for realtime use?
There is ggml bark implementation https://github.com/PABannier/bark.cpp but it's quite slow
from ggml.
bark.cpp is good because it does not require use of a phoneme library, it does everything automatically.
StyleTTS 2 has better sounding voices but it requires third party libraries like espeak and some nltk stuff.
XTTS is not permissive in license and breaks the idea of building these ggml libraries under MIT.
So best solution for StyleTTS 2 is to do one of the following:
- find a more permissively licensed phonemizer
- dynamically link to espeak and build it under GPL (so it cannot be in this repository)
- build a phonemizer from scratch in C/C++ specifically for this project
from ggml.
@manmay-nakhashi And sounds good regarding collaboration! I posted here precisely to avoid duplicating effort, better to work together than unknowlingly duplicate effort.
from ggml.
The text to speech part of SeamlessM4T, but they implemented Speech-to-text translation (S2TT), Acoustic speech recognition (ASR), Text-to-text translation (T2TT).
They have also Speech-to-speech translation (S2ST) and Text-to-speech translation (T2ST) models in the same family, but the ggml implementation for them is still missing. maybe there is hope they implement those too. :)
https://github.com/facebookresearch/seamless_communication/tree/main/ggml
from ggml.
Hmm, would be nice to see the WIP project, but Fri works. Thank you for creating this project! Really looking forward to a faster way to run Tortoise
from ggml.
If people are interested in contributing to tortoise.cpp, a great first task would be getting the tokenizer to always match the tokenization tortoise-tts uses. The tokenizer I'm using in tortoise.cpp seems to be able to load the tokenizer vocab but the regex had issues with some of the special chars which I bandaided at least for spaces, but more perplexingly, the tortoise-tts tokenizer isn't greedy with respect to always choosing the longest possible next token, while the default tokenizer I copied from ggml gpt-2 seems to be greedy. So the task would be studying the tokenizer tortoise-tts uses, and modifying the tokenizer in tortoise,cpp to exactly match this behavior. I can also come up with some other tasks to work on.
from ggml.
Looks like it also uses the Meta encodec library. I think it will be crucial to get a good implementation of this in C++ as I have a feeling that more and more SOTA audio models are going to use this library (bark also comes to mind).
An initial implementation of encodec is found here: https://github.com/PABannier/encodec.cpp
from ggml.
Piper uses a VITS model (run through some conversion to ONNX) which runs quite quickly on chunky CPUs. It's not as shiny as some others mentioned here but is quite capable and known to compress well. (StyleTTS considers VITS a close competitor, in their examples.)
from ggml.
How about using vall-e?
from ggml.
How about using vall-e?
AFAIK Microsoft has not released the weights of VALL-E. They just uploaded the paper to arxiv and set up a demo website with some generation samples.
from ggml.
@ggerganov I hope you make a text to speech example from cpp
from ggml.
Here, there is a TTS pytorch model, which has available weights:
https://github.com/r9y9/deepvoice3_pytorch
I would be particularly interested in the implemented "nyanko" model (described in https://aclanthology.org/2020.lrec-1.789.pdf).
There are several stages of pre-processing in python, but if the model can be ported, porting those to c/c++ could be done afterwards.
@ggerganov , whats your assessment on the level of difficulty?
from ggml.
UP
from ggml.
VALL-E looks like a good candidate - but no weights.
It seems quite demanding in terms of training data required (60k hours). Aiming to VALL-E X (multilingual) would be the natural choice (this requires 70K hours), but, apparently (paper), tested only for 2 languages by now. I think it is very unlikely that they release the model, and difficult to have a community based one (at least for a breath of languages).
Also, it might be also quite demanding for inference (I know ggml is reaching unbelievable achievements by quantizing, etc. but still...).
On the contrary, the one I proposed (nianko) gets acceptable quality with only ~20h (yes, hours!) of training data, and it can be trained for each language in just 3 days on a single GPU (single speaker). I trained models for 3 speakers (1 non-English language). Let me know if you would like to listen to the samples or test the python implementation. Besides, python inference in CPU is already real-time in modern systems. It would really have outstanding performance based on c.
I believe a desirable TTS would be "universal language" direct unicode text to wav converter, but I have not been able to spot such model.
from ggml.
With respect to VALL-E, there are 2 pytorch unofficial implementations, none of them implement the VALL-E X (multilanguage), and none of them have released the weights (due to ethical concerns?).
https://github.com/enhuiz/vall-e
https://github.com/lifeiteng/vall-e
I do not have details on the weights size or training/inference requirements.
Compare that to a multilingual TTS with lots of available languages: larynx
https://github.com/rhasspy/larynx
The quality seems a bit lower. But the training work is done.
One may wonder what would be the real advantage of using ggml in this case.
from ggml.
they don't provide any code, but
https://speechresearch.github.io/naturalspeech2/
https://arxiv.org/abs/2304.09116
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
still more diffusion models ...
from ggml.
While tango looks sounds cool, it's a text-to-audio and not a text-to-speech model.
from ggml.
@ggerganov is there any possibility that Bark, ported to cpp, would be feasible to run on constrained devices like iPhones? Ex. a device with 4GB RAM and a tolerable limit of model size in the low 100s of MB.
from ggml.
even with the best "4bit" quantization imaginable, you still end up with ~3gigs for model files
Is the quantization referring to the "smaller model" released 05-01?
from ggml.
@dennislysenko no i was talking about ggml, not sure what changes they made in 1.5
from ggml.
@Green-Sky
Seems like they refer to smaller model cards as low as 2GB in their README now:
The full version of Bark requires around 12Gb of memory to hold everything on GPU at the same time. However, even smaller cards down to ~2Gb work with some additional settings.
05-01 release notes mention:
We also added an option for a smaller version of Bark, which offers additional speed-up with the trade-off of slightly lower quality.
In theory, could this mean with 4x quantization, it's possible to target ~500MB VRAM?
from ggml.
it's possible to target ~500MB VRAM?
@dennislysenko ggml using vram is very optional. by default ggml only uses ram and cpu. :)
In theory, could this mean with 4x quantization,
their description is very obscure and I dont have the time to look at the code, so maybe
from ggml.
xtts is based on tortoise-tts https://github.com/neonbjb/tortoise-tts which uses modified gpt-2. embedding is different and it has two heads one for mel and one for text. i think we can convert tortoise-tts autoregrassive model to ggml.
@manmay-nakhashi do you mean that ggml should be able to handle XTTS out of the box or would it need some adaptations?
from ggml.
the problem I had with tortise (and bark) was that it wasn't consistent across inferences. E.g. it would change voices, even with the same conditioning.
from ggml.
@wassname gpt models can be unpredictable sometimes , fine-tuning on better speaker segmented data can resolve this problem.
from ggml.
I think converting tortoise-tts to ggml makes sense , anyone willing colab on converting tortoise-tts to ggml?
from ggml.
the problem I had with tortise (and bark) was that it wasn't consistent across inferences. E.g. it would change voices, even with the same conditioning.
That's just a implementation issue, look at that python package it's a nightmare imho
tortoise, especially the latest v2 of xtts is producing very good results. In many cases flawless speech to my ears.
It's transformers based, I didn't look into the source but that is likely a good candidate. The only one of today imho.
xtts v2 just clones a voice in seconds (more or less closely) and then uses it for any language.
from ggml.
from ggml.
I’ve soured on bark altogether. Even on their own discord, the thing regularly produces wildly unpredictable output, frequently with a thin resemblance to what you asked for.On Nov 18, 2023, at 7:48 PM, John @.> wrote: the problem I had with tortise (and bark) was that it wasn't consistent across inferences. E.g. it would change voices, even with the same conditioning. That's just a implementation issue, look at that python package it's a nightmare imho tortoise, especially the latest v2 of xtts is producing very good results. In many cases flawless speech to my ears. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>
something went wrong with your reply
from ggml.
Tortoise please!!!
from ggml.
Please, implement Tortoise instead of XTTS. XTTS is licensed under the ultra-restrictive CPML which completely prohibits ALL commercial use. Please help promote open-source by supporting Tortoise instead.
from ggml.
Yes. Perhaps someone could create a "merge" of XTTS and Tortoise, similar to the Tortoise Fast API. For example, using an autoregressive model + hifigan?
from ggml.
StyleTTS looks great. Really hope this gets implemented. Would love to have something similar to llama.cpp that supports many models (tts.cpp?)
from ggml.
Another upvote here for text-to-speech cpp.
Since I am a complete noob with this stuff, could someone give me the high level steps needed for this happen?
Does it require GPU time to re-train/quantize models, or is it mostly just writing code to port encoders etc..??
Thanks, and appreciate all the work the community have put in to making this stuff work for the GPU-poor!
from ggml.
@eolasd It shouldn't require GPU time. For example, w/ llama.cpp, you don't need to retrain the models. Probably mostly just porting the Python inference code to C++ and getting the models to work with GGML, right?
from ggml.
from ggml.
What I read out of the discussions so far was that training material for multi-language is assembled, someone promised to sponsor 8xA100 for the ~3000 hours training time and that last step is currently open.
from ggml.
Main issue with StyleTTS is it uses IPA phonemes. Right now espeak is the only lib that works with STTS and it's GPL licensed
from ggml.
from ggml.
I think xtts is the best candidate currently it's multimodal gpt-2 , so it should be relatively easy to port from gpt-2 code which is already implemented.
from ggml.
But not permissively licensed
from ggml.
I am working on tortoise.cpp for a class project. The implementation is underway. I might be able to have it ready by the end of the year, but if people are interested I can open the project to collaborators on Github in a week or so. I will release it with an MIT License.
from ggml.
I have a model conversion and some changes , you can create a repo , I'll contribute over there
from ggml.
I will make a cleaner version of my repo public Friday (its currently just a messy fork of ggml with a new folder tortoise in examples) if that sounds good (after the submission deadline for my class). I have a partially developed ggml file format for the model that my code uses, I'm building it tensor by tensor since I'm pretty new to ggml reverse engineering. I'm still working on the autoregressive forward pass, though it looks like I might be able to use a lot of ggml code from the existing ggml gpt2 implementation. I have numbers matching the pytorch forward pass for the text embeddings, which isn't much but it shows that I can load tensors from the ggml file, construct a cgraph, and get the ggml ops to work. I also added a cuda implementation for ggml_concat since I'm using a fork of ggml.
from ggml.
Looks interesting too, works quite good but does not produce the same quality as StyleTTS-2 (or xtts2). Seamless is a much bigger project.
If they integrate ggml that would certainly be a good thing however their models are not permissive (non commercial).
from ggml.
I am working on tortoise.cpp for a class project. The implementation is underway. I might be able to have it ready by the end of the year, but if people are interested I can open the project to collaborators on Github in a week or so. I will release it with an MIT License.
Thank you! Will it support quantization and Metal?
from ggml.
Thank you! Will it support quantization and Metal?
Initially, It's cuda only, but I'm open to merging in whatever people are willing to contribute; the goal is to get an open source project going. Also we can think about adding training and voice cloning etc, but the first goal (and the code I am more interested in writing myself as a starting point) is just getting inference to work for arbitrary text from hardcoded voice latents for the mol voice.
from ggml.
Hmm, makes sense. Will you make a PR to merge your fork of GGML to the main repo?
from ggml.
The goal would be to upstream any changes I make to ggml so as not to use a weird version of it. I just forked ggml because the ggml gpt-2 implementation was a really nice template to start from. So far the only upstream change I made was adding a CUDA concatenation kernel because for some reason it was CPU only previously.
from ggml.
Hi,
is the tortoise.cpp repo public?
from ggml.
I technically can't make it public before 9pm today, but I was thinking Friday so I have some time to do some cleanup work. I don't see why I couldn't release it sooner. Would you prefer if I released it sooner than Friday?
from ggml.
Hopefully ends up being useful! Just to level expectations, I want to emphasize the project is nowhere near done, It should be public and ready for contributors by Friday, but will definitely not be anywhere close to a complete forward pass by then.
from ggml.
from ggml.
The tiny number of examples in WhisperSpeech is concerning
Compare it with that: https://styletts.github.io/
from ggml.
If you try it on Colab it's actually quite good (not as good as XTTS but not bad) but definitely not as fast as StyleTTS
from ggml.
I think the main thing to consider here is that it does multilingual very well (StyleTTS only does English) and is very similar architecture to whisper so I assume we could borrow content from whisper.cpp.
from ggml.
Related Issues (20)
- Is there interest in a groupnorm operation being added? HOT 1
- License for Python-based GGUF parser with NumPy vectorization HOT 2
- Behavior mismatch between PyTorch `GroupNorm` and `ggml_group_norm` HOT 5
- Behaviour Mismatch between ggml_opt in Native Program and WASM
- ggml vs Qualcomm SNPE inference engine on qualcomm soc
- ggml vs onnxruntime on SOC chip HOT 1
- Is there interest in `ggml_upscale_to_shape` supporting non integer scaling factors?
- `ggml_flip` or `ggml_pad_reflect`? HOT 4
- Proposing To Add Naming Convention For GGUF files in documents HOT 10
- GGML_MAX_NAME is too small. HOT 1
- GGML Fragmentation Issue
- ggml how to compute depthwise conv
- Completion of error handling
- issues about YaRN
- User defined operation HOT 2
- ggml inference time is significantly slower than onnxruntime HOT 6
- Build fail on self-built ROCm sdk HOT 1
- ggml_conv_2d problem
- Is there interest in adding option for beginning padding to `ggml_pad`? HOT 1
- How to get scale / delta from quantized file?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ggml.