Giter Site home page Giter Site logo

Text to speech models in GGML? about ggml HOT 92 OPEN

ggerganov avatar ggerganov commented on July 19, 2024 34
Text to speech models in GGML?

from ggml.

Comments (92)

ggerganov avatar ggerganov commented on July 19, 2024 22

This looks like the best candidate now: https://github.com/suno-ai/bark

from ggml.

Green-Sky avatar Green-Sky commented on July 19, 2024 13

This looks like the best candidate now: https://github.com/suno-ai/bark

their voice creation got reverse engineered.
https://github.com/serp-ai/bark-with-voice-clone

from ggml.

ggerganov avatar ggerganov commented on July 19, 2024 8

I personally will look into TTS after finishing the SAM implementation. Maybe someone else is already working on TTS inference

from ggml.

danemadsen avatar danemadsen commented on July 19, 2024 8

For anyone interested this is the current state of Neural Text to Speech for C / C++. I'm currently in search for a decent Neural TTS C or C++ library for integration into my llama.cpp front end (MAID).

rhasspy/piper

Piper is currently the most stable C++ implementation of Neural TTS, it utilities VITS models and depends on ONNX (A similar ML library to GGML). It also requires a custom fork of espeak-ng for phonemization. I can get Piper to compile and run on linux and windows (Though i used my own fork of piper to get it to compile), but actual inference only seems to work on linux. Though it must have worked at one point in time because there's a windows front end that uses it (Piper_UI).

Sherpa Onnx

Sherpa Onnx is very similar to piper. Like its name suggests it utilities the Onnx runtime for inference and like piper it also uses Espeak-NG for phonemization. It has various implementations and API's available and is actively maintained but the monumental size of it makes it difficult to integrate into dart which is a requirement for what i need to do.

I'm preferring to avoid using Piper or Sherpa Onnx for my own project as i would prefer to not be dependent on Espeak-ng or another separate ML library other than GGMl which im already using for the Llama.cpp integration.

Vits.cpp

Vits.cpp is a GGML implementation of VITS models (The same Piper and Sherpa Onnx use). It does not require Espeak-NG and as stated uses GGML and not Onnx. VITS models are good if you have alot of data because they produce very small model files (One model file Ive tested is 60mb). For what i need to do Vits.cpp would be ideal, however, though i can get Vits.cpp to compile it immediately segfaults when launched. It has also not been updated for 3 months so the maintainer has likely abandoned it.

Bark.cpp

Bark.cpp is another GGML implementation but unlike the others it uses the bark series of models released by suno-ai. I haven't tested if it worked but from what Ive seen there's little control over what voice is used and a limited variety of voice presets. There's also no ability to clone voices and because bark is a GPT model the words spoken by the output can be different to the input. Its also worth stating that as of writing Bark.cpp hasn't exactly been receiving frequent updates so it may also be abandoned.

That's it for the C++ implementations I'm aware of if anyone else knows any more let me know.

Purely for Voice cloning models though there's a few available now but they all require python which means they cant be integrated into dart and C++ software well. Off the top of my head theres OpenVoice, WhisperSpeech, StyleTTS, VALL-E, metavoice and the already mentioned bark and vits.

from ggml.

noe avatar noe commented on July 19, 2024 7

Coqui released their cross-lingual TTS model: XTTS:

It supports 13 languages (Arabic, Brazilian Portuguese, Chinese, Czech, Dutch, English, French, German, Italian, Polish, Russian, Spanish, and Turkish).

It also offers voice cloning and cross-lingual voice cloning.

from ggml.

balisujohn avatar balisujohn commented on July 19, 2024 7

https://github.com/balisujohn/tortoise.cpp

Repository is public, feel free to make an issue on this repo if you want to contribute. I will release the ggml export script and modified tortoise that prints out intermediate values that I'm using for reverse engineering also if people want it.

from ggml.

x066it avatar x066it commented on July 19, 2024 6

What about https://github.com/snakers4/silero-models ?

from ggml.

gut4 avatar gut4 commented on July 19, 2024 5

it's in roadmap now ggerganov/llama.cpp#1729

from ggml.

Green-Sky avatar Green-Sky commented on July 19, 2024 5

there is now a tracking issue for bark #388
which links https://github.com/PABannier/bark.cpp (wip) 🥳

from ggml.

afyacnkep avatar afyacnkep commented on July 19, 2024 4

Is there another update about text to speech?

from ggml.

balisujohn avatar balisujohn commented on July 19, 2024 4

WhisperSpeech works really well with 10-20 minutes of audio for many voices for zero shot voice cloning. It does a good job on JFK for example if you use this file as input: https://upload.wikimedia.org/wikipedia/commons/5/50/Jfk_rice_university_we_choose_to_go_to_the_moon.ogg The overall quality is less than tortoise though.

If I have time I'm interested in making whisperspeech.cpp, but I'm busy with tortoise.cpp for now :^)

I'd also add, that WhisperSpeech already seems like it could be fast enough that ggml might not improve much over it in terms of speed, but I'm not really sure.

from ggml.

Green-Sky avatar Green-Sky commented on July 19, 2024 3

This looks like the best candidate now: https://github.com/suno-ai/bark

by far. since they provide the models. (a bit over 12gig)

from ggml.

mattkanwisher avatar mattkanwisher commented on July 19, 2024 3

A new paper came out called Tango looks pretty good, also using LLMs

from ggml.

fakerybakery avatar fakerybakery commented on July 19, 2024 3

Been playing around with StyleTTS 2 and it's pretty fast. IMHO it would be better to add Tortoise first since its slower and GGML could have a more significant impact in speeding it up, but StyleTTS 2 is pretty impressive too.

from ggml.

noe avatar noe commented on July 19, 2024 3

There is a new contender: WhisperSpeech:

An Open Source text-to-speech system built by inverting Whisper. Previously known as spear-tts-pytorch.

We want this model to be like Stable Diffusion but for speech – both powerful and easily customizable.

We are working only with properly licensed speech recordings and all the code is Open Source so the model will be always safe to use for commercial applications.

from ggml.

lin72h avatar lin72h commented on July 19, 2024 3

Seems like recent update of StyleTTS2 is really good TTS Arena

from ggml.

ggerganov avatar ggerganov commented on July 19, 2024 2

I'm interested in implementing a TTS using ggml, but don't have capacity atm - there are other priorities.
Also, I don't think it is worth implementing a model from 3-4 years ago. It should be SOTA.
What is SOTA atm?

VALL-E looks like a good candidate - but no weights.

from ggml.

cmp-nct avatar cmp-nct commented on July 19, 2024 2

The original Bark did sound artificial to me.
The voice cloning repos (both) sound amazing already!

Combine that with llm text and the inference speed we see already .. then we have realtime generative speech output

from ggml.

TechnotechGit avatar TechnotechGit commented on July 19, 2024 2

While I'm not an expert by any means, VITS in CoquiTTS is almost realtime on CPU (I tested on a medium range laptop CPU). With ggml and a good quant if possible, could almost certainly be realtime, maybe even playing to speakers in realtime too. Just a thought.

from ggml.

khimaros avatar khimaros commented on July 19, 2024 2

coincidentally, this was just released claiming SOTA: https://github.com/yl4579/StyleTTS2

from ggml.

cmp-nct avatar cmp-nct commented on July 19, 2024 2

StyleTTS looks like a very clean approach, also very good english.
But .. it's not multi-lingual at this point. The readme reads like it's a small thing to add more but it doesn't appear that easy when looking closer.
If StyleTTS2 would support languages similar to the others, I'd focus on it fully

from ggml.

MichaelWengren avatar MichaelWengren commented on July 19, 2024 2

StyleTTS 2 is truly state-of-the-art. Just look at quality and speed https://huggingface.co/spaces/styletts2/styletts2
This is a great candidate for cpp implementation.

from ggml.

MichaelWengren avatar MichaelWengren commented on July 19, 2024 2

Is anyone working on StyleTTS2.cpp?
Support for other languages is important, but even in English it would be an incredibly useful thing, especially the speed at which it works.

from ggml.

fakerybakery avatar fakerybakery commented on July 19, 2024 2

Yeah some ppl on the STTS Slack channel are trying to reimplement Phonemizer

from ggml.

balisujohn avatar balisujohn commented on July 19, 2024 2

uh but tortoise.cpp will get its own repo, but the goal will be to keep it's ggml version consistent with normal GGML in the long run.

from ggml.

Green-Sky avatar Green-Sky commented on July 19, 2024 1

@dennislysenko

by far. since they provide the models. (a bit over 12gig)

even with the best "4bit" quantization imaginable, you still end up with ~3gigs for model files

from ggml.

Martin-Laclaustra avatar Martin-Laclaustra commented on July 19, 2024 1

It seems that the unlocked Bark with voice cloning is here now:
https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer
This is a necessary step for a complete system.
It is worth to have a look into the open and closed issues there to get an overview of the multiple needed models.

from ggml.

Martin-Laclaustra avatar Martin-Laclaustra commented on July 19, 2024 1

Also, in llama.cpp May 2023 roadmap, a recent comment suggests "drop-in replacement for EnCodec", which may be (or not) easier to implement.

from ggml.

kskelm avatar kskelm commented on July 19, 2024 1

it's in roadmap now ggerganov/llama.cpp#1729

That's great news! My only complaint with bark is its speed... your magic touch would be ✨✨✨

from ggml.

vietanhdev avatar vietanhdev commented on July 19, 2024 1

@ggerganov This is a good TTS with C++ code (ONNX Runtime). https://github.com/rhasspy/piper.
You can try some generated sample at: https://rhasspy.github.io/piper-samples/.

from ggml.

yorkzero831 avatar yorkzero831 commented on July 19, 2024 1

https://github.com/Plachtaa/VALL-E-X/blob/master/README.md

how about this?

from ggml.

manmay-nakhashi avatar manmay-nakhashi commented on July 19, 2024 1

xtts is based on tortoise-tts
https://github.com/neonbjb/tortoise-tts
which uses modified gpt-2. embedding is different and it has two heads one for mel and one for text.
i think we can convert tortoise-tts autoregrassive model to ggml.

from ggml.

manmay-nakhashi avatar manmay-nakhashi commented on July 19, 2024 1

it should be able to handle XTTS if ggml supports tortoise-tts, xtts is multilingual tortoise-tts model, we might need different model conversion scripts for xtts that's it.

from ggml.

cmp-nct avatar cmp-nct commented on July 19, 2024 1

coincidentally, this was just released claiming SOTA: https://github.com/yl4579/StyleTTS2

Thanks for that info, I'll test that now, the audio samples in their paper is superior to the others though the latest xttsv2 is also almost flawless (but for xtts v2 we'd need a equivalent open source model to be useful).
If we go the Tortoise/XTTS route it would be best to make sure to implement the advantages of xtts as well, namely the instant voice cloning and the language independent models.

Evaluating StyleTTS2:

  1. The dataset to train is completely open and the steps to train appear very simple
  2. The code is MIT
  3. They supply models which are completely open with the exception that you need to inform people about StyleTTS2 UNLESS you do have permission by the voice originator which is awesome .. AND it only applies if you don't just train your own.
  4. Now testing it

from ggml.

eolasd avatar eolasd commented on July 19, 2024 1

Time for me learn to C++ i think!

from ggml.

cmp-nct avatar cmp-nct commented on July 19, 2024 1

Yep it's the best out there, as soon as multilanguage is supported nothing can stop it.
The quality is better and the computation requirements are a fraction of before.

from ggml.

MichaelWengren avatar MichaelWengren commented on July 19, 2024 1

Does anyone know of any other high-quality models suitable for realtime use?
There is ggml bark implementation https://github.com/PABannier/bark.cpp but it's quite slow

from ggml.

bachittle avatar bachittle commented on July 19, 2024 1

bark.cpp is good because it does not require use of a phoneme library, it does everything automatically.
StyleTTS 2 has better sounding voices but it requires third party libraries like espeak and some nltk stuff.
XTTS is not permissive in license and breaks the idea of building these ggml libraries under MIT.

So best solution for StyleTTS 2 is to do one of the following:

  • find a more permissively licensed phonemizer
  • dynamically link to espeak and build it under GPL (so it cannot be in this repository)
  • build a phonemizer from scratch in C/C++ specifically for this project

from ggml.

balisujohn avatar balisujohn commented on July 19, 2024 1

@manmay-nakhashi And sounds good regarding collaboration! I posted here precisely to avoid duplicating effort, better to work together than unknowlingly duplicate effort.

from ggml.

Green-Sky avatar Green-Sky commented on July 19, 2024 1

The text to speech part of SeamlessM4T, but they implemented Speech-to-text translation (S2TT), Acoustic speech recognition (ASR), Text-to-text translation (T2TT).
They have also Speech-to-speech translation (S2ST) and Text-to-speech translation (T2ST) models in the same family, but the ggml implementation for them is still missing. maybe there is hope they implement those too. :)
https://github.com/facebookresearch/seamless_communication/tree/main/ggml

from ggml.

fakerybakery avatar fakerybakery commented on July 19, 2024 1

Hmm, would be nice to see the WIP project, but Fri works. Thank you for creating this project! Really looking forward to a faster way to run Tortoise

from ggml.

balisujohn avatar balisujohn commented on July 19, 2024 1

If people are interested in contributing to tortoise.cpp, a great first task would be getting the tokenizer to always match the tokenization tortoise-tts uses. The tokenizer I'm using in tortoise.cpp seems to be able to load the tokenizer vocab but the regex had issues with some of the special chars which I bandaided at least for spaces, but more perplexingly, the tortoise-tts tokenizer isn't greedy with respect to always choosing the longest possible next token, while the default tokenizer I copied from ggml gpt-2 seems to be greedy. So the task would be studying the tokenizer tortoise-tts uses, and modifying the tokenizer in tortoise,cpp to exactly match this behavior. I can also come up with some other tasks to work on.

from ggml.

bachittle avatar bachittle commented on July 19, 2024 1

Looks like it also uses the Meta encodec library. I think it will be crucial to get a good implementation of this in C++ as I have a feeling that more and more SOTA audio models are going to use this library (bark also comes to mind).

An initial implementation of encodec is found here: https://github.com/PABannier/encodec.cpp

from ggml.

IcedQuinn avatar IcedQuinn commented on July 19, 2024 1

Piper uses a VITS model (run through some conversion to ONNX) which runs quite quickly on chunky CPUs. It's not as shiny as some others mentioned here but is quite capable and known to compress well. (StyleTTS considers VITS a close competitor, in their examples.)

from ggml.

simplejackcoder avatar simplejackcoder commented on July 19, 2024

How about using vall-e?

from ggml.

noe avatar noe commented on July 19, 2024

How about using vall-e?

AFAIK Microsoft has not released the weights of VALL-E. They just uploaded the paper to arxiv and set up a demo website with some generation samples.

from ggml.

gavsidua avatar gavsidua commented on July 19, 2024

@ggerganov I hope you make a text to speech example from cpp

from ggml.

Martin-Laclaustra avatar Martin-Laclaustra commented on July 19, 2024

Here, there is a TTS pytorch model, which has available weights:
https://github.com/r9y9/deepvoice3_pytorch
I would be particularly interested in the implemented "nyanko" model (described in https://aclanthology.org/2020.lrec-1.789.pdf).
There are several stages of pre-processing in python, but if the model can be ported, porting those to c/c++ could be done afterwards.
@ggerganov , whats your assessment on the level of difficulty?

from ggml.

flosserblossom avatar flosserblossom commented on July 19, 2024

UP

from ggml.

Martin-Laclaustra avatar Martin-Laclaustra commented on July 19, 2024

VALL-E looks like a good candidate - but no weights.

It seems quite demanding in terms of training data required (60k hours). Aiming to VALL-E X (multilingual) would be the natural choice (this requires 70K hours), but, apparently (paper), tested only for 2 languages by now. I think it is very unlikely that they release the model, and difficult to have a community based one (at least for a breath of languages).
Also, it might be also quite demanding for inference (I know ggml is reaching unbelievable achievements by quantizing, etc. but still...).

On the contrary, the one I proposed (nianko) gets acceptable quality with only ~20h (yes, hours!) of training data, and it can be trained for each language in just 3 days on a single GPU (single speaker). I trained models for 3 speakers (1 non-English language). Let me know if you would like to listen to the samples or test the python implementation. Besides, python inference in CPU is already real-time in modern systems. It would really have outstanding performance based on c.

I believe a desirable TTS would be "universal language" direct unicode text to wav converter, but I have not been able to spot such model.

from ggml.

Martin-Laclaustra avatar Martin-Laclaustra commented on July 19, 2024

With respect to VALL-E, there are 2 pytorch unofficial implementations, none of them implement the VALL-E X (multilanguage), and none of them have released the weights (due to ethical concerns?).
https://github.com/enhuiz/vall-e
https://github.com/lifeiteng/vall-e
I do not have details on the weights size or training/inference requirements.

Compare that to a multilingual TTS with lots of available languages: larynx
https://github.com/rhasspy/larynx
The quality seems a bit lower. But the training work is done.
One may wonder what would be the real advantage of using ggml in this case.

from ggml.

Green-Sky avatar Green-Sky commented on July 19, 2024

they don't provide any code, but

https://speechresearch.github.io/naturalspeech2/
https://arxiv.org/abs/2304.09116

We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.

still more diffusion models ...

from ggml.

Green-Sky avatar Green-Sky commented on July 19, 2024

While tango looks sounds cool, it's a text-to-audio and not a text-to-speech model.

from ggml.

dennislysenko avatar dennislysenko commented on July 19, 2024

@ggerganov is there any possibility that Bark, ported to cpp, would be feasible to run on constrained devices like iPhones? Ex. a device with 4GB RAM and a tolerable limit of model size in the low 100s of MB.

from ggml.

dennislysenko avatar dennislysenko commented on July 19, 2024

@Green-Sky

even with the best "4bit" quantization imaginable, you still end up with ~3gigs for model files

Is the quantization referring to the "smaller model" released 05-01?

from ggml.

Green-Sky avatar Green-Sky commented on July 19, 2024

@dennislysenko no i was talking about ggml, not sure what changes they made in 1.5

from ggml.

dennislysenko avatar dennislysenko commented on July 19, 2024

@Green-Sky
Seems like they refer to smaller model cards as low as 2GB in their README now:

The full version of Bark requires around 12Gb of memory to hold everything on GPU at the same time. However, even smaller cards down to ~2Gb work with some additional settings.

05-01 release notes mention:

We also added an option for a smaller version of Bark, which offers additional speed-up with the trade-off of slightly lower quality.

In theory, could this mean with 4x quantization, it's possible to target ~500MB VRAM?

from ggml.

Green-Sky avatar Green-Sky commented on July 19, 2024

it's possible to target ~500MB VRAM?

@dennislysenko ggml using vram is very optional. by default ggml only uses ram and cpu. :)

In theory, could this mean with 4x quantization,

their description is very obscure and I dont have the time to look at the code, so maybe

from ggml.

noe avatar noe commented on July 19, 2024

xtts is based on tortoise-tts https://github.com/neonbjb/tortoise-tts which uses modified gpt-2. embedding is different and it has two heads one for mel and one for text. i think we can convert tortoise-tts autoregrassive model to ggml.

@manmay-nakhashi do you mean that ggml should be able to handle XTTS out of the box or would it need some adaptations?

from ggml.

wassname avatar wassname commented on July 19, 2024

the problem I had with tortise (and bark) was that it wasn't consistent across inferences. E.g. it would change voices, even with the same conditioning.

from ggml.

manmay-nakhashi avatar manmay-nakhashi commented on July 19, 2024

@wassname gpt models can be unpredictable sometimes , fine-tuning on better speaker segmented data can resolve this problem.

from ggml.

manmay-nakhashi avatar manmay-nakhashi commented on July 19, 2024

I think converting tortoise-tts to ggml makes sense , anyone willing colab on converting tortoise-tts to ggml?

from ggml.

cmp-nct avatar cmp-nct commented on July 19, 2024

the problem I had with tortise (and bark) was that it wasn't consistent across inferences. E.g. it would change voices, even with the same conditioning.
That's just a implementation issue, look at that python package it's a nightmare imho

tortoise, especially the latest v2 of xtts is producing very good results. In many cases flawless speech to my ears.
It's transformers based, I didn't look into the source but that is likely a good candidate. The only one of today imho.
xtts v2 just clones a voice in seconds (more or less closely) and then uses it for any language.

from ggml.

kskelm avatar kskelm commented on July 19, 2024

from ggml.

cmp-nct avatar cmp-nct commented on July 19, 2024

I’ve soured on bark altogether. Even on their own discord, the thing regularly produces wildly unpredictable output, frequently with a thin resemblance to what you asked for.On Nov 18, 2023, at 7:48 PM, John @.> wrote: the problem I had with tortise (and bark) was that it wasn't consistent across inferences. E.g. it would change voices, even with the same conditioning. That's just a implementation issue, look at that python package it's a nightmare imho tortoise, especially the latest v2 of xtts is producing very good results. In many cases flawless speech to my ears. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>

something went wrong with your reply

from ggml.

fakerybakery avatar fakerybakery commented on July 19, 2024

Tortoise please!!!

from ggml.

fakerybakery avatar fakerybakery commented on July 19, 2024

Please, implement Tortoise instead of XTTS. XTTS is licensed under the ultra-restrictive CPML which completely prohibits ALL commercial use. Please help promote open-source by supporting Tortoise instead.

from ggml.

fakerybakery avatar fakerybakery commented on July 19, 2024

Yes. Perhaps someone could create a "merge" of XTTS and Tortoise, similar to the Tortoise Fast API. For example, using an autoregressive model + hifigan?

from ggml.

fakerybakery avatar fakerybakery commented on July 19, 2024

StyleTTS looks great. Really hope this gets implemented. Would love to have something similar to llama.cpp that supports many models (tts.cpp?)

from ggml.

eolasd avatar eolasd commented on July 19, 2024

Another upvote here for text-to-speech cpp.
Since I am a complete noob with this stuff, could someone give me the high level steps needed for this happen?
Does it require GPU time to re-train/quantize models, or is it mostly just writing code to port encoders etc..??

Thanks, and appreciate all the work the community have put in to making this stuff work for the GPU-poor!

from ggml.

fakerybakery avatar fakerybakery commented on July 19, 2024

@eolasd It shouldn't require GPU time. For example, w/ llama.cpp, you don't need to retrain the models. Probably mostly just porting the Python inference code to C++ and getting the models to work with GGML, right?

from ggml.

kskelm avatar kskelm commented on July 19, 2024

from ggml.

cmp-nct avatar cmp-nct commented on July 19, 2024

What I read out of the discussions so far was that training material for multi-language is assembled, someone promised to sponsor 8xA100 for the ~3000 hours training time and that last step is currently open.

from ggml.

fakerybakery avatar fakerybakery commented on July 19, 2024

Main issue with StyleTTS is it uses IPA phonemes. Right now espeak is the only lib that works with STTS and it's GPL licensed

from ggml.

kskelm avatar kskelm commented on July 19, 2024

from ggml.

manmay-nakhashi avatar manmay-nakhashi commented on July 19, 2024

I think xtts is the best candidate currently it's multimodal gpt-2 , so it should be relatively easy to port from gpt-2 code which is already implemented.

from ggml.

fakerybakery avatar fakerybakery commented on July 19, 2024

But not permissively licensed

from ggml.

balisujohn avatar balisujohn commented on July 19, 2024

I am working on tortoise.cpp for a class project. The implementation is underway. I might be able to have it ready by the end of the year, but if people are interested I can open the project to collaborators on Github in a week or so. I will release it with an MIT License.

from ggml.

manmay-nakhashi avatar manmay-nakhashi commented on July 19, 2024

I have a model conversion and some changes , you can create a repo , I'll contribute over there

from ggml.

balisujohn avatar balisujohn commented on July 19, 2024

I will make a cleaner version of my repo public Friday (its currently just a messy fork of ggml with a new folder tortoise in examples) if that sounds good (after the submission deadline for my class). I have a partially developed ggml file format for the model that my code uses, I'm building it tensor by tensor since I'm pretty new to ggml reverse engineering. I'm still working on the autoregressive forward pass, though it looks like I might be able to use a lot of ggml code from the existing ggml gpt2 implementation. I have numbers matching the pytorch forward pass for the text embeddings, which isn't much but it shows that I can load tensors from the ggml file, construct a cgraph, and get the ggml ops to work. I also added a cuda implementation for ggml_concat since I'm using a fork of ggml.

from ggml.

cmp-nct avatar cmp-nct commented on July 19, 2024

Looks interesting too, works quite good but does not produce the same quality as StyleTTS-2 (or xtts2). Seamless is a much bigger project.
If they integrate ggml that would certainly be a good thing however their models are not permissive (non commercial).

from ggml.

fakerybakery avatar fakerybakery commented on July 19, 2024

I am working on tortoise.cpp for a class project. The implementation is underway. I might be able to have it ready by the end of the year, but if people are interested I can open the project to collaborators on Github in a week or so. I will release it with an MIT License.

Thank you! Will it support quantization and Metal?

from ggml.

balisujohn avatar balisujohn commented on July 19, 2024

Thank you! Will it support quantization and Metal?

Initially, It's cuda only, but I'm open to merging in whatever people are willing to contribute; the goal is to get an open source project going. Also we can think about adding training and voice cloning etc, but the first goal (and the code I am more interested in writing myself as a starting point) is just getting inference to work for arbitrary text from hardcoded voice latents for the mol voice.

from ggml.

fakerybakery avatar fakerybakery commented on July 19, 2024

Hmm, makes sense. Will you make a PR to merge your fork of GGML to the main repo?

from ggml.

balisujohn avatar balisujohn commented on July 19, 2024

The goal would be to upstream any changes I make to ggml so as not to use a weird version of it. I just forked ggml because the ggml gpt-2 implementation was a really nice template to start from. So far the only upstream change I made was adding a CUDA concatenation kernel because for some reason it was CPU only previously.

from ggml.

fakerybakery avatar fakerybakery commented on July 19, 2024

Hi,
is the tortoise.cpp repo public?

from ggml.

balisujohn avatar balisujohn commented on July 19, 2024

I technically can't make it public before 9pm today, but I was thinking Friday so I have some time to do some cleanup work. I don't see why I couldn't release it sooner. Would you prefer if I released it sooner than Friday?

from ggml.

balisujohn avatar balisujohn commented on July 19, 2024

Hopefully ends up being useful! Just to level expectations, I want to emphasize the project is nowhere near done, It should be public and ready for contributors by Friday, but will definitely not be anywhere close to a complete forward pass by then.

from ggml.

kskelm avatar kskelm commented on July 19, 2024

from ggml.

cmp-nct avatar cmp-nct commented on July 19, 2024

The tiny number of examples in WhisperSpeech is concerning
Compare it with that: https://styletts.github.io/

from ggml.

fakerybakery avatar fakerybakery commented on July 19, 2024

If you try it on Colab it's actually quite good (not as good as XTTS but not bad) but definitely not as fast as StyleTTS

from ggml.

bachittle avatar bachittle commented on July 19, 2024

I think the main thing to consider here is that it does multilingual very well (StyleTTS only does English) and is very similar architecture to whisper so I assume we could borrow content from whisper.cpp.

from ggml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.