I am trying to train my model on a Marathi dataset. It is strange espeak doesnt seem t

My trim_trailing_leading_silence was the culprit for this. I will check the dur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Problem training for new language. about forwardtacotron HOT 17 CLOSED

as-ideas commented on August 15, 2024

Problem training for new language.

from forwardtacotron.

Comments (17)

cschaefer26 commented on August 15, 2024 2

On a side note. If you need a better phonemizer, you could check out my new repo https://github.com/as-ideas/DeepPhonemizer. You will have to train your on phonemizer on an indian phoneme dataset. You could then use ForwarTacotron with 'no_cleaners' and 'use_phonemes=False'.

from forwardtacotron.

cschaefer26 commented on August 15, 2024

Hi, it seems that its loading an empty wav file and breaks when trying to do perform a np.max() operation, unrelated to the text. The workaround for the phonemes seems fine to me.

from forwardtacotron.

prajwaljpj commented on August 15, 2024

My trim_trailing_leading_silence script was the culprit for this. I will check the durations for the original audio and keep you updated.

Would there be a performance decrease with this workaround?

Also, Does forward tacotron support multi-language tts assuming I use espeak-ng to generate phones? (Marathi + hindi), (English + hindi)

from forwardtacotron.

cschaefer26 commented on August 15, 2024

I found that the pronounciation is more stable with phonemes, but it should work pretty ok with chars too. Multi-language would work I guess, but requires you to preprocess the datasets separately with the corresponding languages (or char-wise encodings), then merge them. I will try something similar in the context of multispeaker. Is the same speaker speaking both languages?

from forwardtacotron.

prajwaljpj commented on August 15, 2024

Unfortunately the same speaker is not speaking both the languages i.e Hindi and English. But we have the resources to get annotations.
How many hours is required for both languages (assuming Hindi is dominant language)?

from forwardtacotron.

cschaefer26 commented on August 15, 2024

Hi, then you would probably need to inject a speaker id. I have a branch called 'multispeaker' thats able to do this, but you would have to mess around with the data prep etc.

from forwardtacotron.

prajwaljpj commented on August 15, 2024

@cschaefer26 Is there a reason you are taking only 500 speakers in the multispeaker branch? here
I trained LibriTTS recipe which uses x-vector features with 500 speakers (varying from 15mins to 1.5 hours every speaker) and the results were really bad.
samples.zip

Do you have any benchmarks for Multispeaker? Also, I feel we would need atleast one hour of data per speaker to train this model. Any thoughts?

P.S. I am extremely grateful for your help regarding all of my queries.

from forwardtacotron.

cschaefer26 commented on August 15, 2024

No, it just needs to be a number larger than the actual speakers. LibriTTS is pretty low quality - I got pretty decent results with VCTK. You gonna have to massage the dataset quite a bit with cutting silences etc. to make good use of the multispeaker dataset. VCTK has about 400 samples per speaker.

from forwardtacotron.

prajwaljpj commented on August 15, 2024

Is there a phonemizer for Indian english? I trained my indian dataset using en and en-us and it gave me pretty bad results. Lot of words are being swallowed. Are there any alternatives for g2p for indian english?

from forwardtacotron.

prajwaljpj commented on August 15, 2024

Is there a phonemizer for Indian english? I trained my indian dataset using en and en-us and it gave me pretty bad results. Lot of words are being swallowed. Are there any alternatives for g2p for indian english?

Indian English is finally working after the FastPitch implementation. Even while using the British English phonemizer. Thanks

from forwardtacotron.

rasenganai commented on August 15, 2024

@prajwaljpj @cschaefer26 Im also working on a hindi dataset and I found that in the current master implementation
the phoneme_set is not complete , for example words like 'भालू' phoneme conversion is "bʰaːluː" but since 'ʰ' is not presesnt in phoneme_set its removing 'ʰ' making it to "baːluː" which sounds wrong.
One way could be to update phonemeset with all the IPA symbols.
I wanted to know that if I update the symbols can I finetuned my dataset on the pre-trained model and also is there any better workaround.
Thanks.

from forwardtacotron.

cschaefer26 commented on August 15, 2024

Hi, yeah thats unfortunate. The only way currently to deal with this for you would be to fork the repo and update the phoneme symbols. I am thinking of making the phoneme set flexible (to be provided by the config), but that would be a major change. After changing the phoneme set you for sure will have to retrain everything (tacotron and forward-tacotron).

from forwardtacotron.

prajwaljpj commented on August 15, 2024

@cschaefer26 Thank you for your suggestion. I had a look into DeepPhonemizer. Looks really promising for situations where there is a mix of languages because the espeak phonemizer does not have good language switching capability. Is there a benchmark on how much latency DeepPhonemizer will introduce?

from forwardtacotron.

cschaefer26 commented on August 15, 2024

Yeah I built the repo mainly because we found that the espeak phonemizer is making too many mistakes (for German). If you use the forward transformer for the DeepPhonemizer, the latency is pretty low, for German I could phonemize around 60 sentences per second on my Mac without any additional parallelization (although to be fair the German dictionary is quite large). Its for sure faster than the espeak phonemizer.

from forwardtacotron.

m-toman commented on August 15, 2024

@cschaefer26

Yeah I built the repo mainly because we found that the espeak phonemizer is making too many mistakes (for German). If you use the forward transformer for the DeepPhonemizer, the latency is pretty low, for German I could phonemize around 60 sentences per second on my Mac without any additional parallelization (although to be fair the German dictionary is quite large). Its for sure faster than the espeak phonemizer.

Hmm, that's interesting... for German (75 characters sentence) I get

time for i in {1..60}; do ../tacorn/tools/espeak-ng -v de -x --ipa -q -f test-de.txt ; done

real	0m0,659s
user	0m0,364s
sys	0m0,191s

Perhaps the issue is calling it from python via subprocess.run, which forks and causes us some headaches with OOMs.

Or perhaps you run into this issue?
mozilla/TTS#417 (comment)

btw what do you plan to use for normalization? espeak isn't great but there's not too much out there that works well enough.
Recently found https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/tools/text_normalization.html
but at first try just as much mixing up dates and numbers etc.

from forwardtacotron.

cschaefer26 commented on August 15, 2024

I think there is lots of overhead in re-initializing espeak. If you provide a list to the python phonemizer wrapper, its probably a lot faster. As for normalization, thats another story and I probably won't target it in the phonemizer repo as its quite language dependent.

from forwardtacotron.

rasenganai commented on August 15, 2024

Thanks for the clarification I'll try to do the same.

from forwardtacotron.

Problem training for new language. about forwardtacotron HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent