Hi Thank you for great repo. I am trying to implement version for multi speaker si

confuse about duration extract about forwardtacotron HOT 10 CLOSED

as-ideas commented on September 18, 2024

confuse about duration extract

from forwardtacotron.

Comments (10)

cschaefer26 commented on September 18, 2024

Hi, first of all good luck with the multispeaker implementation, I actually have played around with it (branch multispeaker, very old). Regarding your question, usually the restriction to mel_len is there in case one wants to do batched inference (to remove the paddings). For batch size=1 the lengths should match.

from forwardtacotron.

thanhlong1997 commented on September 18, 2024

Oh. Your code extract aliment from tacotron by pass forward not by inference. That why we need get aliment matrix up to mel_len element for removing the padding value. When I replace your tacotron model, i thought aliment matrix must be extracted by running inference and i get the warning "Sum of durations did not match mel length". Now i know why that happen
Thank you for the explain

from forwardtacotron.

thanhlong1997 commented on September 18, 2024

Can I let the issues open until i finish the implement ?

Thank sir

from forwardtacotron.

cschaefer26 commented on September 18, 2024

Surely, keep me updated!

from forwardtacotron.

thanhlong1997 commented on September 18, 2024

Hi, i am successful implement Forward tacotron multispeaker version. The result sound good. Still we are using pretrained tacotron or with my version is mellotron to extract aliment between character and melspectrogram, but in fastpitch and fastspeech i see they now use montre to extract aliment. Have you tried it before ? will it better or worse than using tacotron to extract? I have tried using it in fastpitch but still the result is pool

from forwardtacotron.

cschaefer26 commented on September 18, 2024

Hi, I tried using the MFA for duration extraction before and I found it to be slightly worse, there was also quite some fiddling involved with mapping the phonemes. It wasn't totally bad though.

from forwardtacotron.

thanhlong1997 commented on September 18, 2024

thank sir
one more question, now i am using grapheme for encoding text, since i have to work with multilingual data. Will use phoneme is better or not ? in fastpitch too, i saw they using both grapheme and phoneme

from forwardtacotron.

cschaefer26 commented on September 18, 2024

In my experience phonemes trump graphemes big time because of the bijective nature of phonemes that is easier to learn for the tts model. Result is much more stable pronunciation and prosody. For multilingual data simply phonemize your metafile upfront and set use_phonemes=False in the config. You could checkout https://github.com/as-ideas/DeepPhonemizer for a stable phonemizer (you might need to train your own phonemizer model for your needs, but its probably worth it)

from forwardtacotron.

thanhlong1997 commented on September 18, 2024

thank you sir for your advise, since i use phoneme instead of graphemes then the result is promising. Now i would to deploy this model to device like mobile phone or camera, but the model have a big size so i must compression them, but the result is not good. Have you done it before ? what can i do if i want to deploy to device and still keep its quality?
Thank sir

from forwardtacotron.

thanhlong1997 commented on September 18, 2024

do you think non auto-regressive model like forward tacotron or fast speech2 is not as good as auto-regressive model like tacotron2 ? when i try bot forward tacotron and fast speech on same dataset, i found that the audio generated is not good as result by tacotron2, even it outperform tacotron2 in speed. I am trying to improve your model and fast speech 2 for comparable to tacotron2 but it seem too hard to do this

from forwardtacotron.

confuse about duration extract about forwardtacotron HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent