Comments (47)
It seems there might be an issue with the training process. According to the current code, if your symbol size is not equal to the original 219, a new size will be used to initialize the TextEncoder. This means that you are not utilizing the base model, but rather retraining it. Based on my previous tests, this could lead to strange results where the model fails to properly generate text.
Solution: Similar to adding a new vocabulary to BERT, you should modify the loading process of the model.
https://huggingface.co/transformers/v2.11.0/_modules/transformers/modeling_utils.html (_get_resized_embeddings function)
from melotts.
@jeremy110 Don't worry, this is not your fault at all!
We are here to discuss and find a solution.
I will keep you updated if I got something. @jadechip @jeremy110
from melotts.
This one is my implementation (not optimized 😞) for the @jeremy110 's tones. I also trying to change @jadechip 's phonemizer which sometime return the characters not phonemes.
https://gist.github.com/BankNatchapol/1276e34dcb51c521536978859dd948cd
But, the problem now is the phonemizer is not that good. It's usually give repeated phonemes for some reason.
from melotts.
hello @jadechip
let me know if its working
i'm training for indonesia and malay language
changing phonem and bert also
after 10 epoch the model doesnt produce any good word, only some noise , some random vowel
my data
~200hours dataset
~500 speaker
from melotts.
I see, thank you for the heads up @jeremy110 🙏
I've updated my code to reflect your suggestion, now I have.
# combine all symbols
normal_symbols = sorted(set(zh_symbols + ja_symbols + en_symbols + kr_symbols + es_symbols + fr_symbols + de_symbols + ru_symbols))
symbols = [pad] + normal_symbols + pu_symbols + th_symbols
sil_phonemes_ids = [symbols.index(i) for i in pu_symbols]
# combine all tones
num_tones = num_zh_tones + num_ja_tones + num_en_tones + num_kr_tones + num_es_tones + num_fr_tones + num_de_tones + num_ru_tones + num_th_tones
# language maps
language_id_map = {"ZH": 0, "JP": 1, "EN": 2, "ZH_MIX_EN": 3, 'KR': 4, 'ES': 5, 'SP': 5, 'FR': 6, 'TH': 7}
num_languages = len(language_id_map.keys())
I'll try running a new training job to evaluate performance with these changes.
from melotts.
thanks @jadechip and @jeremy110
i'll try it to my environment also,see if works
from melotts.
btw I am currently training on a subset of Thai commonvoice 13, converted to .wav with a sample rate of 48 kHz.
Edit: Happy weekend everyone 🎉
from melotts.
hello~ @jadechip
My config is basically the same as yours, except my batch size is 6. Perhaps you can increase your learning rate to 9e-4 and see how it performs. Also, I've added a constraint to the clip_grad_value in the code.
grad_norm_d = commons.clip_grad_value_(net_d.parameters(), 200)
grad_norm_g = commons.clip_grad_value_(net_g.parameters(), 500)
Finally, I'm attaching my tensorboard for reference. (https://drive.google.com/drive/folders/1xPNURmWsmJqwEDHVM8ZsK6CAbuv65ipI?usp=sharing)
Additionally, if the silence before and after your audio files is shorter, your g/dur will converge to a smaller value, which will also affect the length of the silence before and after the inference.
I'm not sure if the Thai CommonVoice 13 dataset is suitable for training. Also, there's no need to specifically convert it to 48kHz. I remember that the code will resample it. I think you can start by testing whether it can be trained with 10 hours of data from one person.
I hope this is helpful for you.
from melotts.
hello~ @acul3 @jadechip
Sorry, I spent some time looking at that, but since I can't read Thai, I did some online research. I wanted to ask about the symbols from line 266 to 339 in the th_symbols . Are those symbols not IPA?
Also, I looked at the Wiktionary file and found several symbols that seem to represent tones: ˧, ˨˩, ˦˥, ˩˦, and ˥˩. It looks like there are five tones. So, you need to convert these symbols into tones and then add the corresponding number of tones to the 'tones' list based on the number of phones in your phone list.
But I'm confused about lines 5908 to 5910. Which one is correct?
from melotts.
Great job! I planed to work on training a Thai TTS model using MeloTTS too.
from melotts.
Hi - Thanks for the contribution. We would suggest you first train on the Thai dataset to see if the code works. We haven't had any attempt to train on Thai
from melotts.
@Zengyi-Qin Sounds good, will report back once I have proper training results.
from melotts.
Thank you @tchayintr, if you have any recommendations for Thai audio datasets, I would greatly appreciate it!
from melotts.
@jadechip
Sure! There are several datasets such as TSync2, Lotus, etc. You can check several of them here: https://github.com/korakot/corpus/releases/tag/v1.0 with documentation at https://lexitron.nectec.or.th/KM_HL5001/file_HL5001/Document/krrn_14518.pdf.
There are also Thai dialects available at https://github.com/SLSCU/thai-dialect-corpus.
However, I recommend collecting clear voice clips and crafting their transcriptions with ASR tools like WhisperX. This way, you can generate a lot of samples, but you may need to fine-tune it for the Thai language 😄.
I am reviewing your commits too. They mostly look great 🎆 , but I found some points that need to be clarified.
I will clarify and let you know if there is a point that may need to be adjusted in terms of Thai linguistic knowledge.
from melotts.
@tchayintr this is super helpful and any feedback you have for my code will be greatly appreciated 🙏
I was also looking at this other nectec dataset: https://github.com/vistec-AI/dataset-releases/releases/tag/v1
I'll work on creating transcriptions next and report back.
from melotts.
@Zengyi-Qin are there any additional steps or files needed before training? I am getting the following error:
output
⚡ add-thai ~/MeloTTS/melo torchrun --nproc_per_node=1 --master_port=10902 train.py --c data/thai/config.json --model thai
2024-05-07 15:24:58.152 | INFO | data_utils:_filter:64 - Init dataset...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 141910/141910 [00:04<00:00, 32864.77it/s]
2024-05-07 15:25:02.475 | INFO | data_utils:_filter:84 - min: 65; max: 987
2024-05-07 15:25:02.475 | INFO | data_utils:_filter:85 - skipped: 327, total: 141910
buckets: [92994, 31326, 11604, 4350, 1068, 156, 84, 24]
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
2024-05-07 15:25:02.699 | INFO | data_utils:_filter:64 - Init dataset...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 32832.13it/s]
2024-05-07 15:25:02.700 | INFO | data_utils:_filter:84 - min: 164; max: 625
2024-05-07 15:25:02.700 | INFO | data_utils:_filter:85 - skipped: 0, total: 4
Using noise scaled MAS for VITS2
Using duration discriminator for VITS2
(torch.Size([219, 192]), torch.Size([360, 192]))
(torch.Size([16, 192]), torch.Size([17, 192]))
(torch.Size([10, 192]), torch.Size([9, 192]))
(torch.Size([256, 256]), torch.Size([1, 256]))
list index out of range
0%| | 0/23601 [00:01<?, ?it/s]
Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/teamspace/studios/this_studio/MeloTTS/melo/data_utils.py", line 194, in __getitem__
return self.get_audio_text_speaker_pair(self.audiopaths_sid_text[index])
File "/teamspace/studios/this_studio/MeloTTS/melo/data_utils.py", line 98, in get_audio_text_speaker_pair
bert, ja_bert, phones, tone, language = self.get_text(
File "/teamspace/studios/this_studio/MeloTTS/melo/data_utils.py", line 180, in get_text
raise
RuntimeError: No active exception to reraise
...it seems to happen around line 200 in train.py
config.json
{
"train": {
"log_interval": 200,
"eval_interval": 1000,
"seed": 52,
"epochs": 10000,
"learning_rate": 0.0003,
"betas": [
0.8,
0.99
],
"eps": 1e-09,
"batch_size": 6,
"fp16_run": false,
"lr_decay": 0.999875,
"segment_size": 16384,
"init_lr_ratio": 1,
"warmup_epochs": 0,
"c_mel": 45,
"c_kl": 1.0,
"skip_optimizer": true
},
"data": {
"training_files": "data/thai/train.list",
"validation_files": "data/thai/val.list",
"max_wav_value": 32768.0,
"sampling_rate": 44100,
"filter_length": 2048,
"hop_length": 512,
"win_length": 2048,
"n_mel_channels": 128,
"mel_fmin": 0.0,
"mel_fmax": null,
"add_blank": true,
"n_speakers": 1,
"cleaned_text": true,
"spk2id": {
"TH-default": 0
}
},
"model": {
"use_spk_conditioned_encoder": true,
"use_noise_scaled_mas": true,
"use_mel_posterior_encoder": false,
"use_duration_discriminator": true,
"inter_channels": 192,
"hidden_channels": 192,
"filter_channels": 768,
"n_heads": 2,
"n_layers": 6,
"n_layers_trans_flow": 3,
"kernel_size": 3,
"p_dropout": 0.1,
"resblock": "1",
"resblock_kernel_sizes": [
3,
7,
11
],
"resblock_dilation_sizes": [
[
1,
3,
5
],
[
1,
3,
5
],
[
1,
3,
5
]
],
"upsample_rates": [
8,
8,
2,
2,
2
],
"upsample_initial_channel": 512,
"upsample_kernel_sizes": [
16,
16,
8,
2,
2
],
"n_layers_q": 3,
"use_spectral_norm": false,
"gin_channels": 256
},
"num_languages": 9,
"num_tones": 17,
"symbols": [
"_",
"\"",
"(",
")",
"*",
"/",
":",
"AA",
"E",
"EE",
"En",
"N",
"OO",
"Q",
"V",
"[",
"\\",
"]",
"^",
"a",
"a:",
"aa",
"ae",
"ah",
"ai",
"an",
"ang",
"ao",
"aw",
"ay",
"b",
"by",
"c",
"ch",
"d",
"dh",
"dy",
"e",
"e:",
"eh",
"ei",
"en",
"eng",
"er",
"ey",
"f",
"g",
"gy",
"h",
"hh",
"hy",
"i",
"i0",
"i:",
"ia",
"ian",
"iang",
"iao",
"ie",
"ih",
"in",
"ing",
"iong",
"ir",
"iu",
"iy",
"j",
"jh",
"k",
"ky",
"l",
"m",
"my",
"n",
"ng",
"ny",
"o",
"o:",
"ong",
"ou",
"ow",
"oy",
"p",
"py",
"q",
"r",
"ry",
"s",
"sh",
"t",
"th",
"ts",
"ty",
"u",
"u:",
"ua",
"uai",
"uan",
"uang",
"uh",
"ui",
"un",
"uo",
"uw",
"v",
"van",
"ve",
"vn",
"w",
"x",
"y",
"z",
"zh",
"zy",
"~",
"æ",
"ç",
"ð",
"ø",
"ŋ",
"œ",
"ɐ",
"ɑ",
"ɒ",
"ɔ",
"ɕ",
"ə",
"ɛ",
"ɜ",
"ɡ",
"ɣ",
"ɥ",
"ɦ",
"ɪ",
"ɫ",
"ɬ",
"ɭ",
"ɯ",
"ɲ",
"ɵ",
"ɸ",
"ɹ",
"ɾ",
"ʁ",
"ʃ",
"ʊ",
"ʌ",
"ʎ",
"ʏ",
"ʑ",
"ʒ",
"ʝ",
"ʲ",
"ˈ",
"ˌ",
"ː",
"̃",
"̩",
"β",
"θ",
"ก",
"ข",
"ฃ",
"ค",
"ฅ",
"ฆ",
"ง",
"จ",
"ฉ",
"ช",
"ซ",
"ฌ",
"ญ",
"ฎ",
"ฏ",
"ฐ",
"ฑ",
"ฒ",
"ณ",
"ด",
"ต",
"ถ",
"ท",
"ธ",
"น",
"บ",
"ป",
"ผ",
"ฝ",
"พ",
"ฟ",
"ภ",
"ม",
"ย",
"ร",
"ล",
"ว",
"ศ",
"ษ",
"ส",
"ห",
"ฬ",
"อ",
"ฮ",
"ะ",
"ั",
"า",
"ำ",
"ิ",
"ี",
"ึ",
"ื",
"ุ",
"ู",
"เ",
"แ",
"โ",
"ใ",
"ไ",
"ๅ",
"็",
"่",
"้",
"์",
"๐",
"๑",
"๒",
"๓",
"๔",
"๕",
"๖",
"๗",
"๘",
"๙",
"ᄀ",
"ᄁ",
"ᄂ",
"ᄃ",
"ᄄ",
"ᄅ",
"ᄆ",
"ᄇ",
"ᄈ",
"ᄉ",
"ᄊ",
"ᄋ",
"ᄌ",
"ᄍ",
"ᄎ",
"ᄏ",
"ᄐ",
"ᄑ",
"ᄒ",
"ᅡ",
"ᅢ",
"ᅣ",
"ᅤ",
"ᅥ",
"ᅦ",
"ᅧ",
"ᅨ",
"ᅩ",
"ᅪ",
"ᅫ",
"ᅬ",
"ᅭ",
"ᅮ",
"ᅯ",
"ᅰ",
"ᅱ",
"ᅲ",
"ᅳ",
"ᅴ",
"ᅵ",
"ᆨ",
"ᆫ",
"ᆮ",
"ᆯ",
"ᆷ",
"ᆸ",
"ᆼ",
"ㄸ",
"!",
"?",
"…",
",",
".",
"'",
"-",
"¿",
"¡",
"SP",
"UNK"
]
}
from melotts.
Nevermind, I was able to pinpoint the issue, I didn't realize you needed to add the language code here as well:
I've updated my PR with the missing code.
I seems like it is training correctly now although I am still getting some warnings/exceptions:
Using noise scaled MAS for VITS2
Using duration discriminator for VITS2
(torch.Size([219, 192]), torch.Size([360, 192]))
(torch.Size([16, 192]), torch.Size([17, 192]))
(torch.Size([10, 192]), torch.Size([9, 192]))
(torch.Size([256, 256]), torch.Size([1, 256]))
list index out of range
0%| | 0/23601 [00:00<?, ?it/s][W reducer.cpp:1298] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/autograd/__init__.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1, 9, 96], strides() = [99168, 96, 1]
bucket_view.sizes() = [1, 9, 96], strides() = [864, 96, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Evaluating ...
Evauate done
0%|▍ | 74/23601 [03:24<11:00:36, 1.68s/it]min value is tensor(-1.1265)
Will try to run the complete training loop on some H100s 🤞
from melotts.
It seems there might be an issue with the training process. According to the current code, if your symbol size is not equal to the original 219, a new size will be used to initialize the TextEncoder. This means that you are not utilizing the base model, but rather retraining it. Based on my previous tests, this could lead to strange results where the model fails to properly generate text.
Solution: Similar to adding a new vocabulary to BERT, you should modify the loading process of the model. https://huggingface.co/transformers/v2.11.0/_modules/transformers/modeling_utils.html (_get_resized_embeddings function)
Thank you @jeremy110.
If I understand correctly in melo/models.py, we should first initialize the TextEncoder with the original 219, in order to use the retrained weights, like this:
// models.py
self.enc_p = TextEncoder(
219, # Initialize with the original symbol size
inter_channels,
hidden_channels,
filter_channels,
n_heads,
n_layers,
kernel_size,
p_dropout,
gin_channels=self.enc_gin_channels,
num_languages=num_languages,
num_tones=num_tones,
)
...then right after add a check if the n_vocab (len(symbols)) has a different size, and if so update the self.enc_p.embed_tokens with the resized embeddings?
if n_vocab != 219:
old_embeddings = self.enc_p.emb
new_num_tokens = n_vocab
self.enc_p.emb = self.get_resized_embeddings(old_embeddings, new_num_tokens)
Does that look correct to you?
Note: I've updated my PR to reflect this.
from melotts.
hello~ @jadechip
Yes, it looks fine as it is.
However, in symbols.py, you'll need to make some modifications. If you place your new symbol inside the sorted list and then use the method above, it may result in some symbols having weights that don't match up with the original model. So, I suggest you do it like this.
# combine all symbols
normal_symbols = sorted(set(zh_symbols + ja_symbols + en_symbols + kr_symbols + es_symbols + fr_symbols + de_symbols + ru_symbols))
symbols = [pad] + normal_symbols + pu_symbols + new_symbols # add new symbols here
from melotts.
Ok, I was able to run a training job for around 9k steps yesterday. I tried running inference using the new checkpoint, but it seems to produce unintelligible sounds. I think the learning rate looks ok though? ...so I will try ramping up the batch size and training for longer on multiple GPUs and report back with my results 🤞
For reference here is my current config and Tensorboard metrics.
{
"train": {
"log_interval": 200,
"eval_interval": 1000,
"seed": 52,
"epochs": 10000,
"learning_rate": 0.0003,
"betas": [
0.8,
0.99
],
"eps": 1e-09,
"batch_size": 16,
"fp16_run": false,
"lr_decay": 0.999875,
"segment_size": 16384,
"init_lr_ratio": 1,
"warmup_epochs": 0,
"c_mel": 45,
"c_kl": 1.0,
"skip_optimizer": true
},
"data": {
"training_files": "../Data/locutor/train.list",
"validation_files": "../Data/locutor/val.list",
"max_wav_value": 32768.0,
"sampling_rate": 44100,
"filter_length": 2048,
"hop_length": 512,
"win_length": 2048,
"n_mel_channels": 128,
"mel_fmin": 0.0,
"mel_fmax": null,
"add_blank": true,
"n_speakers": 1,
"cleaned_text": true,
"spk2id": {
"locutor": 0
}
},
"model": {
"use_spk_conditioned_encoder": true,
"use_noise_scaled_mas": true,
"use_mel_posterior_encoder": false,
"use_duration_discriminator": true,
"inter_channels": 192,
"hidden_channels": 192,
"filter_channels": 768,
"n_heads": 2,
"n_layers": 6,
"n_layers_trans_flow": 3,
"kernel_size": 3,
"p_dropout": 0.1,
"resblock": "1",
"resblock_kernel_sizes": [
3,
7,
11
],
"resblock_dilation_sizes": [
[
1,
3,
5
],
[
1,
3,
5
],
[
1,
3,
5
]
],
"upsample_rates": [
8,
8,
2,
2,
2
],
"upsample_initial_channel": 512,
"upsample_kernel_sizes": [
16,
16,
8,
2,
2
],
"n_layers_q": 3,
"use_spectral_norm": false,
"gin_channels": 256
},
"num_languages": 1,
"num_tones": 16,
"symbols": [
...
from melotts.
Thank you for you sharing! Your advice has been super helpful @jeremy110 🙏
from melotts.
Hmm trained for longer with different hyperparameters but so far the results are not much better, something might be wrong with my code.
from melotts.
yeah me too
longer training,,the voice is clearer and similar, but cant pronounce a single word
maybe phenomizer problem ,idk
from melotts.
hello @jadechip @acul3
I'd like to confirm something. Are all your tones set to 0?
Because I made a similar mistake before where I treated tones like ˧ ˦ as phones, but they should correspond to tones. Here's an example of what I did before.
#error
phones: ['_', 'k', 'e', 'ʔ', '˧', 'p', 'i', 'a', 'ʔ', '˧', 'ʦ', 'ʰ', 'i', 'n', '˦', '˦', 'k', 'e', '˦', '˦', ',', 'l', 'e', '˥', '˧', 's', 'ɔ', '˨', '˩', 'g', 'u', 'a', 'n', '˩', '˧', 'ʦ', 'a', 'i', '˧', '˧', '.', '_']
tones: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
word2ph: [1, 4, 5, 6, 4, 1, 4, 4, 6, 5, 1, 1]
#correct
phones: 28 ['_', 'k', 'e', 'ʔ', 'p', 'i', 'a', 'ʔ', 'ʦ', 'ʰ', 'i', 'n', 'k', 'e', ',', 'l', 'e', 's', 'ɔ', 'g', 'u', 'a', 'n', 'ʦ', 'a', 'i', '.', '_']
tones: [0, 4, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 0, 2, 2, 3, 3, 5, 5, 5, 5, 7, 7, 7, 0, 0]
word2ph: [1, 3, 4, 4, 2, 1, 2, 2, 4, 3, 1, 1]
from melotts.
@jeremy110 yes all my tone are set to 0
now wondering how can i fix this
from melotts.
@jeremy110 you are absolutely right. My code was outputting zeroes for the tones list.
I've pushed some changes to the g2p function which hopefully addresses this:
def g2p(norm_text):
tokenized = tokenizer.tokenize(norm_text)
phs = []
word2ph = []
current_word = []
current_phonemes = []
for token in tokenized:
if token.startswith("▁"): # Start of a new word
if current_word:
word_phonemes = " ".join(current_phonemes)
phs.extend(word_phonemes.split())
word2ph.append(len(current_phonemes))
current_word = []
current_phonemes = []
current_word.append(token.replace("▁", ""))
else:
current_word.append(token)
if token in punctuation or token in pu_symbols:
phs.append(token)
word2ph.append(1)
else:
phonemes = thai_text_to_phonemes(token.replace("▁", ""))
current_phonemes.extend(phonemes.split())
if current_word:
word_phonemes = " ".join(current_phonemes)
phs.extend(word_phonemes.split())
word2ph.append(len(current_phonemes))
# Distribute phonemes to match the number of tokens
distributed_word2ph = []
for i, group in enumerate(tokenized):
if group.startswith("▁"):
group = group.replace("▁", "")
if group in punctuation or group in pu_symbols:
distributed_word2ph.append(1)
else:
phonemes = thai_text_to_phonemes(group)
distributed_word2ph.append(len(phonemes.split()))
tone_markers = ['˥', '˦', '˧', '˨', '˩']
phones = ["_"] + [re.sub(f'[{"".join(tone_markers)}]', '', p) for p in phs] + ["_"] # Remove tone markers from phones
tones = extract_tones(phs) # Extract tones from the original phs list
word2ph = [1] + distributed_word2ph + [1]
assert len(word2ph) == len(tokenized) + 2
return phones, tones, word2ph
def extract_tones(phones):
tones = []
tone_map = {
"˥": 5, # High tone
"˦": 4, # Rising tone
"˧": 3, # Mid tone
"˨": 2, # Falling tone
"˩": 1, # Low tone
}
for phone in phones:
tone = 0
for marker, value in tone_map.items():
if marker in phone:
tone = value
break
tones.append(tone)
return tones
TLDR;
- it now removes the tone markers from the phonemes in phs using a regular expression and stores the result in the phones list, adding start and end markers ("_").
- It then extracts the tones from the original phs list using the extract_tones function and stores them in the tones list.
- It constructs the final word2ph list by adding start and end markers (1) to the distributed_word2ph list and finally, it returns the phones, tones, and word2ph lists.
...I've also updated the test following test case:
def test_g2p():
text = "ฉันรักเมืองไทย"
normalized_text = text_normalize(text)
phones, tones, word2ph = g2p(normalized_text)
assert phones == ['_', 't͡ɕʰ', 'a', 'n', '', 'r', 'a', 'k̚', '', 'm', 'ɯa̯', 'ŋ', '', 'tʰ', 'aj', '', '.', 'j', 'a', '', '.', '_']
assert tones == [0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 5, 0]
assert word2ph == [1, 0, 8, 12, 1]
I think this output makes sense as the output is now similar to yours.
The phones list contains the phonemes corresponding to the input text, excluding the tone markers.
The mapping of tone markers to numeric values seems accurate (4 for ˩˩˦, 5 for ˦˥, 3 for ˧).
The word2ph list represents the number of phonemes for each word in the tokenized input. The values correspond to the number of phonemes for each word:
1: Start-of-sequence token
0: No phonemes for the first token (likely punctuation or special symbol)
8: Number of phonemes for the second token ("ฉันรัก")
12: Number of phonemes for the third token ("เมืองไทย")
1: End-of-sequence token
from melotts.
About the Thai symbols, the characters from line 266 to 339 are the characters of the Thai alphabet, including numbers.
The remaining lines (340 - 406) were characters that I copied from the Wiktionary file (which I got from here https://github.com/PyThaiNLP/thai-g2p-wiktionary-corpus/tree/main), I am not sure if I should include them in this file (symbols.py) but if I remember correctly I was getting an error if I didn't include them.
from melotts.
About lines 5908 to 5910 in the Wiktionary file, that is a good question. I am not sure which one is correct to be honest 🤔
Maybe I should try looking for a different Grapheme to Phoneme dictionary...
from melotts.
About lines 5908 to 5910 in the Wiktionary file, that is a good question. I am not sure which one is correct to be honest 🤔 Maybe I should try looking for a different Grapheme to Phoneme dictionary...
I appreciate your hard work. 🥇
One of my concerns is that most Thai G2P tools are either rule-based or seq2seq, and their phoneme formats vary (e.g., haas, IPA, etc.):
- https://github.com/wannaphong/thai-grapheme-to-phoneme
- https://github.com/nozomiyamada/thaig2p
- https://github.com/wannaphong/thai-g2p
- https://www.thaicorpus.net/g2p
In case, you missed some of them. 😄
While rule-based tools offer more precise conversions, they may not always provide results for some graphemes. Seq2seq tools, on the other hand, offer more flexible conversions, but their CER or PER is still considered high, IMO.
Of course, these factors can reduce the smoothness in TTS.
I am concerned about the current state of Thai G2P and am trying to survey how we can address the challenges with Thai G2P.
from melotts.
@jeremy110 you are absolutely right. My code was outputting zeroes for the tones list. I've pushed some changes to the g2p function which hopefully addresses this:
def g2p(norm_text): tokenized = tokenizer.tokenize(norm_text) phs = [] word2ph = [] current_word = [] current_phonemes = [] for token in tokenized: if token.startswith("▁"): # Start of a new word if current_word: word_phonemes = " ".join(current_phonemes) phs.extend(word_phonemes.split()) word2ph.append(len(current_phonemes)) current_word = [] current_phonemes = [] current_word.append(token.replace("▁", "")) else: current_word.append(token) if token in punctuation or token in pu_symbols: phs.append(token) word2ph.append(1) else: phonemes = thai_text_to_phonemes(token.replace("▁", "")) current_phonemes.extend(phonemes.split()) if current_word: word_phonemes = " ".join(current_phonemes) phs.extend(word_phonemes.split()) word2ph.append(len(current_phonemes)) # Distribute phonemes to match the number of tokens distributed_word2ph = [] for i, group in enumerate(tokenized): if group.startswith("▁"): group = group.replace("▁", "") if group in punctuation or group in pu_symbols: distributed_word2ph.append(1) else: phonemes = thai_text_to_phonemes(group) distributed_word2ph.append(len(phonemes.split())) tone_markers = ['˥', '˦', '˧', '˨', '˩'] phones = ["_"] + [re.sub(f'[{"".join(tone_markers)}]', '', p) for p in phs] + ["_"] # Remove tone markers from phones tones = extract_tones(phs) # Extract tones from the original phs list word2ph = [1] + distributed_word2ph + [1] assert len(word2ph) == len(tokenized) + 2 return phones, tones, word2ph def extract_tones(phones): tones = [] tone_map = { "˥": 5, # High tone "˦": 4, # Rising tone "˧": 3, # Mid tone "˨": 2, # Falling tone "˩": 1, # Low tone } for phone in phones: tone = 0 for marker, value in tone_map.items(): if marker in phone: tone = value break tones.append(tone) return tones
TLDR;
- it now removes the tone markers from the phonemes in phs using a regular expression and stores the result in the phones list, adding start and end markers ("_").
- It then extracts the tones from the original phs list using the extract_tones function and stores them in the tones list.
- It constructs the final word2ph list by adding start and end markers (1) to the distributed_word2ph list and finally, it returns the phones, tones, and word2ph lists.
...I've also updated the test following test case:
def test_g2p(): text = "ฉันรักเมืองไทย" normalized_text = text_normalize(text) phones, tones, word2ph = g2p(normalized_text) assert phones == ['_', 't͡ɕʰ', 'a', 'n', '', 'r', 'a', 'k̚', '', 'm', 'ɯa̯', 'ŋ', '', 'tʰ', 'aj', '', '.', 'j', 'a', '', '.', '_'] assert tones == [0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 5, 0] assert word2ph == [1, 0, 8, 12, 1]
I think this output makes sense as the output is now similar to yours.
The phones list contains the phonemes corresponding to the input text, excluding the tone markers. The mapping of tone markers to numeric values seems accurate (4 for ˩˩˦, 5 for ˦˥, 3 for ˧).
The word2ph list represents the number of phonemes for each word in the tokenized input. The values correspond to the number of phonemes for each word:
1: Start-of-sequence token 0: No phonemes for the first token (likely punctuation or special symbol) 8: Number of phonemes for the second token ("ฉันรัก") 12: Number of phonemes for the third token ("เมืองไทย") 1: End-of-sequence token
text: 禮 數
ipa: l e ˥ ˧ s ɔ ˨˩
phones: ['_', 'l', 'e', 's', 'ɔ', '_']
tones: [0, 2, 2, 3, 3, 0]
word2ph: [1, 2, 2, 1]
Perhaps I misled you a bit. Let me clarify using an example.
For '˥ ˧' in my case, it corresponds to 2. Then, with two phones, 'l' and 'e', so the tones correspond to two 2.
For '˩' in my case, it corresponds to 3. Then, with two phones, 's' and 'ɔ', so the tones correspond to two 3.
from melotts.
About lines 5908 to 5910 in the Wiktionary file, that is a good question. I am not sure which one is correct to be honest 🤔 Maybe I should try looking for a different Grapheme to Phoneme dictionary...
I appreciate your hard work. 🥇
One of my concerns is that most Thai G2P tools are either rule-based or seq2seq, and their phoneme formats vary (e.g., haas, IPA, etc.):
- https://github.com/wannaphong/thai-grapheme-to-phoneme
- https://github.com/nozomiyamada/thaig2p
- https://github.com/wannaphong/thai-g2p
- https://www.thaicorpus.net/g2p
In case, you missed some of them. 😄
While rule-based tools offer more precise conversions, they may not always provide results for some graphemes. Seq2seq tools, on the other hand, offer more flexible conversions, but their CER or PER is still considered high, IMO. Of course, these factors can reduce the smoothness in TTS.
I am concerned about the current state of Thai G2P and am trying to survey how we can address the challenges with Thai G2P.
Because I don't know Thai at all, I can't help with the g2p part.
sorry
from melotts.
@jeremy110 if I understand correctly, I am extracting the tones from the original phs list and assigning them to the phones in a one-to-one manner, but you are saying a single tone marker should be assigned to multiple phones based on the number of phones associated with that tone marker?
I've pushed some code that tries to solve this issue and I've added a new test cases. However I am still experiencing some strange behavior and I'm afraid I am reaching the limits of my knowledge of the Thai language as well 😔
from melotts.
@jadechip
I spent some time finding a few words from the Wiktionary file to use as examples because I noticed that the processing of Thai is different from mine, so let me give another example.
กง k o ŋ ˧ -> 3 phones + 1 tone
ล้อ l ɔː ˦˥ -> 2 phones + 1 tone
กงล้อ k o ŋ ˧ . l ɔː ˦˥
suppose tone map: ˧ -> 2, ˦˥ -> 3
text: กงล้อ -> กง ล้อ
phones: [ '_', 'k', 'o', 'ŋ', 'l', 'ɔː', '_']
tones: [ 1, 2, 2, 2, 3, 3, 1] # 2 copied three times (3 phones), 3 copied twice (2 phones)
Here, I'll ignore the "." for now because I can't figure out what it represents.
from melotts.
Related Issues (20)
- so bad
- Can't fine-tune a model on my dataset in Google Colab HOT 1
- when build docker, AttributeError: module 'botocore.exceptions' has no attribute 'HTTPClientError' HOT 3
- I need to help for working MeloTTS Korean in apple silicon python 3.10.14 (pyenv, virtualenv) HOT 3
- mecab-python3 and python-mecab-ko conflict HOT 2
- Python API BUG HOT 2
- ValueError: min() arg is an empty sequence HOT 3
- Failed building wheel for mecab-python-3 HOT 5
- 中文文本中夹杂的英文专业术语识别错误 HOT 2
- Docker build error. HOT 2
- Melo TTS pronounces "US" as "ʌs", not "ˌjuː ˈes". HOT 3
- Calling melo CLI for "ZH" long coldstart times, even if cached
- The generated sound is very small?
- Error on english_bert.py in melo/text/english_bert.py when using local bert-base-uncased model.
- botocore HTTPClientError Error
- Is there any way to set intonation and accent in Japanese text-to-speech? HOT 1
- Failed initializing MeCab. HOT 2
- How can i use voice conversion?
- Add support hindi HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from melotts.