Giter Site home page Giter Site logo

Comments (47)

jeremy110 avatar jeremy110 commented on June 1, 2024 4

hello @jadechip @acul3

It seems there might be an issue with the training process. According to the current code, if your symbol size is not equal to the original 219, a new size will be used to initialize the TextEncoder. This means that you are not utilizing the base model, but rather retraining it. Based on my previous tests, this could lead to strange results where the model fails to properly generate text.

Solution: Similar to adding a new vocabulary to BERT, you should modify the loading process of the model.
https://huggingface.co/transformers/v2.11.0/_modules/transformers/modeling_utils.html (_get_resized_embeddings function)

from melotts.

tchayintr avatar tchayintr commented on June 1, 2024 2

@jeremy110 Don't worry, this is not your fault at all!

We are here to discuss and find a solution.

I will keep you updated if I got something. @jadechip @jeremy110

from melotts.

BankNatchapol avatar BankNatchapol commented on June 1, 2024 2

This one is my implementation (not optimized 😞) for the @jeremy110 's tones. I also trying to change @jadechip 's phonemizer which sometime return the characters not phonemes.
https://gist.github.com/BankNatchapol/1276e34dcb51c521536978859dd948cd
But, the problem now is the phonemizer is not that good. It's usually give repeated phonemes for some reason.

from melotts.

acul3 avatar acul3 commented on June 1, 2024 1

hello @jadechip
let me know if its working

i'm training for indonesia and malay language

changing phonem and bert also

after 10 epoch the model doesnt produce any good word, only some noise , some random vowel

my data
~200hours dataset
~500 speaker

from melotts.

jadechip avatar jadechip commented on June 1, 2024 1

I see, thank you for the heads up @jeremy110 🙏
I've updated my code to reflect your suggestion, now I have.

# combine all symbols
normal_symbols = sorted(set(zh_symbols + ja_symbols + en_symbols + kr_symbols + es_symbols + fr_symbols + de_symbols + ru_symbols))
symbols = [pad] + normal_symbols + pu_symbols + th_symbols
sil_phonemes_ids = [symbols.index(i) for i in pu_symbols]

# combine all tones
num_tones = num_zh_tones + num_ja_tones + num_en_tones + num_kr_tones + num_es_tones + num_fr_tones + num_de_tones + num_ru_tones + num_th_tones

# language maps
language_id_map = {"ZH": 0, "JP": 1, "EN": 2, "ZH_MIX_EN": 3, 'KR': 4, 'ES': 5, 'SP': 5, 'FR': 6, 'TH': 7}
num_languages = len(language_id_map.keys())

I'll try running a new training job to evaluate performance with these changes.

from melotts.

acul3 avatar acul3 commented on June 1, 2024 1

thanks @jadechip and @jeremy110

i'll try it to my environment also,see if works

from melotts.

jadechip avatar jadechip commented on June 1, 2024 1

btw I am currently training on a subset of Thai commonvoice 13, converted to .wav with a sample rate of 48 kHz.
Edit: Happy weekend everyone 🎉

from melotts.

jeremy110 avatar jeremy110 commented on June 1, 2024 1

hello~ @jadechip

My config is basically the same as yours, except my batch size is 6. Perhaps you can increase your learning rate to 9e-4 and see how it performs. Also, I've added a constraint to the clip_grad_value in the code.

grad_norm_d = commons.clip_grad_value_(net_d.parameters(), 200)
grad_norm_g = commons.clip_grad_value_(net_g.parameters(), 500)

Finally, I'm attaching my tensorboard for reference. (https://drive.google.com/drive/folders/1xPNURmWsmJqwEDHVM8ZsK6CAbuv65ipI?usp=sharing)

Additionally, if the silence before and after your audio files is shorter, your g/dur will converge to a smaller value, which will also affect the length of the silence before and after the inference.

I'm not sure if the Thai CommonVoice 13 dataset is suitable for training. Also, there's no need to specifically convert it to 48kHz. I remember that the code will resample it. I think you can start by testing whether it can be trained with 10 hours of data from one person.

I hope this is helpful for you.

from melotts.

jeremy110 avatar jeremy110 commented on June 1, 2024 1

hello~ @acul3 @jadechip
Sorry, I spent some time looking at that, but since I can't read Thai, I did some online research. I wanted to ask about the symbols from line 266 to 339 in the th_symbols . Are those symbols not IPA?

Also, I looked at the Wiktionary file and found several symbols that seem to represent tones: ˧, ˨˩, ˦˥, ˩˦, and ˥˩. It looks like there are five tones. So, you need to convert these symbols into tones and then add the corresponding number of tones to the 'tones' list based on the number of phones in your phone list.

But I'm confused about lines 5908 to 5910. Which one is correct?

from melotts.

tchayintr avatar tchayintr commented on June 1, 2024

Great job! I planed to work on training a Thai TTS model using MeloTTS too.

from melotts.

Zengyi-Qin avatar Zengyi-Qin commented on June 1, 2024

Hi - Thanks for the contribution. We would suggest you first train on the Thai dataset to see if the code works. We haven't had any attempt to train on Thai

from melotts.

jadechip avatar jadechip commented on June 1, 2024

@Zengyi-Qin Sounds good, will report back once I have proper training results.

from melotts.

jadechip avatar jadechip commented on June 1, 2024

Thank you @tchayintr, if you have any recommendations for Thai audio datasets, I would greatly appreciate it!

from melotts.

tchayintr avatar tchayintr commented on June 1, 2024

@jadechip
Sure! There are several datasets such as TSync2, Lotus, etc. You can check several of them here: https://github.com/korakot/corpus/releases/tag/v1.0 with documentation at https://lexitron.nectec.or.th/KM_HL5001/file_HL5001/Document/krrn_14518.pdf.

There are also Thai dialects available at https://github.com/SLSCU/thai-dialect-corpus.

However, I recommend collecting clear voice clips and crafting their transcriptions with ASR tools like WhisperX. This way, you can generate a lot of samples, but you may need to fine-tune it for the Thai language 😄.

I am reviewing your commits too. They mostly look great 🎆 , but I found some points that need to be clarified.
I will clarify and let you know if there is a point that may need to be adjusted in terms of Thai linguistic knowledge.

from melotts.

jadechip avatar jadechip commented on June 1, 2024

@tchayintr this is super helpful and any feedback you have for my code will be greatly appreciated 🙏
I was also looking at this other nectec dataset: https://github.com/vistec-AI/dataset-releases/releases/tag/v1
I'll work on creating transcriptions next and report back.

from melotts.

jadechip avatar jadechip commented on June 1, 2024

@Zengyi-Qin are there any additional steps or files needed before training? I am getting the following error:

output

⚡ add-thai ~/MeloTTS/melo torchrun --nproc_per_node=1 --master_port=10902 train.py --c data/thai/config.json --model thai
2024-05-07 15:24:58.152 | INFO     | data_utils:_filter:64 - Init dataset...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 141910/141910 [00:04<00:00, 32864.77it/s]
2024-05-07 15:25:02.475 | INFO     | data_utils:_filter:84 - min: 65; max: 987
2024-05-07 15:25:02.475 | INFO     | data_utils:_filter:85 - skipped: 327, total: 141910
buckets: [92994, 31326, 11604, 4350, 1068, 156, 84, 24]
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
2024-05-07 15:25:02.699 | INFO     | data_utils:_filter:64 - Init dataset...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 32832.13it/s]
2024-05-07 15:25:02.700 | INFO     | data_utils:_filter:84 - min: 164; max: 625
2024-05-07 15:25:02.700 | INFO     | data_utils:_filter:85 - skipped: 0, total: 4
Using noise scaled MAS for VITS2
Using duration discriminator for VITS2
(torch.Size([219, 192]), torch.Size([360, 192]))
(torch.Size([16, 192]), torch.Size([17, 192]))
(torch.Size([10, 192]), torch.Size([9, 192]))
(torch.Size([256, 256]), torch.Size([1, 256]))
list index out of range
  0%|                                                                                                                                                        | 0/23601 [00:01<?, ?it/s]
Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/teamspace/studios/this_studio/MeloTTS/melo/data_utils.py", line 194, in __getitem__
    return self.get_audio_text_speaker_pair(self.audiopaths_sid_text[index])
  File "/teamspace/studios/this_studio/MeloTTS/melo/data_utils.py", line 98, in get_audio_text_speaker_pair
    bert, ja_bert, phones, tone, language = self.get_text(
  File "/teamspace/studios/this_studio/MeloTTS/melo/data_utils.py", line 180, in get_text
    raise
RuntimeError: No active exception to reraise

...it seems to happen around line 200 in train.py
image

config.json

{
  "train": {
    "log_interval": 200,
    "eval_interval": 1000,
    "seed": 52,
    "epochs": 10000,
    "learning_rate": 0.0003,
    "betas": [
      0.8,
      0.99
    ],
    "eps": 1e-09,
    "batch_size": 6,
    "fp16_run": false,
    "lr_decay": 0.999875,
    "segment_size": 16384,
    "init_lr_ratio": 1,
    "warmup_epochs": 0,
    "c_mel": 45,
    "c_kl": 1.0,
    "skip_optimizer": true
  },
  "data": {
    "training_files": "data/thai/train.list",
    "validation_files": "data/thai/val.list",
    "max_wav_value": 32768.0,
    "sampling_rate": 44100,
    "filter_length": 2048,
    "hop_length": 512,
    "win_length": 2048,
    "n_mel_channels": 128,
    "mel_fmin": 0.0,
    "mel_fmax": null,
    "add_blank": true,
    "n_speakers": 1,
    "cleaned_text": true,
    "spk2id": {
      "TH-default": 0
    }
  },
  "model": {
    "use_spk_conditioned_encoder": true,
    "use_noise_scaled_mas": true,
    "use_mel_posterior_encoder": false,
    "use_duration_discriminator": true,
    "inter_channels": 192,
    "hidden_channels": 192,
    "filter_channels": 768,
    "n_heads": 2,
    "n_layers": 6,
    "n_layers_trans_flow": 3,
    "kernel_size": 3,
    "p_dropout": 0.1,
    "resblock": "1",
    "resblock_kernel_sizes": [
      3,
      7,
      11
    ],
    "resblock_dilation_sizes": [
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ]
    ],
    "upsample_rates": [
      8,
      8,
      2,
      2,
      2
    ],
    "upsample_initial_channel": 512,
    "upsample_kernel_sizes": [
      16,
      16,
      8,
      2,
      2
    ],
    "n_layers_q": 3,
    "use_spectral_norm": false,
    "gin_channels": 256
  },
  "num_languages": 9,
  "num_tones": 17,
  "symbols": [
    "_",
    "\"",
    "(",
    ")",
    "*",
    "/",
    ":",
    "AA",
    "E",
    "EE",
    "En",
    "N",
    "OO",
    "Q",
    "V",
    "[",
    "\\",
    "]",
    "^",
    "a",
    "a:",
    "aa",
    "ae",
    "ah",
    "ai",
    "an",
    "ang",
    "ao",
    "aw",
    "ay",
    "b",
    "by",
    "c",
    "ch",
    "d",
    "dh",
    "dy",
    "e",
    "e:",
    "eh",
    "ei",
    "en",
    "eng",
    "er",
    "ey",
    "f",
    "g",
    "gy",
    "h",
    "hh",
    "hy",
    "i",
    "i0",
    "i:",
    "ia",
    "ian",
    "iang",
    "iao",
    "ie",
    "ih",
    "in",
    "ing",
    "iong",
    "ir",
    "iu",
    "iy",
    "j",
    "jh",
    "k",
    "ky",
    "l",
    "m",
    "my",
    "n",
    "ng",
    "ny",
    "o",
    "o:",
    "ong",
    "ou",
    "ow",
    "oy",
    "p",
    "py",
    "q",
    "r",
    "ry",
    "s",
    "sh",
    "t",
    "th",
    "ts",
    "ty",
    "u",
    "u:",
    "ua",
    "uai",
    "uan",
    "uang",
    "uh",
    "ui",
    "un",
    "uo",
    "uw",
    "v",
    "van",
    "ve",
    "vn",
    "w",
    "x",
    "y",
    "z",
    "zh",
    "zy",
    "~",
    "æ",
    "ç",
    "ð",
    "ø",
    "ŋ",
    "œ",
    "ɐ",
    "ɑ",
    "ɒ",
    "ɔ",
    "ɕ",
    "ə",
    "ɛ",
    "ɜ",
    "ɡ",
    "ɣ",
    "ɥ",
    "ɦ",
    "ɪ",
    "ɫ",
    "ɬ",
    "ɭ",
    "ɯ",
    "ɲ",
    "ɵ",
    "ɸ",
    "ɹ",
    "ɾ",
    "ʁ",
    "ʃ",
    "ʊ",
    "ʌ",
    "ʎ",
    "ʏ",
    "ʑ",
    "ʒ",
    "ʝ",
    "ʲ",
    "ˈ",
    "ˌ",
    "ː",
    "̃",
    "̩",
    "β",
    "θ",
    "ก",
    "ข",
    "ฃ",
    "ค",
    "ฅ",
    "ฆ",
    "ง",
    "จ",
    "ฉ",
    "ช",
    "ซ",
    "ฌ",
    "ญ",
    "ฎ",
    "ฏ",
    "ฐ",
    "ฑ",
    "ฒ",
    "ณ",
    "ด",
    "ต",
    "ถ",
    "ท",
    "ธ",
    "น",
    "บ",
    "ป",
    "ผ",
    "ฝ",
    "พ",
    "ฟ",
    "ภ",
    "ม",
    "ย",
    "ร",
    "ล",
    "ว",
    "ศ",
    "ษ",
    "ส",
    "ห",
    "ฬ",
    "อ",
    "ฮ",
    "ะ",
    "ั",
    "า",
    "ำ",
    "ิ",
    "ี",
    "ึ",
    "ื",
    "ุ",
    "ู",
    "เ",
    "แ",
    "โ",
    "ใ",
    "ไ",
    "ๅ",
    "็",
    "่",
    "้",
    "์",
    "๐",
    "๑",
    "๒",
    "๓",
    "๔",
    "๕",
    "๖",
    "๗",
    "๘",
    "๙",
    "ᄀ",
    "ᄁ",
    "ᄂ",
    "ᄃ",
    "ᄄ",
    "ᄅ",
    "ᄆ",
    "ᄇ",
    "ᄈ",
    "ᄉ",
    "ᄊ",
    "ᄋ",
    "ᄌ",
    "ᄍ",
    "ᄎ",
    "ᄏ",
    "ᄐ",
    "ᄑ",
    "ᄒ",
    "ᅡ",
    "ᅢ",
    "ᅣ",
    "ᅤ",
    "ᅥ",
    "ᅦ",
    "ᅧ",
    "ᅨ",
    "ᅩ",
    "ᅪ",
    "ᅫ",
    "ᅬ",
    "ᅭ",
    "ᅮ",
    "ᅯ",
    "ᅰ",
    "ᅱ",
    "ᅲ",
    "ᅳ",
    "ᅴ",
    "ᅵ",
    "ᆨ",
    "ᆫ",
    "ᆮ",
    "ᆯ",
    "ᆷ",
    "ᆸ",
    "ᆼ",
    "ㄸ",
    "!",
    "?",
    "…",
    ",",
    ".",
    "'",
    "-",
    "¿",
    "¡",
    "SP",
    "UNK"
  ]
}

from melotts.

jadechip avatar jadechip commented on June 1, 2024

Nevermind, I was able to pinpoint the issue, I didn't realize you needed to add the language code here as well:
image

I've updated my PR with the missing code.
I seems like it is training correctly now although I am still getting some warnings/exceptions:

Using noise scaled MAS for VITS2
Using duration discriminator for VITS2
(torch.Size([219, 192]), torch.Size([360, 192]))
(torch.Size([16, 192]), torch.Size([17, 192]))
(torch.Size([10, 192]), torch.Size([9, 192]))
(torch.Size([256, 256]), torch.Size([1, 256]))
list index out of range
  0%|                                                                                                                                                        | 0/23601 [00:00<?, ?it/s][W reducer.cpp:1298] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/autograd/__init__.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1, 9, 96], strides() = [99168, 96, 1]
bucket_view.sizes() = [1, 9, 96], strides() = [864, 96, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Evaluating ...
Evauate done
  0%|▍                                                                                                                                           | 74/23601 [03:24<11:00:36,  1.68s/it]min value is  tensor(-1.1265)

Will try to run the complete training loop on some H100s 🤞

from melotts.

jadechip avatar jadechip commented on June 1, 2024

hello @jadechip @acul3

It seems there might be an issue with the training process. According to the current code, if your symbol size is not equal to the original 219, a new size will be used to initialize the TextEncoder. This means that you are not utilizing the base model, but rather retraining it. Based on my previous tests, this could lead to strange results where the model fails to properly generate text.

Solution: Similar to adding a new vocabulary to BERT, you should modify the loading process of the model. https://huggingface.co/transformers/v2.11.0/_modules/transformers/modeling_utils.html (_get_resized_embeddings function)

Thank you @jeremy110.
If I understand correctly in melo/models.py, we should first initialize the TextEncoder with the original 219, in order to use the retrained weights, like this:

// models.py
        self.enc_p = TextEncoder(
            219,  # Initialize with the original symbol size
            inter_channels,
            hidden_channels,
            filter_channels,
            n_heads,
            n_layers,
            kernel_size,
            p_dropout,
            gin_channels=self.enc_gin_channels,
            num_languages=num_languages,
            num_tones=num_tones,
        )

...then right after add a check if the n_vocab (len(symbols)) has a different size, and if so update the self.enc_p.embed_tokens with the resized embeddings?

if n_vocab != 219:
    old_embeddings = self.enc_p.emb
    new_num_tokens = n_vocab
    self.enc_p.emb = self.get_resized_embeddings(old_embeddings, new_num_tokens)

Does that look correct to you?
Note: I've updated my PR to reflect this.

from melotts.

jeremy110 avatar jeremy110 commented on June 1, 2024

hello~ @jadechip

Yes, it looks fine as it is.

However, in symbols.py, you'll need to make some modifications. If you place your new symbol inside the sorted list and then use the method above, it may result in some symbols having weights that don't match up with the original model. So, I suggest you do it like this.

# combine all symbols
normal_symbols = sorted(set(zh_symbols + ja_symbols + en_symbols + kr_symbols + es_symbols + fr_symbols + de_symbols + ru_symbols))
symbols = [pad] + normal_symbols + pu_symbols + new_symbols # add new symbols here 

from melotts.

jadechip avatar jadechip commented on June 1, 2024

Ok, I was able to run a training job for around 9k steps yesterday. I tried running inference using the new checkpoint, but it seems to produce unintelligible sounds. I think the learning rate looks ok though? ...so I will try ramping up the batch size and training for longer on multiple GPUs and report back with my results 🤞
For reference here is my current config and Tensorboard metrics.

{
  "train": {
    "log_interval": 200,
    "eval_interval": 1000,
    "seed": 52,
    "epochs": 10000,
    "learning_rate": 0.0003,
    "betas": [
      0.8,
      0.99
    ],
    "eps": 1e-09,
    "batch_size": 16,
    "fp16_run": false,
    "lr_decay": 0.999875,
    "segment_size": 16384,
    "init_lr_ratio": 1,
    "warmup_epochs": 0,
    "c_mel": 45,
    "c_kl": 1.0,
    "skip_optimizer": true
  },
  "data": {
    "training_files": "../Data/locutor/train.list",
    "validation_files": "../Data/locutor/val.list",
    "max_wav_value": 32768.0,
    "sampling_rate": 44100,
    "filter_length": 2048,
    "hop_length": 512,
    "win_length": 2048,
    "n_mel_channels": 128,
    "mel_fmin": 0.0,
    "mel_fmax": null,
    "add_blank": true,
    "n_speakers": 1,
    "cleaned_text": true,
    "spk2id": {
      "locutor": 0
    }
  },
  "model": {
    "use_spk_conditioned_encoder": true,
    "use_noise_scaled_mas": true,
    "use_mel_posterior_encoder": false,
    "use_duration_discriminator": true,
    "inter_channels": 192,
    "hidden_channels": 192,
    "filter_channels": 768,
    "n_heads": 2,
    "n_layers": 6,
    "n_layers_trans_flow": 3,
    "kernel_size": 3,
    "p_dropout": 0.1,
    "resblock": "1",
    "resblock_kernel_sizes": [
      3,
      7,
      11
    ],
    "resblock_dilation_sizes": [
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ]
    ],
    "upsample_rates": [
      8,
      8,
      2,
      2,
      2
    ],
    "upsample_initial_channel": 512,
    "upsample_kernel_sizes": [
      16,
      16,
      8,
      2,
      2
    ],
    "n_layers_q": 3,
    "use_spectral_norm": false,
    "gin_channels": 256
  },
  "num_languages": 1,
  "num_tones": 16,
  "symbols": [
...
Screenshot 2567-05-10 at 23 57 48 Screenshot 2567-05-10 at 23 57 41 Screenshot 2567-05-10 at 23 57 00 Screenshot 2567-05-10 at 23 56 54

from melotts.

jadechip avatar jadechip commented on June 1, 2024

Thank you for you sharing! Your advice has been super helpful @jeremy110 🙏

from melotts.

jadechip avatar jadechip commented on June 1, 2024

Hmm trained for longer with different hyperparameters but so far the results are not much better, something might be wrong with my code.

from melotts.

acul3 avatar acul3 commented on June 1, 2024

yeah me too

longer training,,the voice is clearer and similar, but cant pronounce a single word

maybe phenomizer problem ,idk

from melotts.

jeremy110 avatar jeremy110 commented on June 1, 2024

hello @jadechip @acul3
I'd like to confirm something. Are all your tones set to 0?
Because I made a similar mistake before where I treated tones like ˧ ˦ as phones, but they should correspond to tones. Here's an example of what I did before.

#error
phones: ['_', 'k', 'e', 'ʔ', '˧', 'p', 'i', 'a', 'ʔ', '˧', 'ʦ', 'ʰ', 'i', 'n', '˦', '˦', 'k', 'e', '˦', '˦', ',', 'l', 'e', '˥', '˧', 's', 'ɔ', '˨', '˩', 'g', 'u', 'a', 'n', '˩', '˧', 'ʦ', 'a', 'i', '˧', '˧', '.', '_']
tones: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
word2ph: [1, 4, 5, 6, 4, 1, 4, 4, 6, 5, 1, 1]
#correct
phones: 28 ['_', 'k', 'e', 'ʔ', 'p', 'i', 'a', 'ʔ', 'ʦ', 'ʰ', 'i', 'n', 'k', 'e', ',', 'l', 'e', 's', 'ɔ', 'g', 'u', 'a', 'n', 'ʦ', 'a', 'i', '.', '_']
tones: [0, 4, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 0, 2, 2, 3, 3, 5, 5, 5, 5, 7, 7, 7, 0, 0]
word2ph: [1, 3, 4, 4, 2, 1, 2, 2, 4, 3, 1, 1]

from melotts.

acul3 avatar acul3 commented on June 1, 2024

@jeremy110 yes all my tone are set to 0

now wondering how can i fix this

from melotts.

jadechip avatar jadechip commented on June 1, 2024

@jeremy110 you are absolutely right. My code was outputting zeroes for the tones list.
I've pushed some changes to the g2p function which hopefully addresses this:

def g2p(norm_text):
    tokenized = tokenizer.tokenize(norm_text)
    phs = []
    word2ph = []
    current_word = []
    current_phonemes = []

    for token in tokenized:
        if token.startswith("▁"):  # Start of a new word
            if current_word:
                word_phonemes = " ".join(current_phonemes)
                phs.extend(word_phonemes.split())
                word2ph.append(len(current_phonemes))
                current_word = []
                current_phonemes = []
            current_word.append(token.replace("▁", ""))
        else:
            current_word.append(token)

        if token in punctuation or token in pu_symbols:
            phs.append(token)
            word2ph.append(1)
        else:
            phonemes = thai_text_to_phonemes(token.replace("▁", ""))
            current_phonemes.extend(phonemes.split())

    if current_word:
        word_phonemes = " ".join(current_phonemes)
        phs.extend(word_phonemes.split())
        word2ph.append(len(current_phonemes))

    # Distribute phonemes to match the number of tokens
    distributed_word2ph = []
    for i, group in enumerate(tokenized):
        if group.startswith("▁"):
            group = group.replace("▁", "")
        if group in punctuation or group in pu_symbols:
            distributed_word2ph.append(1)
        else:
            phonemes = thai_text_to_phonemes(group)
            distributed_word2ph.append(len(phonemes.split()))

    tone_markers = ['˥', '˦', '˧', '˨', '˩']
    phones = ["_"] + [re.sub(f'[{"".join(tone_markers)}]', '', p) for p in phs] + ["_"]  # Remove tone markers from phones
    tones = extract_tones(phs)  # Extract tones from the original phs list
    word2ph = [1] + distributed_word2ph + [1]

    assert len(word2ph) == len(tokenized) + 2

    return phones, tones, word2ph


def extract_tones(phones):
    tones = []
    tone_map = {
        "˥": 5,  # High tone
        "˦": 4,  # Rising tone
        "˧": 3,  # Mid tone
        "˨": 2,  # Falling tone
        "˩": 1,  # Low tone
    }

    for phone in phones:
        tone = 0
        for marker, value in tone_map.items():
            if marker in phone:
                tone = value
                break
        tones.append(tone)

    return tones

TLDR;

  • it now removes the tone markers from the phonemes in phs using a regular expression and stores the result in the phones list, adding start and end markers ("_").
  • It then extracts the tones from the original phs list using the extract_tones function and stores them in the tones list.
  • It constructs the final word2ph list by adding start and end markers (1) to the distributed_word2ph list and finally, it returns the phones, tones, and word2ph lists.

...I've also updated the test following test case:

def test_g2p():
    text = "ฉันรักเมืองไทย"
    normalized_text = text_normalize(text)
    phones, tones, word2ph = g2p(normalized_text)
    assert phones == ['_', 't͡ɕʰ', 'a', 'n', '', 'r', 'a', 'k̚', '', 'm', 'ɯa̯', 'ŋ', '', 'tʰ', 'aj', '', '.', 'j', 'a', '', '.', '_']
    assert tones == [0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 5, 0]
    assert word2ph == [1, 0, 8, 12, 1]

I think this output makes sense as the output is now similar to yours.

The phones list contains the phonemes corresponding to the input text, excluding the tone markers.
The mapping of tone markers to numeric values seems accurate (4 for ˩˩˦, 5 for ˦˥, 3 for ˧).

The word2ph list represents the number of phonemes for each word in the tokenized input. The values correspond to the number of phonemes for each word:

1: Start-of-sequence token
0: No phonemes for the first token (likely punctuation or special symbol)
8: Number of phonemes for the second token ("ฉันรัก")
12: Number of phonemes for the third token ("เมืองไทย")
1: End-of-sequence token

from melotts.

jadechip avatar jadechip commented on June 1, 2024

About the Thai symbols, the characters from line 266 to 339 are the characters of the Thai alphabet, including numbers.
The remaining lines (340 - 406) were characters that I copied from the Wiktionary file (which I got from here https://github.com/PyThaiNLP/thai-g2p-wiktionary-corpus/tree/main), I am not sure if I should include them in this file (symbols.py) but if I remember correctly I was getting an error if I didn't include them.

from melotts.

jadechip avatar jadechip commented on June 1, 2024

About lines 5908 to 5910 in the Wiktionary file, that is a good question. I am not sure which one is correct to be honest 🤔
Maybe I should try looking for a different Grapheme to Phoneme dictionary...

from melotts.

tchayintr avatar tchayintr commented on June 1, 2024

About lines 5908 to 5910 in the Wiktionary file, that is a good question. I am not sure which one is correct to be honest 🤔 Maybe I should try looking for a different Grapheme to Phoneme dictionary...

I appreciate your hard work. 🥇

One of my concerns is that most Thai G2P tools are either rule-based or seq2seq, and their phoneme formats vary (e.g., haas, IPA, etc.):

In case, you missed some of them. 😄

While rule-based tools offer more precise conversions, they may not always provide results for some graphemes. Seq2seq tools, on the other hand, offer more flexible conversions, but their CER or PER is still considered high, IMO.
Of course, these factors can reduce the smoothness in TTS.

I am concerned about the current state of Thai G2P and am trying to survey how we can address the challenges with Thai G2P.

from melotts.

jeremy110 avatar jeremy110 commented on June 1, 2024

@jeremy110 you are absolutely right. My code was outputting zeroes for the tones list. I've pushed some changes to the g2p function which hopefully addresses this:

def g2p(norm_text):
    tokenized = tokenizer.tokenize(norm_text)
    phs = []
    word2ph = []
    current_word = []
    current_phonemes = []

    for token in tokenized:
        if token.startswith("▁"):  # Start of a new word
            if current_word:
                word_phonemes = " ".join(current_phonemes)
                phs.extend(word_phonemes.split())
                word2ph.append(len(current_phonemes))
                current_word = []
                current_phonemes = []
            current_word.append(token.replace("▁", ""))
        else:
            current_word.append(token)

        if token in punctuation or token in pu_symbols:
            phs.append(token)
            word2ph.append(1)
        else:
            phonemes = thai_text_to_phonemes(token.replace("▁", ""))
            current_phonemes.extend(phonemes.split())

    if current_word:
        word_phonemes = " ".join(current_phonemes)
        phs.extend(word_phonemes.split())
        word2ph.append(len(current_phonemes))

    # Distribute phonemes to match the number of tokens
    distributed_word2ph = []
    for i, group in enumerate(tokenized):
        if group.startswith("▁"):
            group = group.replace("▁", "")
        if group in punctuation or group in pu_symbols:
            distributed_word2ph.append(1)
        else:
            phonemes = thai_text_to_phonemes(group)
            distributed_word2ph.append(len(phonemes.split()))

    tone_markers = ['˥', '˦', '˧', '˨', '˩']
    phones = ["_"] + [re.sub(f'[{"".join(tone_markers)}]', '', p) for p in phs] + ["_"]  # Remove tone markers from phones
    tones = extract_tones(phs)  # Extract tones from the original phs list
    word2ph = [1] + distributed_word2ph + [1]

    assert len(word2ph) == len(tokenized) + 2

    return phones, tones, word2ph


def extract_tones(phones):
    tones = []
    tone_map = {
        "˥": 5,  # High tone
        "˦": 4,  # Rising tone
        "˧": 3,  # Mid tone
        "˨": 2,  # Falling tone
        "˩": 1,  # Low tone
    }

    for phone in phones:
        tone = 0
        for marker, value in tone_map.items():
            if marker in phone:
                tone = value
                break
        tones.append(tone)

    return tones

TLDR;

  • it now removes the tone markers from the phonemes in phs using a regular expression and stores the result in the phones list, adding start and end markers ("_").
  • It then extracts the tones from the original phs list using the extract_tones function and stores them in the tones list.
  • It constructs the final word2ph list by adding start and end markers (1) to the distributed_word2ph list and finally, it returns the phones, tones, and word2ph lists.

...I've also updated the test following test case:

def test_g2p():
    text = "ฉันรักเมืองไทย"
    normalized_text = text_normalize(text)
    phones, tones, word2ph = g2p(normalized_text)
    assert phones == ['_', 't͡ɕʰ', 'a', 'n', '', 'r', 'a', 'k̚', '', 'm', 'ɯa̯', 'ŋ', '', 'tʰ', 'aj', '', '.', 'j', 'a', '', '.', '_']
    assert tones == [0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 5, 0]
    assert word2ph == [1, 0, 8, 12, 1]

I think this output makes sense as the output is now similar to yours.

The phones list contains the phonemes corresponding to the input text, excluding the tone markers. The mapping of tone markers to numeric values seems accurate (4 for ˩˩˦, 5 for ˦˥, 3 for ˧).

The word2ph list represents the number of phonemes for each word in the tokenized input. The values correspond to the number of phonemes for each word:

1: Start-of-sequence token
0: No phonemes for the first token (likely punctuation or special symbol)
8: Number of phonemes for the second token ("ฉันรัก")
12: Number of phonemes for the third token ("เมืองไทย")
1: End-of-sequence token
text:           
ipa: l e ˥ ˧      s ɔ ˨˩
phones: ['_', 'l', 'e',     's', 'ɔ', '_']
tones: [0, 2, 2,       3, 3, 0]
word2ph: [1, 2,      2, 1]

Perhaps I misled you a bit. Let me clarify using an example.
For '˥ ˧' in my case, it corresponds to 2. Then, with two phones, 'l' and 'e', so the tones correspond to two 2.
For '˩' in my case, it corresponds to 3. Then, with two phones, 's' and 'ɔ', so the tones correspond to two 3.

from melotts.

jeremy110 avatar jeremy110 commented on June 1, 2024

About lines 5908 to 5910 in the Wiktionary file, that is a good question. I am not sure which one is correct to be honest 🤔 Maybe I should try looking for a different Grapheme to Phoneme dictionary...

I appreciate your hard work. 🥇

One of my concerns is that most Thai G2P tools are either rule-based or seq2seq, and their phoneme formats vary (e.g., haas, IPA, etc.):

In case, you missed some of them. 😄

While rule-based tools offer more precise conversions, they may not always provide results for some graphemes. Seq2seq tools, on the other hand, offer more flexible conversions, but their CER or PER is still considered high, IMO. Of course, these factors can reduce the smoothness in TTS.

I am concerned about the current state of Thai G2P and am trying to survey how we can address the challenges with Thai G2P.

Because I don't know Thai at all, I can't help with the g2p part.
sorry

from melotts.

jadechip avatar jadechip commented on June 1, 2024

@jeremy110 if I understand correctly, I am extracting the tones from the original phs list and assigning them to the phones in a one-to-one manner, but you are saying a single tone marker should be assigned to multiple phones based on the number of phones associated with that tone marker?

I've pushed some code that tries to solve this issue and I've added a new test cases. However I am still experiencing some strange behavior and I'm afraid I am reaching the limits of my knowledge of the Thai language as well 😔

from melotts.

jeremy110 avatar jeremy110 commented on June 1, 2024

@jadechip
I spent some time finding a few words from the Wiktionary file to use as examples because I noticed that the processing of Thai is different from mine, so let me give another example.

กง	k o ŋ ˧    -> 3 phones + 1 tone
ล้อ	l ɔː ˦˥        -> 2 phones + 1 tone
กงล้อ	k o ŋ ˧ . l ɔː ˦˥ 

suppose tone map: ˧ -> 2,  ˦˥ -> 3
text: กงล้อ -> กง   ล้อ
phones: [ '_', 'k', 'o', 'ŋ',    'l', 'ɔː', '_']
tones: [ 1, 2, 2, 2,      3, 3, 1]  # 2 copied three times (3 phones), 3 copied twice (2 phones)

Here, I'll ignore the "." for now because I can't figure out what it represents.

from melotts.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.