Giter Site home page Giter Site logo

pl-bert's Introduction

Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions

Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani

Large-scale pre-trained language models have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns. However, these models are usually word-level or sup-phoneme-level and jointly trained with phonemes, making them inefficient for the downstream TTS task where only phonemes are needed. In this work, we propose a phoneme-level BERT (PL-BERT) with a pretext task of predicting the corresponding graphemes along with the regular masked phoneme predictions. Subjective evaluations show that our phoneme-level BERT encoder has significantly improved the mean opinion scores (MOS) of rated naturalness of synthesized speech compared with the state-of-the-art (SOTA) StyleTTS baseline on out-of-distribution (OOD) texts.

Paper: https://arxiv.org/abs/2301.08810

Audio samples: https://pl-bert.github.io/

Pre-requisites

  1. Python >= 3.7
  2. Clone this repository:
git clone https://github.com/yl4579/PL-BERT.git
cd PL-BERT
  1. Create a new environment (recommended):
conda create --name BERT python=3.8
conda activate BERT
python -m ipykernel install --user --name BERT --display-name "BERT"
  1. Install python requirements:
pip install pandas singleton-decorator datasets "transformers<4.33.3" accelerate nltk phonemizer sacremoses pebble

Preprocessing

Please refer to the notebook preprocess.ipynb for more details. The preprocessing is for English Wikipedia dataset only. I will make a new branch for Japanese if I have extra time to demostrate training on other languages. You may also refer to #6 for preprocessing in other languages like Japanese.

Trianing

Please run each cell in the notebook train.ipynb. You will need to change the line config_path = "Configs/config.yml" in cell 2 if you wish to use a different config file. The training code is in Jupyter notebook primarily because the initial epxeriment was conducted in Jupyter notebook, but you can easily make it a Python script if you want to.

Finetuning

Here is an example of how to use it for StyleTTS finetuning. You can use it for other TTS models by replacing the text encoder with the pre-trained PL-BERT.

  1. Modify line 683 in models.py with the following code to load BERT model in to StyleTTS:
from transformers import AlbertConfig, AlbertModel

log_dir = "YOUR PL-BERT CHECKPOINT PATH"
config_path = os.path.join(log_dir, "config.yml")
plbert_config = yaml.safe_load(open(config_path))

albert_base_configuration = AlbertConfig(**plbert_config['model_params'])
bert = AlbertModel(albert_base_configuration)

files = os.listdir(log_dir)
ckpts = []
for f in os.listdir(log_dir):
    if f.startswith("step_"): ckpts.append(f)

iters = [int(f.split('_')[-1].split('.')[0]) for f in ckpts if os.path.isfile(os.path.join(log_dir, f))]
iters = sorted(iters)[-1]
        
checkpoint = torch.load(log_dir + "/step_" + str(iters) + ".t7", map_location='cpu')
state_dict = checkpoint['net']
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
    name = k[7:] # remove `module.`
    if name.startswith('encoder.'):
        name = name[8:] # remove `encoder.`
        new_state_dict[name] = v
bert.load_state_dict(new_state_dict)

nets = Munch(bert=bert,
  # linear projection to match the hidden size (BERT 768, StyleTTS 512)
  bert_encoder=nn.Linear(plbert_config['model_params']['hidden_size'], args.hidden_dim),
  predictor=predictor,
    decoder=decoder,
             pitch_extractor=pitch_extractor,
                 text_encoder=text_encoder,
                 style_encoder=style_encoder,
             text_aligner = text_aligner,
            discriminator=discriminator)
  1. Modify line 126 in train_second.py with the following code to adjust the learning rate of BERT model:
# for stability
for g in optimizer.optimizers['bert'].param_groups:
    g['betas'] = (0.9, 0.99)
    g['lr'] = 1e-5
    g['initial_lr'] = 1e-5
    g['min_lr'] = 0
    g['weight_decay'] = 0.01
  1. Modify line 211 in train_second.py with the following code to replace text encoder with BERT encoder:
            bert_dur = model.bert(texts, attention_mask=(~text_mask).int()).last_hidden_state
            d_en = model.bert_encoder(bert_dur).transpose(-1, -2)
            d, _ = model.predictor(d_en, s, 
                                                    input_lengths, 
                                                    s2s_attn_mono, 
                                                    m)

line 257:

            _, p = model.predictor(d_en, s, 
                                                    input_lengths, 
                                                    s2s_attn_mono, 
                                                    m)

and line 415:

                bert_dur = model.bert(texts, attention_mask=(~text_mask).int()).last_hidden_state
                d_en = model.bert_encoder(bert_dur).transpose(-1, -2)
                d, p = model.predictor(d_en, s, 
                                                    input_lengths, 
                                                    s2s_attn_mono, 
                                                    m)
  1. Modify line 347 in train_second.py with the following code to make sure parameters of BERT model are updated:
            optimizer.step('bert_encoder')
            optimizer.step('bert')

The pre-trained PL-BERT on Wikipedia for 1M steps can be downloaded at: PL-BERT link.

The demo on LJSpeech dataset along with the pre-modified StyleTTS repo and pre-trained models can be downloaded here: StyleTTS Link. This zip file contains the code modification above, the pre-trained PL-BERT model listed above, pre-trained StyleTTS w/ PL-BERT, pre-trained StyleTTS w/o PL-BERT and pre-trained HifiGAN on LJSpeech from the StyleTTS repo.

References

pl-bert's People

Contributors

dsplog avatar yl4579 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pl-bert's Issues

Used in Chinese

In Chinese datasets, do English phonemes correspond to Chinese pinyin in papers?Are there any other changes needed?

Use with other TTS?

Hi @yl4579 !
Thanks for the great work!
I'm working on a TTS application based on VITS, and I'd like to improve the natualness of speech, and I came across here.
You mentioned here that you had tested PL-BERT with VITS. Could you please explain a bit more about how to use it with VITS (and maybe Fastspeech 2 also)?
Thank you.

Slavic languages

Hello!

Thank you, you have done an incredible and very useful work for the community

I would like to train PL-BERT for Slavic languages: Polish, Russian, Ukrainian

I don't really understand how much data will be required for training
Could you please elaborate on this?

Also I see, that you use espeak phonemizer in your work.
It has a significant drawback - we can't manually control the stress, it puts it down using his dictionary
So, the final PL-BERT model can't have reaction for stress? If we denote them as + or ' symbols before vowel letters
Is that stress problem solved in some way, or you didn't focus on it with english language?

Looking forward for your answer, thank you

Uf-bert

Hello, I have seen this new ultrafastbert https://github.com/pbelcak/UltraFastBERT where they train a bert in one day on 1 Gpu !

I was wondering if Ultrafastbert can be incorporated in Pl-bert to train really fast ? and how to do it if yes

Thanks in advance!

Irregular Loss Pattern ; getting "Loss:NaN"

TL;DR:

  • Encountering frequent NaN values mainly for the Loss, during training with a large JPN dataset (10.5 million rows).
  • No such issues with another, albeit smaller dataset (800,000 rows).
  • Should I ignore NaN values or reverting to the other dataset? considering how much smaller it is.
  • Attempted to disable mixed precision during training, but the issue remains unresolved.

Hi. I'm trying to train a PL-Bert on Japanese language. I used the entirety of this dataset for that purpose.

Somehow I'm getting a lot of NaN, while Vocab Loss(for the most part, on some rare occassions I also get NaN for this as well) and Tokenizer Loss seem to be doing fine. I've also tried using the whole vocab size of the tokenizer in case something was wrong with the way I pruned it, but nope, still getting the same.

If i decrease the log steps (to 10 for instance), I see Loss to be around 2 to 3 but then goes to NaN, and back and forth.

Step [5100/1000000], Loss: nan, Vocab Loss: 1.12363, Token Loss: 2.01707
Step [5200/1000000], Loss: nan, Vocab Loss: 1.15805, Token Loss: 1.97737
Step [5300/1000000], Loss: nan, Vocab Loss: 1.24844, Token Loss: 1.88506
Step [5400/1000000], Loss: nan, Vocab Loss: 1.18666, Token Loss: 1.90820
Step [5500/1000000], Loss: nan, Vocab Loss: 1.33804, Token Loss: 2.04283
Step [5600/1000000], Loss: nan, Vocab Loss: 1.18824, Token Loss: 1.99786
Step [5700/1000000], Loss: nan, Vocab Loss: 0.98660, Token Loss: 1.84933
Step [5800/1000000], Loss: nan, Vocab Loss: 1.19794, Token Loss: 2.06009
Step [5900/1000000], Loss: nan, Vocab Loss: 1.12529, Token Loss: 2.08546
Step [6000/1000000], Loss: nan, Vocab Loss: 1.10970, Token Loss: 1.98083
Step [6100/1000000], Loss: nan, Vocab Loss: nan, Token Loss: 1.96394
Step [6200/1000000], Loss: nan, Vocab Loss: 1.10657, Token Loss: 1.97735

I should say that I'm seeing this pattern only on this particular dataset; I ran a short test session on this one, while keeping everything else constant and unchanged. this one seem to be working fine. Should I simply ignore the NaN or should I change back my dataset? (the problematic dataset is roughly 10.5M rows, if a good model can be trained with 800k (the dataset that works fine) then I guess I should do that?)

I have also tried disabling the mixed_precision, but it still did not help.


Here's my config:

log_dir: "Checkpoint"
mixed_precision: "fp16"
data_folder: "/home/ubuntu/001_PLBERT_JA/PL-BERT/jpn_wiki"
batch_size: 72
save_interval: 5000
log_interval: 100
num_process: 1 # number of GPUs
num_steps: 1000000

dataset_params:
    tokenizer: "cl-tohoku/bert-base-japanese-v2"
    token_separator: " " # token used for phoneme separator (space)
    token_mask: "M" # token used for phoneme mask (M)
    word_separator: 14 # token used for word separator (<unused9>)
    token_maps: "token_maps.pkl" # token map path
    
    max_mel_length: 512 # max phoneme length
    
    word_mask_prob: 0.15 # probability to mask the entire word
    phoneme_mask_prob: 0.1 # probability to mask each phoneme
    replace_prob: 0.2 # probablity to replace phonemes
    
model_params:
    vocab_size: 178
    hidden_size: 768
    num_attention_heads: 12
    intermediate_size: 2048
    max_position_embeddings: 512
    num_hidden_layers: 12
    dropout: 0.1

I'm training on 2x V100s (32gb each)
Thank you very much.

How to preprocess a large text dataset (approximately 80 GB)

I am trying to preprocess a huge text dataset (non English) as per the code of preprocess.ipynb as provided in the repo itself. In order to do so, I have split the large dataset into small chunks of 1.26 GB (approximately) and then trying to preprocess it. However, I am getting errors (like segmentation error, etc.,) and unable to complete the preprocessing for all the chunks. Can anyone suggest anything regarding this?

Reproduce demo samples quality

hello,

Thanks for you contribution!

I want to ask how can I reproduce the quality of the sound for your out of distribution details?

After playing around with it, I figured it matters a lot which input audio I have to use as reference embedding.

Do you have 1 that works best in general? Or we need a different one for different scenarios?

Deviation on using some other BERT model

Thank you for your work, @yl4579 !
I had a question, if we use some other sort of BERT, which have better context modeling, what are your thoughts on the level of improvement we can achieve?
Also, if possible, it would be great if you can share some insight regarding releasing of the code.
This looks very promising, though.

Regards,
Pranjalya

tokenizer.decode throwing an error

kindly see the code snippet below. was using the flow in preprocess.ipynb, but faced an error in

# get each token's lower case

lower_tokens = []
for t in tqdm(unique_index):
    word = tokenizer.decode([t])
    if word.lower() != word:
        t = tokenizer.encode([word.lower()])[0]
        lower_tokens.append(t)
    else:
        lower_tokens.append(t)

could see tokenizer.sym2idx defined, but tokenizer.idx2sym is an empty list.

from transformers import TransfoXLTokenizer
tokenizer = TransfoXLTokenizer.from_pretrained("transfo-xl-wt103")

enc = tokenizer.encode("Hello, my dog is cute")
enc
[14049, 2, 617, 3225, 23, 16072]

tokenizer.decode
<bound method PreTrainedTokenizerBase.decode of TransfoXLTokenizer(name_or_path='transfo-xl-wt103', vocab_size=0, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '', 'unk_token': '', 'additional_special_tokens': ['']}, clean_up_tokenization_spaces=True), added_tokens_decoder={
0: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
24: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
3039: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}>
tokenizer.decode(enc)
Traceback (most recent call last):
File "", line 1, in
File "/home/home/.local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3738, in decode
return self._decode(
File "/home/home/.local/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 1001, in _decode
filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
File "/home/home/.local/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 982, in convert_ids_to_tokens
tokens.append(self._convert_id_to_token(index))
File "/home/home/.local/lib/python3.8/site-packages/transformers/models/transfo_xl/tokenization_transfo_xl.py", line 451, in _convert_id_to_token
return self.idx2sym[idx]
IndexError: list index out of range

Pretrained model for English

Hi @yl4579 ,

Thanks for your great work! Can you please share the pretrained model (checkpoint) for English? I'd like to use it in StyleTTS and VITS.

Is this better than Png-bert ?

@yl4579 Hello Author ,
1-I just wanted to quickly ask if "PL-BERT" is actually better than Png-bert?
2-Also Png-bert pushed NAT to achieve GT 4.47 out of 4.47
but pl-bert didn't push styletts to that level , Is it because styletts is already inferior to NAT?
thanks in advance

About word alignment information

Hi, @yl4579 !

Thanks for the exciting work!
I have a question regarding inputs for your model.

In PnG BERT, the authors used the word position embedding as word alignment information. Also, they found that the information helped to improve the model's performance on MLM, G2P, and P2G tasks.

Predicting grapheme without considering word position could be challenging for the model (of course, you already proved that the model is stable and effective in your paper), but, in my understanding, your model only uses position embedding for phoneme tokens.

So, my question is below;

  • Why has word position embedding not been used for your model?
  • Did you have some experiments to find the difference whether using word position embedding?

Thanks!

No shards being saved

I'm having trouble running the preprocess jupyter notebook you provided. I was trying to create PL-BERT for Slovak language but even when I try to run the code you provided, it would only work until shards processing. Once there, all I can see in code is "Processing shard XY ..." and then a lot of progress bars like this: Map: 0% ::::::::::::::::::::0/64587 [00:02<?, ? examples/s]

There is never an error raised and all (I tried 100 for a test) shards "complete". But there is nothing in the ./wiki_phoneme folder at all. I've no idea what is wrong. The WIKI dataset loads without any problems.

Have you test PL-BERT with FastSpeech 2 ?

Hi @yl4579
MP-BERT paper publish result with FS2 so to make that comparison have you integrated PL-BERT with FastSpeech 2.
If not, then I will test that from my side as I am interested to know MP-BERT behavior comparison to MP-BERT with FastSpeech 2.
Also let me know when the code is available if you are planning to open source it. Otherwise, I will try to implement the paper from my side.

Thanks

What is the effect of punctuations in text in PL-BERT

Hi, yl
I did not see any description in the paper, about the punctuations in the text. How much effect did they taken in the success of PL-BERT? can PL-BERT help synthesis those long sentences with nested subordinate clauses without punctuations inside well?

I have another question that maybe out of range. Whyt this paper choose to use StyleTTS as backbone of the research, which need a style reference audio that seems weird in common TTS products?

Proper format for dataset

I want to train my own PL-bert model, but am unsure in what format the dataset needs to be. Could you please shed some lights on this? Thanks!

RuntimeError: CUDA error: device-side assert triggered

Hi, the training of PL-BERT have terminated few times by the follwing error. What might be the underlying cause? Is someting wrong inside the dataset? it is difficult to debug as the error happens per several hours or a day.

By the way, has anyone tried a regular pretrained language model (like bert/albert/roberta) as an alternative encoder for styletts? they only mask/replace the tokens, but the pl-bert mask/replace an individual phonem as well.

Step [649200/1000000], Loss: 2.91462, Vocab Loss: 2.15879, Token Loss: 1.49617
Step [649400/1000000], Loss: 2.95243, Vocab Loss: 2.51928, Token Loss: 1.62138
Step [649600/1000000], Loss: 2.91280, Vocab Loss: 0.92345, Token Loss: 0.68771
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertion t >= 0 && t < n_classes failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [31,0,0] Assertion t >= 0 && t < n_classes failed.
[W CUDAGuardImpl.h:115] Warning: CUDA warning: device-side assert triggered (function destroyEvent)
Traceback (most recent call last):
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/launchers.py", line 175, in notebook_launcher
start_processes(launcher, args=args, nprocs=num_processes, start_method="fork")
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
while not context.join():
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 163, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/utils/launch.py", line 554, in call
self.launcher(*args)
File "/home/tts/speechlab/repo/PL-BERT-main/train.py", line 119, in train
for _, batch in enumerate(train_loader):
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/data_loader.py", line 460, in iter
current_batch = send_to_device(current_batch, self.device)
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/utils/operations.py", line 151, in send_to_device
return honor_type(
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/utils/operations.py", line 83, in honor_type
return type(obj)(generator)
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/utils/operations.py", line 152, in
tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/utils/operations.py", line 167, in send_to_device
return tensor.to(device, non_blocking=non_blocking)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

UnicodeDecodeError while processing shards

Hello
I am trying to reproduce your preprocessing and training processes with wikipedia en dataset for better understanding

But I get decoding error while shard phonemizing
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaa in position 6: invalid start byte

Have you encountered same problem?

i've tried this, but it doesn't help
phonemize(t['text'].encode('utf-8', errors='ignore').decode('utf-8'), global_phonemizer, tokenizer)

masked_index is empty in train_loader

Thank you for your great work!

I am trying to train a PL_Bert for new language, but I'am not sure if it is a bug, the masked_index is always empty when I enumerate the train_loader. I've seen the masked_index was populated before it is re-initialized by the following code:

masked_index = []

Is masked_index used when only mel_length > self.max_mel_length? To me, masked_index = [] should be placed under the if statement at line 90.

RuntimeError: CUDA error: device-side assert triggered on criterion

I saw issues about this error. #28
But, I don't know how to solve this error..

I don't know how to write a code that skips the error.
Can you tell me the solution?

Error occured on this code
`

accelerator.print('Start training...')

running_loss = 0

for _, batch in enumerate(train_loader):        
    curr_steps += 1
    
    words, labels, phonemes, input_lengths, masked_indices = batch
    text_mask = length_to_mask(torch.Tensor(input_lengths))# .to(device)
    
    tokens_pred, words_pred = bert(phonemes, attention_mask=(~text_mask).int())
    
    loss_vocab = 0
    for _s2s_pred, _text_input, _text_length, _masked_indices in zip(words_pred, words, input_lengths, masked_indices):
        loss_vocab += criterion(_s2s_pred[:_text_length], _text_input[:_text_length]) # Here!!
    loss_vocab /= words.size(0)

`

C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [7,0,0] Assertion t >= 0 && t < n_classes failed.
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [8,0,0] Assertion t >= 0 && t < n_classes failed.
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [9,0,0] Assertion t >= 0 && t < n_classes failed.
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [10,0,0] Assertion t >= 0 && t < n_classes failed.
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [11,0,0] Assertion t >= 0 && t < n_classes failed.
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [12,0,0] Assertion t >= 0 && t < n_classes failed.
C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [13,0,0] Assertion t >= 0 && t < n_classes failed.
Traceback (most recent call last):
File "C:\Users\user_\Desktop\PL-BERT-KO\train_infer.py", line 198, in
notebook_launcher(train, args=(), num_processes=1)
File "C:\Users\user_\anaconda3\envs\PL-BERT-KO\lib\site-packages\accelerate\launchers.py", line 207, in notebook_launcher
function(*args)
File "C:\Users\user_\Desktop\PL-BERT-KO\train_infer.py", line 147, in train
loss_vocab += criterion(_s2s_pred[:_text_length], _text_input[:text_length])
File "C:\Users\user\anaconda3\envs\PL-BERT-KO\lib\site-packages\torch\nn\modules\module.py", line 1518, in wrapped_call_impl
return self.call_impl(*args, **kwargs)
File "C:\Users\user\anaconda3\envs\PL-BERT-KO\lib\site-packages\torch\nn\modules\module.py", line 1527, in call_impl
return forward_call(*args, **kwargs)
File "C:\Users\user\anaconda3\envs\PL-BERT-KO\lib\site-packages\torch\nn\modules\loss.py", line 1179, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "C:\Users\user\anaconda3\envs\PL-BERT-KO\lib\site-packages\torch\nn\functional.py", line 3053, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

What is a reasonable phoneme prediction accuracy?

Thanks for your contribution!

I am currently toying with my implementation of phoneme-level BERT for Chinese. and i have noticed a gap between
the phonemic prediction accuracy of the tokens overall (top-1 true prediction / n_tokens, ~88%) and tokens whose inputs were randomly masked (top-1 true prediction of masked tokens / n_masks, ~25%).

Since the published checkpoint did not include a language modeling head for phonemes, i trained a simple linear prediction layer on top of your published model using 10,000 lines from wikipedia and used another 10,000 lines for testing (probably all part of your training data though), by simply masking 15% of the words, i calculated an overall accuracy of 87.42% and masked accuracy of 24.86%, which is basically the same as my Chinese model albeit mine used a different phone-set.

  1. I am curious if these numbers are roughly consistent with your findings.
  2. Which accuracy does the grapheme prediction accuracy reported in Table-3 of the paper refers to?
  3. What is your opinion on the relation of these prediction accuracies to the actual downstream task performance (TTS)?

i'm very looking forward to your kind reply!

When will the code be publicly available?

Thanks for your great work. Just wondering is there any plan for code public?
Also, before code are publicly available, is there any suggests for pretrain bert on TTS tasks?

Question regarding train.ipynb - num_tokens

see: https://github.com/yl4579/PL-BERT/blob/main/train.ipynb

bert = AlbertModel(albert_base_configuration)
bert = MultiTaskModel(bert, 
                      num_vocab=max([m['token'] for m in token_maps.values()]), 
                      num_tokens=config['model_params']['vocab_size'],
                      hidden_size=config['model_params']['hidden_size'])

num_tokens=config['model_params']['vocab_size'] = 178. However, if you view the len(dicts) = 177

_pad = "$"
_punctuation = ';:,.!?¡¿—…"«»“” '
_letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
_letters_ipa = "ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ"

# Export all symbols:
symbols = [_pad] + list(_punctuation) + list(_letters) + list(_letters_ipa)

letters = list(_letters) + list(_letters_ipa)

dicts = {}
for i in range(len((symbols))):
    dicts[symbols[i]] = i

So, we have only 177 unique phoneme character, based in the length of the dict. We would have a length of 178, if we would count the duplicate sign in _letters_ipa twice. Maybe there is a special phoneme that is not stored inside the dicts, that I am missing here? So the question is, why do we have num_token = 178 and not 177?

Thank you for effort

Token indices sequence length is longer than the specified maximum sequence length

when trying to do the preprocessing on malayalam dataset, saw the following error

Token indices sequence length is longer than the specified maximum sequence length for this model (4377 > 512). Running this sequence through the model will result in indexing errors

quite likely is happening because the tokenizer 'bert-base-multilingual-cased' is doing mostly 2-character or so tokenization

>>> from transformers import BertTokenizer
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
>>> text = 'ദേശീയോദ്യാനങ്ങൾ സംരക്ഷിതപ്രദേശങ്ങളാണ്.'
>>> words = tokenizer.tokenize(text)
>>> print(words)
['ദ', '##േശീയ', '##ോ', '##ദ്', '##യ', '##ാന', '##ങ്ങൾ', 'സ', '##ം', '##ര', '##ക്', '##ഷ', '##ി', '##ത', '##പ്', '##ര', '##ദ', '##േ', '##ശ', '##ങ്', '##ങ', '##ളാണ്', '.']

which parameter to change in the config to handle this?

max_mel_length: 512 # max phoneme length
or

max_position_embeddings: 512
?

The key error

May I ask why I reported this error? I used the default code and data set
image

Possible bug in train.ipynb - num_vocab

see: https://github.com/yl4579/PL-BERT/blob/main/train.ipynb

bert = AlbertModel(albert_base_configuration)
bert = MultiTaskModel(bert, 
                      num_vocab=max([m['token'] for m in token_maps.values()]), 
                      num_tokens=config['model_params']['vocab_size'],
                      hidden_size=config['model_params']['hidden_size'])

When you try to get the maximum embedding scalar value from the list: ([m['token'] for m in token_maps.values()], you might need to add the number 1 to it.

Let me explain why I think so. Imagine we would have 3 embedding scalar values or embedding ids, which would be [0,1,2] --> the length of this array would be 3, however, the max of this list would be 2. From what I can tell, the num_vocab should be chosen as 3 and not as 2, thus leng(list_Of_Embedding_Scalar_Values) should be preferred over max(list_Of_Embedding_Scalar_Values)

ZeroDivisionError: division by zero

@yl4579 first of all - thank you for your great work!
I'm unable to train model from scratch using provided notebook.
Kindly provide some hint on this issue.

Start training... Traceback (most recent call last): File "train.py", line 172, in <module> notebook_launcher(train, args=(), num_processes=1, use_port=33389) File "/lib/python3.8/site-packages/accelerate/launchers.py", line 156, in notebook_launcher function(*args) File "train.py", line 139, in train loss_token /= sizes ZeroDivisionError: division by zero

PL-BERT on chinese

Hello, I would like to ask if the PL-Bert pre-training model here can be directly used on Chinese tts.

Vocab Loss = 0.0

I try training PLBert for vietnamese (using multilingual Bert based model with wiki-vi dataset), but Vocab loss is 0.0 (begin first step), is it okay? @yl4579
Step [19920/1000000], Loss: 0.33009, Vocab Loss: 0.00000, Token Loss: 0.27358
Step [19930/1000000], Loss: 0.34967, Vocab Loss: 0.00000, Token Loss: 0.21050
Step [19940/1000000], Loss: 0.31899, Vocab Loss: 0.00000, Token Loss: 0.27819
Step [19950/1000000], Loss: 0.31953, Vocab Loss: 0.00000, Token Loss: 0.27369
Step [19960/1000000], Loss: 0.32448, Vocab Loss: 0.00000, Token Loss: 0.37215

Preprocessing code for Chinese

Do you have any suggestions for Chinese data preprocessing?
For example, text normalization, g2p, etc.
From your experience, will the accuracy of the g2p model have great impact on the model performance ?

Requesting a quick heal with a few queries at hand

Went through few of your answers in the issues, would like to know:

  1. whether your suggested modifications for VITS and FastSeech 2 models apply only for inference or for finetuning too?
  2. in the readme section you have mentioned some changes to be made in the train_second.py of styleTTS if one wishes to include PL-BERT. Does that mean train_first.py can be skilled at all or the stage one has to be done without PL-Bert. I am only interested in finetuning the pre-trained model
  3. the first modification https://github.com/yl4579/StyleTTS/blob/main/models.py#L683, it points to the line where there is the discriminator is instantiated, which will an argument for Munch(), but the replacement code doesn't instantiate discriminator anywhere
  • I suppose either the line pointed here has to be kept as is or the replaced Munch instantiation shall not include discriminator. What's the correct implementation, if you'd like to tell?
  1. Any final comments in regards to comparing styleTTS directly with Tortoise in terms of inference quality and speed?

Modify the dataloader of Chinese PL-Bret

I encountered a problem when modifying the dataloader. What is the token_map of the Chinese data in the dataloader? I cannot understand what the "token" in the token_map used to process English data means, and what the corresponding "token" in Chinese is. What is the thing. If possible, could you please give me the code of the dataloader you use to process Chinese data?

Possible Bug in Code

Hi, I was trying to recreate some of the results and got the following error:

......../1_Code/speech_Sytn/Repos/Pl_BERT/phonemize.py", line 57, in phonemize
    if next_phoneme[0] in 'ɪiʊuɔɛeəɜoæʌɑaɐ':
IndexError: string index out of range

I modified it like that:

if word == "the": # change the pronunciations before voewls
    if i_Pos_Sent < len(words):
        next_phoneme = phonemes_bad[i_Pos_Sent + 1].replace('ˈ', '').replace('ˌ', '')
        
        # need to make sure that next_phoneme does exist and then run the actual command
        if next_phoneme and next_phoneme[0] in 'ɪiʊuɔɛeəɜoæʌɑaɐ':
            phoneme = "ðɪ"

Basically just added "next_phoneme and" in the if condition.

numbers range not converted correctly

for a text which has number range,

text = 'hello (1200 - 1230)'
out = normalize_text(text)

gets converted as
hello (twelve million one thousand two hundred thirty)

fix for this case is @ #27

the training restarts every 500 steps

Hello
thank you so much for your great work!
not sure if it is a bug, I converted train.ipynb to train.py and i launched the training on spanish. However, it restarts the training every 500 steps?

Implement details about mask strategy

Hi, I'm curious about the MLM task's masking strategy in your paper (section 2.2.2)

"When a grapheme is selected, its phonemes tokens are replaced with a MASK token 80% of the time, are replaced with random phonemes token 10% of the time, and stay unchanged 10% of the time."

when you replace with random phonemes, will it be related to an real word? For example, will insane phoneme sequence such as 'abcde' be possible?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.