yl4579 / auxiliaryasr Goto Github PK

Joint CTC-S2S Phoneme-level ASR for Voice Conversion and TTS (Text-Mel Alignment)

License: MIT License

Python 100.00%

auxiliaryasr's Introduction

AuxiliaryASR

This repo contains the training code for Phoneme-level ASR for Voice Conversion (VC) and TTS (Text-Mel Alignment) used in StarGANv2-VC and StyleTTS.

Pre-requisites

Python >= 3.7
Clone this repository:

git clone https://github.com/yl4579/AuxiliaryASR.git
cd AuxiliaryASR

Install python requirements:

pip install SoundFile torchaudio torch jiwer pyyaml click matplotlib g2p_en librosa

Prepare your own dataset and put the train_list.txt and val_list.txt in the Data folder (see Training section for more details).

Training

python train.py --config_path ./Configs/config.yml

Please specify the training and validation data in config.yml file. The data list format needs to be filename.wav|label|speaker_number, see train_list.txt as an example (a subset for LJSpeech). Note that speaker_number can just be 0 for ASR, but it is useful to set a meaningful number for TTS training (if you need to use this repo for StyleTTS).

Checkpoints and Tensorboard logs will be saved at log_dir. To speed up training, you may want to make batch_size as large as your GPU RAM can take. However, please note that batch_size = 64 will take around 10G GPU RAM.

Languages

This repo is set up for English with the g2p_en package, but you can train it with other languages. If you would like to train for datasets in different languages, you will need to modify the meldataset.py file (L86-93) with your own phonemizer. You also need to change the vocabulary file (word_index_dict.txt) and change n_token in config.yml to reflect the number of tokens. A recommended phonemizer for other languages is phonemizer.

References

Acknowledgement

The author would like to thank @tosaka-m for his great repository and valuable discussions.

auxiliaryasr's People

Contributors

Stargazers

Watchers

auxiliaryasr's Issues

Why is " " used as the blank in the CTCLoss?

Hey @yl4579 thank you for your great work on this (and StyleTTS).

I was wondering if there was a reason for using " " as the blank token in the CTCLoss instead of something distinct from what can be returned from G2p as is suggested here? I was thinking of using something like id 80 if appending onto the vocab defined here.

Was wondering if this would affect the downstream training of StyleTTS much or if the aligner just has to be a "good enough" starting point?

Thanks!

How to train ZH-EN duo language aligner？

Hi there.
I saw that the repo's code only support Engilish aligner training

Error Message: RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (1024, 1024) at dimension 2 of input [1, 65621, 2]

@yl4579
I added a extra line to the train_list.txt file and got this error message:

python train.py --config_path ./Configs/config.yml
{'max_lr': 0.0005, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 72}
ctc_linear.2.linear_layer.weight does not have same shape
torch.Size([178, 256]) torch.Size([80, 256])
ctc_linear.2.linear_layer.bias does not have same shape
torch.Size([178]) torch.Size([80])
asr_s2s.embedding.weight does not have same shape
torch.Size([178, 512]) torch.Size([80, 256])
asr_s2s.project_to_n_symbols.weight does not have same shape
torch.Size([178, 128]) torch.Size([80, 128])
asr_s2s.project_to_n_symbols.bias does not have same shape
torch.Size([178]) torch.Size([80])
asr_s2s.decoder_rnn.weight_ih does not have same shape
torch.Size([512, 640]) torch.Size([512, 384])

Traceback (most recent call last):
File "/home/bud/AuxiliaryASR/train.py", line 116, in
main()
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/bud/AuxiliaryASR/train.py", line 98, in main
train_results = trainer._train_epoch()
File "/home/bud/AuxiliaryASR/trainer.py", line 186, in _train_epoch
for train_steps_per_epoch, batch in enumerate(tqdm(self.train_dataloader, desc="[train]"), 1):
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/tqdm/std.py", line 1182, in iter
for obj in iterable:
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in next
data = self._next_data()
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/_utils.py", line 694, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/bud/AuxiliaryASR/meldataset.py", line 65, in getitem
mel_tensor = self.to_melspec(wave_tensor)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torchaudio/transforms/_transforms.py", line 619, in forward
specgram = self.spectrogram(waveform)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torchaudio/transforms/_transforms.py", line 110, in forward
return F.spectrogram(
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torchaudio/functional/functional.py", line 126, in spectrogram
spec_f = torch.stft(
File "/home/bud/AuxiliaryASR/venv/lib/python3.10/site-packages/torch/functional.py", line 648, in stft
input = F.pad(input.view(extended_shape), [pad, pad], pad_mode)
RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (1024, 1024) at dimension 2 of input [1, 65621, 2]

How much data did you use to train the model?

Hi. Thank you for you great work!
I was wondering how much data did you use to train the model, and did you augment the data?
I notice that you put the LJSpeech dataset here as an example, but the sample rate of LJ is 22050khz, so I think it is not the data you actually used when training the model...?

Multiple GPU training and changing to librosa mel spec?

Hello again, Is there multiple gpu training for this repo? Also do you have any training logs I can compare with? Thanks!

ill also post this question here as its for this repo...

Should i convert from torchaudio to librosa in AuxiliaryASR and PitchExtractor or just leave it with torchaudio?
something like this?
chatgpt converted:

import librosa
import numpy as np

DEFAULT_DICT_PATH = osp.join(osp.dirname(file), 'word_index_dict.txt')
SPECT_PARAMS = {
"n_fft": 1024,
"win_length": 1024,
"hop_length": 256
}
MEL_PARAMS = {
"n_mels": 80,
"n_fft": 1024,
"win_length": 1024,
"hop_length": 256
}

class MelDataset(torch.utils.data.Dataset):
def init(self,
data_list,
dict_path=DEFAULT_DICT_PATH,
sr=24000
):

    spect_params = SPECT_PARAMS
    mel_params = MEL_PARAMS

    _data_list = [l[:-1].split('|') for l in data_list]
    self.data_list = [data if len(data) == 3 else (*data, 0) for data in _data_list]
    self.text_cleaner = TextCleaner(dict_path)
    self.sr = sr

    self.to_melspec = librosa.feature.melspectrogram(**MEL_PARAMS)
    self.mean, self.std = -4, 4
    ```

About the loss

Hi. Could you kindly share your training loss of the model (maybe a tensorboard picture)? Thank you very much.

get error

[train]: 24%|██▍ | 16/66 [00:04<00:15, 3.20it/s]
Traceback (most recent call last):
File "/home/mike/PycharmProjects/AuxiliaryASR/train.py", line 116, in
main()
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/mike/PycharmProjects/AuxiliaryASR/train.py", line 98, in main
train_results = trainer._train_epoch()
File "/home/mike/PycharmProjects/AuxiliaryASR/trainer.py", line 186, in _train_epoch
for train_steps_per_epoch, batch in enumerate(tqdm(self.train_dataloader, desc="[train]"), 1):
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/tqdm/std.py", line 1195, in iter
for obj in iterable:
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in next
data = self._next_data()
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1204, in _next_data
return self._process_data(data)
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
data.reraise()
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/_utils.py", line 457, in reraise
raise exception
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/mike/anaconda3/envs/asr/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/mike/PycharmProjects/AuxiliaryASR/meldataset.py", line 60, in getitem
wave, text_tensor, speaker_id = self._load_tensor(data)
File "/home/mike/PycharmProjects/AuxiliaryASR/meldataset.py", line 78, in _load_tensor
speaker_id = int(speaker_id)
ValueError: invalid literal for int() with base 10: ''

my train_list :
/media/mike/yys/data_asr/SSB00800056.wav|wo men can jia guo xu duo zhong da huo dong de biao yan|0
/media/mike/yys/data_asr/SSB00050001.wav|guang zhou nv da xue sheng deng shan shi lian si tian jing fang zhao dao yi si nv shi|0
/media/mike/yys/data_asr/SSB00050002.wav|zhun zhong ke xue gui lv de yao qiu|0
/media/mike/yys/data_asr/SSB00050003.wav|qi lu wu ren shou piao|0
..

Is there anyone who has used the phonemizer? Any advice, please, on how to change the code correctly

how to make word_index_dict.txt

I have a little immature question, how to make word_index_dict.txt about Mandarin?

why mel_spectrogam feature extracting using only MEL_PARAMS here?

Hi, why mel_spectrogam feature extracting using only MEL_PARAMS? why SPECT_PARAMS not used?

AuxiliaryASR/meldataset.py

Line 50 in 7bca68a

self.to_melspec = torchaudio.transforms.MelSpectrogram(**MEL_PARAMS)

how to train for mandarin asr?

if I want to train mandarin asr,dict is like thisdict.txt,and I use g2pM as phonemizer,and train.txt is like this:
SSB06930002.wav | 武 wu3 术 shu4 始 shi3 终 zhong1 被 bei4 看 kan4 作 zuo4 我 wo3 国 guo2 的 de5 国 guo2 粹 cui4 | 0
I don't know how to change my format（dict.txt \train.txt） to suit this project, and how to change it in meldataset. can you help me? Thank you very much