StyleTTS-VC model is modified from the repo for Japanese.
- Python >= 3.7
- Clone this repository:
git clone https://github.com/QuyAnh2005/StyleTTS-VC-Japanese.git
cd StyleTTS-VC-Japanese
- Install python requirements:
pip install -r requirements.txt
- Dataset Dataset is downloaded from
and locate at dataset
folder.
The pretrained text aligner and pitch extractor models are provided under the Utils
folder. Both the text aligner and pitch extractor models are trained with melspectrograms preprocessed using meldataset.py.
You can edit the meldataset.py with your own melspectrogram preprocessing, but the provided pretrained models will no longer work. You will need to train your own text aligner and pitch extractor with the new preprocessing.
The code for training new text aligner model is available here and that for training new pitch extractor models is available here.
The data list format needs to be filename.wav|transcription|speaker
, see val_list.txt as an example. The speaker information is needed in order to perform speaker-dependent adversarial training.
To convert data into phonemes before training. Run
python preprocess.py
First stage training:
python train_first.py --config_path ./Configs/config.yml
Second stage training:
python train_second.py --config_path ./Configs/config.yml
Pretrained models are available at here.
Please refer to inference.ipynb for details.
The pretrained StyleTTS-VC on Japanese dataset and Hifi-GAN on LibriTTS corpus in 24 kHz can be downloaded at StyleTTS-VC Link and Hifi-GAN Link.
Please unzip to Models
and Vocoder
respectivey and run each cell in the notebook.
Run app.py to see demo using gradio.