Giter Site home page Giter Site logo

gallilmaimon / dissc Goto Github PK

View Code? Open in Web Editor NEW
120.0 7.0 9.0 450 KB

Official repository for "Speaking Style Conversion With Discrete Self-Supervised Units" (EMNLP 2023). https://arxiv.org/abs/2212.09730

License: MIT License

Python 100.00%
acoustics audio-processing self-supervised-learning voice-conversion

dissc's Introduction

Speaking Style Conversion With Discrete Self-Supervised Units

Official implementation of "Speaking Style Conversion With Discrete Self-Supervised Units", accepted at EMNLP 2023 (Findings).

arxiv license python

DISSC architecture overview

Abstract: Voice conversion is the task of making a spoken utterance by one speaker sound as if uttered by a different speaker, while keeping other aspects like content the same. Existing methods focus primarily on spectral features like timbre, but ignore the unique speaking style of people which often impacts prosody. In this study we introduce a method for converting not only the timbre, but also the rhythm and pitch changes to those of the target speaker. In addition, we do so in the many-to-many setting with no paired data. We use pretrained, self-supervised, discrete units which make our approach extremely light-weight. We introduce a suite of quantitative and qualitative evaluation metrics for this setup, and show that our approach outperforms existing methods.

Quick Links

Setup

We present all setup requirements for running all parts of the code - including evaluation metrics, and all datasets. This adds further limitations which might not be mandatory otherwise, such as Montreal Forced Aligner requiring Conda installation because of Kaldi. You can also install requirements and download datasets by use.

Environment

Create a conda environment, with python version 3.8 and install all the dependencies:

conda create -n dissc python=3.8
conda activate dissc

# download code
git clone https://github.com/gallilmaimon/DISSC.git

# install textlesslib, based on https://github.com/facebookresearch/textlesslib#installation
git clone https://github.com/facebookresearch/textlesslib.git
cd textlesslib
pip install -e .
pip install git+https://github.com/facebookresearch/fairseq.git@dd106d9534b22e7db859a6b87ffd7780c38341f8

# install requirements
cd ../DISSC
conda config --append channels conda-forge  # add needed channels
conda install --file requirements.txt

While certain other versions may be compatible as well, this was only tested with this setup.

Data

We describe here how to download, preprocess and parse VCTK, Emotional Speech Dataset (ESD) and our synthetic dataset - Syn_VCTK.

For VCTK

  1. Download the data from here and extract the audio to data/VCTK/wav_orig folder, and the text to data/VCTK/txt folder.
  2. Preprocess the audio (downsample audio from 48 kHz to 16 kHz and pad). One could also trim silences to potentially improve results, but we do not do so.
python3 data/preprocess.py --srcdir data/VCTK/wav_orig --outdir data/VCTK/wav --pad --postfix mic2.flac

For ESD

  1. Download the preprocessed data from here to data/ESD folder.
  2. If you want to preprocess this dataset from scratch, for instance if you wish to select different emotions for each speaker, download the entire dataset from here.

For Syn_VCTK

  1. Download the preprocessed data from here to data/Syn_VCTK folder.

Infer

This section discusses how to perform speaking style conversion on a given sample with a trained model (in this case syn_vctk). We show the option of converting a sample of an unseen speaker (in the any-to-many) setup with a sample we recorded ourselves. For converting a subset of data from known speakers (such as the validation set), see the evaluation section.

Any-to-Many

  1. Preprocess the recording, resample to 16khz if needed and pad as needed:
python3 data/preprocess.py --srcdir data/unseen/wav_orig --outdir data/unseen/wav --pad --postfix .wav
  1. Encode the sample with HuBERT:
python3 data/encode.py --base_dir data/unseen/wav --out_file data/unseen/hubert100/encoded.txt --device cuda:0
  1. Download the pretrained models from here to checkpoints/syn_vctk in the current file structure and all files from here to sr/checkpoints/vctk_hubert.

  2. Convert the prosody - rhythm (--pred_len option) and pitch contour (--pred_pitch option) using DISSC:

python3 infer.py --input_path data/unseen/hubert100/encoded.txt --out_path data/unseen/pred_hubert/ --pred_len --pred_pitch --len_model checkpoints/syn_vctk/len/ --f0_model checkpoints/syn_vctk/pitch/ --f0_path data/Syn_VCTK/hubert100/f0_stats.pkl --vc --target_speakers p231 p239 p245 p270 --wild_sample --id_to_spkr data/Syn_VCTK/hubert100/id_to_spkr.pkl
  1. Convert the audio with speech-resynthesis in the new speakers' voice and style, for here we demonstrate with p231 from Syn_VCTK. Results are saved to dissc_p231:
python3 sr/inference.py --input_code_file data/unseen/pred_hubert/p231_encoded.txt --data_path data/unseen/wav --output_dir dissc_p231 --checkpoint_file sr/checkpoints/vctk_hubert --unseen_speaker --id_to_spkr data/Syn_VCTK/hubert100/id_to_spkr.pkl --vc

Evaluation

This section discusses how to evaluate the pretrained models on each of the datasets, first performing the SSC and then calculating all metrics. If you wish to manually inspect the different conversions, and alter the models, we suggest you see the scripts section and run the commands from there manually (or look at the infer section), these scripts are mainly meant as an "all-in-one" to wrap up key results.

VCTK

  1. Download the pretrained DISSC model from here to checkpoints/vctk, and the pretrained vocoder from here to sr/checkpoints/vctk_hubert (if you haven't done so yet).

  2. Encode the dataset using HuBERT, and perform train-val split:

python3 data/encode.py --base_dir data/VCTK/wav --out_file data/VCTK/hubert100/encoded.txt --device cuda:0
python3 data/prep_dataset.py --encoded_path data/VCTK/hubert100/encoded.txt --stats_path data/VCTK/hubert100/f0_stats.pkl --split_method paired_val
  1. We give a single script which runs the conversion (predicts prosody + generates with SR), then restructures the file format for evaluation. It then runs MFA to align the text to the audio, as used for metrics and runs all metrics other than speaker verification. For more details, see the script. Results are printed and also saved as a pickle file.
python3 scripts/convert_eval.py --dissc_type dissc_l --data vctk --sort_gt  # Rhythm only
python3 scripts/convert_eval.py --dissc_type dissc_b --data vctk            # Convert Rhythm and Pitch
python3 scripts/convert_eval.py --dissc_type dissc_p --data vctk            # Pitch only - not in original paper
  1. Download the speaker verification table, which describes the pairs, from here to data/VCTK/.

  2. Evaluate speaker verification. Also here we give a single script which runs the conversion (predicts prosody + generates with SR), then restructures the file format for evaluation. For more details, see the script. Results for EER are printed.

python3 scripts/convert_eval_sv.py --dissc_type dissc_l --data vctk  # Rhythm only
python3 scripts/convert_eval_sv.py --dissc_type dissc_b --data vctk  # Convert Rhythm and Pitch

ESD

  1. Download the pretrained DISSC model from here to checkpoints/esd, and the pretrained vocoder from here to sr/checkpoints/esd_hubert.

  2. Encode the dataset using HuBERT, and perform train-val split:

python3 data/encode.py --base_dir data/ESD/wav/train --out_file data/ESD/hubert100/train.txt --device cuda:0
python3 data/encode.py --base_dir data/ESD/wav/evaluation --out_file data/ESD/hubert100/val.txt --device cuda:0
python3 data/encode.py --base_dir data/ESD/wav/paired_test --out_file data/ESD/hubert100/test.txt --device cuda:0
python3 data/prep_dataset.py --encoded_path data/ESD/hubert100/train.txt --stats_path data/ESD/hubert100/f0_stats.pkl
  1. We give a single script which runs the conversion (predicts prosody + generates with SR), then restructures the file format for evaluation. It then runs MFA to align the text to the audio, as used for metrics and runs all metrics other than speaker verification. For more details, see the script. Results are printed and also saved as a pickle file.
python3 scripts/convert_eval.py --dissc_type dissc_l --data esd --sort_gt  # Rhythm only
python3 scripts/convert_eval.py --dissc_type dissc_b --data esd            # Convert Rhythm and Pitch - not in original paper
python3 scripts/convert_eval.py --dissc_type dissc_p --data esd            # Pitch only - not in original paper
  1. Download the speaker verification table, which describes the pairs, from here to data/ESD/.

  2. Evaluate speaker verification. Also here we give a single script which runs the conversion (predicts prosody + generates with SR), then restructures the file format for evaluation. For more details, see the script. Results for EER are printed.

python3 scripts/convert_eval_sv.py --dissc_type dissc_l --data esd  # Rhythm only

Syn_VCTK

  1. Download the pretrained DISSC model from here to checkpoints/syn_vctk, and the pretrained vocoder from here to sr/checkpoints/vctk_hubert (if you haven't done so yet, this is the same as the VCTK vocoder).

  2. Encode the dataset using HuBERT, and perform train-val split:

python3 data/encode.py --base_dir data/Syn_VCTK/wav --out_file data/Syn_VCTK/hubert100/encoded.txt --device cuda:0
python3 data/prep_dataset.py --encoded_path data/Syn_VCTK/hubert100/encoded.txt --stats_path data/Syn_VCTK/hubert100/f0_stats.pkl --split_method paired_val
  1. We give a single script which runs the conversion (predicts prosody + generates with SR), then restructures the file format for evaluation. It then runs MFA to align the text to the audio, as used for metrics and runs all metrics other than speaker verification. For more details, see the script. Results are printed and also saved as a pickle file.
python3 scripts/convert_eval.py --dissc_type dissc_b --data syn_vctk --sort_gt  # Convert Rhythm and Pitch
python3 scripts/convert_eval.py --dissc_type dissc_l --data syn_vctk            # Rhythm only
python3 scripts/convert_eval.py --dissc_type dissc_p --data syn_vctk            # Pitch only

Train

This section discusses how to train the models from scratch, as in the paper. We encourage you to test out other configurations as well. This assumes you have downloaded and prepared the datasets (including train-test split) as described in the previous sections.

Pitch Predictor

These models should take around 30 minutes to train on a single GPU.

  • VCTK, this is with the version with no positional encoding to match the paper:
python3 train_f0_predictor.py --out_path checkpoints/vctk --data_path data/VCTK/hubert100/ --f0_path data/VCTK/hubert100/f0_stats.pkl --model_type base --n_epochs 20
  • ESD, this is with the version with no positional encoding to match the paper:
python3 train_f0_predictor.py --out_path checkpoints/esd --data_path data/ESD/hubert100/ --f0_path data/ESD/hubert100/f0_stats.pkl --model_type base --n_epochs 20
  • Syn_VCTK:
python3 train_f0_predictor.py --out_path checkpoints/syn_vctk --data_path data/Syn_VCTK/hubert100/ --f0_path data/Syn_VCTK/hubert100/f0_stats.pkl --model_type new --n_epochs 20

Rhythm Predictor

These models should take around 30 minutes to train on a single GPU.

  • VCTK:
python3 train_len_predictor.py --out_path checkpoints/vctk --data_path data/VCTK/hubert100/ --n_epochs 30
  • ESD, this is with the version with no positional encoding to match the paper:
python3 train_f0_predictor.py --out_path checkpoints/esd --data_path data/ESD/hubert100/ --n_epochs 30
  • Syn_VCTK. Note that the paper version uses VCTK rhythm predictor for syn_vctk as it is a larger dataset with identical rhythm for the speakers, if you nevertheless wish to train one:
python3 train_f0_predictor.py --out_path checkpoints/syn_vctk --data_path data/Syn_VCTK/hubert100/ --n_epochs 30

Vocder

Training this model is based on Speech-Resynthesis with minor adjustments. Training it will take a couple of days on 2 GPUs. Update the number of available GPUs in the config files under sr/configs, and in the run command. Also make sure the data paths in the config files match yours. After training, go to the checkpoint path configuration, and change the normalisation to false.

  • VCTK:
python3 -m torch.distributed.launch --nproc_per_node <NUM_GPUS> sr/train.py --checkpoint_path sr/checkpoints/vctk_hubert --config sr/configs/VCTK/hubert100_lut.json
  • ESD:
python3 -m torch.distributed.launch --nproc_per_node <NUM_GPUS> sr/train.py --checkpoint_path sr/checkpoints/esd_hubert --config sr/configs/ESD/hubert100_lut.json
  • Syn_VCTK uses the same vocoder as VCTK.

Reference

If you found this code useful, we would appreciate you citing the related paper

@inproceedings{maimon-adi-2023-speaking,
    title = "Speaking Style Conversion in the Waveform Domain Using Discrete Self-Supervised Units",
    author = "Maimon, Gallil  and
      Adi, Yossi",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.541",
    pages = "8048--8061",
    abstract = "We introduce DISSC, a novel, lightweight method that converts the rhythm, pitch contour and timbre of a recording to a target speaker in a textless manner. Unlike DISSC, most voice conversion (VC) methods focus primarily on timbre, and ignore people{'}s unique speaking style (prosody). The proposed approach uses a pretrained, self-supervised model for encoding speech to discrete units, which makes it simple, effective, and fast to train. All conversion modules are only trained on reconstruction like tasks, thus suitable for any-to-many VC with no paired data. We introduce a suite of quantitative and qualitative evaluation metrics for this setup, and empirically demonstrate that DISSC significantly outperforms the evaluated baselines. Code and samples are available at https://pages.cs.huji.ac.il/adiyoss-lab/dissc/.",
}

dissc's People

Contributors

gallilmaimon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

dissc's Issues

README inference does not work?

Hi there,

Firstly, thanks for your work and for open sourcing this code. It's truly appreciated!
I am trying to recreate the instructions in the README to infer with an "in the wild speaker". Specifically I am running:

python3 data/preprocess.py --srcdir data/unseen/wav_orig --outdir data/unseen/wav --pad --postfix .wav

python3 data/encode.py --base_dir data/unseen/wav --out_file data/unseen/hubert100/encoded.txt --device cuda:1

python3 infer.py --input_path data/unseen/hubert100/encoded.txt --out_path data/unseen/pred_hubert/ --pred_len \
--pred_pitch --len_model checkpoints/syn_vctk/len/ --f0_model checkpoints/syn_vctk/pitch/ \ 
--f0_path data/Syn_VCTK/hubert100/f0_stats.pkl --vc --target_speakers p231 p239 p245 p270 --wild_sample \ 
--id_to_spkr data/Syn_VCTK/hubert100/id_to_spkr.pkl

python3 sr/inference.py --input_code_file data/unseen/hubert100/p231_encoded.txt --data_path data/unseen/wav \ 
--output_dir dissc_p231 --checkpoint_file sr/checkpoints/vctk_hubert --unseen_speaker \ 
--id_to_spkr data/Syn_VCTK/hubert100/id_to_spkr.pkl

but the last command fails:

Traceback (most recent call last):
  File "sr/inference.py", line 363, in <module>
    main()
  File "sr/inference.py", line 320, in main
    file_list = parse_manifest(a.input_code_file, base_path)
  File "/home/raph/repos/DISSC/sr/dataset.py", line 111, in parse_manifest
    with open(manifest) as info:
FileNotFoundError: [Errno 2] No such file or directory: 'data/unseen/hubert100/p231_encoded.txt'

I figured perhaps you had a typo, where --input_code_file is actually meant to be data/unseen/pred_hubert/p231_encoded.txt, so when I switch that argument, the code runs successfully, but the dissc_p231 folder only contains s1_1_gt.wav and s1_1_gt.wav, which are simply the ground truths. Inspecting the code a little bit, all the if statements are returning False, so the only thing the code does is dump the ground truths to the folder.

Any tips on what needs to change?

Thanks a lot!

I got this error when running inference.py, can you help me fix this?

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Training loss of LenPredictor.

Hey, I open this new issue as you suggested. For anyone interested, I raised new question here,

Hey. Sorry to reopen this issue, I tried to train a LenPredictor but got bad results. Even after 200 epochs, the LenSumLoss was still more than 1800 and the predicted durations were unreasonable. (1) I wonder what the normal loss value is, also that for PitchPredictor. (I used just part of your code, maybe it's caused by my wrong implementation) (2) BTW, does the mean and var of the dataset matters? I set them 0 and 1 for convenience.

and the author gave the answer here.

Hey again, generally speaking I prefer if you open a new issue for each separate thing so that people in the future can find it more easily :)

About your questions: (1) a normal loss value depends on the distribution of the dataset (specifically the duration distribution), a general rule of thumb is that it should be a noticeably better loss than simply guessing the average length or the common value or something. Are you using one of the described datasets or a custom one? I would look at the distribution of gt length to see if maybe there is bug or maybe the Hubert model doesn't output good tokens for your dataset. Anyway your loss sounds like some sort of bug. (2) the mean and var of which dataset? Of the durations? This can impact the absolute loss scale which would impact learning rate and also the relative weight to big errors (because of the MSE). This won't necessarily cause a big difference but hard to guarantee.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.