Giter Site home page Giter Site logo

descriptinc / cargan Goto Github PK

View Code? Open in Web Editor NEW
180.0 22.0 30.0 92.41 MB

Official repository for the paper "Chunked Autoregressive GAN for Conditional Waveform Synthesis"

Home Page: https://maxrmorrison.com/sites/cargan

License: MIT License

Python 100.00%
audio gan autoregression vocoder

cargan's Introduction

Chunked Autoregressive GAN (CARGAN)

PyPI License Downloads

Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis [paper] [companion website]

Table of contents

Installation

pip install cargan

Configuration

All configuration is performed in cargan/constants.py. The default configuration is CARGAN. Additional configuration files for experiments described in our paper can be found in config/.

Inference

CLI

Infer from an audio files on disk. audio_files and output_files can be lists of files to perform batch inference.

python -m cargan \
    --audio_files <audio_files> \
    --output_files <output_files> \
    --checkpoint <checkpoint> \
    --gpu <gpu>

Infer from files of features on disk. feature_files and output_files can be lists of files to perform batch inference.

python -m cargan \
    --feature_files <feature_files> \
    --output_files <output_files> \
    --checkpoint <checkpoint> \
    --gpu <gpu>

API

cargan.from_audio

"""Perform vocoding from audio

Arguments
    audio : torch.Tensor(shape=(1, samples))
        The audio to vocode
    sample_rate : int
        The audio sample rate
    gpu : int or None
        The index of the gpu to use

Returns
    vocoded : torch.Tensor(shape=(1, samples))
        The vocoded audio
"""

cargan.from_audio_file_to_file

"""Perform vocoding from audio file and save to file

Arguments
    audio_file : Path
        The audio file to vocode
    output_file : Path
        The location to save the vocoded audio
    checkpoint : Path
        The generator checkpoint
    gpu : int or None
        The index of the gpu to use
"""

cargan.from_audio_files_to_files

"""Perform vocoding from audio files and save to files

Arguments
    audio_files : list(Path)
        The audio files to vocode
    output_files : list(Path)
        The locations to save the vocoded audio
    checkpoint : Path
        The generator checkpoint
    gpu : int or None
        The index of the gpu to use
"""

cargan.from_features

"""Perform vocoding from features

Arguments
    features : torch.Tensor(shape=(1, cargan.NUM_FEATURES, frames)
        The features to vocode
    gpu : int or None
        The index of the gpu to use

Returns
    vocoded : torch.Tensor(shape=(1, cargan.HOPSIZE * frames))
        The vocoded audio
"""

cargan.from_feature_file_to_file

"""Perform vocoding from feature file and save to disk

Arguments
    feature_file : Path
        The feature file to vocode
    output_file : Path
        The location to save the vocoded audio
    checkpoint : Path
        The generator checkpoint
    gpu : int or None
        The index of the gpu to use
"""

cargan.from_feature_files_to_files

"""Perform vocoding from feature files and save to disk

Arguments
    feature_files : list(Path)
        The feature files to vocode
    output_files : list(Path)
        The locations to save the vocoded audio
    checkpoint : Path
        The generator checkpoint
    gpu : int or None
        The index of the gpu to use
"""

Reproducing results

For the following subsections, the arguments are as follows

  • checkpoint - Path to an existing checkpoint on disk
  • datasets - A list of datasets to use. Supported datasets are vctk, daps, cumsum, and musdb.
  • gpu - The index of the gpu to use
  • gpus - A list of indices of gpus to use for distributed data parallelism (DDP)
  • name - The name to give to an experiment or evaluation
  • num - The number of samples to evaluate

Download

Downloads, unzips, and formats datasets. Stores datasets in data/datasets/. Stores formatted datasets in data/cache/.

python -m cargan.data.download --datasets <datasets>

vctk must be downloaded before cumsum.

Preprocess

Prepares features for training. Features are stored in data/cache/.

python -m cargan.preprocess --datasets <datasets> --gpu <gpu>

Running this step is not required for the cumsum experiment.

Partition

Partitions a dataset into training, validation, and testing partitions. You should not need to run this, as the partitions used in our work are provided for each dataset in cargan/assets/partitions/.

python -m cargan.partition --datasets <datasets>

The optional --overwrite flag forces the existing partition to be overwritten.

Train

Trains a model. Checkpoints and logs are stored in runs/.

python -m cargan.train \
    --name <name> \
    --datasets <datasets> \
    --gpus <gpus>

You can optionally specify a --checkpoint option pointing to the directory of a previous run. The most recent checkpoint will automatically be loaded and training will resume from that checkpoint. You can overwrite a previous training by passing the --overwrite flag.

You can monitor training via tensorboard as follows.

tensorboard --logdir runs/ --port <port>

Evaluate

Objective

Reports the pitch RMSE (in cents), periodicity RMSE, and voiced/unvoiced F1 score. Results are both printed and stored in eval/objective/.

python -m cargan.evaluate.objective \
    --name <name> \
    --datasets <datasets> \
    --checkpoint <checkpoint> \
    --num <num> \
    --gpu <gpu>

Subjective

Generates samples for subjective evaluation. Also performs benchmarking of inference speed. Results are stored in eval/subjective/.

python -m cargan.evaluate.subjective \
    --name <name> \
    --datasets <datasets> \
    --checkpoint <checkpoint> \
    --num <num> \
    --gpu <gpu>

Receptive field

Get the size of the (non-causal) receptive field of the generator. cargan.AUTOREGRESSIVE must be False to use this.

python -m cargan.evaluate.receptive_field

Running tests

pip install pytest
pytest

Citation

IEEE

M. Morrison, R. Kumar, K. Kumar, P. Seetharaman, A. Courville, and Y. Bengio, "Chunked Autoregressive GAN for Conditional Waveform Synthesis," Submitted to ICLR 2022, April 2022.

BibTex

@inproceedings{morrison2022chunked,
    title={Chunked Autoregressive GAN for Conditional Waveform Synthesis},
    author={Morrison, Max and Kumar, Rithesh and Kumar, Kundan and Seetharaman, Prem and Courville, Aaron and Bengio, Yoshua},
    booktitle={Submitted to ICLR 2022},
    month={April},
    year={2022}
}

cargan's People

Contributors

maxrmorrison avatar pluieelectrique avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cargan's Issues

Versions of torch and torchaudio to use on Colab?

UPDATE: !pip install torch==1.10.2 torchaudio==0.10.2 did the trick. Still not sure about how to use TensorBoard but closing this issue as my goal was to at least run training on Colab.


Hi,

This may be pretty Google-Colab-specific but I would appreciate guidance.

On Colab, I was trying to train CARGAN on VCTK. I ran into an exception on line 70 of train.py (writer = SummaryWriter(str(directory))). Exception pasted below:

[libprotobuf FATAL google/protobuf/stubs/common.cc:87] This program was compiled against version 3.9.2 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.17.3).  Contact the program author for an update.  If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library.  (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".)
terminate called after throwing an instance of 'google::protobuf::FatalException'

Along the lines of this error message, I tried installing libprotobuf 3.9, but then got some sort of low-level C error (I'm forgetting details but can reproduce if helpful). Rather than investigate I commented out all the references to the writer object as I wanted to just get training to work as a first step (even w/o TensorBoard monitoring).

That allowed me to get further, line 523 of train.py (metrics.update(x_t, x_pred_t)), but this resulted in AttributeError: module 'torchaudio.functional' has no attribute 'magphase' on line 115 of metrics.py.

I assume this is a torchaudio version issue, so I did !pip uninstall torchaudio and then ran !pip install -e . from the repo root to reinstall it via setup.py, but got the same exception. I believe the old (already installed) torchaudio version was 0.12.1+cu113 and the reinstalled version was then 0.12.1+cu102. Colab appears to have CUDA 11.1 installed.

Anyways, I suppose I'm asking, does anyone have a recommendation of versions of torchaudio (and perhaps torch) to install to have the least chance of issues along these lines? Appreciate any and all help greatly.

Discriminator weights

I saw someone ask for these weights a few months ago and was just curious if these will get released/any updates? Appreciate it and great work on the speeding up the training time significantly.

Training models with 24000 Hz audio data

Thank you for your nice works!
If I would like to train CARGAN with 24000 Hz audio data, besides SAMPLE_RATE in cargan.constant.py, what other parts of the code do I need to modify?

Poor results on Mandarin singing voice data

Thank you for your work. I used this repository to experiment on a Mandarin singing voice dataset, the training result of 50w steps is not satisfactory, the main problem is that the spectrum looks like stitching together one by one Chunk, there are very obvious vertical line streaks(can be clearly heard).
image
image

I am using the default hyperparameter configuration, how should I avoid this problem?

Will discriminator weights be released?

It would be helpful for finetuning. If not, maybe HiFi-GAN's Universal V1 discriminator could be used, though I'm not sure how much the changed feature matching/mel-spectrogram loss weighting will impact things.

Pip package is missing submodules

I tried to import cargan after running pip install cargan. But, from . import model failed because the model module could not be found. Indeed, on PyPI, the 0.0.2 wheel and tar.gz only have the following source files:

cargan/__init__.py
cargan/__main__.py
cargan/constants.py
cargan/core.py
cargan/load.py
cargan/partition.py
cargan/train.py

This seems like a setup.py issue. Maybe find_packages() should be used. Or, submodules should be listed out explicitly (since find_packages() might include tests).

How to generate speech at 16kHz?

Hi,
Thanks for sharing this nice work. I would like to train cargan on a 16kHz training dataset .. could you elaborate what changes I need to apply to the generator architecture in order to achieve this?

Many thanks in advance!

inference speed

Hello @maxrmorrison , thanks for your contribution about cargan.
In my experiment of cargan, the model inference speed is slow although it was setted AUTOREGRESSIVE=False. I wonder that how to speed up cargan to close to paper's speed?

colab

please add a google colab for inference

How do i train my own model?

hello,plz tell me how to train my own model, if I have a new dataset,what should I do first? What does the data format look like?
Thank you!

about the ar loop?

from the code:

signals[start:start + cargan.CHUNK_SIZE] += signal.squeeze()

for each chunk output samples, it will be add to signals,
but in the for loop,

for i in range(0, features.shape[2] - feat_chunk + 1, feat_hop):

we have the feat_hop,,
for my understand, it will cumsum on the signals, but we only need the first feat_hop * hop_size samples, right?

TypeError: can't convert np.ndarray of type numpy.uint16.

When I ran the code with my own dataset
python -m cargan.preprocess --dataset ljspeech
An error occured

Traceback (most recent call last):
File "XX/anaconda3/envs/cargan/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "XX/anaconda3/envs/cargan/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "XX/models/cargan/cargan/preprocess/main.py", line 26, in
cargan.preprocess.datasets(**vars(parse_args()))
File "XX/models/cargan/cargan/preprocess/core.py", line 37, in datasets
mels, pitch, periodicity = from_audio(audio, gpu=gpu)
File "XX/models/cargan/cargan/preprocess/core.py", line 62, in from_audio
pitch, periodicity = cargan.preprocess.pitch.from_audio(
File "XX/models/cargan/cargan/preprocess/pitch.py", line 38, in from_audio
pitch, periodicity = torchcrepe.predict(
File "XX/anaconda3/envs/cargan/lib/python3.8/site-packages/torchcrepe-0.0.15-py3.8.egg/torchcrepe/core.py", line 127, in predict
result = postprocess(probabilities,
File "XX/anaconda3/envs/cargan/lib/python3.8/site-packages/torchcrepe-0.0.15-py3.8.egg/torchcrepe/core.py", line 605, in postprocess
bins, pitch = decoder(probabilities)
File "XX/anaconda3/envs/cargan/lib/python3.8/site-packages/torchcrepe-0.0.15-py3.8.egg/torchcrepe/decode.py", line 76, in viterbi
bins = torch.tensor(bins, device=probs.device)
TypeError: can't convert np.ndarray of type numpy.uint16. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

I guess it is cause by

  # Perform viterbi decoding
    bins = [librosa.sequence.viterbi(sequence, viterbi.transition)
            for sequence in sequences]
    # Convert to pytorch
    bins = torch.tensor(bins, device=probs.device)

in torchcrepe\decode.py

The datatype of bins is numpy.unint 16.
Whether I need to modify the code in torchcrepe ?

Pitch Losses

Hello, first of all thanks for sharing your work results and all the implementation.

I had noticed that the code implements PitchLoss term, but it is not used in any of the configs and you don't mention it in the article.
Also I have seen that you implemented the PitchDiscriminator, but I had not noticed any results from using it.

Would you mind commenting on the results of using pitch as part of vocoder loss?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.