Giter Site home page Giter Site logo

apnet's Introduction

APNet: An All-Frame-Level Neural Vocoder Incorporating Direct Prediction of Amplitude and Phase Spectra

Yang Ai, Zhen-Hua Ling

In our paper, we proposed APNet: An all-frame-level neural vocoder reconstructing speech waveforms from acoustic features by predicting amplitude and phase spectra directly.
We provide our implementation as open source in this repository.

Abstract : This paper presents a novel neural vocoder named APNet which reconstructs speech waveforms from acoustic features by predicting amplitude and phase spectra directly. The APNet vocoder is composed of an amplitude spectrum predictor (ASP) and a phase spectrum predictor (PSP). The ASP is a residual convolution network which predicts frame-level log amplitude spectra from acoustic features. The PSP also adopts a residual convolution network using acoustic features as input, then passes the output of this network through two parallel linear convolution layers respectively, and finally integrates into a phase calculation formula to estimate frame-level phase spectra. Finally, the outputs of ASP and PSP are combined to reconstruct speech waveforms by inverse short-time Fourier transform (ISTFT). All operations of the ASP and PSP are performed at the frame level. We train the ASP and PSP jointly and define multi-level loss functions based on amplitude mean square error, phase anti-wrapping error, short-time spectral inconsistency error and time domain reconstruction error. Experimental results show that our proposed APNet vocoder achieves about 8x faster inference speed than HiFi-GAN v1 on a CPU due to the all-frame-level operations while its synthesized speech quality is comparable to HiFi-GAN v1. The synthesized speech quality of the APNet vocoder is also better than several equally efficient models. Ablation experiments also confirm that the proposed parallel phase estimation architecture is essential to phase modeling and the proposed loss functions are helpful for improving the synthesized speech quality.

Visit our demo website for audio samples.

Requirements

torch==1.8.1+cu111
numpy==1.21.6
librosa==0.9.1
tensorboard==2.8.0
soundfile==0.10.3
matplotlib==3.1.3

Data Preparation

For training, write the list paths of training set and validation set to input_training_wav_list and input_validation_wav_list in config.json, respectively.

For inference, we provide two ways to read data:

(1) set test_mel_load to 0 in config.json and write the test set waveform path to test_input_wavs_dir in config.json, the inference process will first load the waveform, then extract the mel spectrogram, and finally synthesize the waveform through the vocoder;

(2) set test_mel_load to 1 in config.json and write the test set mel spectrogram (size is 80*frames) path to test_input_mels_dir in config.json, the inference process will dierctly load the mel spectrogram, and then synthesize the waveform through the vocoder.

Note : The sampling rate of speech waveforms must be 16kHz in this version of the code.

Training

Run using GPU:

CUDA_VISIBLE_DEVICES=0 python train.py

Using TensorBoard to monitor the training process:

tensorboard --logdir=cp_APNet/logs

Inference:

Write the checkpoint path to checkpoint_file_load in config.json.

Run using GPU:

CUDA_VISIBLE_DEVICES=0 python inference.py

Run using CPU:

CUDA_VISIBLE_DEVICES=CPU python inference.py

Acknowledgements

We referred to HiFi-GAN to implement this.

Citation

@article{ai2023apnet,
  title={A{PN}et: An All-Frame-Level Neural Vocoder Incorporating Direct Prediction of Amplitude and Phase Spectra},
  author={Ai, Yang and Ling, Zhen-Hua},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2023}
}

apnet's People

Contributors

yangai520 avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.