Giter Site home page Giter Site logo

gmvae_tacotron's Introduction

GMVAE Tacotron-2:

Tensorflow Unofficial Implementation of HIERARCHICAL GENERATIVE MODELING FOR CONTROLLABLE SPEECH SYNTHESIS

Repository Structure:

Tacotron-2
├── datasets
├── LJSpeech-1.1	(0)
│   └── wavs
├── logs-Tacotron	(2)
│   ├── mel-spectrograms
│   ├── plots
│   ├── pretrained
│   └── wavs
├── papers
├── tacotron
│   ├── models
│   └── utils
├── tacotron_output	(3)
│   ├── eval
│   ├── gta
│   ├── logs-eval
│   │   ├── plots
│   │   └── wavs
│   └── natural
└── training_data	(1)
    ├── audio
    └── mels

The previous tree shows what the current state of the repository.

  • Step (0): Get your dataset, here I have set the examples of Ljspeech.
  • Step (1): Preprocess your data. This will give you the training_data folder.
  • Step (2): Train your Tacotron model. Yields the logs-Tacotron folder.
  • Step (3): Synthesize/Evaluate the Tacotron model. Gives the tacotron_output folder.

Requirements

first, you need to have python 3.5 installed along with Tensorflow v1.6.

next you can install the requirements :

pip install -r requirements.txt

else:

pip3 install -r requirements.txt

Dataset:

This repo tested on the ljspeech dataset, which has almost 24 hours of labeled single actress voice recording.

Preprocessing

Before running the following steps, please make sure you are inside Tacotron-2 folder

cd Tacotron-2

Preprocessing can then be started using:

python preprocess.py

or

python3 preprocess.py

dataset can be chosen using the --dataset argument. Default is Ljspeech.

Training:

Feature prediction model can be trained using:

python train.py --model='Tacotron'

or

python3 train.py --model='Tacotron'

Synthesis

There are three types of mel spectrograms synthesis for the Spectrogram prediction network (Tacotron):

  • Evaluation (synthesis on custom sentences). This is what we'll usually use after having a full end to end model.

python synthesize.py --model='Tacotron' --mode='eval' --reference_audio='ref_1.wav'

or

python3 synthesize.py --model='Tacotron' --mode='eval' --reference_audio='ref_1.wav'

Note:

  • This implementation not completly tested for all scenarios but training and synthesis with reference audio working.
  • Though it only tested on synthesize without GTA and with eval mode.
  • After training 250k step with 32 batch size on LJSpeech, KL error settled down near to zero (around 0.001) still not get good style transfer and control, may be because this model trained on LJSpeech which is not quite expressive datasets and only have 24 hrs of data, it might be produce good result on expressive dataset like Blizzard 2013 voice dataset though author of the paper used 105 hrs of Blizzard Challenge 2013 dataset.
  • In my testing, I havn't get good results so far on style transfer side may be some more tweaking required, this implementation easily integrated with wavenet as well as WaveRNN.
  • Feel free to suggest some changes or even better raise PR.

Pretrained model and Samples:

TODO

References and Resources:

Work in progress

gmvae_tacotron's People

Contributors

rishikksh20 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

gmvae_tacotron's Issues

This repo. is VAE, but not a GMVAE?

Hi, I've read your code, and it seems that your implementation is a VAE-Tacotron, but not a GMVAE-Tacotron?
Because the loss here seems is no difference from this repo.?

Missing yl, yo and latent models

Hi, I couldn't find the implementation for latent posteriors of q(yl | X) and q(zl | X) as discussed in the claimed paper. The MC estimate of the variational lower bound also seems to be missing as discussed in Appendix A,
Wanted to confirm if I'm missing anything or the latent is just assumed as a dummy VAE with a single variable zl as the learnt latent.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.