facebookarchive / loop Goto Github PK

A method to generate speech across multiple speakers

License: Other

Python 88.80% Jupyter Notebook 8.56% Shell 2.64%

loop's Introduction

VoiceLoop

PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop.

VoiceLoop is a neural text-to-speech (TTS) that is able to transform text to speech in voices that are sampled in the wild. Some demo samples can be found here.

Quick Links

Demo Samples
Quick Start
Setup
Training

Quick Start

Follow the instructions in Setup and then simply execute:

python generate.py  --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth

Results will be placed in models/vctk/results. It will generate 2 samples:

The generated sample will be saved with the gen_10.wav extension.
Its ground-truth (test) sample is also generated and is saved with the orig.wav extension.

You can also generate the same text but with a different speaker, specifically:

python generate.py  --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 18 --checkpoint models/vctk/bestmodel.pth

Which will generate the following sample.

Here is the corresponding attention plot:

Legend: X-axis is output time (acoustic samples) Y-axis is input (text/phonemes). Left figure is speaker 10, right is speaker 14.

Finally, free text is also supported:

python generate.py  --text "hello world" --spkr 1 --checkpoint models/vctk/bestmodel.pth

Setup

Requirements: Linux/OSX, Python2.7 and PyTorch 0.1.12. Generation requires installing phonemizer, follow the setup instructions there. The current version of the code requires CUDA support for training. Generation can be done on the CPU.

git clone https://github.com/facebookresearch/loop.git
cd loop
pip install -r scripts/requirements.txt

Data

The data used to train the models in the paper can be downloaded via:

bash scripts/download_data.sh

The script downloads and preprocesses a subset of VCTK. This subset contains speakers with american accent.

The dataset was preprocessed using Merlin - from each audio clip we extracted vocoder features using the WORLD vocoder. After downloading, the dataset will be located under subfolder data as follows:

loop
├── data
    └── vctk
        ├── norm_info
        │   ├── norm.dat
        ├── numpy_feautres
        │   ├── p294_001.npz
        │   ├── p294_002.npz
        │   └── ...
        └── numpy_features_valid

The preprocess pipeline can be executed using the following script by Kyle Kastner: https://gist.github.com/kastnerkyle/cc0ac48d34860c5bb3f9112f4d9a0300.

Pretrained Models

Pretrainde models can be downloaded via:

bash scripts/download_models.sh

After downloading, the models will be located under subfolder models as follows:

loop
├── data
├── models
    ├── blizzard
    ├── vctk
    │   ├── args.pth
    │   └── bestmodel.pth
    └── vctk_alt

Update 10/25/2017: Single speaker model available in models/blizzard/

SPTK and WORLD

Finally, speech generation requires SPTK3.9 and WORLD vocoder as done in Merlin. To download the executables:

bash scripts/download_tools.sh

Which results the following sub directories:

loop
├── data
├── models
├── tools
    ├── SPTK-3.9
    └── WORLD

Training

Single-Speaker

Single speaker model is trained on blizzard 2011. Data should be downloaded and prepared as described above. Once the data is ready, run:

python train.py --noise 1 --expName blizzard_init --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-5 --epochs 10

Then, continue training the model with :

python train.py --noise 1 --expName blizzard --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-4 --checkpoint checkpoints/blizzard_init/bestmodel.pth --epochs 90

Multi-Speaker

Training a new model on vctk, first train the model using noise level of 4 and input sequence length of 100:

python train.py --expName vctk --data data/vctk --noise 4 --seq-len 100 --epochs 90

Then, continue training the model using noise level of 2, on full sequences:

python train.py --expName vctk_noise_2 --data data/vctk --checkpoint checkpoints/vctk/bestmodel.pth --noise 2 --seq-len 1000 --epochs 90

Citation

If you find this code useful in your research then please cite:

@article{taigman2017voice,
  title           = {VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop},
  author          = {Taigman, Yaniv and Wolf, Lior and Polyak, Adam and Nachmani, Eliya},
  journal         = {ArXiv e-prints},
  archivePrefix   = "arXiv",
  eprinttype      = {arxiv},
  eprint          = {1705.03122},
  primaryClass    = "cs.CL",
  year            = {2017}
  month           = October,
}

License

Loop has a CC-BY-NC license.

loop's People

Contributors

Stargazers

Watchers

Forkers

dacson codeaudit huguanglong likeucode niucheney chagge ml-lab stevenlol jjdbear jackchinor issac8huxley oztc g-wang soumenms2015 junzh821 nacerbo entn-at amiremadz jfsantos benjamesbabala frankatmech techscientist dp-aixball chaishufan1008 charlieyqin flyingwing gsp-27 vic4key jquimera andyzhuang krdheeraj51 chetankhatri rinack zhongxingpeng speechprojects gridl wxdublin dinhnn zhanghonglishanzai szhaomsft ciyongch hepower jaxlinksync jameshwartlopez kbullaughey dachengai avi123456 gherao lizho shaun-xu praveenmunagapati shuaijiang akuzeee shafiahmed baldrlector shubhampachori12110095 pbaljeka donadonny maozhiqiang plkms baifengbai pierrevi shivamgupta211 fishg mexicanamerican chaogu77 melspectrum007 giantstonex7 maggie0830 ericma2014 xzm2004260 caoxmm tadityasrinivas ghostcow gaosandy heavy02011 imgaara balbertalli heimanba89 shadowridgedev sausax yongyug zhf459 beckgom xiaoyeye1117 afcarl ryancwalsh kingw0 xinkez scpark20 fancyerii mannykayy hunglethanh9 fishexpert saitamandd bencoster vince-lynch toannhu yanxiaobin-ben mattanimation

loop's Issues

Any clue? Missing norm_info_mgc_lf0_vuv_bap_63_MVN.dat while prepossessing Lj Speech Dataset.

I am getting below error while preprocessing Lj Speech data set.

Traceback (most recent call last):
  File "extract_feats.py", line 1406, in <module>
    save_numpy_features()
  File "extract_feats.py", line 853, in save_numpy_features
    shutil.copy2(audio_norm_source, audio_norm_dest)
  File "/usr/lib/python2.7/shutil.py", line 130, in copy2
    copyfile(src, dst)
  File "/usr/lib/python2.7/shutil.py", line 82, in copyfile
    with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: '/home/jax/latest_features/final_acoustic_data/norm_info_mgc_lf0_vuv_bap_63_MVN.dat'

How long speech I can synthesize from the text?

I've tried, and it works well. Thanks for sharing the great project!
I wonder how long speech it can generate from text. It seems limited < 3 secs if I tried a little long sentence. Where the limit is originated from and how to make it longer? Is it related to the --seq-len option in training?
Thank you!

How to train the samples obtained from public？too much noise and speed too fast!

I obtain some samples from public, but these samples are too much noise and speed too fast. After training, I found that generated sound is very vague，can not separate tone and tone. How to train these noisy and faster sound samples？thanks!

Can we use the pretrained model that was used on this project to a new speaker?

espeak or espeak-NG?

[REDACTED]

problem setting ?

Hi,

Want to confirm if the problem setting for this research is like this:

Some known speakers trained from VCTK
Speaker in the wild / text try to mimic a known speaker in 1 through generate.py ?

How much minutes of audio datasets to train for a single speaker using blizzard model?

Reproducing the results

Hi, thanks for open sourcing the code!

I am trying to reproduce your results. However, I am running into problems. I have been training:

sequence length: 100
epoch: 90
only American accent VCTK speaker samples
noise level 4

So the problem is that only some speakers actually produce a speech signal based on the input. The majority of speakers only produce noise. However, the speech producing speakers are depended on the actual phoneme input. The problem seems to be that the attention does not work correctly for these samples. The attention basically stays at the beginning of the sequence and does not advance.

Did you have a similar issue when training the model? Or do you might have an idea what the problem could be?

good attention with speech output:
p226_009_11.pdf
p225_005_4.pdf

somewhat working:
p226_009_2.pdf

Most examples:
p226_009_9.pdf
p226_009_13.pdf
p226_009_1.pdf

Thanks!

Issue for training on new Dataset.

Hi,

Thanks for sharing the project and I am doing some experiment with the tools. I have 2 questions.

the npz file download with download_data.sh is different with the ones generated by the extract_feats.py according to the same sample wave/text file. let's say p294_001. Why is this happened? other arrays are also have some differences.
download one:
phonemes
[28 22 19 41 21 3 22 31 34 11 22 5]
durations
[29 4 25 18 21 27 11 32 7 12 39 3]
extract one:
phonemes
[28 22 19 40 21 3 22 31 33 11 22 5]
durations
[ 9 6 23 33 6 17 24 32 3 14 28 32]
If I want to retrain the model using the data, I need to extract features to prepare the npz files, do I need to put the training set and validation set together to run extract_feats.py and get the norm.dat? or I need only deal with the training data to get the norm.dat then kick-off training?

Thank you for your guidance in advanced. :)

self.training？

Hi, I am reading your code and the code is really clean.
I noticed that in the class 'Loop' and 'Decoder' in python file 'model.py', 'self.training' is not defined but used as a condition statement. The inherited class 'torch.nn.Module' doesn't have an attribute named 'training' either.

Pretrained checkpoints for Blizzard 2013 and LJSpeech

Hi There!

In your paper, you mentioned training on LJSpeech and Blizzard 2013. Can you release those as well?

Thanks!

Getting errors in generate.py

python generate.py --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth Traceback (most recent call last): File "generate.py", line 153, in <module> main() File "generate.py", line 142, in main norm_path) File "/mnt/sdb1/Learning/pytorch/loop/utils.py", line 257, in generate_merlin_wav weight=os.path.join(gen_dir, 'weight')), shell=True) File "/mnt/sdb1/Learning/pytorch/loop/utils.py", line 121, in pe for line in execute(cmd, shell=shell): File "/mnt/sdb1/Learning/pytorch/loop/utils.py", line 114, in execute raise subprocess.CalledProcessError(return_code, cmd) subprocess.CalledProcessError: Command 'echo 1 1 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 | /mnt/sdb1/Learning/pytorch/loop/tools/SPTK-3.9/x2x +af > /mnt/sdb1/Learning/pytorch/loop/models/vctk/results/weight' returned non-zero exit status 127

Confusion of the Phoneme Generation

In the paper, it says the phoneme transcription of the text is generated by CMU lexicon. However, in this code, it uses phonemizer, a toolkit uses US phoneset. There is a little difference in phoneme set and phoneme number between them. Besides, the paper also mentions that they added two phonemes for two pauses with different length, but I do not know where it is done in the code.

Thanks!

New Language

First of all thank you for releasing the codes.
I would like to know how difficult will be to do the training on a speakers data on a new language such as Turkish. As far as I sow during the generation step there is need for some kind of pronunciation dictionary. But what about pre-processing steps, Merlin and other tools, are they language agnostic. Thank you in advance

shape mismatch of "audio_features" between downloaded and generated npz files

There is a shape mismatch in the audio_features array of npz files between data uploaded by you and npz generated by using your extract_features script by Kyle.

For eg.
in p299_405.npz,
shape of audio_features is (393,60) for uploaded npz file
shape is (829,60) for npz created by the extract_feats script.

This issue could possibly stem from silences not being removed by the extract_feats script, while it has been removed from the uploaded data.

Can you please recommend a solution for this?

Face 'Not a finite gradient or too big, ignoring.' when training other data.

Hello,

I met some issues when training loop model with my own data. please help.
I am preparing a data set with 12 person and total 5000 sentences. I am using the parameters in the readme guide to training:
python train.py --expName myexp--data data/mydata--noise 4 --seq-len 100 --epochs 90 --nspk 12
python train.py --expName myexp_final --data data/mydata--checkpoint checkpoints/myexp/bestmodel.pth --noise 2 --seq-len 1000 --epochs 90 --nspk 12

The first training is done and seems no issues. with some logs in the last lines:

INFO - 11/16/17 08:03:41 - 21:26:30 - ====> Train set loss: 31.4378
INFO - 11/16/17 08:04:01 - 21:26:50 - ====> Test set loss: 32.6544
INFO - 11/16/17 08:18:16 - 21:41:05 - ====> Train set loss: 31.4457
INFO - 11/16/17 08:18:37 - 21:41:26 - ====> Test set loss: 32.5302

But when start training with the second line, in first epoch. it start showing: 'Not a finite gradient or too big, ignoring.' frequently. I have print the befgad in utils.py in below line:
befgad = torch.nn.utils.clip_grad_norm(params, clip_th)

it has some values larger than 10000 like below ones
42648.5450444
1599437.41826
167695.944851

I have tried another experiments with 12 person and 10000 sentences, the same issue happened when training the second model.

My questions are:

Why we separate the training with 2 steps?
Need I adjust some parameters for training or what is the problems?

Thanks.

What is the split ratio between the training set and test set of the VCTK data provided in your project?

invalid combination of arguments error in training or generating

Once I've installed and tried to train or generate as described in the readme, it makes an invalid combination of arguments error as following:

$ python train.py --expName vctk --data data/vctk --noise 4 --seq-len 100 --epochs 90
INFO - 09/06/17 09:30:25 - 0:00:00 - Namespace(K=10, attention_alignment=0.05, batch_size=64, checkpoint='', clip_grad=0.5, data='data/vctk', epochs=90, expName='checkpoints/vctk', gpu=0, hidden_size=256, ignore_grad=10000.0, lr=0.0001, max_seq_len=1000, mem_size=20, noise=4, nspk=22, output_size=63, seed=1, seq_len=100, vocabulary_size=44)
INFO - 09/06/17 09:30:25 - 0:00:00 - Building dataset.
INFO - 09/06/17 09:30:25 - 0:00:00 - Dataset ready!
Traceback (most recent call last):
  File "train.py", line 207, in <module>
    main()
  File "train.py", line 175, in main
    model = Loop(args)
  File "/d2/jbaik/loop/model.py", line 217, in __init__
    self.decoder = Decoder(opt)
  File "/d2/jbaik/loop/model.py", line 137, in __init__
    opt.attention_alignment)
  File "/d2/jbaik/loop/model.py", line 87, in __init__
    self.N_a = getLinear(mem_elem, 3*K)
  File "/d2/jbaik/loop/model.py", line 15, in getLinear
    return nn.Sequential(nn.Linear(dim_in, dim_in/10),
  File "/home/jbaik/.pyenv/versions/3.6.2/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 41, in __init__
    self.weight = Parameter(torch.Tensor(out_features, in_features))
TypeError: torch.FloatTensor constructor received an invalid combination of arguments - got (float, int), but expected one of:
 * no arguments
 * (int ...)
      didn't match because some of the arguments have invalid types: (float, int)
 * (torch.FloatTensor viewed_tensor)
 * (torch.Size size)
 * (torch.FloatStorage data)
 * (Sequence data)

Could you let me get some hint to handle this? Thanks!

Error during training

Hi,

So I have run the training a 4 months ago and there was no issue. But now when I add new a dataset and train with multiple speakers I get this error.

cuda runtime error (59) : device-side assert triggered at /b/wheel/pytorch-src/torch/lib/THC/generic/THCTensorMath.cu:226

Can you please help?

What does a good loss curve look like?

After the first 90 epochs, I attained a loss of 36. Starting on the next 90 epochs with reduced noise and increased sequence length, it's dropped to 26.

How low should it get to for a quality model?

slow down the voice

I find it speak too fast, how can I slow down the voice?

Can we use multi-GPU to make the training faster? How?

Why scale the outputs by 30 in Decoder.update_buffer?

The 30 here seems to be a magic number unless I missed something in the paper?

    def update_buffer(self, S_tm1, c_t, o_tm1, ident):
        # concat previous output & context
        idt = torch.tanh(self.F_u(ident))
        o_tm1 = o_tm1.squeeze(0)
        z_t = torch.cat([c_t + idt, o_tm1/30], 1)
        z_t = z_t.unsqueeze(2)
        Sp = torch.cat([z_t, S_tm1[:, :, :-1]], 2)

        # update S
        u = self.N_u(Sp.view(Sp.size(0), -1))
        u[:, :idt.size(1)] = u[:, :idt.size(1)] + idt
        u = u.unsqueeze(2)
        S = torch.cat([u, S_tm1[:, :, :-1]], 2)

        return S

Thanks.

Does anyone successfully train a multi-speaker loop model with his/her own data?

Or use the same dataset of VTCK speakers but extract the features locally?

Issue on generating with --text param

Hi when I try to run
sudo python generate.py --text "hello world" --spkr 1 --checkpoint models/vctk/bestmodel.pth
I always get this error.

Traceback (most recent call last):
  File "generate.py", line 153, in <module>
    main()
  File "generate.py", line 112, in main
    txt = text2phone(args.text, char2code)
  File "generate.py", line 43, in text2phone
    cmudict = nltk.corpus.cmudict.dict()
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/util.py", line 116, in __getattr__
    self.__load()
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/util.py", line 81, in __load
    except LookupError: raise e
LookupError: 
**********************************************************************
  Resource cmudict not found.
  Please use the NLTK Downloader to obtain the resource:
  >>> import nltk
  >>> nltk.download('cmudict')
  
  Searched in:
    - '/home/jax/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/usr/nltk_data'
    - '/usr/lib/nltk_data'
**********************************************************************

Thanks in advance.

Training Error. Epoch 5.

This is the stack trace:

Traceback (most recent call last):
  File "train.py", line 211, in <module>
    main()
  File "train.py", line 199, in main
    train(model, criterion, optimizer, epoch, train_losses)
  File "train.py", line 119, in train
    loss = criterion(output, target[0], target[1])
  File "/home/michael/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/michael/Desktop/loop/model.py", line 42, in forward
    mask_ = mask.expand_as(input)
  File "/home/michael/.local/lib/python2.7/site-packages/torch/autograd/variable.py", line 655, in expand_as
    return Expand(tensor.size())(self)
  File "/home/michael/.local/lib/python2.7/site-packages/torch/autograd/_functions/tensor.py", line 115, in forward
    result = i.expand(*self.sizes)
RuntimeError: The expanded size of the tensor (21) must match the existing size (5) at               non-singleton dimension 0. at /b/wheel/pytorch-src/torch/lib/TH/THStorage.c:99

Any clue what is going on?

Training Epochs Number

How many epochs of training are needed in order to start hearing anything meaningful on output of generate.py? tnx!

Error when synthesis wave with the download model.

Hi,

When I try to generate some waves using the download model, it has some errors when dealing with some sentences, the error type is same, not sure why. Could you please help?

Sentence1:
The boy's grandmother is his legal guardian.
Cmd:
python generate.py --text "The boy's grandmother is his legal guardian." --spkr 1 --checkpoint models/vctk/bestmodel.pth
Error:
Traceback (most recent call last):
File "generate.py", line 151, in
main()
File "generate.py", line 140, in main
norm_path)
File "/kaldi/loop/utils.py", line 266, in generate_merlin_wav
base_r0=files['mgc'] + '_r0'), shell=True)
File "/kaldi/loop/utils.py", line 121, in pe
for line in execute(cmd, shell=shell):
File "/kaldi/loop/utils.py", line 114, in execute
raise subprocess.CalledProcessError(return_code, cmd)
subprocess.CalledProcessError: Command '/kaldi/loop/tools/bin/SPTK-3.9/freqt -m 59 -a 0.58 -M 511 -A 0 < The_boy's_grandmother_is_his_legal_guardian..gen_1.mgc | /kaldi/loop/tools/bin/SPTK-3.9/c2acr -m 511 -M 0 -l 1024 > The_boy's_grandmother_is_his_legal_guardian..gen_1.mgc_r0' returned non-zero exit status 2

Sentence2:
When he's able to return to campaigning, Santorum will have to decide whether he wants to.
Cmd:
python generate.py --text "When he's able to return to campaigning, Santorum will have to decide whether he wants to." --spkr 1 --checkpoint models/vctk/bestmodel.pth
Error:
Traceback (most recent call last):
File "generate.py", line 151, in
main()
File "generate.py", line 140, in main
norm_path)
File "/kaldi/loop/utils.py", line 266, in generate_merlin_wav
base_r0=files['mgc'] + '_r0'), shell=True)
File "/kaldi/loop/utils.py", line 121, in pe
for line in execute(cmd, shell=shell):
File "/kaldi/loop/utils.py", line 114, in execute
raise subprocess.CalledProcessError(return_code, cmd)
subprocess.CalledProcessError: Command '/kaldi/loop/tools/bin/SPTK-3.9/freqt -m 59 -a 0.58 -M 511 -A 0 < When_he's_able_to_return_to_campaigning,_Santorum_will_have_to_decide_whether_he_wants_to..gen_1.mgc | /kaldi/loop/tools/bin/SPTK-3.9/c2acr -m 511 -M 0 -l 1024 > When_he's_able_to_return_to_campaigning,_Santorum_will_have_to_decide_whether_he_wants_to..gen_1.mgc_r0' returned non-zero exit status 2

No numpy_features_valid

Hello,

After running extract_feats.py all went through except I don't see any numpy_features_valid. Is it still needed? Do I manually create that?

Blizzard Model

Can't use the Blizzard model without the original training data:

Traceback (most recent call last):
  File "generate.py", line 156, in <module>
    main()
  File "generate.py", line 83, in main
    train_dataset = NpzFolder(train_args.data + '/numpy_features')
  File "/home/michael/Desktop/loop/data.py", line 84, in __init__
    self.NPZ_EXTENSION))
RuntimeError: Found 0 npz in subfolders of: data/blizzard/numpy_features
Supported image extensions are: npz

Looks generate.py uses parameters in the training data to generate.py

What's vctk_alt model?

Is this trained on noise level 4?

From English into Chinese?

Hi, how can I prepare Chinese material as training data? Thank you.

RuntimeError: invalid argument 6: expected 3D tensor at /home/a524yangsen/soft/pytorch/torch/lib/THC/generic/THCTensorMathBlas.cu:442

when i run the command provided by the document
python train.py --expName vctk --data data/vctk --noise 4 --seq-len 100 --epochs 90

i counter the problem ,what is wrong? would you help me?
RuntimeError: invalid argument 6: expected 3D tensor at /home/a524yangsen/soft/pytorch/torch/lib/THC/generic/THCTensorMathBlas.cu:442

Full VCTK dataset

Hi There!

Did you try training on the full VCTK dataset? Does the quality get better?
How long does it take to train on the 22 speakers VCTK dataset?

ConnectionError: HTTPConnectionPool(host='localhost', port=8097)

Anyone else getting this error?

python train.py --expName vctk --data data/vctk --noise 4 --seq-len 100 --epochs 90
INFO - 09/06/17 08:52:38 - 0:00:00 - Namespace(K=10, attention_alignment=0.05, batch_size=64, checkpoint='', clip_grad=0.5, data='data/vctk', epochs=90, expName='checkpoints/vctk', gpu=0, hidden_size=256, ignore_grad=10000.0, lr=0.0001, max_seq_len=1000, mem_size=20, noise=4, nspk=22, output_size=63, seed=1, seq_len=100, vocabulary_size=44)
INFO - 09/06/17 08:52:38 - 0:00:00 - Building dataset.
INFO - 09/06/17 08:52:38 - 0:00:00 - Dataset ready!
Train (loss 50.63) epoch 1: 100%|█████████████| 126/126 [11:06<00:00,  4.17s/it]
Exception in user code:
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/visdom/__init__.py", line 240, in _send
    data=json.dumps(msg),
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 112, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 508, in send
    raise ConnectionError(e, request=request)
ConnectionError: HTTPConnectionPool(host='localhost', port=8097): Max retries exceeded with url: /events (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc867a836d0>: Failed to establish a new connection: [Errno 111] Connection refused',))
INFO - 09/06/17 09:03:46 - 0:11:08 - ====> Train set loss: 55.6526
Valid (loss 51.16) epoch 1: 100%|███████████████| 11/11 [00:17<00:00,  1.73s/it]
Exception in user code:
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/visdom/__init__.py", line 240, in _send
    data=json.dumps(msg),
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 112, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 508, in send
    raise ConnectionError(e, request=request)
ConnectionError: HTTPConnectionPool(host='localhost', port=8097): Max retries exceeded with url: /events (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc858e3abd0>: Failed to establish a new connection: [Errno 111] Connection refused',))
INFO - 09/06/17 09:04:04 - 0:11:26 - ====> Test set loss: 51.7753
Train (loss 42.89) epoch 2: 100%|█████████████| 126/126 [11:18<00:00,  5.10s/it]
Exception in user code:
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/visdom/__init__.py", line 240, in _send
    data=json.dumps(msg),
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 112, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 508, in send
    raise ConnectionError(e, request=request)
ConnectionError: HTTPConnectionPool(host='localhost', port=8097): Max retries exceeded with url: /events (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc8617b41d0>: Failed to establish a new connection: [Errno 111] Connection refused',))
INFO - 09/06/17 09:15:23 - 0:22:44 - ====> Train set loss: 49.4345
Valid (loss 47.57) epoch 2: 100%|███████████████| 11/11 [00:17<00:00,  1.73s/it]
Exception in user code:
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/visdom/__init__.py", line 240, in _send
    data=json.dumps(msg),
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 112, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 508, in send
    raise ConnectionError(e, request=request)
ConnectionError: HTTPConnectionPool(host='localhost', port=8097): Max retries exceeded with url: /events (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc85b904c10>: Failed to establish a new connection: [Errno 111] Connection refused',))
INFO - 09/06/17 09:15:40 - 0:23:02 - ====> Test set loss: 48.0748
Train (loss 45.03) epoch 3: 100%|█████████████| 126/126 [10:59<00:00,  4.99s/it]
Exception in user code:
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/visdom/__init__.py", line 240, in _send
    data=json.dumps(msg),
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 112, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 508, in send
    raise ConnectionError(e, request=request)
ConnectionError: HTTPConnectionPool(host='localhost', port=8097): Max retries exceeded with url: /events (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc862629190>: Failed to establish a new connection: [Errno 111] Connection refused',))
INFO - 09/06/17 09:26:41 - 0:34:02 - ====> Train set loss: 46.5043
Valid (loss 44.79) epoch 3: 100%|███████████████| 11/11 [00:17<00:00,  1.73s/it]
Exception in user code:
------------------------------------------------------------

Adding new datasets to train

Hi,

How can you add new datasets (voices) for training? I want to use this datasets. https://linksync-2032.kxcdn.com/wp-content/uploads/2017/06/female-voice-1.zip

they are all in .wav files and I want them to add as a dataset so I can use that voice.

Creating norm_info_mgc_lf0_vuv_bap_63_MVN.dat for the Full VCTK dataset

Hi There!

For large datasets, where extract_feats.py uses it's multifolder feature like the full VCTK dataset; it's unclear what the norm_info/norm.dat file is. The norm_info_mgc_lf0_vuv_bap_63_MVN.dat file is regenerated for each tmp split of the dataset. How do you create the norm_info/norm.dat for datasets with more than 5000 files?

I believe you had to deal with the same problem with the 22 speaker dataset because it contains around 8000 files.

Thanks for your time, Michael. Happy to contribute back the findings.

P.S. I've been commenting in https://gist.github.com/kastnerkyle/cc0ac48d34860c5bb3f9112f4d9a0300 about changes needed to make the extract_feats.py script work. I can't submit a pull request. I know many people are struggling to get it running.

' out of memory' accrued when training with single speaker

Hi,

Below error accrued when training with single speaker:
Train (loss 63.31) epoch 1: 3%|████▍ | 1/29 [00:23<11:10, 23.93s/it]THCudaCheck FAIL file=/b/wheel/pytorch-src/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
Traceback (most recent call last):
File "train.py", line 211, in
main()
File "train.py", line 199, in main
train(model, criterion, optimizer, epoch, train_losses)
File "train.py", line 122, in train
loss.backward()
File "/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py", line 146, in backward
self._execution_engine.run_backward((self,), (gradient,), retain_variables)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/_functions/linear.py", line 24, in backward
grad_weight = torch.mm(grad_output.t(), input)
RuntimeError: cuda runtime error (2) : out of memory at /b/wheel/pytorch-src/torch/lib/THC/generic/THCStorage.cu:66

Could you please help?

How much hours or minutes of training data do we need to make the performance very good?

error on generate.py execution

I get an error upon executing:
python generate.py --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth

(gpu_13) abhinav@ubuntu11:~/.../loop$ python generate.py  --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth
Traceback (most recent call last):
  File "generate.py", line 153, in <module>
    main()
  File "generate.py", line 132, in main
    out, attn = model([txt, spkr], feat)
  File "/home/abhinav/tensorflow/gpu_13/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/software/LM_stash/abhinav/projects/tts/loop/model.py", line 247, in forward
    context, ident = self.encoder(src[0], src[1])
  File "/home/abhinav/tensorflow/gpu_13/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/software/LM_stash/abhinav/projects/tts/loop/model.py", line 66, in forward
    outputs = self.lut_p(input)
  File "/home/abhinav/tensorflow/gpu_13/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/abhinav/tensorflow/gpu_13/local/lib/python2.7/site-packages/torch/nn/modules/sparse.py", line 94, in forward
    self.scale_grad_by_freq, self.sparse
  File "/home/abhinav/tensorflow/gpu_13/local/lib/python2.7/site-packages/torch/nn/_functions/thnn/sparse.py", line 48, in forward
    cls._renorm(indices, weight, max_norm, norm_type)
TypeError: _renorm() takes exactly 5 arguments (4 given)

I have followed all steps in the Setup segment

Decoder input

In the paper, decoder input seems to mix previous decoder output and ground truth input (+noise).
But it seems the decoder in the code only uses ground truth input with noise.
Am I missing something?

How do I train on a new speaker?

Can loop support parallel training in multiple GPU?

Hi Loop experts,

Currently, I have repro the original loop model with 8K vctk data, it tooks around 3 days in my Ubuntu GPU server, the server have 2 GPU.
So can loop support parallel training in multiple GPU to accelerate the training?

Thanks.

Memory usage

Hi, and thank you for realizing your code. Currently, I'm trying to reproduce VCTK results on an ec2 instance with a Kepler GPU, and more than an issue I have a question:

Tqdm report shows iterations are taking around 9 seconds:

Train (loss 50.62) epoch 2: 28%|##7 | 35/126 [05:01<13:02, 8.60s/it]
Train (loss 48.54) epoch 2: 40%|#### | 51/126 [07:29<12:21, 9.89s/it]

And nvidia-smi shows a very low memory usage:

1208MiB / 11439MiB

So, I'm not sure if I'm missing something or if that is the expected performance.

Thanks.

Thank you for releasing the code

I have tried to rebuild this model based on the details mentioned in the paper, but the result is bad. I used adam optimizer with 0.0002 lr and 0.5/0.9 momentum as well as gradient clip 1, but the gradient still explodes at the first few epochs. Now I can check what is wrong with my implementation.

Training on big dataset.

Hello loop experts,

If I have a big dataset say 12 person with around 50K data in total, I want to train a loop model, any parameters need to adjust?

Feature extraction

I trained loop with a subset of vctk data (American speakers). I found that the audio from those speakers when I run generate.py using my trained model are pretty bad. I just hear only a couple of words in a sentence and the rest is silence or noise.

My guess is that something went wrong during feature extraction. When I compare same feature extracted files i.e. p294_001.npz from the given s3 bucket and the one I feature extracted by running extract_feats.py, I see that vuv_idx from s3 has larger numbers (range: -5 to 5) compared to mine (range: -10e-02 to 5 )

I also noticed that text_features and audio_features are of different shape:
(226, 420) - s3
(540, 420) - me

Other features like durations and code2phone also look different.

May I know what changes I've to make to the extract_feats.py to get similar features as the one in s3?

Training data is slow.

I'm on the section where training the data. It took 10hrs for the first 33 epochs out of 90. Is it normal or did I miss something? I'm new to this so I'm not that expert in this field.

Thanks.

New Dataset

Hi, So everything worked perfectly with your pre-process Vctk. Now I want to test with Nancy data set. I'm using the script you suggested, but I have 2 questions:

When I run the script I get 2 files on the norm_info folder: label_norm_HTS_420.dat and norm_info_mgc_lf0_vuv_bap_63_MVN.dat. Based on the shape the correct file is norm_info_mgc_lf0_vuv_bap_63_MVN.dat, but I want to be sure.
In order to combine both datasets, should I have to run the script for each speaker and them combine somehow the norms file, or should I put all data in one folder and process it?

Thanks.

can not run the demo

I tested the demo but failed

python generate.py --text "hello world" --spkr 1 --checkpoint models/vctk/bestmodel.pth
Traceback (most recent call last):
File "generate.py", line 153, in
main()
File "generate.py", line 132, in main
out, attn = model([txt, spkr], feat)
File "/usr/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 224, in call
result = self.forward(*input, **kwargs)
File "/home/xinwang/voiceloop/loop/model.py", line 247, in forward
context, ident = self.encoder(src[0], src[1])
File "/usr/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 224, in call
result = self.forward(*input, **kwargs)
File "/home/xinwang/voiceloop/loop/model.py", line 66, in forward
outputs = self.lut_p(input)
File "/usr/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 224, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python2.7/site-packages/torch/nn/modules/sparse.py", line 94, in forward
self.scale_grad_by_freq, self.sparse
File "/usr/local/lib/python2.7/site-packages/torch/nn/_functions/thnn/sparse.py", line 48, in forward
cls._renorm(indices, weight, max_norm, norm_type)
TypeError: _renorm() takes exactly 5 arguments (4 given)

facebookarchive / loop Goto Github PK

loop's Introduction

VoiceLoop

Quick Links

Quick Start

Setup

Data

Pretrained Models

SPTK and WORLD

Training

Single-Speaker

Multi-Speaker

Citation

License

loop's People

Contributors

Stargazers

Watchers

Forkers

loop's Issues

Recommend Projects

Recommend Topics

Recommend Org