k2kobayashi / crank Goto Github PK
View Code? Open in Web Editor NEWA toolkit for non-parallel voice conversion based on vector-quantized variational autoencoder
License: MIT License
A toolkit for non-parallel voice conversion based on vector-quantized variational autoencoder
License: MIT License
# python -m crank.bin.extract_statistics --n_jobs 2 --phase train --conf conf/mlfb_vqvae.yml --scpdir data/scp --featdir data/feature
# Started at Mon Dec 7 14:55:35 UTC 2020
#
INFO:root:# of samples for mlfb: 208496
INFO:root:# of samples for mcep: 208496
INFO:root:# of samples for lcf0: 208496
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/crank/crank/bin/extract_statistics.py", line 90, in <module>
main()
File "/content/crank/crank/bin/extract_statistics.py", line 79, in main
"lcf0", spkr, s.ss.n_samples_seen_
AttributeError: 'StandardScaler' object has no attribute 'n_samples_seen_'
# Accounting: time=3 threads=1
# Ended (code 1) at Mon Dec 7 14:55:38 UTC 2020, elapsed time 3 seconds
I cant start the training at stage 3 anymore:
# python -m crank.bin.train --flag train --n_jobs 10 --conf conf/mlfb_vqvae.yml --checkpoint None --scpdir data/scp --featdir data/feature --expdir exp
# Started at Wed Dec 22 19:20:31 UTC 2021
#
2021-12-22 19:20:40,711 (train:164) INFO: feature: {'label': 'mlfb', 'fs': 22050, 'fftl': 1024, 'win_length': 1024, 'hop_size': 128, 'fmin': 80, 'fmax': 7600, 'mlfb_dim': 80, 'n_iteration': 100, 'framems': 20, 'shiftms': 5.80499, 'mcep_dim': 34, 'mcep_alpha': 0.466, 'window_types': ['hann']}
2021-12-22 19:20:40,713 (train:164) INFO: input_feat_type: mlfb
2021-12-22 19:20:40,713 (train:164) INFO: output_feat_type: mlfb
2021-12-22 19:20:40,713 (train:164) INFO: trainer_type: vqvae
2021-12-22 19:20:40,714 (train:164) INFO: input_size: 80
2021-12-22 19:20:40,714 (train:164) INFO: output_size: 80
2021-12-22 19:20:40,714 (train:164) INFO: n_steps: 200000
2021-12-22 19:20:40,714 (train:164) INFO: dev_steps: 2000
2021-12-22 19:20:40,715 (train:164) INFO: n_steps_save_model: 350
2021-12-22 19:20:40,715 (train:164) INFO: n_steps_print_loss: 50
2021-12-22 19:20:40,715 (train:164) INFO: batch_size: 50
2021-12-22 19:20:40,715 (train:164) INFO: batch_len: 500
2021-12-22 19:20:40,715 (train:164) INFO: cache_dataset: True
2021-12-22 19:20:40,716 (train:164) INFO: spec_augment: False
2021-12-22 19:20:40,716 (train:164) INFO: n_spec_augment: 0
2021-12-22 19:20:40,716 (train:164) INFO: use_mcep_0th: False
2021-12-22 19:20:40,716 (train:164) INFO: ignore_scaler: []
2021-12-22 19:20:40,716 (train:164) INFO: alpha: {'l1': 2, 'mse': 0, 'stft': 1, 'commit': 0.25, 'dict': 0.5, 'cycle': 0.1, 'ce': 1, 'adv': 1, 'real': 0.5, 'fake': 0.5, 'acgan': 1}
2021-12-22 19:20:40,717 (train:164) INFO: stft_params: {'fft_sizes': [64, 128], 'win_sizes': [64, 128], 'hop_sizes': [16, 32], 'logratio': 0}
2021-12-22 19:20:40,717 (train:164) INFO: optim: {'G': {'type': 'adam', 'lr': 0.0002, 'decay_size': 0.5, 'decay_step_size': 200000, 'clip_grad_norm': 0.0}, 'D': {'type': 'adam', 'lr': 5e-05, 'decay_size': 0.5, 'decay_step_size': 200000, 'clip_grad_norm': 0.0}, 'C': {'type': 'adam', 'lr': 0.0001, 'decay_size': 0.5, 'decay_step_size': 200000, 'clip_grad_norm': 0.0}, 'SPKRADV': {'type': 'adam', 'lr': 0.0001, 'decay_size': 0.5, 'decay_step_size': 200000, 'clip_grad_norm': 0.0}}
2021-12-22 19:20:40,717 (train:164) INFO: encoder_f0: False
2021-12-22 19:20:40,717 (train:164) INFO: decoder_f0: True
2021-12-22 19:20:40,718 (train:164) INFO: encoder_energy: False
2021-12-22 19:20:40,718 (train:164) INFO: decoder_energy: False
2021-12-22 19:20:40,718 (train:164) INFO: causal: False
2021-12-22 19:20:40,718 (train:164) INFO: causal_size: 0
2021-12-22 19:20:40,719 (train:164) INFO: use_spkr_embedding: True
2021-12-22 19:20:40,719 (train:164) INFO: spkr_embedding_size: 32
2021-12-22 19:20:40,719 (train:164) INFO: ema_flag: True
2021-12-22 19:20:40,719 (train:164) INFO: n_vq_stacks: 2
2021-12-22 19:20:40,719 (train:164) INFO: n_layers_stacks: [4, 3, 2]
2021-12-22 19:20:40,720 (train:164) INFO: n_layers: [2, 2, 2]
2021-12-22 19:20:40,720 (train:164) INFO: kernel_size: [5, 3, 3]
2021-12-22 19:20:40,720 (train:164) INFO: emb_dim: [64, 64, 64]
2021-12-22 19:20:40,720 (train:164) INFO: emb_size: [512, 512, 512]
2021-12-22 19:20:40,720 (train:164) INFO: use_spkradv_training: True
2021-12-22 19:20:40,721 (train:164) INFO: n_spkradv_layers: 3
2021-12-22 19:20:40,721 (train:164) INFO: spkradv_kernel_size: 3
2021-12-22 19:20:40,721 (train:164) INFO: spkradv_lambda: 0.1
2021-12-22 19:20:40,721 (train:164) INFO: use_spkr_classifier: True
2021-12-22 19:20:40,722 (train:164) INFO: n_spkr_classifier_layers: 8
2021-12-22 19:20:40,722 (train:164) INFO: spkr_classifier_kernel_size: 5
2021-12-22 19:20:40,722 (train:164) INFO: use_cyclic_training: False
2021-12-22 19:20:40,722 (train:164) INFO: n_steps_cycle_start: 50000
2021-12-22 19:20:40,722 (train:164) INFO: n_cycles: 1
2021-12-22 19:20:40,723 (train:164) INFO: n_steps_gan_start: 100000
2021-12-22 19:20:40,723 (train:164) INFO: gan_type: lsgan
2021-12-22 19:20:40,723 (train:164) INFO: use_residual_network: True
2021-12-22 19:20:40,723 (train:164) INFO: n_discriminator_layers: 2
2021-12-22 19:20:40,724 (train:164) INFO: n_discriminator_stacks: 4
2021-12-22 19:20:40,724 (train:164) INFO: discriminator_kernel_size: 5
2021-12-22 19:20:40,724 (train:164) INFO: discriminator_dropout: 0.25
2021-12-22 19:20:40,724 (train:164) INFO: train_first: D
2021-12-22 19:20:40,724 (train:164) INFO: switch_update: False
2021-12-22 19:20:40,725 (train:164) INFO: cvadv_flag: False
2021-12-22 19:20:40,725 (train:164) INFO: acgan_flag: False
2021-12-22 19:20:40,725 (train:164) INFO: encoder_detach: False
2021-12-22 19:20:40,726 (train:164) INFO: use_real_only_acgan: False
2021-12-22 19:20:40,726 (train:164) INFO: use_D_uv: True
2021-12-22 19:20:40,726 (train:164) INFO: use_D_spkrcode: True
2021-12-22 19:20:40,726 (train:164) INFO: use_vqvae_loss: True
2021-12-22 19:20:40,726 (train:164) INFO: n_steps_stop_generator: 0
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/drive/MyDrive/crank/crank/bin/train.py", line 231, in <module>
main()
File "/content/drive/MyDrive/crank/crank/bin/train.py", line 182, in main
model = get_model(conf, spkr_size, device, scaler=scaler)
File "/content/drive/MyDrive/crank/crank/bin/train.py", line 57, in get_model
models = {"G": VQVAE2(conf, spkr_size=spkr_size, scaler=scaler).to(device)}
File "/content/drive/MyDrive/crank/crank/net/module/vqvae2.py", line 52, in __init__
if self.conf["use_raw"]:
KeyError: 'use_raw'
# Accounting: time=12 threads=1
# Ended (code 1) at Wed Dec 22 19:20:43 UTC 2021, elapsed time 12 seconds
Hi, first thx for this great repo!)
I have question
Traceback (most recent call last): File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/data/crank/crank/bin/extract_feature.py", line 18, in <module> from crank.feature import Feature File "/data/crank/crank/feature/__init__.py", line 1, in <module> from .feature import Feature # noqa File "/data/crank/crank/feature/feature.py", line 20, in <module> from sprocket.speech import FeatureExtractor, Synthesizer File "/data/crank/tools/venv/lib/python3.7/site-packages/sprocket/speech/__init__.py", line 1, in <module> from .feature_extractor import FeatureExtractor File "/data/crank/tools/venv/lib/python3.7/site-packages/sprocket/speech/feature_extractor.py", line 3, in <module> import pysptk File "/data/crank/tools/venv/lib/python3.7/site-packages/pysptk/__init__.py", line 41, in <module> from .sptk import * # pylint: disable=wildcard-import File "/data/crank/tools/venv/lib/python3.7/site-packages/pysptk/sptk.py", line 147, in <module> from . import _sptk File "__init__.pxd", line 242, in init pysptk._sptk ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
I trained on a custom dataset and stopped around 50000steps
Stage 4 and 5 were done.
Stage 6 failed.
download_pretrained_vocoder log says following:
# local/download_pretrained_vocoder.sh --downloaddir downloads/PWG --voc PWG
#
Permission denied: https://drive.google.com/uc?id=set id of google drive
Maybe you need to change permission over 'Anyone with the link'?
tar (child): downloads/PWG/PWG.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
# Accounting: time=0 threads=1
# Ended (code 2) at, elapsed time 0 seconds
Edit: Maybe the id of the google drive folder for the vocoder is missing?
Could you please upload a copy of your paper? I can't find it. Thanks.
I am have bunch of real voice file. Some of it has real bad quality. At stage 2, utils.py, convert_continuos_f0(...), start_f0 = f0[f0 != 0][0] I am get exception "index 0 is out of bounds for axis 0 with size 0".
There may be good do not break process and\or check quality and drop files automatically.
@k2kobayashi Thank you for your work!
Could you please make it more Windows friendly like you did it with https://github.com/k2kobayashi/sprocket?
The file path rules in crank on windows is a disaster...I think you have already know about it :D
This is generated one, the speech of speech is twice slower than orignal one... and my pwg vocoder just train to 200000.ckpt
this is the duration of same waveform for original and your sample :
And I used the default configuration and setup...Could you help me to fix this kind of problem? Does it will influence the MCD and MOSNET outcome?
Let's discuss neural vocoder support.
pip install -U parallel_wavegan
.voc_expdir
to load the pretrained model. In addition, with PWG, kan-bayashi has also packed training code in the package, so we can provide recipes for users to train their own vocoders if they want. One example design can be like egs/pwg/vcc2018
.stage 1 - 6 successfully worked. Stage 7 failed.
mcd log says following:
'''
# python -m crank.bin.evaluate_mcd --conf conf/mlfb_vqvae.yml --n_jobs 10 --spkr_conf conf/spkr.yml --outwavdir exp/mlfb_vqvae/eval_PWG_wav/100000/wav --featdir data/feature
# Started at Sun Mar 7 19:34:38 UTC 2021
#
2021-03-07 19:34:41,647 (evaluate_mcd:117) INFO: number of utterances = 20
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/joblib/externals/loky/process_executor.py", line 431, in _process_worker
r = call_item()
File "/usr/local/lib/python3.7/dist-packages/joblib/externals/loky/process_executor.py", line 285, in __call__
return self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.7/dist-packages/joblib/_parallel_backends.py", line 595, in __call__
return self.func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 263, in __call__
for func, args, kwargs in self.items]
File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 263, in <listcomp>
for func, args, kwargs in self.items]
File "/content/drive/MyDrive/crank/crank/bin/evaluate_mcd.py", line 48, in calculate
number, orgspk, tarspk = basename.split("_")
ValueError: too many values to unpack (expected 3)
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/drive/MyDrive/crank/crank/bin/evaluate_mcd.py", line 151, in <module>
main()
File "/content/drive/MyDrive/crank/crank/bin/evaluate_mcd.py", line 131, in main
for cv_path in converted_files
File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 1054, in __call__
self.retrieve()
File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 933, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/usr/local/lib/python3.7/dist-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
ValueError: too many values to unpack (expected 3)
# Accounting: time=11 threads=1
# Ended (code 1) at Sun Mar 7 19:34:49 UTC 2021, elapsed time 11 seconds
Whats is happening here?
Yesterday, I try to use the toolkit with my own dataset. However, I got the following error:
# python -m crank.bin.extract_feature --n_jobs 10 --phase train --conf conf/mlfb_vqvae.yml --spkr_yml conf/spkr.yml --scpdir data/scp --featdir data/feature
# Started at Thu Mar 25 00:09:23 UTC 2021
#
INFO:root:extract feature for Dysarthria_Taiwan_Moderate_CP_WuSH_VAD
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/joblib/externals/loky/process_executor.py", line 418, in _process_worker
r = call_item()
File "/usr/local/lib/python3.6/dist-packages/joblib/externals/loky/process_executor.py", line 272, in call
return self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.6/dist-packages/joblib/_parallel_backends.py", line 600, in call
return self.func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/joblib/parallel.py", line 256, in call
for func, args, kwargs in self.items]
File "/usr/local/lib/python3.6/dist-packages/joblib/parallel.py", line 256, in
for func, args, kwargs in self.items]
File "/home/alim/Crank/crank/crank/feature/feature.py", line 51, in analyze
self._synthesize_world_features(flbl)
File "/home/alim/Crank/crank/crank/feature/feature.py", line 115, in _synthesize_world_features
self.feats["mcep"],
KeyError: 'mcep'
Any solution?
I trained the model and want to test it by converting new audio files using the new trained model? How should I approach this?
will there be support for directml on ubuntu? Right now it only works on CUDA but not on AMD GPUs. With directml it could work on both nvidia and amd gpus.
stage 1: initialization
run.pl: job failed, log is in data/log/generate_histogram.log
logfile is attatched.
generate_histogram.log
What am I missing here?
EDIT: I have 7 speakers, I noticed that 6 of them just work fine on this stage and everything gets generated. Once I add the 7th speaker then it causes this issue. I checked the WAVs of the 7th speaker but could not find any problems. If I remove the dataset of speaker 7 then it just works fine, so there has to be a problem with the wav files. All the Wavs of the speakers have a bitdepth of 16 and samplingrate of 22050 Hz
Edit 2: I fixed the problem. I just converted the WAVs of speaker 7 with ffmpeg once again to wav, restarted stage 1 and the stage just passed succesfully.
These are some models I wish to implement.
During stage 2, I got the following error in data/log/extract_feature_train.log
:
Traceback (most recent call last):
File "/home/huang18/anaconda3/envs/py36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/huang18/anaconda3/envs/py36/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/nas01/internal/wenchin-h/VC/Experiments/crank/crank/bin/extract_feature.py", line 75, in <module>
main()
File "/nas01/internal/wenchin-h/VC/Experiments/crank/crank/bin/extract_feature.py", line 42, in main
(featsscp).unlink(missing_ok=True)
TypeError: unlink() got an unexpected keyword argument 'missing_ok'
According to https://docs.python.org/3/library/pathlib.html#pathlib.Path.unlink, this new parameter seems to be new in Python 3.8. Is it possible to make it compatible to lower python versions?
i managed to set everything up with ubuntu WLS on win 10. But i discovered now that ubuntu WLS does not support gpus and thats bad. So i cant run stage 3 to 5 on my local gpu.
I have a large Dataset but my space on my google drive is limited. the extracted features took 12 gb on my drive so the further stages cant continue, because there is no space available.
cant you atleast let stage 4 and 5 run on cpu? I would cut down the dataset and train a smaller part over Google colab. and do the other stages localy on my cpu
Hi, after your last update I have this error , when training:
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/crank/crank/bin/train.py", line 179, in
main()
File "/content/crank/crank/bin/train.py", line 134, in main
model = get_model(conf, spkr_size, device)
File "/content/crank/crank/bin/train.py", line 82, in get_model
if conf["speaker_adversarial"]:
KeyError: 'speaker_adversarial'
are there any detailed informations to all the parameters in the config files and how they affect the audio?
conf/mlfb_vqvae.yml
cobf/mflb_vqvae.yml
I left it all on default and trained 2 speakers with a dataset of 90 files each.
These were the Results after Stage 7:
mel-cepstrum distortion:
spkr1 spkr1 7.458
spkr1 spkr2 11.707
spkr2 spkr1 11.223
spkr2 spkr2 7.358
mosnet score prediction (0=bad, 5=good):
spkr1 spkr1 2.900
spkr1 spkr2 2.577
spkr2 spkr1 2.814
spkr2 spkr2 2.502
after around 30000 steps there was not any signifact improvement made up to 200000 steps in the training Stage.
I used a pretrained vocoder from the example vcc_2020, since other pretrained pwg vocoders resulted in female voices, but both my speakers were male. the generated wavs had an awful ditorted growl in the lower frequencies and mostly the pitch sounded very flat.
So now I am thinking to improve the model.
Do i need more data for the speakers? what parameters do i need to change in the config files before starting Stage 2 again?
I found that in PR #3 you changed from using wavfile
to using soundfile
. If we use wavfile
to load waveform, the values will be in [-32768, 32767]
, and for soundfile
the range is [-1, 1]
. But, for WORLD features, since we still use the Synthesizer
class from sprocket, according to https://github.com/k2kobayashi/sprocket/blob/master/example/src/convert.py#L160-L161, I think Synthesizer
will synthesize waveform ranging from [-32768, 32767]
.
I wonder if there was a specific reason for PR #3 ?
Since the VCC data is not commonly available could you either:
Hello,
Each input file in my dateset is 6 seconds long but the output samples are 3 seconds long, what should I change to make the output time same as the input?
Also the results are bad are there any settings should I change?
I am add debug print(...) somewhere, for example, in feature.py _open_wavf(...) for print wav file name.
Sometimes I am see output, sometimes not. Оbviously this is multithread related error.
P.S. I am can see exceptions always.
I noticed the two lines in the run.sh file for resuming the training:
resume_checkpoint="exp/mlfb_vqvae/checkpoint_95000steps.pkl" # checkpoint path to resume
decode_checkpoint="None" # checkpoint path to resume
I only changed the path to resume the training, whats the decode_checkpoint option for? I left it blank all the time.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.