hujinsen / stargan-voice-conversion Goto Github PK

full tensorflow implementation of the paper: StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks https://arxiv.org/abs/1806.02169

License: MIT License

Python 100.00%

voice-conversion voice-converter stargan-vc stargan cyclegan-vc tensorflow

stargan-voice-conversion's People

Contributors

Stargazers

Watchers

Forkers

tarepan shaun95 railsloes tttjjjwww ianmcaulay 068089dy kutim michaelvobejda oracle9i88 orbisai caopan16 entn-at sylviaziyuzhang fastcode3d gzfffff xjia520 fasobaby miguellopes silyfox noteboy forwiat gitrekm wahello thestarboy jordanfan860406 norangeeroli speechdnn jjandnn wonderwrj jennifer19 marlon-br xwyf05 williamxww tricky61 cuijianzhu hengzi52125 zhuxiaoxuhit florademoss sshuster youngboy52 sduxinxi powei-c bhaskarkumar1 officialarijit jiahong3837 hailinghaifeng baiyu9 python-repository-hub jihyemooon harry7337 iq-scm vital121 ldodev

stargan-voice-conversion's Issues

why train the classifier in parallel with the generator and the discriminator?

I want to know why you do not train the classifier in advance.
To me, it seems that the classifier can be trained separately from the other sub-models.

Issues in training the model

python train.py
Loading Data...
found stat file: ./etc/children-stats.npz
found stat file: ./etc/adult-stats.npz
Loading Data Done.
d1: [None, 36, 512, 32]
d2: [None, 18, 256, 64]
d3: [None, 9, 128, 128]
d4: [None, 9, 128, 64]
[None, 1, 128, 4]
u1.shape :[None, 9, 128, 64]
c1 shape: (?, 9, 128, 4)
u1_concat.shape :[None, 9, 128, 68]
u2.shape :[None, 9, 128, 128]
u3.shape :[None, 18, 256, 64]
u4.shape :[None, 36, 512, 32]
u4_concat.shape :[None, 36, 512, 36]
u5.shape :[None, 36, 512, 1]
d1: [None, 36, 512, 32]
d2: [None, 18, 256, 64]
d3: [None, 9, 128, 128]
d4: [None, 9, 128, 64]
[None, 1, 128, 4]
u1.shape :[None, 9, 128, 64]
c1 shape: (?, 9, 128, 4)
u1_concat.shape :[None, 9, 128, 68]
u2.shape :[None, 9, 128, 128]
u3.shape :[None, 18, 256, 64]
u4.shape :[None, 36, 512, 32]
u4_concat.shape :[None, 36, 512, 36]
u5.shape :[None, 36, 512, 1]
d1: [None, 36, 512, 32]
d2: [None, 18, 256, 64]
d3: [None, 9, 128, 128]
d4: [None, 9, 128, 64]
[None, 1, 128, 4]
u1.shape :[None, 9, 128, 64]
c1 shape: (?, 9, 128, 4)
u1_concat.shape :[None, 9, 128, 68]
u2.shape :[None, 9, 128, 128]
u3.shape :[None, 18, 256, 64]
u4.shape :[None, 36, 512, 32]
u4_concat.shape :[None, 36, 512, 36]
u5.shape :[None, 36, 512, 1]
domain_classifier_d1: (?, 8, 512, 8)
domain_classifier_d1_p: (?, 4, 256, 8)
domain_classifier_d12: (?, 4, 256, 16)
domain_classifier_d2_p: (?, 2, 128, 16)
domain_classifier_d3: (?, 2, 128, 32)
domain_classifier_d3_p: (?, 1, 64, 32)
domain_classifier_d4: (?, 1, 64, 16)
domain_classifier_d4_p: (?, 1, 32, 16)
domain_classifier_d5: (?, 1, 32, 4)
domain_classifier_d5_p: (?, 1, 16, 4)
classifier_output: (?, 1, 1, 4)
domain_classifier_d1: (?, 8, 512, 8)
domain_classifier_d1_p: (?, 4, 256, 8)
domain_classifier_d12: (?, 4, 256, 16)
domain_classifier_d2_p: (?, 2, 128, 16)
domain_classifier_d3: (?, 2, 128, 32)
domain_classifier_d3_p: (?, 1, 64, 32)
domain_classifier_d4: (?, 1, 64, 16)
domain_classifier_d4_p: (?, 1, 32, 16)
domain_classifier_d5: (?, 1, 32, 4)
domain_classifier_d5_p: (?, 1, 16, 4)
classifier_output: (?, 1, 1, 4)
d1: [None, 36, 512, 32]
d2: [None, 18, 256, 64]
d3: [None, 9, 128, 128]
d4: [None, 9, 128, 64]
[None, 1, 128, 4]
u1.shape :[None, 9, 128, 64]
c1 shape: (?, 9, 128, 4)
u1_concat.shape :[None, 9, 128, 68]
u2.shape :[None, 9, 128, 128]
u3.shape :[None, 18, 256, 64]
u4.shape :[None, 36, 512, 32]
u4_concat.shape :[None, 36, 512, 36]
u5.shape :[None, 36, 512, 1]
2019-03-12 12:55:15.378518: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Traceback (most recent call last):
File "train.py", line 282, in
train(processed_dir, test_wav_dir)
File "train.py", line 158, in train
lambda_classifier=lambda_classifier
File "/home/StarGan/model.py", line 148, in train
self.generator_learning_rate: generator_learning_rate})
File "/homeStarGan/myvenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/home/StarGan/myvenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1111, in _run
str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (8, 2) for Tensor 'source_label:0', which has shape '(?, 4)'

Python-v:3.6
Tensorflow:1.8

Help Appericiated

Exception: ====no match files!====

platform: ubuntu 18.04

process:
python3.6 train.py --processed_dir ./data/processed --test_wav_dir ./data/fourspeakers_test

issue log:
t_wav_dir ./data/fourspeakers_test
Loading Data...
found stat file: ./etc/TM1-stats.npz
Traceback (most recent call last):
File "StarGAN-Voice-Conversion/utility.py", line 57, in normalizer_dict
stat_filepath = [fn for fn in glob.glob(p) if one_speaker in fn][0]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train.py", line 282, in
train(processed_dir, test_wav_dir)
File "train.py", line 48, in train
normlizer = Normalizer()
File "StarGAN-Voice-Conversion/utility.py", line 29, in init
self.norm_dict = self.normalizer_dict()
File "StarGAN-Voice-Conversion/utility.py", line 59, in normalizer_dict
raise Exception('====no match files!====')
Exception: ====no match files!====

Download dataset-->Preprocess dataset -->Train
processed dataset already pass, but train above error occurred, Can you provide guidance?

tks.

can the train.py be run on gpu?

I use tensorflow-gpu. will the training physe be done on gpu instead of cpu?

Is it possible to have any source and target audio (wav file) as input when inference?

Training time on Google Colab 12GB GPU

How much time takes the training in Google Colab 12GB GPU, and what are the outputs?

Do not use os.path.join and str.split('/') both

On Windows, os.path.join returns '\', which will not work properly if you use '/' to split.

Where is the advantage of the method in audio conversion?

Cannot open the generated .wav file

Hello, I was playing around your code but the files generated by the command below cannot be opened.
python convert.py --model_dir ./out/100_2018-11-11-20-46/model --source_speaker SF1 --target_speaker SM1
Here, I manually placed SF1, SF2, SM1, SM2 in both of the train and test directory and I was able to preprocess the train data and was able to train the model successfully. Do you have any idea?

AttributeError: module 'posixpath' has no attribute 'normpaths'

this is a bug in utility.py. You should change line 16 "os.path.normpaths" to "os.path.normpath"

wavs generated by convert.py are just noise

Does anybody have the same problem?
I use the laest code, no parameter changed.

@hujinsen the wavs in "converted_speech" can not be played, could you share some result wavs that can be played

Why do the loss of generator and discriminator are 'nan' during the training process? and if there is a gradient explosion, I still have no change after changing the learning rate.

KeyError: 'coded_sps_mean'

OS: Windows
I'm only changed some code that the folders and files will be work on windows, the rest of the code is untouched.

If i start train.py
python train.py --processed_dir .\data\processed --test_wav_dir .\data\fourspeakers_test

I get the error:

Traceback (most recent call last): File "train.py", line 266, in <module> train(processed_dir, test_wav_dir) File "train.py", line 115, in train one_file = normlizer.forward_process(one_file, speaker_name) File "C:\work\StarGAN-Voice-Conversion\utility.py", line 33, in forward_process mean = self.norm_dict[speakername]['coded_sps_mean'] KeyError: 'coded_sps_mean'

in the array of 'speakername' are the dict keys:
dict_keys(['f0', 'ap', 'sp', 'coded_sp'])

Any idea how i could fix it?

Audio files generated after conversion cannot be opened

what should i do if i want to use more speakers data

I have changed SPEAKERS_NUM, but got shape problem

network implementation differences (InstanceNorm, probability mean, max-pooling)

Thanks for your great implementation.
I found some implementation differences compared to original article.

difference list

IN as alternatives of BN
mean as alternatives of product in D/C last layer
(in only C) max-pooling as alternatives of strided-Conv

IN as alternatives of BN

In Generator, Discriminator and Classifier, Instance Normalization (IN) are used as alternatives of Batch Normalization (BN) (code).
There are comment-outed Batch Normalization, so are there any problem in BN?

mean as alternatives of product in D/C last layer

In original article, probabilities (probability of each patches) is multiplied (== product).

the final output D(y, c) is given by the product of all these probabilities.
...
“Product” denote ... product pooling layers,

But in this implementation, in last layer, probabilities is taken average.

c1_red = tf.reduce_mean(c1, keepdims=True) code

Is this intended implementation based on your experiments, or some other reasons?

(in only C) max-pooling as alternatives of strided-Conv

In D, down-sampling is made by strided-Conv as same as original article.
But in C, down-sampling is made by max-pooling.
Why are they used in different manner?

How to solve KeyError: 'coded_sps_mean' ?

i looked into the code in utility.py . This mean calculation happens in func generate_stats .. But where generate_stats is called? Please help me with it

ValueError: Cannot feed value of shape (8, 10) for Tensor 'source_label:0', which has shape '(?, 4)'

now ,have this error:
Traceback (most recent call last):
File "train.py", line 283, in
train(processed_dir, test_wav_dir)
File "train.py", line 158, in train
lambda_classifier=lambda_classifier
File "/home/wuli/work/deeplearning/tf-star-gan-voice-conversion/StarGAN-Voice-Conversion/model.py", line 148, in train
self.generator_learning_rate: generator_learning_rate})
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1111, in _run
str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (8, 10) for Tensor 'source_label:0', which has shape '(?, 4)'

Have you met this error before?

ValueError: Object arrays cannot be loaded when allow_pickle=False

[485:486]svaing file: ./data/processed/TM1-100131_0.npy
[485:486]svaing file: ./data/processed/TM1-100131_512.npy
save file: ./data/processed/TM1-100132
audio mcep shape (36, 512)
[486:486]svaing file: ./data/processed/TM1-100132_0.npy
Traceback (most recent call last):
File "preprocess.py", line 200, in
generator.generate_stats()
File "/home/user/sources/StarGAN-Voice-Conversion/utility.py", line 158, in generate_stats
d = t.f.arr_0.item()
File "/home/user/.local/lib/python3.6/site-packages/numpy/lib/npyio.py", line 94, in getattribute
return object.getattribute(self, '_obj')[key]
File "/home/user/.local/lib/python3.6/site-packages/numpy/lib/npyio.py", line 262, in getitem
pickle_kwargs=self.pickle_kwargs)
File "/home/user/.local/lib/python3.6/site-packages/numpy/lib/format.py", line 722, in read_array
raise ValueError("Object arrays cannot be loaded when "
ValueError: Object arrays cannot be loaded when allow_pickle=False

How many epochs did you train?

The default epoch is 101, and the results sound not very well. Did you increase the epochs and get better results?

can this model train VCTK dataset?

Version Issues

Hello,

I can't find the model file which should get generated after running train.py file. First argument of the convert.py is causing issue.

Getting Normalizer issues while running train.py file,

not able to reproduce the results in original paper

I cannot reproduce the results in original paper. I used tensorflow 1.8.0.

About the final result

I am curious about I can't open the final result, except using sox in linux and encode it into 32bit or 16 bit, BTW have you had a good result, I have tried some non-parallel data, and the result is bad. Though i can specify the voice after conversion, but the voice is not clear, do you have any advise on improving the voice qulity