auspicious3000 / autovc Goto Github PK

View Code? Open in Web Editor NEW

966.0 29.0 205.0 8.18 MB

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

Home Page: https://arxiv.org/abs/1905.05879

License: MIT License

Jupyter Notebook 11.47% Python 88.53%

voice-conversion speech-synthesis generative-models tacotron-pytorch wavenet-vocoder unsupervised-learning

autovc's People

Contributors

Stargazers

Watchers

Forkers

ak9250 awesome-archive entn-at bhusanchettri bwry tarepan codeaudit yyht yhgon cy-time nestyme anotherother malarinv datanimu stjordanis 1015720437 chenchy vaibhavjindal jacob-mink csu-anzai csu-xiao-an nkcdy dendisuhubdy gabiad shujushi raymond00000 hongwen-sun tamzeed-unc peng2017 taktak1 loong1989 josearangos shacharm2 casperwang hrnoh chunhuiwang-china chenllliang zeroqiaoba barbany thestarboy inconnu11 evidament hwidong-na 5l1v3r1 immortalin liuyikuikui huukim136 qq547276542 lunnada michaelpdu mbdash vanova fukaf weixsong qiuqxt xiongmaoxia phuongdongbn hecate2 xwyf05 beibeiouyang hs1003 keishatsai bmxmiko rahulkumar1m tianchi03 lohjine allenhung1025 mingxinliang greboide nintorac xuexidi markyouyuren steven850 tebin wonwizard 17011775 lukelluke abylouw wyp19930313 liangshuang1993 statjuns kevinhua mrpep spxnn ruclion pixelchai kimjj-geek holttechnologycorporation zerrui oytunturk lemontreeqaq garyfeng zhangxinaaaa johnherry aliceinhunterland innovator1311 sstzal emailandxu eribertoo suzukidaishi

autovc's Issues

Can you share a copy of the models in pan.baidu.com

Thanks for your great job. I tried to download the models from google drive through vpn, but it was so slow and might take hours. Can you share a copy of the models in pan.baidu.com ? Thanks a lot.

Tools to split VCTK audio

From your demo, it seems that some tools are used to split origin VCTK audio. Can you please share the tools?

Bad conversion quality after retraining

Hi,
first of all thanks for the great work on the AutoVC system.
I have tried to replicate the system, but could not achieve to achieve nearly the same quality as the pre-trained system.
I use the the same pre-processing for the mel-spectrograms as discussed in issue #4
and have trained the system with he same 20 VCTK speakers of the experiment of the paper (additionally with 8 speakers from the VCC data set, however results were similar when they were omitted).
I additionally used one-hot encodings instead of speaker embeddings of an encoder.

I trained for about 300.000 steps using Adam with default parameters and a LR of 0.0001, the train loss is about 6.67e-3 and the validation loss is about 0.01 and rising. I've also tried out other learning rates (0.001, 0.0005) with no improvement of quality. The converted mel-spectrograms are still blurry and produce a low-quality, robotic voice.
In comparison, the converted mel-spectrograms of the supplied autovc model are much sharper and produce more natural voice, even when used with Griffin-Lim.
Here are the mel-spectrograms of my retrained model and the model of the repo

Retrained model	Supplied model

Here is a minimal example of the loss and training loop I use.
I can also provide more of my code if wanted.

def train_step(mel_spec_batch, embeddings_batch, generator,
               weight_mu_zero_rec: float, weight_lambda_content: float):
    optimizer.zero_grad()

    mel_spec_batch_exp = mel_spec_batch.unsqueeze(1) # (batch_size=2, 1, num_frames=128, num_mels=80)
    mel_outputs, mel_outputs_postnet, content_codes_mel_input = generator(mel_spec_batch,
                                                                          embeddings_batch,
                                                                          embeddings_batch)
    # Returns content codes with self.encoder without using the decoder and postnet a second time
    content_codes_gen_output = generator.get_content_codes(mel_outputs_postnet, embeddings_batch)

    rec_loss = F.mse_loss(input=mel_outputs_postnet, target=mel_spec_batch_exp, reduction="mean")
    rec_0_loss = F.mse_loss(input=mel_outputs, target=mel_spec_batch_exp, reduction="mean")
    content_loss = F.l1_loss(input=content_codes_gen_output, target=content_codes_mel_input, reduction="mean")
    total_loss = rec_loss + weight_mu_zero_rec * rec_0_loss + weight_lambda_content * content_loss

    total_loss.backward()
    optimizer.step()


# Train loop..
for epoch in range(start_epoch + 1, args[FLAGS.MAX_NUM_EPOCHS] + 1):
    generator.train()
    # Iterate over Mel-Spec Slices and the index of their speakers
    for step_idx, (mel_spec_batch, speaker_idx_batch) in enumerate(train_set_loader):
        # Load the speaker embeddings of the speakers of the mel-spectograms
        spkr_embeddings = speaker_embedding_mat[speaker_idx_batch.to(device)].to(device)
	train_step(mel_spec_batch.to(device), spkr_embeddings, generator, optim,
                   weight_mu_zero_rec=args[FLAGS.AUTO_VC_MU_REC_LOSS_BEFORE_POSTNET], # == 1.0
                   weight_lambda_content=args[FLAGS.AUTO_VC_LAMBDA_CONTENT_LOSS]) # == 1.0
	# The rest is computing the validation loss, resynthesizing utterances, saving the model every n epochs, etc

Does anyone have an idea what is wrong with my re-implementation or could anyone reimplement the system with good quality?

Thanks a lot in advance.

About the batch size

The paper said the batch size is 2 for 100k steps. Would it be too small for batch normalization? are you sure the batch size is 2 in your training?

Is voice activity detection necessary for wav preprocessing？

Question concerning loss function in the paper

Hello,
I'd like to enquire on the loss function mentioned in section 3.2, under the section "Training", Equation (6), where it mentions the definition of the loss function. I'd like to ask two questions concerning the two equations:

Do the \mathbb{E} brackets denote expected values? If so, how is it calculated? (There is only one Xhat1 and X1 per iteration)
What do the indices outside of the absolute values mean? I take the absolute values to mean the L2 norm of the pointwise differences of the tensors, and the 2 on top of the definition of L_recon to mean power (that is, norm squared).

Thank you for your time!

The output encoder

it seems that the output encoder should be an extra module with the same structure and the same weight with the input encoder. But it is very difficult to get convergence in my training. Correct me if I am wrong.

Tranning is too slow

I used a training set of 400 speaker each of them have 350 senten .Training is very slow.
is that too big?
What is the training set you use?
Can you tell me?

Does anyone reproduce the sound quality in the demo page?

The complete training code may be sent through email upon special request for non-commercial purposes.

For all code requests, please send an email to [email protected] with name, affiliation and a description of how the code will be used for your research. Thanks!

Cmpatible loss function

Hello

To reproduce results of AutoVC we used following loss function implementation on PyTorch

num_steps = 100000
G = Generator(32,256,512,32).to(device)
...
criterion_recon = nn.MSELoss().to(device)
criterion_recon0 = nn.MSELoss().to(device)
criterion_content = nn.L1Loss().to(device)
optimizer = optim.Adam(G.parameters(), lr=1e-4)
G.train()
...
for i in range(1, num_steps + 1):
    ...
    optimizer.zero_grad()
    X1rt, X1r, C1 = G(X1, S1, S1)
    #with torch.no_grad(): # Should we calculate C1r with grad?
    #    C1r = G(X1r[:, 0, :, :], S1, None)
    C1r = G(X1r[:, 0, :, :], S1, None)
    L_recon = criterion_recon(X1r[:, 0, :, :], X1)
    L_recon0 = criterion_recon0(X1rt[:, 0, :, :], X1)
    L_content = criterion_content(C1r, C1)
    loss = L_recon + 1.0*L_recon0 + 1.0*L_content
    loss.backward()
    optimizer.step()

But we did not achieved comparable voice quality and loss value around 1e-3 according issue whats your final loss and final learning rate?
One of supposed reason is that we used inapropriate loss function implementation.
Is our loss function implementation compatible with that you used?

Downsampling process is different from that described in the paper

Thanks for sharing the code and you did a great job.
I noticed that in the paper the downsampling process on the temporal axis is different for the forward sequence and the backward sequence. But it seems that the downsampling operation for the forward sequence in the code follows exactly the process described in the paper for the backward sequence. I'm quite confused because these two processes (what described in the code and in the paper) seems to behave differently for that they encode different contextual information.
Since the code is more up-to-date, does the downsampling process in the code is better?

autovc/model_vc.py

Line 79 in 2d8a6c8

    
           codes.append(torch.cat((out_forward[:,i+self.freq-1,:],out_backward[:,i,:]), dim=-1))

Issues with conversion of VCTK speakers using pre-trained model

I am training a model on my own data to make conversions between two speakers with the training code you mailed me. I have tried various approaches to produce good conversions and, although I have gotten some average sounding output (with robotic crackles), the output is not as good as the samples in the audio demo website.
As a part of debugging my training, I am investigating the pre-trained model (autovc.ckpt) to produce conversions between the speakers in the VCTK dataset (The dataset the model was trained on). Although I manage to get relatively good outputs using samples from the speakers in the metadata.pkl file as the target and source speaker/speech, when I add a sample from another speaker (p240 or p260 for example) in the VCTK dataset to achieve a conversion, the quality of the output is poor. Can you give some pointers as to why this could be happening? Is the model trained on the entire VCTK dataset, or only a portion of it?
Here is some more information on how I generate the metadata.pkl file to do the conversions. To generate the speaker embedding of one recording, I use the pre-trained model that comes with the the training code ("3000000-BL.ckpt") with len_crop = 128. To generate the spectrogram, I use the code in the make_spect.py file that also comes in the training code and I leave the addition of some random noise (it gave better outputs this way).
Thank you beforehand!

AttributeError: 'numpy.ndarray' object has no attribute 'numpy'

AttributeError Traceback (most recent call last)
in ()
15 c = spect[1]
16 print(name)
---> 17 waveform = wavegen(model, c=c)
18 librosa.output.write_wav(name+'.wav', waveform, sr=16000)

/content/autovc/synthesis.py in wavegen(model, c, tqdm)
46
47 """
---> 48 c = c.numpy()
49
50 model.eval()

AttributeError: 'numpy.ndarray' object has no attribute 'numpy'

Attempting to deserialize object on CUDA device 1 but torch.cuda.device_count

Running conversion.ipynb for the first time, I got the runtime error in the title.
I fixed it using; computationalmedia/semstyle#3
adding ,map_location='cuda:0' to torch.load
Not sure why though.

For those of you who need pre-trained speaker embedding models, Here it is.

https://github.com/resemble-ai/Resemblyzer

How to use this project on another dataset?

Hi @auspicious3000 .
thank you for share this work.
i have some questions, could you please help me with answers.

i have my own dataset and i want to use it on this work , how i can process this data and train it ?. After this, how i could build my own model to get my wav audio result ?. Are this your pre-trained models that provided with links general for using in all data ?
please guide me how i can get my own result in steps by using your work?
Thanks.

segment size

Hi,

What number of time steps (frames) of mels did you use for training?

thanks.

about training

Hi! thanks for sharing this awesome code!

I hear your sample sounds and I run the .ipynb files

and I want to train autovc with my own dataset, but there is no train code...

is there any plan that sharing train code?

kernel died

Each I try conversio, it states:
The kernel appears to have died. It will restart automatically.
on Ubuntu 19.10, Firefox
Any idea?

What is the format of the metadata?

What is the format of the metadata?
I want to try another audio.
I checked the data inside. But I don't know what the second one is.
The first one is the name，The third is mel-spectrogram.

And does this apply to Chinese audio? Or I need to retrain the model and use Chinese data.
thanks！

confusion with speaker encoder and loss func

thx for this code
and i didn't find any implement of speaker encoder in demo
is that unseen in this demo?

and in the loss func

i cant figure out the difference bewteen L'recon' and L'recon0'

thanks a lot for guides

2000 epoches needed to train?

I find you set epoches=2000 in hparams.py. I want to know how much epoches is needed to train and how long it will takes, thanks.

AutoVC on a large scale data?

Hi @auspicious3000, thanks for sharing your research code.
I've worked on a lot of time to make the training code work (mostly due to input hyper parameter issues as the other guys are also struggling).
I'm currently working on the VoxCeleb2 dataset (near 6000 speaker, with 1M utterances).
However, I cannot make it trainable with MSE loss, but with L1 loss, I can manage to get the following auto-encoding reconstruction.

[Original]

[Voice converted with another speaker embedding]

The problem is while the network learns auto-encoding, but during the test time, it is not generalizable to voice conversion. It just did auto-encoding, not something else.
The above pair of examples are voice conversion examples, where both fundamental frequency of the mel-spectrogram looks very similar.

Could you share your your experience or any comments? I'd appreciate.

If speaker embedding is not added to the encoder input, will it affect the model effect?

If speaker embedding is not added to the encoder input, will it affect the model effect? Do you have any relevant experiments? Thx！

Whether the speech on each batch will be crop to a fixed length of time during training?

For example, do you crop the speech to 2 seconds, or do you keep the original speech length during training? Does this affect the performance of the model? Because I find it very much affects the speed of training, so I'd like to know the answer. Thank you :D

Follow-up work available for viewing

We have further improved AutoVC in 2 subsequent works.

The 1st work improves the audio quality by removing any pitch artifacts.

F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder
https://arxiv.org/abs/2004.07370

The 2nd work can convert rhythm, pitch, and/or timbre at the same time.

Unsupervised Speech Decomposition via Triple Information Bottleneck
https://arxiv.org/abs/2004.11284

About decoder mismatch

Hi, I have a question about the decoder mismatch between code and the paper.

In the paper, it used 3 LSTM layers after 3 convnorm layers.
However in the code, it used 1 LSTM, 3 convnorm layers are following after, and 2 LSTM layers are used next.

Is this mismatch intended? I am wondering which implementation is correct.
Thanks.

whats your final loss and final learning rate?

does it work? didn't even try testing

How to generate mel spectrogram

with the same wavenet model and the same utterence(p225_001.wav), i found that the quality of the waveform generated from the mel-spectrogram in provided metadata.pkl is much better than the one generated by myself. Is there any tricky on how to generate proper mel-spectrogram?

Hyperparameters for generating mel spectrogram from training .wav files

Could you please tell us how you generated mel spectrograms for training from .wav files? What were the parameters used?

Is is necessary to pretrain the speaker encoder using GE2E loss on VCTK?

@auspicious3000 Can I train the model directly on VCTK without pretrain the speaker encoder using GE2E loss?

why the quality of the demo page has difference with your new paper?

i listen all the demo wavs in these pages.

this is the new paper: https://auspicious3000.github.io/icassp_2020_demo/
this paper: https://auspicious3000.github.io/autovc-demo/

Why need original speaker embeddings concatenated with original speaker spectrogram?

Theoretically, the original speaker embedding information has already been contained in the spectrogram. The network will automatically squeeze the original speaker embedding information out after convergence. why the original speaker embedding is still needed?

F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder

I'm trying to improve the model by implementing the pitch conditioning introduced in https://arxiv.org/abs/2004.07370. However the process of producing normalized quantized log-F0 seems a bit confusing, as there are more than one way you could compute mean µ and std σ.

A sample's pitch vector is a 1d array whose size is n, where n is the number of frames (which seems to be fixed at 128 according to #6 (comment)). So there are three ways of computing µ and σ:

Suppose f0 is extracted from a sample audio of speaker A.

Compute µ and σ of each individual sample on the fly (f0_norm = (f0 - f0.mean()) / f0.std() / 4).
Compute µ and σ for each speaker (f0_norm = (f0 - f0s.mean()) / f0s.std() / 4 where f0s is an A x 128 array with A being the total number of samples of speaker A).
Compute universal µ and σ of every sample (f0_norm = (f0 - f0s.mean()) / f0s.std() / 4 where f0s is an N x 128 array with N being the total number of all samples - that is, A < N).

And assuming the answer is 2 or 3, for unseen-to-seen or unseen-to-unseen conversion am I correct that µ and σ should be stored somewhere safe so I can reuse those values for inference? (I guess the option 2 doesn't really make sense since you can't compute those for unseen speakers)

Can you please release the code + weights for the speaker embedding network?

What are the full preprocessing steps?

I'm retraining the model using my own data but my output is all noise. I'm suspecting that I'm having an issue with the way I'm generating the mel-spectrograms. I'm generating them using librosa and inverting the output of the model back to raw audio using librosa too.

Here are the functions I'm using to generate mel-spectrogram from raw audio:

def normalize(S):
    return np.clip((S - hp.min_level_db) / -hp.min_level_db, 0, 1)

def denormalize(S):
    return (np.clip(S, 0, 1) * -hp.min_level_db) + hp.min_level_db

def amp_to_db(x):
    return 20 * np.log10(np.maximum(1e-5, x))

def db_to_amp(x):
    return np.power(10.0, x * 0.05)

def melspectrogram(y):
    S = librosa.feature.melspectrogram(y=y, sr=hp.sr, n_fft=hp.fft_size, hop_length=hp.hop_length, n_mels=hp.n_mels, fmin=hp.fmin, fmax=hp.fmax, power = hp.power)
    S = amp_to_db(S)
    S = normalize(S)
    return S

def inverse_melspectrogram(M):
    M = denormalize(M)
    M = db_to_amp(M)
    y = librosa.feature.inverse.mel_to_audio(M=M, sr=hp.sr, n_fft=hp.fft_size, hop_length=hp.hop_length, power =hp.power)
    return y

Here are the hyperparameters I'm using:

sr=16000  
n_mels=80   
fmin=90  
fmax=7600  
fft_size=1024  
hop_length =256  
min_level_db=-100  
ref_level_db=20  
PAD_VALUE = -100000  
BATCH_SIZE = 32  
MAX_FRAMES = 1024  
power = 1.0

Could you tell me if there is an issue with my preprocessing steps? If you need any more info, please ask.

Thanks

if the speaker embedding is one hot code, what's the difference between your work and previous vae-based voice conversion?

The complete training code may be sent through email upon special request for non-commercial purposes.

Please send an email to [email protected]

Dataset Size for Training

I have seen in #27 that 300k steps is necessary for training. Do the original results use the entirety of the VCTK dataset with 9/10 train, 1/10 test as in the paper, or do they use only a subset of the VCTK dataset?

Also, just to clarify, a step or iteration is one batch being passed through the network, correct? i.e. If the dataset is of size 10000, then you get 5000 steps (or iterations) out of it on a batch size of 2 in one epoch.

Making zero-shot model

Thank you for sharing your work.

When i am to make zero shot model, should i train speaker embedder as well as the conversion model with large dataset (VCTK)?
Or is it ok to only train the conversion model with VCTK?

Demo's dont work

It seems quite a few of the demo samples on this page don't seem to work

Using Firefox 60.0

Lots of console errors saying Media resource https://auspicious3000.github.io/autovc-demo/audios/89_01/p261_p225_1000000.wav could not be decoded.

How you generate speaker embedding?

I am wondering about how you extract the speaker embedding with pre-trained verification model.

The speaker embedding I get from https://github.com/resemble-ai/Resemblyzer will have a vector with all positive values and mostly zeros due to the ReLU at the end of the model. However, the speaker embedding in your metadata.pkl will have positive and negative values and it looks like a normal distribution.

Could you give me some advice how you extracted the exact embedding in your work? I try to skip the ReLU layer and L2-normalization for the vector but it is still not similar with your results. Hope to receive your response! Thanks!

By the way, your paper said that you used the speaker encoder with a stack of "2" LSTM layers with cell size "768". But, the model in https://github.com/resemble-ai/Resemblyzer used "3" LSTM layers with cell size "256". I was confused whether you use the same speaker encoder model.

What is the difference between L_recon and L_recon0 ?

About L=L_recon +μ * L_recon0 + λ * L_content
The paper is not very clear, how can I get loss L_recon0 ?
L_recon and L_recon0 Are they related to the loss before and after the residual module?

re-training result, it is not good enough, can you share some advice about Hyperparameter？

this is the result of re-training autovc, first row is original mel, second is reconstruction result, third is transfer result, i use one-hot condition during re-training, the result is not good, can you share some advice? I use three loss in paper and default adam. thx~

How to ensure that the output of the encoder is independent of the speaker？

How to ensure that the output of the encoder is independent of the speaker？
I can't see the concept of confusing networks or generative adversarial training in this paper.
I don't understand how it works.

Why the batch size is set to 2?

Is it because of memory limitations, or is it better than the result of a larger batch size (such as 16, 32, 64)?

pickle error?

Hi, I'm using python 3.7.3 with numpy 1.16.4 and is experiencing the following error when loading meta data:
``
metadata = pickle.load(open('metadata.pkl', "rb"))

Traceback (most recent call last):
File " stdin ", line 1, in module
_pickle.UnpicklingError: invalid load key, '\x0a'.
``
I guess this might be an encoding mismatch ('ASCII' by default) or a numpy version mismatch.
Would it be possible for you to share a npy/npz version of the metadata.pkl? Thanks!

Re-training steps?

Hello @auspicious3000,
I want to re-train this model by using Urdu data-set. Can you guide me the steps that I should follow?
Please list the file names that I need to execute in sequence so that I could reproduce the results for my own dataset?