Giter Site home page Giter Site logo

auspicious3000 / autovc Goto Github PK

View Code? Open in Web Editor NEW
966.0 29.0 205.0 8.18 MB

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

Home Page: https://arxiv.org/abs/1905.05879

License: MIT License

Jupyter Notebook 11.47% Python 88.53%
voice-conversion speech-synthesis generative-models tacotron-pytorch wavenet-vocoder unsupervised-learning

autovc's People

Contributors

auspicious3000 avatar barbany avatar lisabecker avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

autovc's Issues

Tools to split VCTK audio

From your demo, it seems that some tools are used to split origin VCTK audio. Can you please share the tools?

Bad conversion quality after retraining

Hi,
first of all thanks for the great work on the AutoVC system.
I have tried to replicate the system, but could not achieve to achieve nearly the same quality as the pre-trained system.
I use the the same pre-processing for the mel-spectrograms as discussed in issue #4
and have trained the system with he same 20 VCTK speakers of the experiment of the paper (additionally with 8 speakers from the VCC data set, however results were similar when they were omitted).
I additionally used one-hot encodings instead of speaker embeddings of an encoder.

I trained for about 300.000 steps using Adam with default parameters and a LR of 0.0001, the train loss is about 6.67e-3 and the validation loss is about 0.01 and rising. I've also tried out other learning rates (0.001, 0.0005) with no improvement of quality. The converted mel-spectrograms are still blurry and produce a low-quality, robotic voice.
In comparison, the converted mel-spectrograms of the supplied autovc model are much sharper and produce more natural voice, even when used with Griffin-Lim.
Here are the mel-spectrograms of my retrained model and the model of the repo

Retrained model Supplied model
p270-p228-own p270-p228-paper

Here is a minimal example of the loss and training loop I use.
I can also provide more of my code if wanted.

def train_step(mel_spec_batch, embeddings_batch, generator,
               weight_mu_zero_rec: float, weight_lambda_content: float):
    optimizer.zero_grad()

    mel_spec_batch_exp = mel_spec_batch.unsqueeze(1) # (batch_size=2, 1, num_frames=128, num_mels=80)
    mel_outputs, mel_outputs_postnet, content_codes_mel_input = generator(mel_spec_batch,
                                                                          embeddings_batch,
                                                                          embeddings_batch)
    # Returns content codes with self.encoder without using the decoder and postnet a second time
    content_codes_gen_output = generator.get_content_codes(mel_outputs_postnet, embeddings_batch)

    rec_loss = F.mse_loss(input=mel_outputs_postnet, target=mel_spec_batch_exp, reduction="mean")
    rec_0_loss = F.mse_loss(input=mel_outputs, target=mel_spec_batch_exp, reduction="mean")
    content_loss = F.l1_loss(input=content_codes_gen_output, target=content_codes_mel_input, reduction="mean")
    total_loss = rec_loss + weight_mu_zero_rec * rec_0_loss + weight_lambda_content * content_loss

    total_loss.backward()
    optimizer.step()


# Train loop..
for epoch in range(start_epoch + 1, args[FLAGS.MAX_NUM_EPOCHS] + 1):
    generator.train()
    # Iterate over Mel-Spec Slices and the index of their speakers
    for step_idx, (mel_spec_batch, speaker_idx_batch) in enumerate(train_set_loader):
        # Load the speaker embeddings of the speakers of the mel-spectograms
        spkr_embeddings = speaker_embedding_mat[speaker_idx_batch.to(device)].to(device)
	train_step(mel_spec_batch.to(device), spkr_embeddings, generator, optim,
                   weight_mu_zero_rec=args[FLAGS.AUTO_VC_MU_REC_LOSS_BEFORE_POSTNET], # == 1.0
                   weight_lambda_content=args[FLAGS.AUTO_VC_LAMBDA_CONTENT_LOSS]) # == 1.0
	# The rest is computing the validation loss, resynthesizing utterances, saving the model every n epochs, etc

Does anyone have an idea what is wrong with my re-implementation or could anyone reimplement the system with good quality?

Thanks a lot in advance.

About the batch size

The paper said the batch size is 2 for 100k steps. Would it be too small for batch normalization? are you sure the batch size is 2 in your training?

Question concerning loss function in the paper

Hello,
I'd like to enquire on the loss function mentioned in section 3.2, under the section "Training", Equation (6), where it mentions the definition of the loss function. I'd like to ask two questions concerning the two equations:

  1. Do the \mathbb{E} brackets denote expected values? If so, how is it calculated? (There is only one Xhat1 and X1 per iteration)
  2. What do the indices outside of the absolute values mean? I take the absolute values to mean the L2 norm of the pointwise differences of the tensors, and the 2 on top of the definition of L_recon to mean power (that is, norm squared).

Thank you for your time!

The output encoder

it seems that the output encoder should be an extra module with the same structure and the same weight with the input encoder. But it is very difficult to get convergence in my training. Correct me if I am wrong.

Tranning is too slow

I used a training set of 400 speaker each of them have 350 senten .Training is very slow.
is that too big?
What is the training set you use?
Can you tell me?

Cmpatible loss function

Hello

To reproduce results of AutoVC we used following loss function implementation on PyTorch

num_steps = 100000
G = Generator(32,256,512,32).to(device)
...
criterion_recon = nn.MSELoss().to(device)
criterion_recon0 = nn.MSELoss().to(device)
criterion_content = nn.L1Loss().to(device)
optimizer = optim.Adam(G.parameters(), lr=1e-4)
G.train()
...
for i in range(1, num_steps + 1):
    ...
    optimizer.zero_grad()
    X1rt, X1r, C1 = G(X1, S1, S1)
    #with torch.no_grad(): # Should we calculate C1r with grad?
    #    C1r = G(X1r[:, 0, :, :], S1, None)
    C1r = G(X1r[:, 0, :, :], S1, None)
    L_recon = criterion_recon(X1r[:, 0, :, :], X1)
    L_recon0 = criterion_recon0(X1rt[:, 0, :, :], X1)
    L_content = criterion_content(C1r, C1)
    loss = L_recon + 1.0*L_recon0 + 1.0*L_content
    loss.backward()
    optimizer.step()

But we did not achieved comparable voice quality and loss value around 1e-3 according issue whats your final loss and final learning rate?
One of supposed reason is that we used inapropriate loss function implementation.
Is our loss function implementation compatible with that you used?

BR

Downsampling process is different from that described in the paper

Thanks for sharing the code and you did a great job.
I noticed that in the paper the downsampling process on the temporal axis is different for the forward sequence and the backward sequence. But it seems that the downsampling operation for the forward sequence in the code follows exactly the process described in the paper for the backward sequence. I'm quite confused because these two processes (what described in the code and in the paper) seems to behave differently for that they encode different contextual information.
Since the code is more up-to-date, does the downsampling process in the code is better?
image

codes.append(torch.cat((out_forward[:,i+self.freq-1,:],out_backward[:,i,:]), dim=-1))

Issues with conversion of VCTK speakers using pre-trained model

I am training a model on my own data to make conversions between two speakers with the training code you mailed me. I have tried various approaches to produce good conversions and, although I have gotten some average sounding output (with robotic crackles), the output is not as good as the samples in the audio demo website.
As a part of debugging my training, I am investigating the pre-trained model (autovc.ckpt) to produce conversions between the speakers in the VCTK dataset (The dataset the model was trained on). Although I manage to get relatively good outputs using samples from the speakers in the metadata.pkl file as the target and source speaker/speech, when I add a sample from another speaker (p240 or p260 for example) in the VCTK dataset to achieve a conversion, the quality of the output is poor. Can you give some pointers as to why this could be happening? Is the model trained on the entire VCTK dataset, or only a portion of it?
Here is some more information on how I generate the metadata.pkl file to do the conversions. To generate the speaker embedding of one recording, I use the pre-trained model that comes with the the training code ("3000000-BL.ckpt") with len_crop = 128. To generate the spectrogram, I use the code in the make_spect.py file that also comes in the training code and I leave the addition of some random noise (it gave better outputs this way).
Thank you beforehand!

AttributeError: 'numpy.ndarray' object has no attribute 'numpy'

AttributeError Traceback (most recent call last)
in ()
15 c = spect[1]
16 print(name)
---> 17 waveform = wavegen(model, c=c)
18 librosa.output.write_wav(name+'.wav', waveform, sr=16000)

/content/autovc/synthesis.py in wavegen(model, c, tqdm)
46
47 """
---> 48 c = c.numpy()
49
50 model.eval()

AttributeError: 'numpy.ndarray' object has no attribute 'numpy'

How to use this project on another dataset?

Hi @auspicious3000 .
thank you for share this work.
i have some questions, could you please help me with answers.

  1. i have my own dataset and i want to use it on this work , how i can process this data and train it ?. After this, how i could build my own model to get my wav audio result ?. Are this your pre-trained models that provided with links general for using in all data ?
    please guide me how i can get my own result in steps by using your work?
    Thanks.

segment size

Hi,

What number of time steps (frames) of mels did you use for training?

thanks.

about training

Hi! thanks for sharing this awesome code!

I hear your sample sounds and I run the .ipynb files

and I want to train autovc with my own dataset, but there is no train code...

is there any plan that sharing train code?

kernel died

Each I try conversio, it states:
The kernel appears to have died. It will restart automatically.
on Ubuntu 19.10, Firefox
Any idea?

What is the format of the metadata?

What is the format of the metadata?
I want to try another audio.
I checked the data inside. But I don't know what the second one is.
The first one is the name,The third is mel-spectrogram.

And does this apply to Chinese audio? Or I need to retrain the model and use Chinese data.
thanks!

confusion with speaker encoder and loss func

thx for this code
and i didn't find any implement of speaker encoder in demo
is that unseen in this demo?

and in the loss func
image
i cant figure out the difference bewteen L'recon' and L'recon0'

thanks a lot for guides

2000 epoches needed to train?

I find you set epoches=2000 in hparams.py. I want to know how much epoches is needed to train and how long it will takes, thanks.

AutoVC on a large scale data?

Hi @auspicious3000, thanks for sharing your research code.
I've worked on a lot of time to make the training code work (mostly due to input hyper parameter issues as the other guys are also struggling).
I'm currently working on the VoxCeleb2 dataset (near 6000 speaker, with 1M utterances).
However, I cannot make it trainable with MSE loss, but with L1 loss, I can manage to get the following auto-encoding reconstruction.

[Original]
image
[Voice converted with another speaker embedding]
image

The problem is while the network learns auto-encoding, but during the test time, it is not generalizable to voice conversion. It just did auto-encoding, not something else.
The above pair of examples are voice conversion examples, where both fundamental frequency of the mel-spectrogram looks very similar.

Could you share your your experience or any comments? I'd appreciate.

About decoder mismatch

Hi, I have a question about the decoder mismatch between code and the paper.

In the paper, it used 3 LSTM layers after 3 convnorm layers.
However in the code, it used 1 LSTM, 3 convnorm layers are following after, and 2 LSTM layers are used next.

Is this mismatch intended? I am wondering which implementation is correct.
Thanks.

How to generate mel spectrogram

with the same wavenet model and the same utterence(p225_001.wav), i found that the quality of the waveform generated from the mel-spectrogram in provided metadata.pkl is much better than the one generated by myself. Is there any tricky on how to generate proper mel-spectrogram?

F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder

I'm trying to improve the model by implementing the pitch conditioning introduced in https://arxiv.org/abs/2004.07370. However the process of producing normalized quantized log-F0 seems a bit confusing, as there are more than one way you could compute mean µ and std σ.

A sample's pitch vector is a 1d array whose size is n, where n is the number of frames (which seems to be fixed at 128 according to #6 (comment)). So there are three ways of computing µ and σ:

Suppose f0 is extracted from a sample audio of speaker A.

  1. Compute µ and σ of each individual sample on the fly (f0_norm = (f0 - f0.mean()) / f0.std() / 4).
  2. Compute µ and σ for each speaker (f0_norm = (f0 - f0s.mean()) / f0s.std() / 4 where f0s is an A x 128 array with A being the total number of samples of speaker A).
  3. Compute universal µ and σ of every sample (f0_norm = (f0 - f0s.mean()) / f0s.std() / 4 where f0s is an N x 128 array with N being the total number of all samples - that is, A < N).

And assuming the answer is 2 or 3, for unseen-to-seen or unseen-to-unseen conversion am I correct that µ and σ should be stored somewhere safe so I can reuse those values for inference? (I guess the option 2 doesn't really make sense since you can't compute those for unseen speakers)

What are the full preprocessing steps?

I'm retraining the model using my own data but my output is all noise. I'm suspecting that I'm having an issue with the way I'm generating the mel-spectrograms. I'm generating them using librosa and inverting the output of the model back to raw audio using librosa too.

Here are the functions I'm using to generate mel-spectrogram from raw audio:

def normalize(S):
    return np.clip((S - hp.min_level_db) / -hp.min_level_db, 0, 1)

def denormalize(S):
    return (np.clip(S, 0, 1) * -hp.min_level_db) + hp.min_level_db

def amp_to_db(x):
    return 20 * np.log10(np.maximum(1e-5, x))

def db_to_amp(x):
    return np.power(10.0, x * 0.05)

def melspectrogram(y):
    S = librosa.feature.melspectrogram(y=y, sr=hp.sr, n_fft=hp.fft_size, hop_length=hp.hop_length, n_mels=hp.n_mels, fmin=hp.fmin, fmax=hp.fmax, power = hp.power)
    S = amp_to_db(S)
    S = normalize(S)
    return S

def inverse_melspectrogram(M):
    M = denormalize(M)
    M = db_to_amp(M)
    y = librosa.feature.inverse.mel_to_audio(M=M, sr=hp.sr, n_fft=hp.fft_size, hop_length=hp.hop_length, power =hp.power)
    return y

Here are the hyperparameters I'm using:

sr=16000  
n_mels=80   
fmin=90  
fmax=7600  
fft_size=1024  
hop_length =256  
min_level_db=-100  
ref_level_db=20  
PAD_VALUE = -100000  
BATCH_SIZE = 32  
MAX_FRAMES = 1024  
power = 1.0  

Could you tell me if there is an issue with my preprocessing steps? If you need any more info, please ask.

Thanks

Dataset Size for Training

I have seen in #27 that 300k steps is necessary for training. Do the original results use the entirety of the VCTK dataset with 9/10 train, 1/10 test as in the paper, or do they use only a subset of the VCTK dataset?

Also, just to clarify, a step or iteration is one batch being passed through the network, correct? i.e. If the dataset is of size 10000, then you get 5000 steps (or iterations) out of it on a batch size of 2 in one epoch.

Making zero-shot model

Thank you for sharing your work.

When i am to make zero shot model, should i train speaker embedder as well as the conversion model with large dataset (VCTK)?
Or is it ok to only train the conversion model with VCTK?

Demo's dont work

It seems quite a few of the demo samples on this page don't seem to work

Using Firefox 60.0

Lots of console errors saying Media resource https://auspicious3000.github.io/autovc-demo/audios/89_01/p261_p225_1000000.wav could not be decoded.

How you generate speaker embedding?

I am wondering about how you extract the speaker embedding with pre-trained verification model.

The speaker embedding I get from https://github.com/resemble-ai/Resemblyzer will have a vector with all positive values and mostly zeros due to the ReLU at the end of the model. However, the speaker embedding in your metadata.pkl will have positive and negative values and it looks like a normal distribution.

Could you give me some advice how you extracted the exact embedding in your work? I try to skip the ReLU layer and L2-normalization for the vector but it is still not similar with your results. Hope to receive your response! Thanks!

By the way, your paper said that you used the speaker encoder with a stack of "2" LSTM layers with cell size "768". But, the model in https://github.com/resemble-ai/Resemblyzer used "3" LSTM layers with cell size "256". I was confused whether you use the same speaker encoder model.

pickle error?

Hi, I'm using python 3.7.3 with numpy 1.16.4 and is experiencing the following error when loading meta data:
``
metadata = pickle.load(open('metadata.pkl', "rb"))

Traceback (most recent call last):
File " stdin ", line 1, in module
_pickle.UnpicklingError: invalid load key, '\x0a'.
``
I guess this might be an encoding mismatch ('ASCII' by default) or a numpy version mismatch.
Would it be possible for you to share a npy/npz version of the metadata.pkl? Thanks!

Re-training steps?

Hello @auspicious3000,
I want to re-train this model by using Urdu data-set. Can you guide me the steps that I should follow?
Please list the file names that I need to execute in sequence so that I could reproduce the results for my own dataset?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.