auspicious3000 / autovc Goto Github PK
View Code? Open in Web Editor NEWAutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
Home Page: https://arxiv.org/abs/1905.05879
License: MIT License
AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
Home Page: https://arxiv.org/abs/1905.05879
License: MIT License
Thanks for your great job. I tried to download the models from google drive through vpn, but it was so slow and might take hours. Can you share a copy of the models in pan.baidu.com ? Thanks a lot.
From your demo, it seems that some tools are used to split origin VCTK audio. Can you please share the tools?
Hi,
first of all thanks for the great work on the AutoVC system.
I have tried to replicate the system, but could not achieve to achieve nearly the same quality as the pre-trained system.
I use the the same pre-processing for the mel-spectrograms as discussed in issue #4
and have trained the system with he same 20 VCTK speakers of the experiment of the paper (additionally with 8 speakers from the VCC data set, however results were similar when they were omitted).
I additionally used one-hot encodings instead of speaker embeddings of an encoder.
I trained for about 300.000 steps using Adam with default parameters and a LR of 0.0001, the train loss is about 6.67e-3 and the validation loss is about 0.01 and rising. I've also tried out other learning rates (0.001, 0.0005) with no improvement of quality. The converted mel-spectrograms are still blurry and produce a low-quality, robotic voice.
In comparison, the converted mel-spectrograms of the supplied autovc model are much sharper and produce more natural voice, even when used with Griffin-Lim.
Here are the mel-spectrograms of my retrained model and the model of the repo
Retrained model | Supplied model |
---|---|
Here is a minimal example of the loss and training loop I use.
I can also provide more of my code if wanted.
def train_step(mel_spec_batch, embeddings_batch, generator,
weight_mu_zero_rec: float, weight_lambda_content: float):
optimizer.zero_grad()
mel_spec_batch_exp = mel_spec_batch.unsqueeze(1) # (batch_size=2, 1, num_frames=128, num_mels=80)
mel_outputs, mel_outputs_postnet, content_codes_mel_input = generator(mel_spec_batch,
embeddings_batch,
embeddings_batch)
# Returns content codes with self.encoder without using the decoder and postnet a second time
content_codes_gen_output = generator.get_content_codes(mel_outputs_postnet, embeddings_batch)
rec_loss = F.mse_loss(input=mel_outputs_postnet, target=mel_spec_batch_exp, reduction="mean")
rec_0_loss = F.mse_loss(input=mel_outputs, target=mel_spec_batch_exp, reduction="mean")
content_loss = F.l1_loss(input=content_codes_gen_output, target=content_codes_mel_input, reduction="mean")
total_loss = rec_loss + weight_mu_zero_rec * rec_0_loss + weight_lambda_content * content_loss
total_loss.backward()
optimizer.step()
# Train loop..
for epoch in range(start_epoch + 1, args[FLAGS.MAX_NUM_EPOCHS] + 1):
generator.train()
# Iterate over Mel-Spec Slices and the index of their speakers
for step_idx, (mel_spec_batch, speaker_idx_batch) in enumerate(train_set_loader):
# Load the speaker embeddings of the speakers of the mel-spectograms
spkr_embeddings = speaker_embedding_mat[speaker_idx_batch.to(device)].to(device)
train_step(mel_spec_batch.to(device), spkr_embeddings, generator, optim,
weight_mu_zero_rec=args[FLAGS.AUTO_VC_MU_REC_LOSS_BEFORE_POSTNET], # == 1.0
weight_lambda_content=args[FLAGS.AUTO_VC_LAMBDA_CONTENT_LOSS]) # == 1.0
# The rest is computing the validation loss, resynthesizing utterances, saving the model every n epochs, etc
Does anyone have an idea what is wrong with my re-implementation or could anyone reimplement the system with good quality?
Thanks a lot in advance.
The paper said the batch size is 2 for 100k steps. Would it be too small for batch normalization? are you sure the batch size is 2 in your training?
Hello,
I'd like to enquire on the loss function mentioned in section 3.2, under the section "Training", Equation (6), where it mentions the definition of the loss function. I'd like to ask two questions concerning the two equations:
Thank you for your time!
it seems that the output encoder should be an extra module with the same structure and the same weight with the input encoder. But it is very difficult to get convergence in my training. Correct me if I am wrong.
I used a training set of 400 speaker each of them have 350 senten .Training is very slow.
is that too big?
What is the training set you use?
Can you tell me?
For all code requests, please send an email to [email protected] with name, affiliation and a description of how the code will be used for your research. Thanks!
Hello
To reproduce results of AutoVC we used following loss function implementation on PyTorch
num_steps = 100000
G = Generator(32,256,512,32).to(device)
...
criterion_recon = nn.MSELoss().to(device)
criterion_recon0 = nn.MSELoss().to(device)
criterion_content = nn.L1Loss().to(device)
optimizer = optim.Adam(G.parameters(), lr=1e-4)
G.train()
...
for i in range(1, num_steps + 1):
...
optimizer.zero_grad()
X1rt, X1r, C1 = G(X1, S1, S1)
#with torch.no_grad(): # Should we calculate C1r with grad?
# C1r = G(X1r[:, 0, :, :], S1, None)
C1r = G(X1r[:, 0, :, :], S1, None)
L_recon = criterion_recon(X1r[:, 0, :, :], X1)
L_recon0 = criterion_recon0(X1rt[:, 0, :, :], X1)
L_content = criterion_content(C1r, C1)
loss = L_recon + 1.0*L_recon0 + 1.0*L_content
loss.backward()
optimizer.step()
But we did not achieved comparable voice quality and loss value around 1e-3 according issue whats your final loss and final learning rate?
One of supposed reason is that we used inapropriate loss function implementation.
Is our loss function implementation compatible with that you used?
BR
Thanks for sharing the code and you did a great job.
I noticed that in the paper the downsampling process on the temporal axis is different for the forward sequence and the backward sequence. But it seems that the downsampling operation for the forward sequence in the code follows exactly the process described in the paper for the backward sequence. I'm quite confused because these two processes (what described in the code and in the paper) seems to behave differently for that they encode different contextual information.
Since the code is more up-to-date, does the downsampling process in the code is better?
Line 79 in 2d8a6c8
I am training a model on my own data to make conversions between two speakers with the training code you mailed me. I have tried various approaches to produce good conversions and, although I have gotten some average sounding output (with robotic crackles), the output is not as good as the samples in the audio demo website.
As a part of debugging my training, I am investigating the pre-trained model (autovc.ckpt) to produce conversions between the speakers in the VCTK dataset (The dataset the model was trained on). Although I manage to get relatively good outputs using samples from the speakers in the metadata.pkl file as the target and source speaker/speech, when I add a sample from another speaker (p240 or p260 for example) in the VCTK dataset to achieve a conversion, the quality of the output is poor. Can you give some pointers as to why this could be happening? Is the model trained on the entire VCTK dataset, or only a portion of it?
Here is some more information on how I generate the metadata.pkl file to do the conversions. To generate the speaker embedding of one recording, I use the pre-trained model that comes with the the training code ("3000000-BL.ckpt") with len_crop = 128. To generate the spectrogram, I use the code in the make_spect.py file that also comes in the training code and I leave the addition of some random noise (it gave better outputs this way).
Thank you beforehand!
AttributeError Traceback (most recent call last)
in ()
15 c = spect[1]
16 print(name)
---> 17 waveform = wavegen(model, c=c)
18 librosa.output.write_wav(name+'.wav', waveform, sr=16000)
/content/autovc/synthesis.py in wavegen(model, c, tqdm)
46
47 """
---> 48 c = c.numpy()
49
50 model.eval()
AttributeError: 'numpy.ndarray' object has no attribute 'numpy'
Running conversion.ipynb for the first time, I got the runtime error in the title.
I fixed it using; computationalmedia/semstyle#3
adding ,map_location='cuda:0'
to torch.load
Not sure why though.
Hi @auspicious3000 .
thank you for share this work.
i have some questions, could you please help me with answers.
Hi,
What number of time steps (frames) of mels did you use for training?
thanks.
Hi! thanks for sharing this awesome code!
I hear your sample sounds and I run the .ipynb files
and I want to train autovc with my own dataset, but there is no train code...
is there any plan that sharing train code?
Each I try conversio, it states:
The kernel appears to have died. It will restart automatically.
on Ubuntu 19.10, Firefox
Any idea?
What is the format of the metadata?
I want to try another audio.
I checked the data inside. But I don't know what the second one is.
The first one is the name,The third is mel-spectrogram.
And does this apply to Chinese audio? Or I need to retrain the model and use Chinese data.
thanks!
I find you set epoches=2000 in hparams.py. I want to know how much epoches is needed to train and how long it will takes, thanks.
Hi @auspicious3000, thanks for sharing your research code.
I've worked on a lot of time to make the training code work (mostly due to input hyper parameter issues as the other guys are also struggling).
I'm currently working on the VoxCeleb2 dataset (near 6000 speaker, with 1M utterances).
However, I cannot make it trainable with MSE loss, but with L1 loss, I can manage to get the following auto-encoding reconstruction.
[Original]
[Voice converted with another speaker embedding]
The problem is while the network learns auto-encoding, but during the test time, it is not generalizable to voice conversion. It just did auto-encoding, not something else.
The above pair of examples are voice conversion examples, where both fundamental frequency of the mel-spectrogram looks very similar.
Could you share your your experience or any comments? I'd appreciate.
If speaker embedding is not added to the encoder input, will it affect the model effect? Do you have any relevant experiments? Thx!
For example, do you crop the speech to 2 seconds, or do you keep the original speech length during training? Does this affect the performance of the model? Because I find it very much affects the speed of training, so I'd like to know the answer. Thank you :D
We have further improved AutoVC in 2 subsequent works.
The 1st work improves the audio quality by removing any pitch artifacts.
F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder
https://arxiv.org/abs/2004.07370
The 2nd work can convert rhythm, pitch, and/or timbre at the same time.
Unsupervised Speech Decomposition via Triple Information Bottleneck
https://arxiv.org/abs/2004.11284
Hi, I have a question about the decoder mismatch between code and the paper.
In the paper, it used 3 LSTM layers after 3 convnorm layers.
However in the code, it used 1 LSTM, 3 convnorm layers are following after, and 2 LSTM layers are used next.
Is this mismatch intended? I am wondering which implementation is correct.
Thanks.
with the same wavenet model and the same utterence(p225_001.wav), i found that the quality of the waveform generated from the mel-spectrogram in provided metadata.pkl is much better than the one generated by myself. Is there any tricky on how to generate proper mel-spectrogram?
Could you please tell us how you generated mel spectrograms for training from .wav files? What were the parameters used?
@auspicious3000 Can I train the model directly on VCTK without pretrain the speaker encoder using GE2E loss?
i listen all the demo wavs in these pages.
this is the new paper: https://auspicious3000.github.io/icassp_2020_demo/
this paper: https://auspicious3000.github.io/autovc-demo/
Theoretically, the original speaker embedding information has already been contained in the spectrogram. The network will automatically squeeze the original speaker embedding information out after convergence. why the original speaker embedding is still needed?
I'm trying to improve the model by implementing the pitch conditioning introduced in https://arxiv.org/abs/2004.07370. However the process of producing normalized quantized log-F0 seems a bit confusing, as there are more than one way you could compute mean µ and std σ.
A sample's pitch vector is a 1d array whose size is n, where n is the number of frames (which seems to be fixed at 128 according to #6 (comment)). So there are three ways of computing µ and σ:
Suppose f0 is extracted from a sample audio of speaker A.
f0_norm = (f0 - f0.mean()) / f0.std() / 4
).f0_norm = (f0 - f0s.mean()) / f0s.std() / 4
where f0s is an A x 128 array with A being the total number of samples of speaker A).f0_norm = (f0 - f0s.mean()) / f0s.std() / 4
where f0s is an N x 128 array with N being the total number of all samples - that is, A < N).And assuming the answer is 2 or 3, for unseen-to-seen or unseen-to-unseen conversion am I correct that µ and σ should be stored somewhere safe so I can reuse those values for inference? (I guess the option 2 doesn't really make sense since you can't compute those for unseen speakers)
I'm retraining the model using my own data but my output is all noise. I'm suspecting that I'm having an issue with the way I'm generating the mel-spectrograms. I'm generating them using librosa and inverting the output of the model back to raw audio using librosa too.
Here are the functions I'm using to generate mel-spectrogram from raw audio:
def normalize(S):
return np.clip((S - hp.min_level_db) / -hp.min_level_db, 0, 1)
def denormalize(S):
return (np.clip(S, 0, 1) * -hp.min_level_db) + hp.min_level_db
def amp_to_db(x):
return 20 * np.log10(np.maximum(1e-5, x))
def db_to_amp(x):
return np.power(10.0, x * 0.05)
def melspectrogram(y):
S = librosa.feature.melspectrogram(y=y, sr=hp.sr, n_fft=hp.fft_size, hop_length=hp.hop_length, n_mels=hp.n_mels, fmin=hp.fmin, fmax=hp.fmax, power = hp.power)
S = amp_to_db(S)
S = normalize(S)
return S
def inverse_melspectrogram(M):
M = denormalize(M)
M = db_to_amp(M)
y = librosa.feature.inverse.mel_to_audio(M=M, sr=hp.sr, n_fft=hp.fft_size, hop_length=hp.hop_length, power =hp.power)
return y
Here are the hyperparameters I'm using:
sr=16000
n_mels=80
fmin=90
fmax=7600
fft_size=1024
hop_length =256
min_level_db=-100
ref_level_db=20
PAD_VALUE = -100000
BATCH_SIZE = 32
MAX_FRAMES = 1024
power = 1.0
Could you tell me if there is an issue with my preprocessing steps? If you need any more info, please ask.
Thanks
Please send an email to [email protected]
I have seen in #27 that 300k steps is necessary for training. Do the original results use the entirety of the VCTK dataset with 9/10 train, 1/10 test as in the paper, or do they use only a subset of the VCTK dataset?
Also, just to clarify, a step or iteration is one batch being passed through the network, correct? i.e. If the dataset is of size 10000, then you get 5000 steps (or iterations) out of it on a batch size of 2 in one epoch.
Thank you for sharing your work.
When i am to make zero shot model, should i train speaker embedder as well as the conversion model with large dataset (VCTK)?
Or is it ok to only train the conversion model with VCTK?
It seems quite a few of the demo samples on this page don't seem to work
Using Firefox 60.0
Lots of console errors saying Media resource https://auspicious3000.github.io/autovc-demo/audios/89_01/p261_p225_1000000.wav could not be decoded.
I am wondering about how you extract the speaker embedding with pre-trained verification model.
The speaker embedding I get from https://github.com/resemble-ai/Resemblyzer will have a vector with all positive values and mostly zeros due to the ReLU at the end of the model. However, the speaker embedding in your metadata.pkl will have positive and negative values and it looks like a normal distribution.
Could you give me some advice how you extracted the exact embedding in your work? I try to skip the ReLU layer and L2-normalization for the vector but it is still not similar with your results. Hope to receive your response! Thanks!
By the way, your paper said that you used the speaker encoder with a stack of "2" LSTM layers with cell size "768". But, the model in https://github.com/resemble-ai/Resemblyzer used "3" LSTM layers with cell size "256". I was confused whether you use the same speaker encoder model.
About L=Lrecon +μ * Lrecon0 + λ * Lcontent
The paper is not very clear, how can I get loss Lrecon0 ?
Lrecon and Lrecon0 Are they related to the loss before and after the residual module?
How to ensure that the output of the encoder is independent of the speaker?
I can't see the concept of confusing networks or generative adversarial training in this paper.
I don't understand how it works.
Is it because of memory limitations, or is it better than the result of a larger batch size (such as 16, 32, 64)?
Hi, I'm using python 3.7.3 with numpy 1.16.4 and is experiencing the following error when loading meta data:
``
metadata = pickle.load(open('metadata.pkl', "rb"))
Traceback (most recent call last):
File " stdin ", line 1, in module
_pickle.UnpicklingError: invalid load key, '\x0a'.
``
I guess this might be an encoding mismatch ('ASCII' by default) or a numpy version mismatch.
Would it be possible for you to share a npy/npz version of the metadata.pkl? Thanks!
Hello @auspicious3000,
I want to re-train this model by using Urdu data-set. Can you guide me the steps that I should follow?
Please list the file names that I need to execute in sequence so that I could reproduce the results for my own dataset?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.