Comments (77)
You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .
model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")First: I think I may have completed the reconstruction of the voice signal. I followed your method and completed the main function。 Second: What do you think is the PESQ score for the output file? Input files and output files, located below.
https://drive.google.com/drive/folders/1mvyg_CRxI6LGlVXbYu0OHKmhiAV_E-bK?usp=drive_link
from soundstream-pytorch.
You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .
model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")
from soundstream-pytorch.
Yes, you can change the segment size, unless the size is too short. 5,000 might be too short as it is computationally inefficient when the segment size is too short.
I made a mistake, I wanted to ask 50000, not 5000, because I thought the larger the number, the more it represented X, but in fact, I found that increasing the number resulted in a doubling of training time. So I'll pick an appropriate integer around the 32270 you set, not too big or too small, such like 30000-35000
We have 102 embedding vectors along the time. Each embedding vector is quantized using 8 codebooks. Which produces 102 * 8 codes, i.e., tensor(32, 102, 8)
THS,I get it. My idea is to reduce the step size while increasing the segment_length, and keep the number of num_codebooks the same, (because I think the change of these three parameters may be helpful for lower bit rate compression) , and look at the PESQ score.
from soundstream-pytorch.
torch.nn.init.normal_(model.quantizer.weight, mean=mean, std=std)
This initializes the codebook. The weight of the codebook is 8 x 1024 x 512. The number of codebooks is 8 and the number of code in one codebook is 1024. In the beggining of the training, we want this to be close to the distribution of encoder's outputs. In the page 5 of the paper, it says,
initialization for the codebook vectors, we run the k-means
algorithm on the first training batch and use the learned
centroids as initialization
I skipped this and just initialized with the gaussian of the first training batch as I didn't want to run k-means in the training code. The code calls the model but this is required only once.
from soundstream-pytorch.
Yes, you're right. STFT is used to get mel-spectrum. SoundStream paper uses mel-spectrum and https://arxiv.org/pdf/2008.01160.pdf uses STFT.
from soundstream-pytorch.
Do you mean you want to train soundstream model with new training data or want to train other model which uses output of soundstream as features? In the first case, you can run python soundstream.py
that should download LIBRISPEECH under ./data
and start training.
from soundstream-pytorch.
Thank you for your reply.
First, I found that your soundstream models need to download data, including YESNO, LIBRISPEECH or librispeech, which is actually very time-consuming, so I downloaded other new data in advance.
Second, I mean the first case, I want to use your soundstream modle to train a new set of data with a sample-rate of 8KHz which I have already downloaded, but I don't know how to load them into your model.
from soundstream-pytorch.
My ultimate goal is to achieve low bit rate compression. I would like to train a set of data with a sample rate of 8KHZ through your model, then num_embeddings change from 1024 to 256 and num_quantizers from 8 to 6, and see what the end result is.
from soundstream-pytorch.
from soundstream-pytorch.
ds
is not a string of directory path, but torch.utils.data.Dataset. If you want to train 8kHz model with LIBRISPEECH
, you can change sample_rate
. If you want to your custom dataset, you can implement your own Dataset
which should not be too difficult.
from soundstream-pytorch.
Excuse me again, I have successfully started training, and the training data is the same as yours. The difference is that my data was downloaded in advance. During the training process, when the epoch was 98, an inexplicable error occurred, which seemed to be a problem with the data. However, the data was the same as yours, so I don't understand why this error occurred. Have you encountered it before?
from soundstream-pytorch.
Excuse me again, I have successfully started training, and the training data is the same as yours. The difference is that my data was downloaded in advance. During the training process, when the epoch was 98, an inexplicable error occurred, which seemed to be a problem with the data. However, the data was the same as yours, so I don't understand why this error occurred. Have you encountered it before?
I tried to continue training from a location with an epoch of 98 and found no errors. This issue is temporarily considered evaded
The second question, it seems that the testing process for the final model has not been found. Can you provide guidance?
from soundstream-pytorch.
If you just want to hear the output yourself. You can encode the audio file by calling forward()
method.
soundstream-pytorch/soundstream.py
Line 500 in 9c6086e
from soundstream-pytorch.
ViSQOL
If you just want to hear the output yourself. You can encode the audio file by calling
forward()
method.soundstream-pytorch/soundstream.py
Line 500 in 9c6086e
. If you want to compute ViSQOL, sorry. It has no implementation for that.
Firt, so how do you determine your model is useful? What are the judgment indicators?
Second, how to use the output such as "epoch=84-step=150000.ckpt" to check the availability of the model?
from soundstream-pytorch.
You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .
model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")
First,I think I have completed 150 training sessions, as shown in the picture.
Second, what you mentioned that "model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")", is it possible to replace the. ckpt file labeled in the second image to reconstruct my speech signal, in order to verify the usefulness of the model?
Thirdly, is the. ckpt file labeled in the second figure the final training model?
I'm sorry, I'm a novice, so there may be many ignorant questions bothering you
from soundstream-pytorch.
You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .
model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")
First: I think I may have completed the reconstruction of the voice signal. I followed your method and completed the main function。
Second: What do you think is the PESQ score for the output file? Input files and output files, located below.
from soundstream-pytorch.
You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .
model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")First: I think I may have completed the reconstruction of the voice signal. I followed your method and completed the main function。 Second: What do you think is the PESQ score for the output file? Input files and output files, located below.
https://drive.google.com/drive/folders/1mvyg_CRxI6LGlVXbYu0OHKmhiAV_E-bK?usp=drive_link
Firstly, I'm sorry, please forgive me for being a novice. I didn't have access to open audio files earlier, and you can now access them.
Secondly, a few days ago, when the epoch was 84 and an unknown error occurred during the training process, the model generated at that time and the model generated after 150 epochs of training had a significant impact on the output file. Is the epoch too large to handle?
from soundstream-pytorch.
SoundStream has a couple of loss functions. You can use TensorBoard to look at these losses. If some of them have strange behavior you may adjust parameters.
I used the same stride for 16kHz with the original paper, 2 * 4 * 5 * 8 = 320 window size. This makes 50Hz embeddings.
The original SoundStream is for 24kHz which makes 75Hz embeddings. So 8kHz model has to compress a longer audo window into an embedding.
from soundstream-pytorch.
SoundStream has a couple of loss functions. You can use TensorBoard to look at these losses. If some of them have strange behavior you may adjust parameters.
I used the same stride for 16kHz with the original paper, 2 * 4 * 5 * 8 = 320 window size. This makes 50Hz embeddings. The original SoundStream is for 24kHz which makes 75Hz embeddings. So 8kHz model has to compress a longer audo window into an embedding.
First, I studied TensorBoard for several hours today, but I haven't made any progress yet. My understanding is this: when using TensorBoard, I train with the model.fit() function, but for now I train with the pl.Trainer.fit() function. Do I need to change the training function if I want to use TensorBoard? How should I use TensorBoard.
Second, you mentioned that "So 8kHz model has to compress a longer audo window into an embedding", I want to change the window size with 225*8=160. Am I understanding this correctly?
from soundstream-pytorch.
It seems that TensorBoard is not enabled by default. If you enable it, you'll find lightning_logs/version_X/event.xxxx.yyyy
in your output. You can launch by tensorboard --logdir lightning_logs/version_X/
Or you can find CSV file lightning_logs/version_X/metrics.csv
.
225*8=160 makes window size 160. I think it is good number.
from soundstream-pytorch.
It seems that TensorBoard is not enabled by default. If you enable it, you'll find
lightning_logs/version_X/event.xxxx.yyyy
in your output. You can launch bytensorboard --logdir lightning_logs/version_X/
Or you can find CSV file
lightning_logs/version_X/metrics.csv
.2_2_5*8=160 makes window size 160. I think it is good number.
First,After seeing your reply, I spent several hours trying to open TensorBoard and found that just setting logger=True would suffice. I'm so happy
Second, If I want to achieve a low bitrate compression method, such as 1.2kbps, such as an 8KHz sampling rate, if I use a window size of 320, 1200 * 320/8000=48bit, and use six 8-bit codebooks for quantization. If we still use six 8-bit codebooks to achieve 0.6kbps, we need 600 * 640/8000=48bit, which means the window size has changed from 320 to 640. So I seem to need to increase the window size, do you agree?
from soundstream-pytorch.
If you just want to achieve a low bitrate compress, I can just reduce the number of quantizers, without retraining.
soundstream-pytorch/soundstream.py
Line 241 in 9c6086e
Many of latest neural vocoders (SoundStream and Meta Encodec (https://github.com/facebookresearch/encodec) adopt hierarchical quantized autoencoder so that it can achieve adjustable bit rates. However, note that dropping quantizers doesn't reduce computational cost.
I think longer window size is difficult to learn, as audio signal is stational in short time, but not in longer time.
from soundstream-pytorch.
If you just want to achieve a low bitrate compress, I can just reduce the number of quantizers, without retraining.
soundstream-pytorch/soundstream.py
Line 241 in 9c6086e
Many of latest neural vocoders (SoundStream and Meta Encodec (https://github.com/facebookresearch/encodec) adopt hierarchical quantized autoencoder so that it can achieve adjustable bit rates. However, note that dropping quantizers doesn't reduce computational cost.
I think longer window size is difficult to learn, as audio signal is stational in short time, but not in longer time.
First, I studied traditional speech compression before, using codebooks to quantify, such as 48bit to quantify a feature value, 48bit distribution is like this: the num of codebooks is 6, and each codebook has 8bit, that is, 2^8=256 representative arrays.
But I may have misunderstood soundstream because I always thought num_quantizers was equal to num_codebook in the traditional compression algorithm and num_embeddings was equal to dim_codebook in the traditional compression algorithm. I should embedding_dim in soundstream as a dim_codebook, right?
Seond, in your model ,num_quantizers=6; num_embeddings=1024; embedding_dim=512. How to calculate the compression bitrate? It's a parameter in kbps. I want to know what the bit rate is. Can you show me?
Third, I've followed Encodec, but I haven't started learning it yet. You say soundstream also adopt hierarchical quantized autoencoder. Is that mentioned in your code? If you have finished that part , can you show me?
Fourth, your reminder is right, the window size can not be greater than 320, the short-term stability of voice is about 20-30ms, but I would like to ask, the traditional compression algorithm of multi-frame joint quantization idea, can be used or not in soundstream?
from soundstream-pytorch.
If you just want to achieve a low bitrate compress, I can just reduce the number of quantizers, without retraining.
soundstream-pytorch/soundstream.py
Line 241 in 9c6086e
Many of latest neural vocoders (SoundStream and Meta Encodec (https://github.com/facebookresearch/encodec) adopt hierarchical quantized autoencoder so that it can achieve adjustable bit rates. However, note that dropping quantizers doesn't reduce computational cost.
I think longer window size is difficult to learn, as audio signal is stational in short time, but not in longer time.First, I studied traditional speech compression before, using codebooks to quantify, such as 48bit to quantify a feature value, 48bit distribution is like this: the num of codebooks is 6, and each codebook has 8bit, that is, 2^8=256 representative arrays. But I may have misunderstood soundstream because I always thought num_quantizers was equal to num_codebook in the traditional compression algorithm and num_embeddings was equal to dim_codebook in the traditional compression algorithm. I should embedding_dim in soundstream as a dim_codebook, right?
Seond, in your model ,num_quantizers=6; num_embeddings=1024; embedding_dim=512. How to calculate the compression bitrate? It's a parameter in kbps. I want to know what the bit rate is. Can you show me?
Third, I've followed Encodec, but I haven't started learning it yet. You say soundstream also adopt hierarchical quantized autoencoder. Is that mentioned in your code? If you have finished that part , can you show me?
Fourth, your reminder is right, the window size can not be greater than 320, the short-term stability of voice is about 20-30ms, but I would like to ask, the traditional compression algorithm of multi-frame joint quantization idea, can be used or not in soundstream?
[self.register_buffer("code_count", torch.empty(num_quantizers, num_embeddings))](https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L225)
For the second question I mentioned yesterday, I have some new ideas, and I do not know if it is correct. The bit is num_quantizerslog(num_embeddings)=80bit, the code rate is 80bit(16000Hz/320=50frames)=4kbps?Am I right?
from soundstream-pytorch.
First, I studied traditional speech compression before, using codebooks to quantify, such as 48bit to quantify a feature
Yes, you are right, num_quantizers
is the number of codebooks. SoundStream has 8 codebooks and each codebook has 1024 codes. Then one frame is encoded with 8 * log2(1024) = 80 bits. In the original paper, frame rate is 75 Hz for 24kHz sampling rate. This produces 75 * 80 = 6k bps. 4kbps in case of 16kHz.
Third, I've followed Encodec, but I haven't started learning it yet. You say soundstream also adopt hierarchical quantized autoencoder. Is that mentioned in your code? If you have finished that part , can you show me?
The algorithm is explained in Algorithm 1: Residual Vector Quantization of https://arxiv.org/pdf/2107.03312.pdf. This produces 8 x 10 bit codes, in which the first code is the most important and the last is the least important. Here, you can reproduce the original vector only using some of vectors, for example, the first 5 codes, then you can achive 5 * 10 * 50 = 2.5kbps.
Here, you can pass n codes, where n is between 1 and 8 in the inference time.
soundstream-pytorch/soundstream.py
Line 264 in 9c6086e
In the training time, it drops less important codes randomly so that it can reproduce audio with only important codes.
soundstream-pytorch/soundstream.py
Line 241 in 9c6086e
from soundstream-pytorch.
First, I studied traditional speech compression before, using codebooks to quantify, such as 48bit to quantify a feature
Yes, you are right,
num_quantizers
is the number of codebooks. SoundStream has 8 codebooks and each codebook has 1024 codes. Then one frame is encoded with 8 * log2(1024) = 80 bits. In the original paper, frame rate is 75 Hz for 24kHz sampling rate. This produces 75 * 80 = 6k bps. 4kbps in case of 16kHz.Third, I've followed Encodec, but I haven't started learning it yet. You say soundstream also adopt hierarchical quantized autoencoder. Is that mentioned in your code? If you have finished that part , can you show me?
The algorithm is explained in Algorithm 1: Residual Vector Quantization of https://arxiv.org/pdf/2107.03312.pdf. This produces 8 x 10 bit codes, in which the first code is the most important and the last is the least important. Here, you can reproduce the original vector only using some of vectors, for example, the first 5 codes, then you can achive 5 * 10 * 50 = 2.5kbps.
Thank you for your reply. Your reply is my motivation to continue studying. My understanding is this: I just load your pre-trained model (soundstream_16khz-20230425.ckpt),and then change the value of n, I can achieve a variety of bit rate compression, no need to repeat training, such as n=4, 4 * 10 * 50 = 2kbps; n = 2, 2 * 10 * 50 = 1kbps;
Here, you can pass n codes, where n is between 1 and 8 in the inference time.
I want to achieve a lower speech compression bit rate. by change the sampling rate to 8KHz(should be the lowest); change the step size to 2 * 4 * 5 * 6 = 240, which corresponds to the sample rate of 8KHz. 'num_quantizers=8' and 'num_embeddings =1024' remain unchanged, epoch=200. Then compare the results with your 16KHz model by change 'n' synchronously.
soundstream-pytorch/soundstream.py
Line 264 in 9c6086e
In the training time, it drops less important codes randomly so that it can reproduce audio with only important codes.
Can the value of n equal 1 which just keep only the most important code?
soundstream-pytorch/soundstream.py
Line 241 in 9c6086e
from soundstream-pytorch.
When I change the step size 2, 4, 5, 6=240, I need to change the segment_length from 32270 to 30430, which I calculated in order to be able to divide the steps exactly. So I would like to ask if the 32270 you set at that time is also to divide the step size? Can I make it bigger or smaller? I wonder if it could be bigger? Because it can include more X content, am I right?
from soundstream-pytorch.
I am a PhD student and I want to post an article based on soundstream, but I haven't found any innovation yet, can you guide me something about soundstream? For example, where can I continue to improve soundstream?
At first I wanted to use soundstream to achieve lower bitrates, but I found that soundstream had already implemented it by changing 'n', or retraining the new 'num_quantizers' and' num_embeddings', so I couldn't find a new idea, can you remind me something?TKS
from soundstream-pytorch.
In the paper, the authors mentioned that the coding rate is guaranteed to remain the same, and different step sizes will not affect the final score.
So my idea of retraining a new model to achieve a lower bit rate by changing the step size, 2 * 4 * 5 * 6 = 240, might not work.
from soundstream-pytorch.
Your reply is my motivation to continue studying.
Thank you! I'm glad to hear that.
I need to change the segment_length from 32270 to 30430,
32270 is nice number so that the output lenght of decoder is the same as the input length of encoder. They are sometimes different because of rounding. I think 30430 is good number for 2 4 5 6.
For example, where can I continue to improve soundstream?
I'm not sure, but you may try variable rate. SoundStream is fixed rate in time. I might be enough when audio signal is not so complicated.
BTW, Meta's EnCodec https://github.com/facebookresearch/encodec is almost same as SoundStream. They claim using balancer stabilizes training. SoundStream's weighs of losses are manually tuned.
from soundstream-pytorch.
If you just want to hear the output yourself. You can encode the audio file by calling
forward()
method.soundstream-pytorch/soundstream.py
Line 500 in 9c6086e
. If you want to compute ViSQOL, sorry. It has no implementation for that.
https://github.com/aliutkus/speechmetrics
I tested it with PESQ today and found that PESQ didn't work very well. Did you not use ViSQOL or PESQ test tools at that time?
from soundstream-pytorch.
For example, where can I continue to improve soundstream?
I'm not sure, but you may try variable rate. SoundStream is fixed rate in time. I might be enough when audio signal is not so complicated.
Can you be more specific? Because I am a beginner in audio compression and my research direction is very low bit rate compression, I feel that you are an expert in this field, so I would like to hear your specific opinion.
from soundstream-pytorch.
BTW, Meta's EnCodec https://github.com/facebookresearch/encodec is almost same as SoundStream. They claim using balancer stabilizes training. SoundStream's weighs of losses are manually tuned.
I have already paid attention to two models, soundstream and EnCodec, which are very close to my research direction. So my arrangement is like this: For my first paper, I want to do some research based on soundstream, but I haven't found a suitable research site yet. The second paper wants to do some research based on EnCodec, so I have been studying soundstream recently and will start to study EnCodec after the New Year. This is my plan。
from soundstream-pytorch.
https://github.com/aliutkus/speechmetrics
I tested it with PESQ today and found that PESQ didn't work very well. Did you not use ViSQOL or PESQ test tools at that time?
Thank you for input. I have used neither of them. I'll look into them.
Can you be more specific?
This is just random idea. The codec keeps a constant rate, which is given by the user. For example, if you decide to use the first 5 codebooks, then the data rate is constantly 6kbps * 5 / 8 = 3.75kbps. even there's no audio signal. Maybe the codec can decide by itself how many codebooks to use for the given audio by using more bits for important frames and less bits for less important frames.
from soundstream-pytorch.
You set the step size 2 4 5 8 = 320 and the segment_length=32270, then output = Encoder(input), and the output is a tensor(32,102,512).
When I set the step size 2 4 5 6 =240 and the segment_length=30430, the the output is a tensor(32,127,512)
So I want to ask two questions:
The first one, Can I increase the segment_length? Like 5000? Because I noticed that this value is related to the valid content of input.
The second problem, if I increase the segment_length and decrease the step size, the second parameter of the output tensor will also increase, for example, from your 102 to my 127 or more, and I know that the second parameter is related to the number of codebooks, which is 1024, What is the effect of increasing the second value?
from soundstream-pytorch.
The first one, Can I increase the segment_length? Like 5000? Because I noticed that this value is related to the valid content of input.
Yes, you can change the segment size, unless the size is too short. 5,000 might be too short as it is computationally inefficient when the segment size is too short.
I know that the second parameter is related to the number of codebooks
The second dimension of an encoded tensor is time-axis. If the output is a tensor(32,102,512), it means it encodes 102 * 320 = 32640 samples ignoring paddings. The third dimension is the hidden dimension. We have 102 embedding vectors along the time. Each embedding vector is quantized using 8 codebooks. Which produces 102 * 8 codes, i.e., tensor(32, 102, 8)
from soundstream-pytorch.
soundstream-pytorch/soundstream.py
Line 647 in 9c6086e
Is the weight in this place only a batch value? Is there a missing for loop here?
like "for batch in iterator" or something
My understanding is this: weight is the early code book, and all the data needs to be classified, for example, divided into 1024 categories, and then formed 1024 code books.
Did I get it wrong?
from soundstream-pytorch.
First, do you know what the second part of the reconstruction loss formula in the paper is? Why log the mel spectrum?
soundstream-pytorch/soundstream.py
Line 385 in 9c6086e
Second, your code uses STFT instead of mel spectrum. Is the STFT you use an intermediate parameter in the process of solving mel spectrum?
soundstream-pytorch/soundstream.py
Line 383 in 9c6086e
soundstream-pytorch/soundstream.py
Line 347 in 9c6086e
Third,does the loss of the STFT discriminator and the loss of reconstruction with the STFT count as duplicates?
from soundstream-pytorch.
Why log the mel spectrum?
Log Mel-spectrum is believed to be close to human perception. S(x) above is linear to the power, but human perception is linear to the log of power. Probably the second part (5) is what we want to minimize.
Second, your code uses STFT instead of mel spectrum. Is the STFT you use an intermediate parameter in the process of solving mel spectrum?
I didn't understand what 'intermediate parameter' is. But originally the loss formula comes from https://arxiv.org/pdf/2008.01160.pdf where they use STFT not Mel-spectrum like SoundStream paper.
Third,does the loss of the STFT discriminator and the loss of reconstruction with the STFT count as duplicates?
STFT discriminator is based on GAN technique. STFT reconstruction is auto-regression which are different. GAN is used for audio generation task (and image generation) because there're a lot of possible audio outputs which are good for humans but very different in auto-regressive loss, because of likes of shifted phase of audio, (or shifted image). However, GAN doesn't generate audio output which is close to the original, so they use weaker auto-regressive tasks to help audio generation.
from soundstream-pytorch.
I didn't understand what 'intermediate parameter' is.
The MFCC solution process I have learned is as follows: for the voice signal, after adding a window, get the energy distribution on the spectrum through FFT (just like your STFT? ), then get the power spectrum throught the square of the modulus , then get the mel spectrum through the mel filter, then get the Fbank through the Log (just like the second part of the reconstruction loss formula?), then get the MFCC through the DCT.
So I guess the STFT your code uses is the 'intermediate parameter' which is the energy distribution on the spectrum, and the log of STFT which is the second part of the reconstruction loss formula is the Fbank which is indeed a parameter close to human perception.
from soundstream-pytorch.
Sorry to bother you again. I found that if I change batch_size from 32 to 16, the training speed will be increased by 10 times, but I haven't finished the training yet, so I don't know if the training at such a fast speed means that the training is incomplete or ineffective. Besides, why did you set batch_size to 32 in the first place?
from soundstream-pytorch.
Usually you want to use largest batch size of your GPU for more efficiency. Increasing batch size twice shouldn't increase the step time more than twice. Also increasing batch size generally makes stable results as larger batch are less variant.
I don't know why batch_size=16 is 10 times faster, but it is great if it is still stable.
from soundstream-pytorch.
Excuse me, have you ever tried to reconstruct the voice signal through the MFCC? Or have you ever seen someone else do it?
from soundstream-pytorch.
I haven't tried MFCC myself. I think MFCC is not so popular for voice features as melspec because deep-learning based models are strong enough, like HiFi GAN and MelGAN use melspec. But MFCC might be good (or no good) when calculating reconstruction loss of vocodecs.
from soundstream-pytorch.
soundstream-pytorch/soundstream.py
Line 408 in 9c6086e
soundstream-pytorch/soundstream.py
Line 409 in 9c6086e
Excuse me for bothering you again, may I ask why there are two losses in this place? Because I found that the rec_loss is very large, and the g_loss is also large, so I found two losses in this place
from soundstream-pytorch.
In my case g_rec_loss is around 10. Do you see other anormalities?
g_stft_loss | g_wave_loss | g_feat_loss | g_rec_loss | q_loss | g_loss | codes_entropy | d_stft_loss | d_wave_loss | d_loss | num_replaced | epoch | step |
---|---|---|---|---|---|---|---|---|---|---|---|---|
8.765625 | 2.03125 | 0.035614 | 13.462036 | 0.385002 | 20.735474 | 6.826962 | 0.0 | 1.387695 | 1.041016 | 0.0 | 24 | 21487 |
from soundstream-pytorch.
Did you change the Mel-spectrum to STFT at the beginning because there are many negative numbers in the Mel-spectrum? If LOG operation is performed according to the formula in the figure above, the loss will have the problem of NAN.
from soundstream-pytorch.
In my case g_rec_loss is around 10. Do you see other anormalities?
g_stft_loss g_wave_loss g_feat_loss g_rec_loss q_loss g_loss codes_entropy d_stft_loss d_wave_loss d_loss num_replaced epoch step
8.765625 2.03125 0.035614 13.462036 0.385002 20.735474 6.826962 0.0 1.387695 1.041016 0.0 24 21487
I am trying to replace the discriminator in your code with the MSD and MPD modules of HIFIGAN, but it has not been successful. The output speech after training is white noise, and I have been looking for the reason, thinking that loss cannot converge. So the loss parameter that you and I expressed looks different.
In addition, I heard that HIFIGAN's discriminator is the most useful discriminator at present, and I want to add it to your code. I have finished adding, now the code can run through, but the output speech after training is always white noise, I can't find the problem, can you help me to achieve it?
from soundstream-pytorch.
Previous test results(At first, it was over 100, but when the step increased, it dropped to 20)
Current test results(At the beginning, it was over 100, but it didn't decrease as the step increased)
the g_rec_loss in g_loss does not converge
from soundstream-pytorch.
I'm not sure about the reason why.
HiFiGAN paper uses big lambda for generation. https://arxiv.org/pdf/2010.05646.pdf Probably you could try tweaking hyper parameters, or try to replace with good known state_dict of the model to see if the loss is reasonable.
from soundstream-pytorch.
HiFiGAN paper uses big lambda for generation. https://arxiv.org/pdf/2010.05646.pdf Probably you could try tweaking hyper parameters, or try to replace with good known state_dict of the model to see if the loss is reasonable.
HiFiGAN paper uses big lambda for generation. https://arxiv.org/pdf/2010.05646.pdf Probably you could try tweaking hyper parameters, or try to replace with good known state_dict of the model to see if the loss is reasonable.
HiFiGAN paper uses big lambda for generation. The big lambda you mentioned acting on the third parameter in the generator formula, mel_loss, does that mean that the first parameter in the generator formula, adv_loss, changes less?
My test results are as follows:
- d_adv_loss(In the figure below is d_loss)≈g_adv_loss, both of which have small changes, only slightly in the fourth decimal place, as shown in the figure below.
- The feat_loss with the lambda_fm=2, I can't see anything unusual
- The mel_loss, I learned your way and replaced it with stft_loss, as mentioned in your code "uses STFT instead of mel-spectrogram"
The arrangement of three losses looks very reasonable, but the training results are poor and the understanding is very low. I don't know what went wrong. Can you see the problem from the picture?
from soundstream-pytorch.
The first picture is codebook_loss obtained by running your code. The second picture is the code I ran last night (see the changes in the last two replies).
soundstream-pytorch/soundstream.py
Line 503 in 9c6086e
Do you know why this codebook_loss is like this? I just changed the batch_size
from soundstream-pytorch.
soundstream-pytorch/soundstream.py
Line 661 in 9c6086e
soundstream-pytorch/soundstream.py
Line 289 in 9c6086e
soundstream-pytorch/soundstream.py
Line 236 in 9c6086e
I am very sorry to bother you again. @kaiidams
First, if you set precision='16-mixed' in Trainer(), then the global tensor is already automatically mixed?
Second, if you set '@Torch.cuda.am.autocast (enabled=False)' in a certain region, is the tensor in the region only 16bit half-precision?
Third, do other areas without any flags still use 'precision='16-mixed'?
Look forward to your answer, thank you
from soundstream-pytorch.
@torch.cuda.am.autocast
should compute tensors in 32-bit floats. Using precision=16-mixed
should be okay, but should be unstable in general. You may try disable 16-mixed
. codebook_loss
looks very bad to me. Quantizer replace code when it is not used much, if batch_size is small, it may replace more codes.
soundstream-pytorch/soundstream.py
Line 297 in 9c6086e
How's your codebook entropy and num_replaced? They should be around 6.8 and 0.3 each.
from soundstream-pytorch.
How's your codebook entropy and num_replaced? They should be around 6.8 and 0.3 each.
I corrected the code today, replacing the wav and STFT discriminators with MSD and MPD discriminators. These diagrams show the state of epoch=0, but the training time is too long, I don't know why。
from soundstream-pytorch.
When epoch=12, d_loss and g_loss gradually decrease. Although the decrease is not large, when epoch=14, d_loss suddenly decreases by 50%, and g_loss suddenly increases to 150%. Can you point out the problem? I'm a newbie and don't know the direction of the problem, please give me some advice.
from soundstream-pytorch.
In general you could try several things.
- Pick the checkpoint before the sudden change and see if the model can produce reasonable audio output.
- If the above is okay, try looking at norm of weights like
torch.sum(torch.square(model.xxx.yyy.weight))
. If it is too large, then you can apply stronger weight decay. Often, weights closer to the output tend to explode (Not sure if this is the case this time.) - Or you could try decreasing learning_rate. This may cause jump beyond an optimal minimum point.
from soundstream-pytorch.
- Pick the checkpoint before the sudden change and see if the model can produce reasonable audio output.
I think the output audio is very bad, even with the epoch=56. You can listen to it. It's a big difference. The code I changed, compared with your code, even if the epoch is more than 10 times, but the output voice quality is very different.
https://drive.google.com/drive/folders/1gdBJtyc7IKReAWi-V2lVFf1fqkIAuMVK?usp=drive_link
- If the above is okay, try looking at norm of weights like
torch.sum(torch.square(model.xxx.yyy.weight))
. If it is too large, then you can apply stronger weight decay. Often, weights closer to the output tend to explode (Not sure if this is the case this time.)
I didn't find the weight you said. Can you point it out to me? thank you.
- Or you could try decreasing learning_rate. This may cause jump beyond an optimal minimum point.
Yes, I changed the discriminator's learning rate from 0.0001 to 0.00001 (a 10x drop as you explained) and the discriminator's optimization factor from 0.5 to 0.8 (b1) and 0.9 to 0.99 (b2) (these two parameters come from the HIFIGAN paper).
This modification does allow the model to train safely, without the problem discussed three days ago, here, thank you for your advice, but a new problem has emerged, which is that even if the epoch=56, the output speech is still not understandable just like the first question above.
Can you give me some more guidance on how to do the next?
from soundstream-pytorch.
How's your codebook entropy and num_replaced? They should be around 6.8 and 0.3 each.
from soundstream-pytorch.
I didn't find the weight you said. Can you point it out to me? thank you.
If you try something like this, they are not too large. If larger norm means exploded weight.
for n, p in model.state_dict().items():
w = torch.mean(p**2).item()
print(n, w)
encoder.layers.0.weight 0.08002594858407974
encoder.layers.0.bias 0.032715506851673126
encoder.layers.2.layers.0.conv0.weight 0.0024174670688807964
encoder.layers.2.layers.0.conv0.bias 0.0023794351145625114
encoder.layers.2.layers.0.conv1.weight 0.01180010475218296
encoder.layers.2.layers.0.conv1.bias 0.006960079539567232
encoder.layers.2.layers.1.conv0.weight 0.00261313421651721
encoder.layers.2.layers.1.conv0.bias 0.0012120403116568923
encoder.layers.2.layers.1.conv1.weight 0.011738475412130356
encoder.layers.2.layers.1.conv1.bias 0.0076542804017663
encoder.layers.2.layers.2.conv0.weight 0.002758147194981575
codebook entropy and num_replaced looks good to me.
Can you give me some more guidance on how to do the next?
I think training audio codec is unstable. They added many loss, I think that is what they found out after many failures.
Meta's Encodec introduced loss balancer so that you don't have to tune loss weights. I think it is something what can be tried.
from soundstream-pytorch.
I'm sorry to bother you again, but I'd like to take your pre-training results (16kHz pretrained model) as the baseline.
I will list the comparison between codebook size and code rate, and you can judge whether it is correct
codebook size | code rate/bps |
---|---|
1 | 500 |
2 | 1000 |
3 | 1500 |
4 | 2000 |
5 | 2500 |
6 | 3000 |
7 | 3500 |
8 | 4000 |
I want to experiment 2400bps and 1200bps, which seems not to be reflected in the table. Can it only be achieved by changing step[2 4 5 8] and segment_length[35710]? If this is the case, it seems impossible to use your pre-trained model. That would require retraining, right?
from soundstream-pytorch.
Your BPS table looks good to me, it is 4kbps at max. Segment size is not related to bps, but step[2 4 5 8] is. Probably you can interpolate results if you don't want to retrain the model. Origianl SoundStream paper also does not compare with equal bps.
from soundstream-pytorch.
Sorry to bother you again, For the bit rate, I calculated it like this:
‘fs=16000points/s’, ‘[2,4,5,8]=320points/frame’, divide the two to get ‘50frames/s’,
Each frame requires 8 codebooks quantization, each codebook has 1024, so num_codebook=8 codebooksize=1024(10bit), so 8*10='80bit/frame', then 80bit/frame * 50frames/s ="4000bit/s"
But I found out that each codebook is actually 1024 * 512, so the codebook size should be 10 * 9=90bit, not 10bit, right?
from soundstream-pytorch.
The size of codebook is here
num_quantizers
is the number of codebooks.
num_embeddings
is the codebook size, i.e. the number of embeddings in a codebook.
embedding_dim
is the dimension of a embedding vector.
The num_quantizers
(8) and num_embeddings
(1024) determines the bitrate, but not embedding_dim
. 80 bit /frame should be right.
from soundstream-pytorch.
yes, you are right. I get it.
These days, I downloaded your training model (16kHz) ,and took the test-clean dataset in librispeech as the test set, completed the test, and got the ViSQOL score.
I calculated that the bit rate should be 4kbps, but ViSQOL is just only 2.16, I guess that is the epoch too little?
testfile_score.csv
from soundstream-pytorch.
In companies, they often try several configurations and hyper-parameters and pick the best one. I tried no more than what they normally do. However, I guess one of the reason is because my model was trained with too few training data compared to "A. Datasets" of the SoundStream paper. I tried several Japanese audio from https://github.com/kaiidams/Kokoro-Speech-Dataset with my model. It was not very good compared with English audio from LJSpeech.
from soundstream-pytorch.
Thank you for your reply
I am also thinking about how to adjust the parameters to improve ViSQOL, do you have any suggestions? I can try to improve the model.
- Changing the dataset now doesn't seem to be a good idea, because my research direction is low bit rate, 16KHz dataset (from your code) is suitable for low bit rate, 24kHz dataset (from soundstream) may not be suitable. In addition, I have took your existing model as the baseline.
- Your suggestion to improve ViSQOL???
- Your suggestion to improve ViSQOL???
from soundstream-pytorch.
The size of codebook is here
num_quantizers
is the number of codebooks.num_embeddings
is the codebook size, i.e. the number of embeddings in a codebook.embedding_dim
is the dimension of a embedding vector.The
num_quantizers
(8) andnum_embeddings
(1024) determines the bitrate, but notembedding_dim
. 80 bit /frame should be right.
[32,1,32270] through encode to [32,102,512], I found that someone calculated the bit rate by 8x2(log102)=16bit, not 8x10(log1024)=80bit, and claimed that although 10bit(log1024) was used to quantize, but in fact only 2bit(log102) was used. What do you think about that?
from soundstream-pytorch.
[32,1,32270] through encode to [32,102,512], I found that someone calculated the bit rate by 8x2(log102)=16bit, not 8x10(log1024)=80bit, and claimed that although 10bit(log1024) was used to quantize, but in fact only 2bit(log102) was used. What do you think about that?
If batch_size=32, and timestamp=102, each time step is encoded by 80 bits (=8log2(1024)). In total, it is encoded into 10280 = 8160 bits. 32270/16000 is 2.016875 sec. 8016/2.016875/1024 = 3.97 kbps. (not 4kpbs for rounding.)
from soundstream-pytorch.
import torchaudio
import torch
model = torch.hub.load("kaiidams/soundstream-pytorch", "soundstream_16khz")
x, sr = torchaudio.load('input.wav')
x, sr = torchaudio.functional.resample(x, sr, 16000), 16000
with torch.no_grad():
y = model.encode(x)
# y = y[:, :, :4] # if you want to reduce code size.
z = model.decode(y)
torchaudio.save('output.wav', z, sr)
Hi, Sorry to bother you again,
- In this validation code,
y = model.encode(x)
should bey = model.encoder(x)
, andz = model.decode(y)
should bez = model.decoder(y)
, right? soundstream-pytorch/soundstream.py
Line 606 in 9c6086e
At the beginning of the training, you normalized the data, and I was wondering if it need to add anti-normalization in the validation code above?
from soundstream-pytorch.
Right, as it was normalized in training time, it should be normalized in prediction time. Also, if you want the original audio strength, you need to denormalize it.
from soundstream-pytorch.
https://drive.google.com/drive/folders/1p2K09Am-Paz4I-H39uXYj1pOzFiTBjeL?usp=drive_link
I'm sorry to bother you, do you know the reason why the soundstream output audio has current sound as above?
Have you ever met the problem like this before?
How do I get over it? Can you give me some advice?
from soundstream-pytorch.
Your output audio clips sound slower than the originals. Probably sampling rates are wrong somewhere. For example, it could have been trained with 22.5kHz but predicted with 16kHz.
from soundstream-pytorch.
I also want to try training a new set of data, so I'll run your code first. I don't know the reason for the following problem that occurred. Will deleting it have an impact?
from soundstream-pytorch.
StreamableModel
is derived from LightningModule, which implements save_hyperparameters()
.
from soundstream-pytorch.
Related Issues (3)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from soundstream-pytorch.