Thanks for your code, but I want to learn how to use your modle to train a new set of

You can listen reconstructed audio here <a href="https://git

You can listen reconstructed audio here <a href="https://github.com/kaiidams/soundstre

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

First <a target="_blank" rel="noopener noreferrer" href="https://private-user-imag

ds is not a string of directory path, but <a href="ht

How to train a new set of data?,about kaiidams/soundstream-pytorch

a897456 commented on June 12, 2024 2

You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .
 model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")
First: I think I may have completed the reconstruction of the voice signal. I followed your method and completed the main function。 Second: What do you think is the PESQ score for the output file? Input files and output files, located below.

https://drive.google.com/drive/folders/1mvyg_CRxI6LGlVXbYu0OHKmhiAV_E-bK?usp=drive_link

from soundstream-pytorch.

kaiidams commented on June 12, 2024 1

You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .

 model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")

from soundstream-pytorch.

a897456 commented on June 12, 2024 1

Yes, you can change the segment size, unless the size is too short. 5,000 might be too short as it is computationally inefficient when the segment size is too short.

I made a mistake, I wanted to ask 50000, not 5000, because I thought the larger the number, the more it represented X, but in fact, I found that increasing the number resulted in a doubling of training time. So I'll pick an appropriate integer around the 32270 you set, not too big or too small, such like 30000-35000

We have 102 embedding vectors along the time. Each embedding vector is quantized using 8 codebooks. Which produces 102 * 8 codes, i.e., tensor(32, 102, 8)

THS，I get it. My idea is to reduce the step size while increasing the segment_length, and keep the number of num_codebooks the same, (because I think the change of these three parameters may be helpful for lower bit rate compression) , and look at the PESQ score.

from soundstream-pytorch.

kaiidams commented on June 12, 2024 1

torch.nn.init.normal_(model.quantizer.weight, mean=mean, std=std)

This initializes the codebook. The weight of the codebook is 8 x 1024 x 512. The number of codebooks is 8 and the number of code in one codebook is 1024. In the beggining of the training, we want this to be close to the distribution of encoder's outputs. In the page 5 of the paper, it says,

initialization for the codebook vectors, we run the k-means
algorithm on the first training batch and use the learned
centroids as initialization

I skipped this and just initialized with the gaussian of the first training batch as I didn't want to run k-means in the training code. The code calls the model but this is required only once.

from soundstream-pytorch.

kaiidams commented on June 12, 2024 1

Yes, you're right. STFT is used to get mel-spectrum. SoundStream paper uses mel-spectrum and https://arxiv.org/pdf/2008.01160.pdf uses STFT.

from soundstream-pytorch.

kaiidams commented on June 12, 2024

Do you mean you want to train soundstream model with new training data or want to train other model which uses output of soundstream as features? In the first case, you can run python soundstream.py that should download LIBRISPEECH under ./data and start training.

from soundstream-pytorch.

a897456 commented on June 12, 2024

Thank you for your reply.
First, I found that your soundstream models need to download data, including YESNO, LIBRISPEECH or librispeech, which is actually very time-consuming, so I downloaded other new data in advance.
Second, I mean the first case, I want to use your soundstream modle to train a new set of data with a sample-rate of 8KHz which I have already downloaded, but I don't know how to load them into your model.

from soundstream-pytorch.

a897456 commented on June 12, 2024

My ultimate goal is to achieve low bit rate compression. I would like to train a set of data with a sample rate of 8KHZ through your model, then num_embeddings change from 1024 to 256 and num_quantizers from 8 to 6, and see what the end result is.

from soundstream-pytorch.

a897456 commented on June 12, 2024

First

Then

from soundstream-pytorch.

kaiidams commented on June 12, 2024

ds is not a string of directory path, but torch.utils.data.Dataset. If you want to train 8kHz model with LIBRISPEECH, you can change sample_rate. If you want to your custom dataset, you can implement your own Dataset which should not be too difficult.

from soundstream-pytorch.

a897456 commented on June 12, 2024

Excuse me again, I have successfully started training, and the training data is the same as yours. The difference is that my data was downloaded in advance. During the training process, when the epoch was 98, an inexplicable error occurred, which seemed to be a problem with the data. However, the data was the same as yours, so I don't understand why this error occurred. Have you encountered it before?

from soundstream-pytorch.

a897456 commented on June 12, 2024

Excuse me again, I have successfully started training, and the training data is the same as yours. The difference is that my data was downloaded in advance. During the training process, when the epoch was 98, an inexplicable error occurred, which seemed to be a problem with the data. However, the data was the same as yours, so I don't understand why this error occurred. Have you encountered it before?

I tried to continue training from a location with an epoch of 98 and found no errors. This issue is temporarily considered evaded
The second question, it seems that the testing process for the final model has not been found. Can you provide guidance?

from soundstream-pytorch.

kaiidams commented on June 12, 2024

If you just want to hear the output yourself. You can encode the audio file by calling forward() method.

soundstream-pytorch/soundstream.py

Line 500 in 9c6086e

def forward(self, input):

. If you want to compute ViSQOL, sorry. It has no implementation for that.

from soundstream-pytorch.

a897456 commented on June 12, 2024

ViSQOL

If you just want to hear the output yourself. You can encode the audio file by calling forward() method.

soundstream-pytorch/soundstream.py

Line 500 in 9c6086e

def forward(self, input):

. If you want to compute ViSQOL, sorry. It has no implementation for that.

Firt, so how do you determine your model is useful? What are the judgment indicators?
Second, how to use the output such as "epoch=84-step=150000.ckpt" to check the availability of the model？

from soundstream-pytorch.

a897456 commented on June 12, 2024

You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .
 model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")

First，I think I have completed 150 training sessions, as shown in the picture.
Second, what you mentioned that "model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")", is it possible to replace the. ckpt file labeled in the second image to reconstruct my speech signal, in order to verify the usefulness of the model?
Thirdly, is the. ckpt file labeled in the second figure the final training model?
I'm sorry, I'm a novice, so there may be many ignorant questions bothering you

from soundstream-pytorch.

a897456 commented on June 12, 2024

You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .
 model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")

First: I think I may have completed the reconstruction of the voice signal. I followed your method and completed the main function。

Second: What do you think is the PESQ score for the output file? Input files and output files, located below.

from soundstream-pytorch.

a897456 commented on June 12, 2024

You can listen reconstructed audio here https://github.com/kaiidams/soundstream-pytorch#sample-audio . You may reconstruct some of your audio files to judge if it is good enough for your purpose. The checkpoint is a PyTorch Lightning checkpoint. You can load the model https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html .
 model = StreamableModel.load_from_checkpoint("/path/to/checkpoint.ckpt")
First: I think I may have completed the reconstruction of the voice signal. I followed your method and completed the main function。 Second: What do you think is the PESQ score for the output file? Input files and output files, located below.
https://drive.google.com/drive/folders/1mvyg_CRxI6LGlVXbYu0OHKmhiAV_E-bK?usp=drive_link

Firstly, I'm sorry, please forgive me for being a novice. I didn't have access to open audio files earlier, and you can now access them.
Secondly, a few days ago, when the epoch was 84 and an unknown error occurred during the training process, the model generated at that time and the model generated after 150 epochs of training had a significant impact on the output file. Is the epoch too large to handle?

from soundstream-pytorch.

kaiidams commented on June 12, 2024

SoundStream has a couple of loss functions. You can use TensorBoard to look at these losses. If some of them have strange behavior you may adjust parameters.

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L175C1-L180C1

I used the same stride for 16kHz with the original paper, 2 * 4 * 5 * 8 = 320 window size. This makes 50Hz embeddings.
The original SoundStream is for 24kHz which makes 75Hz embeddings. So 8kHz model has to compress a longer audo window into an embedding.

from soundstream-pytorch.

a897456 commented on June 12, 2024

SoundStream has a couple of loss functions. You can use TensorBoard to look at these losses. If some of them have strange behavior you may adjust parameters.

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L175C1-L180C1

I used the same stride for 16kHz with the original paper, 2 * 4 * 5 * 8 = 320 window size. This makes 50Hz embeddings. The original SoundStream is for 24kHz which makes 75Hz embeddings. So 8kHz model has to compress a longer audo window into an embedding.

First, I studied TensorBoard for several hours today, but I haven't made any progress yet. My understanding is this: when using TensorBoard, I train with the model.fit() function, but for now I train with the pl.Trainer.fit() function. Do I need to change the training function if I want to use TensorBoard? How should I use TensorBoard.
Second, you mentioned that "So 8kHz model has to compress a longer audo window into an embedding", I want to change the window size with 225*8=160. Am I understanding this correctly?

from soundstream-pytorch.

kaiidams commented on June 12, 2024

It seems that TensorBoard is not enabled by default. If you enable it, you'll find lightning_logs/version_X/event.xxxx.yyyy in your output. You can launch by tensorboard --logdir lightning_logs/version_X/

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L663C26-L663C26

Or you can find CSV file lightning_logs/version_X/metrics.csv.

225*8=160 makes window size 160. I think it is good number.

from soundstream-pytorch.

a897456 commented on June 12, 2024

It seems that TensorBoard is not enabled by default. If you enable it, you'll find lightning_logs/version_X/event.xxxx.yyyy in your output. You can launch by tensorboard --logdir lightning_logs/version_X/

https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L663C26-L663C26

Or you can find CSV file lightning_logs/version_X/metrics.csv.

2_2_5*8=160 makes window size 160. I think it is good number.

First，After seeing your reply, I spent several hours trying to open TensorBoard and found that just setting logger=True would suffice. I'm so happy

Second, If I want to achieve a low bitrate compression method, such as 1.2kbps, such as an 8KHz sampling rate, if I use a window size of 320, 1200 * 320/8000=48bit, and use six 8-bit codebooks for quantization. If we still use six 8-bit codebooks to achieve 0.6kbps, we need 600 * 640/8000=48bit, which means the window size has changed from 320 to 640. So I seem to need to increase the window size, do you agree?

from soundstream-pytorch.

kaiidams commented on June 12, 2024

If you just want to achieve a low bitrate compress, I can just reduce the number of quantizers, without retraining.

soundstream-pytorch/soundstream.py

Line 241 in 9c6086e

n = random.randrange(1, self.num_quantizers)

Many of latest neural vocoders (SoundStream and Meta Encodec (https://github.com/facebookresearch/encodec) adopt hierarchical quantized autoencoder so that it can achieve adjustable bit rates. However, note that dropping quantizers doesn't reduce computational cost.

I think longer window size is difficult to learn, as audio signal is stational in short time, but not in longer time.

from soundstream-pytorch.

a897456 commented on June 12, 2024

If you just want to achieve a low bitrate compress, I can just reduce the number of quantizers, without retraining.

soundstream-pytorch/soundstream.py

Line 241 in 9c6086e

n = random.randrange(1, self.num_quantizers)

Many of latest neural vocoders (SoundStream and Meta Encodec (https://github.com/facebookresearch/encodec) adopt hierarchical quantized autoencoder so that it can achieve adjustable bit rates. However, note that dropping quantizers doesn't reduce computational cost.
I think longer window size is difficult to learn, as audio signal is stational in short time, but not in longer time.

First, I studied traditional speech compression before, using codebooks to quantify, such as 48bit to quantify a feature value, 48bit distribution is like this: the num of codebooks is 6, and each codebook has 8bit, that is, 2^8=256 representative arrays.
But I may have misunderstood soundstream because I always thought num_quantizers was equal to num_codebook in the traditional compression algorithm and num_embeddings was equal to dim_codebook in the traditional compression algorithm. I should embedding_dim in soundstream as a dim_codebook, right?

Seond, in your model ,num_quantizers=6; num_embeddings=1024; embedding_dim=512. How to calculate the compression bitrate? It's a parameter in kbps. I want to know what the bit rate is. Can you show me?

Third, I've followed Encodec, but I haven't started learning it yet. You say soundstream also adopt hierarchical quantized autoencoder. Is that mentioned in your code? If you have finished that part , can you show me?

Fourth, your reminder is right, the window size can not be greater than 320, the short-term stability of voice is about 20-30ms, but I would like to ask, the traditional compression algorithm of multi-frame joint quantization idea, can be used or not in soundstream?

from soundstream-pytorch.

a897456 commented on June 12, 2024

If you just want to achieve a low bitrate compress, I can just reduce the number of quantizers, without retraining.

soundstream-pytorch/soundstream.py

Line 241 in 9c6086e

n = random.randrange(1, self.num_quantizers)

Many of latest neural vocoders (SoundStream and Meta Encodec (https://github.com/facebookresearch/encodec) adopt hierarchical quantized autoencoder so that it can achieve adjustable bit rates. However, note that dropping quantizers doesn't reduce computational cost.
I think longer window size is difficult to learn, as audio signal is stational in short time, but not in longer time.

First, I studied traditional speech compression before, using codebooks to quantify, such as 48bit to quantify a feature value, 48bit distribution is like this: the num of codebooks is 6, and each codebook has 8bit, that is, 2^8=256 representative arrays. But I may have misunderstood soundstream because I always thought num_quantizers was equal to num_codebook in the traditional compression algorithm and num_embeddings was equal to dim_codebook in the traditional compression algorithm. I should embedding_dim in soundstream as a dim_codebook, right?

Seond, in your model ,num_quantizers=6; num_embeddings=1024; embedding_dim=512. How to calculate the compression bitrate? It's a parameter in kbps. I want to know what the bit rate is. Can you show me?

Third, I've followed Encodec, but I haven't started learning it yet. You say soundstream also adopt hierarchical quantized autoencoder. Is that mentioned in your code? If you have finished that part , can you show me?

Fourth, your reminder is right, the window size can not be greater than 320, the short-term stability of voice is about 20-30ms, but I would like to ask, the traditional compression algorithm of multi-frame joint quantization idea, can be used or not in soundstream?

    [self.register_buffer("code_count", torch.empty(num_quantizers, num_embeddings))](https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L225)

For the second question I mentioned yesterday, I have some new ideas, and I do not know if it is correct. The bit is num_quantizerslog(num_embeddings)=80bit, the code rate is 80bit（16000Hz/320=50frames）=4kbps？Am I right?

from soundstream-pytorch.

kaiidams commented on June 12, 2024

First, I studied traditional speech compression before, using codebooks to quantify, such as 48bit to quantify a feature

Yes, you are right, num_quantizers is the number of codebooks. SoundStream has 8 codebooks and each codebook has 1024 codes. Then one frame is encoded with 8 * log2(1024) = 80 bits. In the original paper, frame rate is 75 Hz for 24kHz sampling rate. This produces 75 * 80 = 6k bps. 4kbps in case of 16kHz.

Third, I've followed Encodec, but I haven't started learning it yet. You say soundstream also adopt hierarchical quantized autoencoder. Is that mentioned in your code? If you have finished that part , can you show me?

The algorithm is explained in Algorithm 1: Residual Vector Quantization of https://arxiv.org/pdf/2107.03312.pdf. This produces 8 x 10 bit codes, in which the first code is the most important and the last is the least important. Here, you can reproduce the original vector only using some of vectors, for example, the first 5 codes, then you can achive 5 * 10 * 50 = 2.5kbps.

Here, you can pass n codes, where n is between 1 and 8 in the inference time.

soundstream-pytorch/soundstream.py

Line 264 in 9c6086e

assert 0 < n <= self.num_quantizers

In the training time, it drops less important codes randomly so that it can reproduce audio with only important codes.

soundstream-pytorch/soundstream.py

Line 241 in 9c6086e

n = random.randrange(1, self.num_quantizers)

from soundstream-pytorch.

a897456 commented on June 12, 2024

First, I studied traditional speech compression before, using codebooks to quantify, such as 48bit to quantify a feature

Yes, you are right, num_quantizers is the number of codebooks. SoundStream has 8 codebooks and each codebook has 1024 codes. Then one frame is encoded with 8 * log2(1024) = 80 bits. In the original paper, frame rate is 75 Hz for 24kHz sampling rate. This produces 75 * 80 = 6k bps. 4kbps in case of 16kHz.

Third, I've followed Encodec, but I haven't started learning it yet. You say soundstream also adopt hierarchical quantized autoencoder. Is that mentioned in your code? If you have finished that part , can you show me?

The algorithm is explained in Algorithm 1: Residual Vector Quantization of https://arxiv.org/pdf/2107.03312.pdf. This produces 8 x 10 bit codes, in which the first code is the most important and the last is the least important. Here, you can reproduce the original vector only using some of vectors, for example, the first 5 codes, then you can achive 5 * 10 * 50 = 2.5kbps.

Thank you for your reply. Your reply is my motivation to continue studying. My understanding is this: I just load your pre-trained model (soundstream_16khz-20230425.ckpt)，and then change the value of n, I can achieve a variety of bit rate compression, no need to repeat training, such as n=4, 4 * 10 * 50 = 2kbps; n = 2, 2 * 10 * 50 = 1kbps;

Here, you can pass n codes, where n is between 1 and 8 in the inference time.

I want to achieve a lower speech compression bit rate. by change the sampling rate to 8KHz(should be the lowest); change the step size to 2 * 4 * 5 * 6 = 240, which corresponds to the sample rate of 8KHz. 'num_quantizers=8' and 'num_embeddings =1024' remain unchanged, epoch=200. Then compare the results with your 16KHz model by change 'n' synchronously.

soundstream-pytorch/soundstream.py

Line 264 in 9c6086e

assert 0 < n <= self.num_quantizers

In the training time, it drops less important codes randomly so that it can reproduce audio with only important codes.

Can the value of n equal 1 which just keep only the most important code?

soundstream-pytorch/soundstream.py

Line 241 in 9c6086e

n = random.randrange(1, self.num_quantizers)

from soundstream-pytorch.

a897456 commented on June 12, 2024

When I change the step size 2, 4, 5, 6=240, I need to change the segment_length from 32270 to 30430, which I calculated in order to be able to divide the steps exactly. So I would like to ask if the 32270 you set at that time is also to divide the step size? Can I make it bigger or smaller? I wonder if it could be bigger? Because it can include more X content, am I right？

from soundstream-pytorch.

a897456 commented on June 12, 2024

I am a PhD student and I want to post an article based on soundstream, but I haven't found any innovation yet, can you guide me something about soundstream? For example, where can I continue to improve soundstream?

At first I wanted to use soundstream to achieve lower bitrates, but I found that soundstream had already implemented it by changing 'n', or retraining the new 'num_quantizers' and' num_embeddings', so I couldn't find a new idea, can you remind me something?TKS

from soundstream-pytorch.

a897456 commented on June 12, 2024

In the paper, the authors mentioned that the coding rate is guaranteed to remain the same, and different step sizes will not affect the final score.

So my idea of retraining a new model to achieve a lower bit rate by changing the step size, 2 * 4 * 5 * 6 = 240, might not work.

from soundstream-pytorch.

kaiidams commented on June 12, 2024

Your reply is my motivation to continue studying.

Thank you! I'm glad to hear that.

I need to change the segment_length from 32270 to 30430,

32270 is nice number so that the output lenght of decoder is the same as the input length of encoder. They are sometimes different because of rounding. I think 30430 is good number for 2 4 5 6.

For example, where can I continue to improve soundstream?

I'm not sure, but you may try variable rate. SoundStream is fixed rate in time. I might be enough when audio signal is not so complicated.

BTW, Meta's EnCodec https://github.com/facebookresearch/encodec is almost same as SoundStream. They claim using balancer stabilizes training. SoundStream's weighs of losses are manually tuned.

from soundstream-pytorch.

a897456 commented on June 12, 2024

If you just want to hear the output yourself. You can encode the audio file by calling forward() method.

soundstream-pytorch/soundstream.py

Line 500 in 9c6086e

def forward(self, input):

. If you want to compute ViSQOL, sorry. It has no implementation for that.

https://github.com/aliutkus/speechmetrics
I tested it with PESQ today and found that PESQ didn't work very well. Did you not use ViSQOL or PESQ test tools at that time?

from soundstream-pytorch.

a897456 commented on June 12, 2024

For example, where can I continue to improve soundstream?

I'm not sure, but you may try variable rate. SoundStream is fixed rate in time. I might be enough when audio signal is not so complicated.

Can you be more specific? Because I am a beginner in audio compression and my research direction is very low bit rate compression, I feel that you are an expert in this field, so I would like to hear your specific opinion.

from soundstream-pytorch.

a897456 commented on June 12, 2024

BTW, Meta's EnCodec https://github.com/facebookresearch/encodec is almost same as SoundStream. They claim using balancer stabilizes training. SoundStream's weighs of losses are manually tuned.

I have already paid attention to two models, soundstream and EnCodec, which are very close to my research direction. So my arrangement is like this: For my first paper, I want to do some research based on soundstream, but I haven't found a suitable research site yet. The second paper wants to do some research based on EnCodec, so I have been studying soundstream recently and will start to study EnCodec after the New Year. This is my plan。

from soundstream-pytorch.

kaiidams commented on June 12, 2024

https://github.com/aliutkus/speechmetrics
I tested it with PESQ today and found that PESQ didn't work very well. Did you not use ViSQOL or PESQ test tools at that time?

Thank you for input. I have used neither of them. I'll look into them.

Can you be more specific?

This is just random idea. The codec keeps a constant rate, which is given by the user. For example, if you decide to use the first 5 codebooks, then the data rate is constantly 6kbps * 5 / 8 = 3.75kbps. even there's no audio signal. Maybe the codec can decide by itself how many codebooks to use for the given audio by using more bits for important frames and less bits for less important frames.

from soundstream-pytorch.

a897456 commented on June 12, 2024

You set the step size 2 4 5 8 = 320 and the segment_length=32270, then output = Encoder(input), and the output is a tensor(32,102,512).

When I set the step size 2 4 5 6 =240 and the segment_length=30430, the the output is a tensor(32,127,512)

So I want to ask two questions:

The first one, Can I increase the segment_length? Like 5000? Because I noticed that this value is related to the valid content of input.

The second problem, if I increase the segment_length and decrease the step size, the second parameter of the output tensor will also increase, for example, from your 102 to my 127 or more, and I know that the second parameter is related to the number of codebooks, which is 1024, What is the effect of increasing the second value?

from soundstream-pytorch.

kaiidams commented on June 12, 2024

The first one, Can I increase the segment_length? Like 5000? Because I noticed that this value is related to the valid content of input.

Yes, you can change the segment size, unless the size is too short. 5,000 might be too short as it is computationally inefficient when the segment size is too short.

I know that the second parameter is related to the number of codebooks

The second dimension of an encoded tensor is time-axis. If the output is a tensor(32,102,512), it means it encodes 102 * 320 = 32640 samples ignoring paddings. The third dimension is the hidden dimension. We have 102 embedding vectors along the time. Each embedding vector is quantized using 8 codebooks. Which produces 102 * 8 codes, i.e., tensor(32, 102, 8)

from soundstream-pytorch.

a897456 commented on June 12, 2024

soundstream-pytorch/soundstream.py

Line 647 in 9c6086e

torch.nn.init.normal_(model.quantizer.weight, mean=mean, std=std)

Is the weight in this place only a batch value? Is there a missing for loop here?
like "for batch in iterator" or something
My understanding is this: weight is the early code book, and all the data needs to be classified, for example, divided into 1024 categories, and then formed 1024 code books.
Did I get it wrong?

from soundstream-pytorch.

a897456 commented on June 12, 2024

First, do you know what the second part of the reconstruction loss formula in the paper is? Why log the mel spectrum?

soundstream-pytorch/soundstream.py

Line 385 in 9c6086e

but uses STFT instead of mel-spectrogram

Second, your code uses STFT instead of mel spectrum. Is the STFT you use an intermediate parameter in the process of solving mel spectrum?

soundstream-pytorch/soundstream.py

Line 383 in 9c6086e

class ReconstructionLoss(nn.Module):

soundstream-pytorch/soundstream.py

Line 347 in 9c6086e

class STFTDiscriminator(nn.Module):

Third，does the loss of the STFT discriminator and the loss of reconstruction with the STFT count as duplicates?

from soundstream-pytorch.

kaiidams commented on June 12, 2024

Why log the mel spectrum?

Log Mel-spectrum is believed to be close to human perception. S(x) above is linear to the power, but human perception is linear to the log of power. Probably the second part (5) is what we want to minimize.

Second, your code uses STFT instead of mel spectrum. Is the STFT you use an intermediate parameter in the process of solving mel spectrum?

I didn't understand what 'intermediate parameter' is. But originally the loss formula comes from https://arxiv.org/pdf/2008.01160.pdf where they use STFT not Mel-spectrum like SoundStream paper.

Third，does the loss of the STFT discriminator and the loss of reconstruction with the STFT count as duplicates?

STFT discriminator is based on GAN technique. STFT reconstruction is auto-regression which are different. GAN is used for audio generation task (and image generation) because there're a lot of possible audio outputs which are good for humans but very different in auto-regressive loss, because of likes of shifted phase of audio, (or shifted image). However, GAN doesn't generate audio output which is close to the original, so they use weaker auto-regressive tasks to help audio generation.

from soundstream-pytorch.

a897456 commented on June 12, 2024

I didn't understand what 'intermediate parameter' is.

The MFCC solution process I have learned is as follows: for the voice signal, after adding a window, get the energy distribution on the spectrum through FFT (just like your STFT? ), then get the power spectrum throught the square of the modulus , then get the mel spectrum through the mel filter, then get the Fbank through the Log (just like the second part of the reconstruction loss formula?), then get the MFCC through the DCT.

So I guess the STFT your code uses is the 'intermediate parameter' which is the energy distribution on the spectrum, and the log of STFT which is the second part of the reconstruction loss formula is the Fbank which is indeed a parameter close to human perception.

from soundstream-pytorch.

a897456 commented on June 12, 2024

Sorry to bother you again. I found that if I change batch_size from 32 to 16, the training speed will be increased by 10 times, but I haven't finished the training yet, so I don't know if the training at such a fast speed means that the training is incomplete or ineffective. Besides, why did you set batch_size to 32 in the first place?

from soundstream-pytorch.

kaiidams commented on June 12, 2024

Usually you want to use largest batch size of your GPU for more efficiency. Increasing batch size twice shouldn't increase the step time more than twice. Also increasing batch size generally makes stable results as larger batch are less variant.
I don't know why batch_size=16 is 10 times faster, but it is great if it is still stable.

from soundstream-pytorch.

a897456 commented on June 12, 2024

Excuse me, have you ever tried to reconstruct the voice signal through the MFCC? Or have you ever seen someone else do it?

from soundstream-pytorch.

kaiidams commented on June 12, 2024

I haven't tried MFCC myself. I think MFCC is not so popular for voice features as melspec because deep-learning based models are strong enough, like HiFi GAN and MelGAN use melspec. But MFCC might be good (or no good) when calculating reconstruction loss of vocodecs.

from soundstream-pytorch.

a897456 commented on June 12, 2024

soundstream-pytorch/soundstream.py

Line 408 in 9c6086e

loss += torch.mean(torch.abs(x - y))

soundstream-pytorch/soundstream.py

Line 409 in 9c6086e

    
           loss += alpha * torch.mean(torch.square(torch.log(x + self.eps) - torch.log(y + self.eps)))

Excuse me for bothering you again, may I ask why there are two losses in this place? Because I found that the rec_loss is very large, and the g_loss is also large, so I found two losses in this place

from soundstream-pytorch.

kaiidams commented on June 12, 2024

In my case g_rec_loss is around 10. Do you see other anormalities?

g_stft_loss	g_wave_loss	g_feat_loss	g_rec_loss	q_loss	g_loss	codes_entropy	d_stft_loss	d_wave_loss	d_loss	num_replaced	epoch	step
8.765625	2.03125	0.035614	13.462036	0.385002	20.735474	6.826962	0.0	1.387695	1.041016	0.0	24	21487

from soundstream-pytorch.

a897456 commented on June 12, 2024

Did you change the Mel-spectrum to STFT at the beginning because there are many negative numbers in the Mel-spectrum? If LOG operation is performed according to the formula in the figure above, the loss will have the problem of NAN.

from soundstream-pytorch.

a897456 commented on June 12, 2024

In my case g_rec_loss is around 10. Do you see other anormalities?

g_stft_loss g_wave_loss g_feat_loss g_rec_loss q_loss g_loss codes_entropy d_stft_loss d_wave_loss d_loss num_replaced epoch step
8.765625 2.03125 0.035614 13.462036 0.385002 20.735474 6.826962 0.0 1.387695 1.041016 0.0 24 21487

I am trying to replace the discriminator in your code with the MSD and MPD modules of HIFIGAN, but it has not been successful. The output speech after training is white noise, and I have been looking for the reason, thinking that loss cannot converge. So the loss parameter that you and I expressed looks different.

In addition, I heard that HIFIGAN's discriminator is the most useful discriminator at present, and I want to add it to your code. I have finished adding, now the code can run through, but the output speech after training is always white noise, I can't find the problem, can you help me to achieve it?

from soundstream-pytorch.

a897456 commented on June 12, 2024

Previous test results（At first, it was over 100, but when the step increased, it dropped to 20）

Current test results（At the beginning, it was over 100, but it didn't decrease as the step increased）

the g_rec_loss in g_loss does not converge

from soundstream-pytorch.

kaiidams commented on June 12, 2024

I'm not sure about the reason why.

HiFiGAN paper uses big lambda for generation. https://arxiv.org/pdf/2010.05646.pdf Probably you could try tweaking hyper parameters, or try to replace with good known state_dict of the model to see if the loss is reasonable.

from soundstream-pytorch.

a897456 commented on June 12, 2024

HiFiGAN paper uses big lambda for generation. https://arxiv.org/pdf/2010.05646.pdf Probably you could try tweaking hyper parameters, or try to replace with good known state_dict of the model to see if the loss is reasonable.

HiFiGAN paper uses big lambda for generation. The big lambda you mentioned acting on the third parameter in the generator formula, mel_loss, does that mean that the first parameter in the generator formula, adv_loss, changes less?

My test results are as follows:

d_adv_loss(In the figure below is d_loss)≈g_adv_loss, both of which have small changes, only slightly in the fourth decimal place, as shown in the figure below.
The feat_loss with the lambda_fm=2, I can't see anything unusual
The mel_loss, I learned your way and replaced it with stft_loss, as mentioned in your code "uses STFT instead of mel-spectrogram"

The arrangement of three losses looks very reasonable, but the training results are poor and the understanding is very low. I don't know what went wrong. Can you see the problem from the picture?

from soundstream-pytorch.

a897456 commented on June 12, 2024

The first picture is codebook_loss obtained by running your code. The second picture is the code I ran last night (see the changes in the last two replies).

soundstream-pytorch/soundstream.py

Line 503 in 9c6086e

x, codes, codebook_loss = self.quantizer(x)

Do you know why this codebook_loss is like this? I just changed the batch_size

from soundstream-pytorch.

a897456 commented on June 12, 2024

soundstream-pytorch/soundstream.py

Line 661 in 9c6086e

precision='16-mixed',

soundstream-pytorch/soundstream.py

Line 289 in 9c6086e

@torch.cuda.amp.autocast(enabled=False)

soundstream-pytorch/soundstream.py

Line 236 in 9c6086e

@torch.cuda.amp.autocast(enabled=False)

I am very sorry to bother you again. @kaiidams
First, if you set precision='16-mixed' in Trainer(), then the global tensor is already automatically mixed?
Second, if you set '@Torch.cuda.am.autocast (enabled=False)' in a certain region, is the tensor in the region only 16bit half-precision?
Third, do other areas without any flags still use 'precision='16-mixed'?
Look forward to your answer, thank you

from soundstream-pytorch.

kaiidams commented on June 12, 2024

@torch.cuda.am.autocast should compute tensors in 32-bit floats. Using precision=16-mixed should be okay, but should be unstable in general. You may try disable 16-mixed. codebook_loss looks very bad to me. Quantizer replace code when it is not used much, if batch_size is small, it may replace more codes.

soundstream-pytorch/soundstream.py

Line 297 in 9c6086e

num_replaced = torch.sum(self.code_count < self.code_replace_threshold).item()

How's your codebook entropy and num_replaced? They should be around 6.8 and 0.3 each.

from soundstream-pytorch.

a897456 commented on June 12, 2024

How's your codebook entropy and num_replaced? They should be around 6.8 and 0.3 each.

I corrected the code today, replacing the wav and STFT discriminators with MSD and MPD discriminators. These diagrams show the state of epoch=0, but the training time is too long, I don't know why。

from soundstream-pytorch.

a897456 commented on June 12, 2024

When epoch=12, d_loss and g_loss gradually decrease. Although the decrease is not large, when epoch=14, d_loss suddenly decreases by 50%, and g_loss suddenly increases to 150%. Can you point out the problem? I'm a newbie and don't know the direction of the problem, please give me some advice.

from soundstream-pytorch.

kaiidams commented on June 12, 2024

In general you could try several things.

Pick the checkpoint before the sudden change and see if the model can produce reasonable audio output.
If the above is okay, try looking at norm of weights like torch.sum(torch.square(model.xxx.yyy.weight)). If it is too large, then you can apply stronger weight decay. Often, weights closer to the output tend to explode (Not sure if this is the case this time.)
Or you could try decreasing learning_rate. This may cause jump beyond an optimal minimum point.

from soundstream-pytorch.

a897456 commented on June 12, 2024

Pick the checkpoint before the sudden change and see if the model can produce reasonable audio output.

I think the output audio is very bad, even with the epoch=56. You can listen to it. It's a big difference. The code I changed, compared with your code, even if the epoch is more than 10 times, but the output voice quality is very different.
https://drive.google.com/drive/folders/1gdBJtyc7IKReAWi-V2lVFf1fqkIAuMVK?usp=drive_link

If the above is okay, try looking at norm of weights like torch.sum(torch.square(model.xxx.yyy.weight)). If it is too large, then you can apply stronger weight decay. Often, weights closer to the output tend to explode (Not sure if this is the case this time.)

I didn't find the weight you said. Can you point it out to me? thank you.

Or you could try decreasing learning_rate. This may cause jump beyond an optimal minimum point.

Yes, I changed the discriminator's learning rate from 0.0001 to 0.00001 (a 10x drop as you explained) and the discriminator's optimization factor from 0.5 to 0.8 (b1) and 0.9 to 0.99 (b2) (these two parameters come from the HIFIGAN paper).

This modification does allow the model to train safely, without the problem discussed three days ago, here, thank you for your advice, but a new problem has emerged, which is that even if the epoch=56, the output speech is still not understandable just like the first question above.

Can you give me some more guidance on how to do the next?

from soundstream-pytorch.

a897456 commented on June 12, 2024

How's your codebook entropy and num_replaced? They should be around 6.8 and 0.3 each.

from soundstream-pytorch.

kaiidams commented on June 12, 2024

I didn't find the weight you said. Can you point it out to me? thank you.

If you try something like this, they are not too large. If larger norm means exploded weight.

for n, p in model.state_dict().items():
    w = torch.mean(p**2).item()
    print(n, w)

encoder.layers.0.weight 0.08002594858407974
encoder.layers.0.bias 0.032715506851673126
encoder.layers.2.layers.0.conv0.weight 0.0024174670688807964
encoder.layers.2.layers.0.conv0.bias 0.0023794351145625114
encoder.layers.2.layers.0.conv1.weight 0.01180010475218296
encoder.layers.2.layers.0.conv1.bias 0.006960079539567232
encoder.layers.2.layers.1.conv0.weight 0.00261313421651721
encoder.layers.2.layers.1.conv0.bias 0.0012120403116568923
encoder.layers.2.layers.1.conv1.weight 0.011738475412130356
encoder.layers.2.layers.1.conv1.bias 0.0076542804017663
encoder.layers.2.layers.2.conv0.weight 0.002758147194981575

codebook entropy and num_replaced looks good to me.

Can you give me some more guidance on how to do the next?

I think training audio codec is unstable. They added many loss, I think that is what they found out after many failures.
Meta's Encodec introduced loss balancer so that you don't have to tune loss weights. I think it is something what can be tried.

from soundstream-pytorch.

a897456 commented on June 12, 2024

I'm sorry to bother you again, but I'd like to take your pre-training results (16kHz pretrained model) as the baseline.
I will list the comparison between codebook size and code rate, and you can judge whether it is correct

codebook size	code rate/bps
1	500
2	1000
3	1500
4	2000
5	2500
6	3000
7	3500
8	4000

I want to experiment 2400bps and 1200bps, which seems not to be reflected in the table. Can it only be achieved by changing step[2 4 5 8] and segment_length[35710]? If this is the case, it seems impossible to use your pre-trained model. That would require retraining, right?

from soundstream-pytorch.

kaiidams commented on June 12, 2024

Your BPS table looks good to me, it is 4kbps at max. Segment size is not related to bps, but step[2 4 5 8] is. Probably you can interpolate results if you don't want to retrain the model. Origianl SoundStream paper also does not compare with equal bps.

from soundstream-pytorch.

a897456 commented on June 12, 2024

Sorry to bother you again, For the bit rate, I calculated it like this：
‘fs=16000points/s’, ‘[2,4,5,8]=320points/frame’, divide the two to get ‘50frames/s’,
Each frame requires 8 codebooks quantization, each codebook has 1024, so num_codebook=8 codebooksize=1024(10bit), so 8*10='80bit/frame', then 80bit/frame * 50frames/s ="4000bit/s"
But I found out that each codebook is actually 1024 * 512, so the codebook size should be 10 * 9=90bit, not 10bit, right?

from soundstream-pytorch.

kaiidams commented on June 12, 2024

The size of codebook is here
num_quantizers is the number of codebooks.
num_embeddings is the codebook size, i.e. the number of embeddings in a codebook.
embedding_dim is the dimension of a embedding vector.

The num_quantizers (8) and num_embeddings (1024) determines the bitrate, but not embedding_dim . 80 bit /frame should be right.

from soundstream-pytorch.

a897456 commented on June 12, 2024

yes, you are right. I get it.
These days, I downloaded your training model (16kHz) ，and took the test-clean dataset in librispeech as the test set, completed the test, and got the ViSQOL score.
I calculated that the bit rate should be 4kbps, but ViSQOL is just only 2.16, I guess that is the epoch too little?
testfile_score.csv

from soundstream-pytorch.

kaiidams commented on June 12, 2024

In companies, they often try several configurations and hyper-parameters and pick the best one. I tried no more than what they normally do. However, I guess one of the reason is because my model was trained with too few training data compared to "A. Datasets" of the SoundStream paper. I tried several Japanese audio from https://github.com/kaiidams/Kokoro-Speech-Dataset with my model. It was not very good compared with English audio from LJSpeech.

from soundstream-pytorch.

a897456 commented on June 12, 2024

Thank you for your reply
I am also thinking about how to adjust the parameters to improve ViSQOL, do you have any suggestions? I can try to improve the model.

Changing the dataset now doesn't seem to be a good idea, because my research direction is low bit rate, 16KHz dataset (from your code) is suitable for low bit rate, 24kHz dataset (from soundstream) may not be suitable. In addition, I have took your existing model as the baseline.
Your suggestion to improve ViSQOL???
Your suggestion to improve ViSQOL???

from soundstream-pytorch.

a897456 commented on June 12, 2024

The size of codebook is here num_quantizers is the number of codebooks. num_embeddings is the codebook size, i.e. the number of embeddings in a codebook. embedding_dim is the dimension of a embedding vector.

The num_quantizers (8) and num_embeddings (1024) determines the bitrate, but not embedding_dim . 80 bit /frame should be right.

[32,1,32270] through encode to [32,102,512], I found that someone calculated the bit rate by 8x2(log102)=16bit, not 8x10(log1024)=80bit, and claimed that although 10bit(log1024) was used to quantize, but in fact only 2bit(log102) was used. What do you think about that?

from soundstream-pytorch.

kaiidams commented on June 12, 2024

[32,1,32270] through encode to [32,102,512], I found that someone calculated the bit rate by 8x2(log102)=16bit, not 8x10(log1024)=80bit, and claimed that although 10bit(log1024) was used to quantize, but in fact only 2bit(log102) was used. What do you think about that?

If batch_size=32, and timestamp=102, each time step is encoded by 80 bits (=8log2(1024)). In total, it is encoded into 10280 = 8160 bits. 32270/16000 is 2.016875 sec. 8016/2.016875/1024 = 3.97 kbps. (not 4kpbs for rounding.)

from soundstream-pytorch.

a897456 commented on June 12, 2024

import torchaudio
import torch

model = torch.hub.load("kaiidams/soundstream-pytorch", "soundstream_16khz")
x, sr = torchaudio.load('input.wav')
x, sr = torchaudio.functional.resample(x, sr, 16000), 16000
with torch.no_grad():
y = model.encode(x)
# y = y[:, :, :4] # if you want to reduce code size.
z = model.decode(y)
torchaudio.save('output.wav', z, sr)

Hi, Sorry to bother you again,

In this validation code, y = model.encode(x) should be y = model.encoder(x), and z = model.decode(y) should be z = model.decoder(y), right?
soundstream-pytorch/soundstream.py

Line 606 in 9c6086e

x *= 0.95 / torch.max(x)

At the beginning of the training, you normalized the data, and I was wondering if it need to add anti-normalization in the validation code above?

from soundstream-pytorch.

kaiidams commented on June 12, 2024

Right, as it was normalized in training time, it should be normalized in prediction time. Also, if you want the original audio strength, you need to denormalize it.

from soundstream-pytorch.

a897456 commented on June 12, 2024

https://drive.google.com/drive/folders/1p2K09Am-Paz4I-H39uXYj1pOzFiTBjeL?usp=drive_link
I'm sorry to bother you, do you know the reason why the soundstream output audio has current sound as above?
Have you ever met the problem like this before?
How do I get over it? Can you give me some advice?

from soundstream-pytorch.

kaiidams commented on June 12, 2024

Your output audio clips sound slower than the originals. Probably sampling rates are wrong somewhere. For example, it could have been trained with 22.5kHz but predicted with 16kHz.

from soundstream-pytorch.

iam-Yue commented on June 12, 2024

I also want to try training a new set of data, so I'll run your code first. I don't know the reason for the following problem that occurred. Will deleting it have an impact?

from soundstream-pytorch.

kaiidams commented on June 12, 2024

StreamableModel is derived from LightningModule, which implements save_hyperparameters().

from soundstream-pytorch.

How to train a new set of data? about soundstream-pytorch HOT 77 OPEN

Comments (77)

Related Issues (3)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent