No. It is just reducing the dimensionality of the latent, which is before the VQ

projector(e)？ about audiodec HOT 7 CLOSED

facebookresearch commented on August 17, 2024

projector(e)？

from audiodec.

Comments (7)

bigpon commented on August 17, 2024

No. It is just reducing the dimensionality of the latent, which is before the VQ(RVQ).
32 is the temporal resolution, which means how many frames (or how many codes) are for this batch.
The input channel is the channel of the input audio. The encode_channel is the initial number of the feature map, and it will be increased according to the enc_ratios, which is the same mechanism as SoundStream. You can check it from the SoundStream paper.

from audiodec.

a897456 commented on August 17, 2024

32 is the temporal resolution, which means how many frames (or how many codes) are for this batch.

input_tensor[16,1,9600]--encode(stride300)--[16,512,32]--projector--[16,64,32]--transpose--[16,32,64]--VQ--[16,32,64].
so even though 32 is the temporal resolution, But when quantized, it corresponds to the codebook_size 1024, Whether the gap between the two is too large?

from audiodec.

bigpon commented on August 17, 2024

9600 is only 0.2 sec of 48kHz speech, and we use it only in the training stage as the batch length.

The codebook(s) should be comprehensive to cover most audio (or at least most audio in the training data). Therefore, the codebook size should be large enough to cover the most cases of the training audio but not too large, which will result in lower codebook usage.

from audiodec.

a897456 commented on August 17, 2024

9600 is only 0.2 sec of 48kHz speech, and we use it only in the training stage as the batch length.

The codebook(s) should be comprehensive to cover most audio (or at least most audio in the training data). Therefore, the codebook size should be large enough to cover the most cases of the training audio but not too large, which will result in lower codebook usage.

I set fs=8kHz, stride=200, batch_length=6400, I keep batch_length/stride=32, just like your 9600/300=32,
but 6400/8000=0.8 sec, it may take longer to train, but I think it's ok, don't you think?

from audiodec.

a897456 commented on August 17, 2024

I found something interesting in your config file, fft_sizes=2048, and from what I understand, hop_size should be one quarter of fft_sizes, which is 512.
However, you set hop_size=300, you seems to set hop_size=enc_strides intentionally.
You don't set hop_size=512, but set the same as enc_strides, why?

from audiodec.

bigpon commented on August 17, 2024

Hi,

for the batch length in the 1st stage training, you can use a much longer one since the training is so fast. I use 2 seconds in the 1st stage for my current settings. (0.2 seconds for the second stage).
Theoretically, the hop_length and fft_sizes should follow the COLA requirement for prefect reconstructions(STFT and then iSTFT). (https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.check_COLA.html). However, the PyTorch STFT implementation already does some optimations, so we can use almost any values for almost perfect reconstructions. Furthermore, the STFT here is for the auxiliary STFT loss to check the reconstruction quality of the codec, so the differences between 300 and 512 will not result in big differences.

from audiodec.

a897456 commented on August 17, 2024

Hi,

for the batch length in the 1st stage training, you can use a much longer one since the training is so fast. I use 2 seconds in the 1st stage for my current settings. (0.2 seconds for the second stage).

Theoretically, the hop_length and fft_sizes should follow the COLA requirement for prefect reconstructions(STFT and then iSTFT). (https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.check_COLA.html). However, the PyTorch STFT implementation already does some optimations, so we can use almost any values for almost perfect reconstructions. Furthermore, the STFT here is for the auxiliary STFT loss to check the reconstruction quality of the codec, so the differences between 300 and 512 will not result in big differences.

Nice, THS.

from audiodec.

projector(e)？ about audiodec HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent