Giter Site home page Giter Site logo

projector(e)? about audiodec HOT 7 CLOSED

facebookresearch avatar facebookresearch commented on August 17, 2024
projector(e)?

from audiodec.

Comments (7)

bigpon avatar bigpon commented on August 17, 2024
  1. No. It is just reducing the dimensionality of the latent, which is before the VQ(RVQ).
  2. 32 is the temporal resolution, which means how many frames (or how many codes) are for this batch.
  3. The input channel is the channel of the input audio. The encode_channel is the initial number of the feature map, and it will be increased according to the enc_ratios, which is the same mechanism as SoundStream. You can check it from the SoundStream paper.

from audiodec.

a897456 avatar a897456 commented on August 17, 2024
  1. 32 is the temporal resolution, which means how many frames (or how many codes) are for this batch.

input_tensor[16,1,9600]--encode(stride300)--[16,512,32]--projector--[16,64,32]--transpose--[16,32,64]--VQ--[16,32,64].
so even though 32 is the temporal resolution, But when quantized, it corresponds to the codebook_size 1024, Whether the gap between the two is too large?

from audiodec.

bigpon avatar bigpon commented on August 17, 2024

9600 is only 0.2 sec of 48kHz speech, and we use it only in the training stage as the batch length.

The codebook(s) should be comprehensive to cover most audio (or at least most audio in the training data). Therefore, the codebook size should be large enough to cover the most cases of the training audio but not too large, which will result in lower codebook usage.

from audiodec.

a897456 avatar a897456 commented on August 17, 2024

9600 is only 0.2 sec of 48kHz speech, and we use it only in the training stage as the batch length.

The codebook(s) should be comprehensive to cover most audio (or at least most audio in the training data). Therefore, the codebook size should be large enough to cover the most cases of the training audio but not too large, which will result in lower codebook usage.

I set fs=8kHz, stride=200, batch_length=6400, I keep batch_length/stride=32, just like your 9600/300=32,
but 6400/8000=0.8 sec, it may take longer to train, but I think it's ok, don't you think?

from audiodec.

a897456 avatar a897456 commented on August 17, 2024

I found something interesting in your config file, fft_sizes=2048, and from what I understand, hop_size should be one quarter of fft_sizes, which is 512.
However, you set hop_size=300, you seems to set hop_size=enc_strides intentionally.
You don't set hop_size=512, but set the same as enc_strides, why?

from audiodec.

bigpon avatar bigpon commented on August 17, 2024

Hi,

  1. for the batch length in the 1st stage training, you can use a much longer one since the training is so fast. I use 2 seconds in the 1st stage for my current settings. (0.2 seconds for the second stage).

  2. Theoretically, the hop_length and fft_sizes should follow the COLA requirement for prefect reconstructions(STFT and then iSTFT). (https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.check_COLA.html). However, the PyTorch STFT implementation already does some optimations, so we can use almost any values for almost perfect reconstructions. Furthermore, the STFT here is for the auxiliary STFT loss to check the reconstruction quality of the codec, so the differences between 300 and 512 will not result in big differences.

from audiodec.

a897456 avatar a897456 commented on August 17, 2024

Hi,

  1. for the batch length in the 1st stage training, you can use a much longer one since the training is so fast. I use 2 seconds in the 1st stage for my current settings. (0.2 seconds for the second stage).
  2. Theoretically, the hop_length and fft_sizes should follow the COLA requirement for prefect reconstructions(STFT and then iSTFT). (https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.check_COLA.html). However, the PyTorch STFT implementation already does some optimations, so we can use almost any values for almost perfect reconstructions. Furthermore, the STFT here is for the auxiliary STFT loss to check the reconstruction quality of the codec, so the differences between 300 and 512 will not result in big differences.

Nice, THS.

from audiodec.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.