nttcslab / byol-a Goto Github PK

View Code? Open in Web Editor NEW

204.0 10.0 35.0 50.51 MB

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

Home Page: https://arxiv.org/abs/2103.06695

License: Other

Python 100.00%

byol byol-pytorch audio ntt byol-a

byol-a's People

Contributors

Stargazers

Watchers

byol-a's Issues

Missing scaling of validation samples in evaluate.py

https://github.com/nttcslab/byol-a/blob/master/evaluate.py#L112

It also needs: X_val = scaler.transform(X_val), or validation acc & loss will be invalid.
This can be one of the reasons why we see lower performance when I tried to get official performances...

BYOL-A Is this independent of language?

Can we create vector representation using a pretrained model only for English or is it language Independent?

missing byol_pytorch.py

in byol_a folder there is byol_pytorch.diff instead of byol_pytorch.py

About inference speed?

Hi is there any inference speed evaluation?
And how to deal with long audios in production?
Many thanks for ur great work.

a basic question：torch.randn(): argument 'size' must be tuple of ints, but found element of type list at pos 3`

Traceback (most recent call last):
  File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\pydevd.py", line 2066, in <module>
    main()
  File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\pydevd.py", line 2060, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\pydevd.py", line 1411, in run
    return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
  File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\pydevd.py", line 1418, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "E:/pythonSpace/byol-a/train.py", line 132, in <module>
    main(audio_dir=base_path + '1/', epochs=100)
  File "E:/pythonSpace/byol-a/train.py", line 112, in main
    learner = BYOLALearner(model, cfg.lr, cfg.shape,
  File "E:/pythonSpace/byol-a/train.py", line 56, in __init__
    self.learner = BYOL(model, image_size=shape, **kwargs)
  File "D:\min\envs\torch1_7_1\lib\site-packages\byol_pytorch\byol_pytorch.py", line 211, in __init__
    self.forward(torch.randn(2, 3, image_size, image_size, device=device))
TypeError: randn(): argument 'size' must be tuple of ints, but found element of type list at pos 3

Random crop is not working.

byol-a/byol_a/dataset.py

Lines 80 to 82 in 60cebdc

    
           length_adj = self.unit_length - len(wav) 
        
           start = random.randint(0, length_adj) if length_adj > 0 else 0 
        
           wav = wav[start:start + self.unit_length]

If len(wav) > self.unit_length, length_adj will be a negative value. So start will be 0. If wav (before pad) is shorter than unit length, length_adj == 0 after padding. So start is always 0. So It will only perform a certain area of crop from 0 to self.unit_length (cropped_wav == wav[0: self.unit_length]), not random crop.

So I think line 80 should be changed to length_adj = len(wav) - self.unit_length .

Doubt in RunningNorm

Hi There, great repo!

I think I have misunderstood something wrong with the RunningNorm function. The function expects the size of an epoch, however, your implementation passes the size of the entire dataset.

Is it a bug? Or is there a problem with my understanding?

Thank You!

Finetuning of BYOL-A

Hi,

your paper is super interesting. I have a question regarding the downstream tasks. If I understand the paper correctly, you used a single linear layer for the downstream tasks which only used the sum of mean and max of the representation over time as input.

Did you try to finetune BYOL-A end-to-end after pretraining to the downstream tasks? In the case of TRILL they were able to improve the performance even further by finetuning the whole model end-to-end. Is there a specific reason why this is not possible with BYOL-A?

Doubt in paper

Hi there,

Section 4, subsection A, part 1 from your paper says:

 The number of frames, T, in one segment was 96 in pretraining, which corresponds to 1,014ms.

However, the previous line says the hop size used was 10ms. So according to this 96 would mean 960ms?

Am I understanding something wrong here?

Thank You in advance!

Evaluation on voxforge

Hi,

Thank you so much for your contribution. This works is very interesting and your code is easy for me to follow. But one of the downstream dataset, voxforge is missing from the preprocess_ds.py. Could you please release the code for that dataset, too?

Thank you again for your time.

Best regards

Question about comments in the train.py

https://github.com/nttcslab/byol-a/blob/master/train.py

At line 67, there is comments for the shape of input.

        # in fact, it should be (B, 1, F, T), e.g. (256, 1, 64, 96) where 64 is the number of mel bins
        paired_inputs = torch.cat(paired_inputs) # [(B,1,T,F), (B,1,T,F)] -> (2*B,1,T,F)

However, it is different from the descriptions in config.yml file

# Shape of loh-mel spectrogram [F, T].
shape: [64, 96]

A mistake in RunningMean

Thank you for the fascinating paper and the code to reproduce it!

I think there might be a problem in RunningMean. The current formula (the same in v1 and v2) looks like this:

$$ m_n = m_{n - 1} + \frac{a_n - m_{n - 1}}{n - 1}, $$

which is inconsistent with the correct formula listed on StackOverflow:

$$ m_n = m_{n - 1} + \frac{a_n - m_{n - 1}}{n}. $$

The problem is that self.n is incremented after the new mean is computed. Could you please either correct me if I am wrong or correct the code?

Model parameters cannot be trainable once they become requires_grad=False

https://github.com/nttcslab/byol-a/blob/master/byol_a/models.py#L42

if p.requires_grad has to be removed.

How to interpret the performance

Hi, it' s a great work, but how can I understance the performance metric? For example, VoxCeleb1 is usually for speaker verification, shouldn't we measure EER?

Question for reproducing results

Hi,

Thanks for sharing this great work! I tried to reproduce the results using the official guidance but I failed.

After processing the data, I run the following commands:

CUDA_VISIBLE_DEVICES=0 python -W ignore train.py work/16k/fsd50k/FSD50K.dev_audio
cp lightning_logs/version_4/checkpoints/epoch\=99-step\=16099.ckpt AudioNTT2020-BYOLA-64x96d2048.pth
CUDA_VISIBLE_DEVICES=4 python evaluate.py AudioNTT2020-BYOLA-64x96d2048.pth spcv2

However, the results are far from the reported results

Did I miss something important? Thank you very much.

Performing evaluation with only a small part of the spectrogram

Thank you for your contribution. It's a really interesting work. However, I have one question regarding the downstream evaluation.
In the paper, you mentioned that "A segment of shape FxT was randomly cropped from each audio clip and
encoded for linear evaluation in the downstream tasks."

However, as far as I know, this procedure was not adopted in the previous works. Have you tried the experiment where the complete log-mel spectrogram (without random cropping) is fed to the network during the evaluation stage? Is there any performance difference?

Thanks

	length_adj = self.unit_length - len(wav)
	start = random.randint(0, length_adj) if length_adj > 0 else 0
	wav = wav[start:start + self.unit_length]

nttcslab / byol-a Goto Github PK

byol-a's People

Contributors

Stargazers

Watchers

Forkers

byol-a's Issues

Recommend Projects

Recommend Topics

Recommend Org