nttcslab / byol-a Goto Github PK
View Code? Open in Web Editor NEWBYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation
Home Page: https://arxiv.org/abs/2103.06695
License: Other
BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation
Home Page: https://arxiv.org/abs/2103.06695
License: Other
https://github.com/nttcslab/byol-a/blob/master/evaluate.py#L112
It also needs: X_val = scaler.transform(X_val)
, or validation acc & loss will be invalid.
This can be one of the reasons why we see lower performance when I tried to get official performances...
Can we create vector representation using a pretrained model only for English or is it language Independent?
in byol_a folder there is byol_pytorch.diff instead of byol_pytorch.py
Hi is there any inference speed evaluation?
And how to deal with long audios in production?
Many thanks for ur great work.
Traceback (most recent call last):
File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\pydevd.py", line 2066, in <module>
main()
File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\pydevd.py", line 2060, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\pydevd.py", line 1411, in run
return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\pydevd.py", line 1418, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "F:\IntellIDEA\PyCharm 2019.2.2\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "E:/pythonSpace/byol-a/train.py", line 132, in <module>
main(audio_dir=base_path + '1/', epochs=100)
File "E:/pythonSpace/byol-a/train.py", line 112, in main
learner = BYOLALearner(model, cfg.lr, cfg.shape,
File "E:/pythonSpace/byol-a/train.py", line 56, in __init__
self.learner = BYOL(model, image_size=shape, **kwargs)
File "D:\min\envs\torch1_7_1\lib\site-packages\byol_pytorch\byol_pytorch.py", line 211, in __init__
self.forward(torch.randn(2, 3, image_size, image_size, device=device))
TypeError: randn(): argument 'size' must be tuple of ints, but found element of type list at pos 3
Lines 80 to 82 in 60cebdc
If len(wav) > self.unit_length, length_adj
will be a negative value. So start
will be 0. If wav (before pad) is shorter than unit length, length_adj == 0 after padding. So start
is always 0. So It will only perform a certain area of crop from 0 to self.unit_length (cropped_wav == wav[0: self.unit_length]), not random crop.
So I think line 80 should be changed to length_adj = len(wav) - self.unit_length
.
Hi There, great repo!
I think I have misunderstood something wrong with the RunningNorm function. The function expects the size of an epoch, however, your implementation passes the size of the entire dataset.
Is it a bug? Or is there a problem with my understanding?
Thank You!
Hi,
your paper is super interesting. I have a question regarding the downstream tasks. If I understand the paper correctly, you used a single linear layer for the downstream tasks which only used the sum of mean and max of the representation over time as input.
Did you try to finetune BYOL-A end-to-end after pretraining to the downstream tasks? In the case of TRILL they were able to improve the performance even further by finetuning the whole model end-to-end. Is there a specific reason why this is not possible with BYOL-A?
Hi there,
Section 4, subsection A, part 1 from your paper says:
The number of frames, T, in one segment was 96 in pretraining, which corresponds to 1,014ms.
However, the previous line says the hop size used was 10ms. So according to this 96 would mean 960ms?
Am I understanding something wrong here?
Thank You in advance!
Hi,
Thank you so much for your contribution. This works is very interesting and your code is easy for me to follow. But one of the downstream dataset, voxforge is missing from the preprocess_ds.py. Could you please release the code for that dataset, too?
Thank you again for your time.
Best regards
https://github.com/nttcslab/byol-a/blob/master/train.py
At line 67, there is comments for the shape of input.
# in fact, it should be (B, 1, F, T), e.g. (256, 1, 64, 96) where 64 is the number of mel bins
paired_inputs = torch.cat(paired_inputs) # [(B,1,T,F), (B,1,T,F)] -> (2*B,1,T,F)
However, it is different from the descriptions in config.yml file
# Shape of loh-mel spectrogram [F, T].
shape: [64, 96]
Thank you for the fascinating paper and the code to reproduce it!
I think there might be a problem in RunningMean. The current formula (the same in v1 and v2) looks like this:
which is inconsistent with the correct formula listed on StackOverflow:
The problem is that self.n is incremented after the new mean is computed. Could you please either correct me if I am wrong or correct the code?
https://github.com/nttcslab/byol-a/blob/master/byol_a/models.py#L42
if p.requires_grad
has to be removed.
Hi, it' s a great work, but how can I understance the performance metric? For example, VoxCeleb1 is usually for speaker verification, shouldn't we measure EER?
Hi,
Thanks for sharing this great work! I tried to reproduce the results using the official guidance but I failed.
After processing the data, I run the following commands:
CUDA_VISIBLE_DEVICES=0 python -W ignore train.py work/16k/fsd50k/FSD50K.dev_audio
cp lightning_logs/version_4/checkpoints/epoch\=99-step\=16099.ckpt AudioNTT2020-BYOLA-64x96d2048.pth
CUDA_VISIBLE_DEVICES=4 python evaluate.py AudioNTT2020-BYOLA-64x96d2048.pth spcv2
However, the results are far from the reported results
Did I miss something important? Thank you very much.
Hi
Thank you for your contribution. It's a really interesting work. However, I have one question regarding the downstream evaluation.
In the paper, you mentioned that "A segment of shape FxT was randomly cropped from each audio clip and
encoded for linear evaluation in the downstream tasks."
However, as far as I know, this procedure was not adopted in the previous works. Have you tried the experiment where the complete log-mel spectrogram (without random cropping) is fed to the network during the evaluation stage? Is there any performance difference?
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.