<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

Hi there, Thanks for reaching out. 1/ <blockqu

Have you tried other input types like melspectrograms or mfcc? <a class="user-mention

I want to use frame-level ssast just for frame-level audio token extraction about ssast HOT 3 OPEN

yuangongnd commented on August 20, 2024

I want to use frame-level ssast just for frame-level audio token extraction

from ssast.

Comments (3)

YuanGongND commented on August 20, 2024

Hi there,

Thanks for reaching out.

In your ast_models.py, you put cluster True as Default ... But if to use frame-level ssast, cluster should be False. Do I have to turn it off?

You are correct that cluster=True is default in the model script, but we do pass cluster=False for frame-level when we instantiate the model, please see here:

ssast/src/run.py

Lines 125 to 130 in a1a3eec

    
           if 'pretrain' in args.task: 
        
               cluster = (args.num_mel_bins != args.fshape) 
        
               if cluster == True: 
        
                   print('The num_mel_bins {:d} and fshape {:d} are different, not masking a typical time frame, using cluster masking.'.format(args.num_mel_bins, args.fshape)) 
        
               else: 
        
                   print('The num_mel_bins {:d} and fshape {:d} are same, masking a typical time frame, not using cluster masking.'.format(args.num_mel_bins, args.fshape))

FYI, you can use cluster=True for frame-level AST, but from my experience, it will lead to a performance drop.

If I want to use your pretrained frame level ssast for audio token extraction, is the output of self.v.norm(x) except the first one what I have to use in finetuningcls function? because the first one is cls token

You are correct, but please be cautious on that cls_token_num might not always be 1 for all models.

One more thing I wonder....Could I get some part of fbank that is corresponding to the video frames? melspectrogram does but I don't know fbank could be....

This involves audio-visual learning while this paper is about pure audio research. But we do use fbank features as input in this paper, see:

ssast/src/dataloader.py

Lines 126 to 127 in a1a3eec

    
           fbank = torchaudio.compliance.kaldi.fbank(waveform, htk_compat=True, sample_frequency=sr, use_energy=False, 
        
                                                     window_type='hanning', num_mel_bins=self.melbins, dither=0.0, frame_shift=10)

Hope these help.

-Yuan

from ssast.

9B8DY6 commented on August 20, 2024

Have you tried other input types like melspectrograms or mfcc? @YuanGongND I am gonna try feeding melspectrograms to SSAST to extract audio feature....Is it okay?
Could I ask you why cls token number is not consistent as 1? It can be 2 because of dist_token? Then, what is dist_token?

from ssast.

YuanGongND commented on August 20, 2024

Have you tried other input types like melspectrograms or mfcc? @YuanGongND I am gonna try feeding melspectrograms to SSAST to extract audio feature....Is it okay?

I have never tried other input features. You can pretrain your own model with other input feature, but if you plan to use our pretrained model to extract feature/embedding/token, then you have to use the same dataloader (which is fully released in this repo) with us, any input distribution shift could cause a dramatic performance difference.

Could I ask you why cls token number is not consistent as 1? It can be 2 because of dist_token? Then, what is dist_token?

dist_token stands for distillation token, please read our AST paper for details. SSAST does not need this token, but our code is compatible with old AST models.

-Yuan

from ssast.

I want to use frame-level ssast just for frame-level audio token extraction about ssast HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	if 'pretrain' in args.task:
	cluster = (args.num_mel_bins != args.fshape)
	if cluster == True:
	print('The num_mel_bins {:d} and fshape {:d} are different, not masking a typical time frame, using cluster masking.'.format(args.num_mel_bins, args.fshape))
	else:
	print('The num_mel_bins {:d} and fshape {:d} are same, masking a typical time frame, not using cluster masking.'.format(args.num_mel_bins, args.fshape))

	fbank = torchaudio.compliance.kaldi.fbank(waveform, htk_compat=True, sample_frequency=sr, use_energy=False,
	window_type='hanning', num_mel_bins=self.melbins, dither=0.0, frame_shift=10)