Thanks for your hard work! I wrote a train.py to tain the network, but I can't get the

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

process the audioset about gpv HOT 15 CLOSED

richermans commented on July 30, 2024

process the audioset

from gpv.

Comments (15)

RicherMans commented on July 30, 2024

Hmm, shouldn't be that hard, since the model itself is only trained on the balanced subset.
The training script for this repo will probably come with the followup work.

Did you maybe: forget to balance each batch during training? That influence is significant.
Also just for reference, on my evaluation audioset subset (only 13k samples), I get an mAP of 0.095, just for reference.

Also maybe your feature preprocessing is problematic?

from gpv.

LucySha commented on July 30, 2024

@RicherMans Thanks for your reply. The Audioset balanced subset contains more than 20k audios, but I only want to seperate 'Speech' from other audios, so I chose about 4k 'Speech' and other 4k audios random, I got about 8k audios as training data, I wonder 'balance each batch during training' means that for every batch, I have to make sure the 'Speech' and other audios should be 1:1, did it work for me?

from gpv.

LucySha commented on July 30, 2024

another question, if I reproduce the GPV-f(527), how to balance the each batch(if the batchsize=64 less than 527)

from gpv.

RicherMans commented on July 30, 2024

@RicherMans Thanks for your reply. The Audioset balanced subset contains more than 20k audios, but I only want to separate 'Speech' from other audios, so I chose about 4k 'Speech' and other 4k audios random, I got about 8k audios as training data

Ohoh, no that wasn't the entire point of the work. So the problem would be if you only use these 4k Speech and 4k non-speech samples, that the "noise" class has a much too large of a variance.
That is actually one of the points of the work - common VAD methods are trained to incorporate a single, noise class, which is likely to be "huge".
Also, your model wouldn't be able to fully distinguish between the individual events either.
So just train on all 20k samples of audioset.

In this work, GPV-B simply replaced all occurrences of non-speech samples with "noise/non-speech". After replacing, you will notice that there are only two event label combinations: (Speech, Noise) and (Noise). Speech never occurs independently in that dataset.

I wonder 'balance each batch during training' means that for every batch, I have to make sure the 'Speech' and other audios should be 1:1, did it work for me?

Actually, I personally didn't care about the balancing of 'Speech' directly, just make sure each batch is reasonably balanced.

another question, if I reproduce the GPV-f(527), how to balance the each batch(if the batchsize=64 less than 527)

It's a multi-class multi-output problem, so balancing, like for standard classification problems, is impossible.
You will always sample a majority class, appearing together with a minority class, therefore you cant guarantee that each class is only seen once in a batch.
However, what I mean by "balancing" here is to roughly balance the dataset, such that within a batch and subsequent batches, some class appears at least once.
So say, in the first batch, classes 1,2,3 are trained, while in the following batch its 4,5,6 and so on. Of course you cant 100% guarantee that this will always be the case.

I just paste my minimum working example for my own "minimumoccupancy sampler".
The input is just a 2d np array named labels with size (N, E), N is the number of overall samples you have (data size) and E is the number of classes (e.g., 527). When used on the unbalanced Audioset, that array becomes too large to store in memory directly, so I used sparse matrices, but you can just neglect it.
Have fun.

import torch.utils.data as tdata
import scipy
import numpy as np

class MinimumOccupancySampler(tdata.Sampler):
    """
        docstring for MinimumOccupancySampler
        samples at least one instance from each class sequentially
    """
    def __init__(self, labels, sampling_mode='same', random_state=1):
        self.labels = labels
        data_samples, n_labels = labels.shape
        label_to_idx_list, label_to_length = [], []
        self.random_state = np.random.RandomState(seed=random_state)
        for lb_idx in range(n_labels):
            label_selection = labels[:, lb_idx]
            if scipy.sparse.issparse(label_selection):
                label_selection = label_selection.toarray()
            label_indexes = np.where(label_selection == 1)[0]
            self.random_state.shuffle(label_indexes)
            label_to_length.append(len(label_indexes))
            label_to_idx_list.append(label_indexes)

        self.longest_seq = max(label_to_length)
        self.data_source = np.zeros((self.longest_seq, len(label_to_length)),
                                    dtype=np.uint32)
        # Each column represents one "single instance per class" data piece
        for ix, leng in enumerate(label_to_length):
            # Fill first only "real" samples
            self.data_source[:leng, ix] = label_to_idx_list[ix]

        self.label_to_idx_list = label_to_idx_list
        self.label_to_length = label_to_length

        if sampling_mode == 'same':
            self.data_length = data_samples
        elif sampling_mode == 'over':  # Sample all items
            self.data_length = np.prod(self.data_source.shape)

    def _reshuffle(self):
        # Reshuffle
        for ix, leng in enumerate(self.label_to_length):
            leftover = self.longest_seq - leng
            random_idxs = self.random_state.randint(leng, size=leftover)
            self.data_source[leng:,
                             ix] = self.label_to_idx_list[ix][random_idxs]

    def __iter__(self):
        # Before each epoch, reshuffle random indicies
        self._reshuffle()
        n_samples = len(self.data_source)
        random_indices = self.random_state.permutation(n_samples)
        data = np.concatenate(
            self.data_source[random_indices])[:self.data_length]
        return iter(data)

    def __len__(self):
        return self.data_length

from gpv.

LucySha commented on July 30, 2024

@RicherMans when I applied your Sampler, the task of 527 classes works better, but when it comes to the bianry task (Speech and non-Speech), it was worse than your gpv-b. for traing and testing, I modified the label(speech, noises) to (1,1) and (noises) to (0,1), the first num stands for the exist of speech and the second num stands for the noises. the original label(527 classes) was used to apply the Sampler to ensure I have each balanced batch.

from gpv.

RicherMans commented on July 30, 2024

So there are quite some hyperparameters out of my "direct" control.
I usually do a stratified (ish) split on the available 20k data to split into 90% train and 10% CV.
The split is governed by a predefined seed. For these experiments, I ran different seeds, which usually lead to different results.

I personally just know that you wouldn't be able to reproduce my results 1:1, meaning that you obtain identical models, but definitely something similar.

Just for the reference, my binary results (macro F1) for four different seeds on the aurora clean scenario are:

Run	macro-F1
1	91.368
2	86.242
3	85.190
4	94.116

I am well aware of the influence of training data and sampling, specifically for a binary classification task.

If your F1-macro is around 85+, then its all good.

it was worse than your gpv-b. for traing and testing,

I never published training results? You mean you just evaluated my model on your dataset?
Also what dataset are you using? Aurora 4? Or some custom one

from gpv.

LucySha commented on July 30, 2024

@RicherMans I splited 20k data into 90% train and 10% CV. I used BCEloss as my loss function. I tested models on my record real data. each input was controled as 20-50 frames to check out results, your GPV-b and GPV-f pretrained models performed much better than my models at the frame level, but I don't know why. Maybe I should try to do more tests to reproduce a similar results

from gpv.

RicherMans commented on July 30, 2024

I tested models on my record real data. each input was controled as 20-50 frames to check out results,

Sorry, I don't get that, you perform every 20-50 batches cross-validation?
What's your input feature? 40ms/20ms LMS?
Also, did you use double thresholding as the post-processing method?
Also

the original label(527 classes) was used to apply the Sampler to ensure I have each balanced batch.\

That seems wrong, you always apply the sampling strategy based on your labels, not based on some other labels. For binary classification, just sample according to (1,1) and (1,0).

from gpv.

LucySha commented on July 30, 2024

@RicherMans I mean that during training, the input clip was tailored to 10s, during tesing, the input clip was in range(20ms, 50ms), when the input clip was 20ms, the input feature was 20ms LMS.

from gpv.

RicherMans commented on July 30, 2024

I mean that during training, the input clip was tailored to 10s, during tesing, the input clip was in range(20ms, 50ms), when the input clip was 20ms, the input feature was 20ms LMS.

Sorry I don't get your meaning here.
Training features differ from testing ones?
Do you input a single frame at a time during testing?

from gpv.

LucySha commented on July 30, 2024

yes, training audio duration differ from testing duration. training clip audios are 10s, if the framesize=20ms, about 500frames, the input feature is 500 * 64, but for the testing audio, if the testing audio was 1s, the input feature is 50 * 64, if the testing audio was 500ms, the input feature is 25*64

from gpv.

RicherMans commented on July 30, 2024

yes, training audio duration differ from testing duration. training clip audios are 10s, if the framesize=20ms, about 500frames, the input feature is 500 * 64, but for the testing audio, if the testing audio was 1s, the input feature is 50 * 64, if the testing audio was 500ms, the input feature is 25*64

I see, that shouldn't influence performance.
Well just so far, I am unsure about your training.
If you can provide some logs, it would be helpful.

from gpv.

LucySha commented on July 30, 2024

@RicherMans Can I have your email if it's conveniet for you, I‘ll send you the code

from gpv.

LucySha commented on July 30, 2024

For the speech label=(1,1) and non-speech, label=(0,1) task , the first batch train loss=0.28, validation loss=0.26, after about 8-10 epoch, the valdidation began to increase from 0.20, and the train loss continued to decrease.

from gpv.

RicherMans commented on July 30, 2024

Can I have your email if it's conveniet for you, I‘ll send you the code

Umm, if possible just add my WeChat, the username is the same as here (richermans)

from gpv.

process the audioset about gpv HOT 15 CLOSED

Comments (15)

Related Issues (9)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent